CityNet - Deep Learning Tools for Urban Ecoacoustic Assessment

Cities support unique and valuable ecological communities, but understanding urban wildlife is limited due to the difficulties of assessing biodiversity. Ecoacoustic surveying is a useful way of assessing habitats, where biotic sound measured from audio recordings is used as a proxy for biodiversity. However, existing algorithms for measuring biotic sound have been shown to be biased by non-biotic sounds in recordings, typical of urban environments. We develop CityNet, a deep learning system using convolutional neural networks (CNNs), to measure audible biotic (CityBioNet) and anthropogenic (CityAnthroNet) acoustic activity in cities. The CNNs were trained on a large dataset of annotated audio recordings collected across Greater London, UK. Using a held-out test dataset, we compare the precision and recall of CityBioNet and CityAnthroNet separately to the best available alternative algorithms: four acoustic indices (AIs): Acoustic Complexity Index, Acoustic Diversity Index, Bioacoustic Index, and Normalised Difference Soundscape Index, and a state-of-the-art bird call detection CNN (bulbul). We also compare the effect of non-biotic sounds on the predictions of CityBioNet and bulbul. Finally we apply CityNet to describe acoustic patterns of the urban soundscape in two sites along an urbanisation gradient. CityBioNet was the best performing algorithm for measuring biotic activity in terms of precision and recall, followed by bulbul, while the AIs performed worst. CityAnthroNet outperformed the Normalised Difference Soundscape Index, but by a smaller margin than CityBioNet achieved against the competing algorithms. The CityBioNet predictions were impacted by mechanical sounds, whereas air traffic and wind sounds influenced the bulbul predictions. Across an urbanisation gradient, we show that CityNet produced realistic daily patterns of biotic and anthropogenic acoustic activity from real-world urban audio data. Using CityNet, it is possible to automatically measure biotic and anthropogenic acoustic activity in cities from audio recordings. If embedded within an autonomous sensing system, CityNet could produce environmental data for cites at large-scales and facilitate investigation of the impacts of anthropogenic activities on wildlife. The algorithms, code and pre-trained models are made freely available in combination with two expert-annotated urban audio datasets to facilitate automated environmental surveillance in cities.


SUMMARY 24
1. Cities support unique and valuable ecological communities, but understanding urban 25 wildlife is limited due to the difficulties of assessing biodiversity. Ecoacoustic 26 surveying is a useful way of assessing habitats, where biotic sound measured from 27 audio recordings is used as a proxy for biodiversity. However, existing algorithms for 28 measuring biotic sound have been shown to be biased by non-biotic sounds in 29 recordings, typical of urban environments. 30 2. We develop CityNet, a deep learning system using convolutional neural networks 31 (CNNs), to measure audible biotic (CityBioNet) and anthropogenic (CityAnthroNet) 32 acoustic activity in cities. The CNNs were trained on a large dataset of annotated 33 audio recordings collected across Greater London, UK. Using a held-out test dataset, 34 we compare the precision and recall of CityBioNet and CityAnthroNet separately to 35 INTRODUCTION Machine learning (ML) is being increasingly applied to biodiversity assessment and 84 monitoring because it facilitates the detection and classification of ecoacoustic signals in 85 audio data (Acevedo et al. 2009;Walters et al. 2012;Stowell & Plumbley 2014). Using 86 annotated audio datasets of soniferous species, a ML model can be trained to recognise biotic 87 sounds based on multiple acoustic characteristics, or features, and to associate these features 88 with taxonomic classifications, and can then assign a probabilistic classification to sounds 89 within recordings. AIs only use a limited number of acoustic features in their calculations, 90 such as spectral entropy within defined frequency bands (Boelman et al. 2007;Villanueva-91 Rivera et al. 2011;Kasten et al. 2012) or entropy changes over time (Pieretti, Farina & Morri 92 2011). Additionally, the relationship between the features and the algorithm outputs are 93 chosen by a human, rather than learned automatically from an annotated dataset. In contrast, 94 ML algorithms can utilise many more features in their calculations, and the relationship 95 between inputs and outputs is determined automatically based on the annotated training data 96 provided. Convolutional Neural Networks, CNNs (or Deep learning) (LeCun, Bengio & 97 Hinton 2015) can even choose, based on the annotations in the training dataset, the features 98 that best discriminate different classes in datasets without being specified a priori, and can 99 take advantage of large quantities of training data where their ability to outperform human 100 defined algorithms increases as more labelled data become available. 101 Species-specific ML algorithms have been developed to automatically identify the sounds 102 emitted by a range of soniferous organisms including birds (Stowell & Plumbley 2014), bats (Grill & Schlüter 2017), but these algorithms remain untested on noisy audio data from urban 109 environments. There are currently no algorithms that produce whole community measures of 110 biotic sound that are known to be suitable for use in acoustically complex urban 111 environments. 112 Here, we develop the CityNet acoustic analysis system, which uses two CNNs for measuring 113 audible (0-12 kHz) biotic (CityBioNet) and anthropogenic (CityAnthroNet) acoustic activity 114 in audio recordings from urban environments. We use this frequency range as it contains the 115 majority of sounds emitted by audible soniferous species in the urban environment (Fairbrass 116 et al. 2017). The CNNs were trained using CitySounds2017, an expert-annotated dataset of 117 urban sounds collected across Greater London, UK that we develop here. We compared the 118 performance of CityNet using a held-out dataset by comparing the algorithms' precision and 119 recall to four commonly used AIs: Acoustic Complexity Index (ACI) (Pieretti,Farina & 120 Morri 2011), Acoustic Diversity Index (ADI) (Villanueva-Rivera et al. 2011), Bioacoustic 121 Index (BI) (Boelman et al. 2007), Normalised Difference Soundscape Index (NDSI) (Kasten 122 et al. 2012), and to bulbul, a state-of-the-art algorithm for detecting bird sounds in order to 123 summarise avian acoustic activity (Grill & Schlüter 2017). As the main focus of the study 124 was the development of algorithms for ecoacoustic assessment of biodiversity in cities, we 125 conducted further analysis on the two best performing algorithms for measuring biotic sound, 126 CityBioNet and bulbul, by investigating the effect of non-biotic sounds on the accuracy of the 127 algorithms. Finally, we applied CityNet to investigate daily patterns of biotic and

MATERIALS AND METHODS 130
We developed two CNN models, CityBioNet and CityAnthroNet within the CityNet system 131 to generate measures of biotic and anthropogenic sound, respectively. The CityNet pipeline 132 ( Figure 1) consisted of 7 main steps as follows: 133 (1) Record audio: Audible frequency (0-12 kHz) .wav audio recordings were made using a 134 passive acoustic recorder. 135 (2) Audio conversion to Mel spectrogram: Each audio file was automatically converted to a 136 Mel spectrogram representation with 32 frequency bins, represented as rows in the 137 spectrogram, using a temporal resolution of 21 columns per second of raw audio. Before use 138 in the classifier, each spectrogram was converted to a log-scale representation, using the 139 formula log(A + B * S). For biotic sound detection the parameters A = 0.001 and B = 10.0 140 were used, while for anthropogenic sound detection the parameters A = 0.025 and B = 2.0 141 were used. 142 (3) Extract window from spectrogram: A single input to the CNN comprised a short 143 spectrogram chunk Ws, 21 columns in width, representing 1 second of audio. 144 (4) Apply different normalisation strategies: There are many different methods for pre-145 processing spectrograms before they are used in ML; for example whitening (Lee et al. 2009) 146 and subtraction of mean values along each frequency bin (Aide et al. 2013). CNNs are able to 147 accept inputs with multiple channels of data, for example the red, green and blue channels of 148 a colour image. We exploited the multiple input channel capability of our CNN by providing 149 as input four spectrograms each pre-processed using a different normalisation strategy (see 150 Supplementary Methods), which gave considerable improvements to network accuracy above 151 any single normalisation scheme in isolation. After applying different normalisation (5) Apply CNN classifier: As described above, classification was performed with a CNN, 154 whose parameters were learnt from training data. To create our training dataset (CitySounds2017 train ) we randomly selected twenty five 1-185 minute recordings from 70% of the study sites (44 sites, 1100 recordings). A.F. manually 186 annotated the spectrograms of each recording, computed as the log magnitude of a discrete 187 Fourier transform (non-overlapping Hamming window size=720 samples=10 ms), using 188 AudioTagger (available at https://github.com/groakat/AudioTagger). Spectrograms were 189 annotated by localising the time and frequency bands of discrete sounds by drawing bounding 190 boxes as tightly as visually possible within spectrograms displayed on a Dell UltraSharp 191 61cm LED monitor. Types of sound, such as "invertebrate", "rain", and "road traffic", were 192 identified by looking for typical patterns in spectrograms ( Figure S1), and by listening to the 193 audio samples represented in the annotated parts of the spectrogram. Categories of sounds 194 were then grouped into biotic, anthropogenic and geophonic classes following Pijanowski et 195 al. (2011), where we define biotic as sounds generated by non-human biotic organisms, 196 anthropogenic as sounds associated with human activities, and geophonic as non-biological 197 ambient sounds e.g. wind and rain. 198

Acoustic Testing Dataset and Evaluation 199
To evaluate the performance of the CityNet algorithms, we created a testing dataset 200 (CitySounds2017 test ) by strategically selecting 40 recordings from CitySounds2017 from the remaining 30% of sites (19 sites) that contained a range of both biotic and anthropogenic 202 acoustic activity. CitySounds2017 test was sampled from different recording sites to 203 CitySounds2017 train to demonstrate that the CityNet algorithms generalise to sounds recorded 204 at new site locations ( Figure 2, Table S1). To optimise the quality of the annotations in 205 CitySounds2017 test , we selected five human labellers to separately annotate the sounds within 206 the audio recordings (using the same methods as above) to create a single annotated test 207 dataset. Conflicts were resolved using a majority rule, and in cases where there was no 208 majority, we used our own judgement on the most suitable classification. Our 209 CitySounds2017 annotated training and testing datasets are available at 210 https://figshare.com/s/adab62c0591afaeafedd. 211 Using the CitySounds2017 test dataset, we separately assessed the performance of the two 212 CityNet algorithms, CityBioNet and CityAnthroNet, using two measures: precision and 213 recall. The CityBioNet and CityAnthroNet algorithms give a probabilistic estimate of the 214 level of biotic or anthropogenic acoustic activity for each 1-second audio chunk as a number 215 between 0 and 1. Different thresholds could be used to convert these probabilities into sound 216 category assignments (e.g. 'sound present' or 'sound absent'). At each threshold, a value of 217 precision and recall was computed, where precision was the fraction of 1-second chunks 218 correctly identified as containing the sound according to the annotations in 219 CitySounds2017 test , and recall was the fraction of 1-second chunks labelled as containing the 220 sound which was retrieved by the algorithm under that threshold. As the threshold was swept 221 between 0 and 1, the resulting values of precision and recall were plotted as a precision-recall where NDSI bio and NDSI anthro are the total biotic and anthropogenic acoustic activity in each 236 recording, respectively. Rather than compare CityNet to the NDSI, we compared the biotic 237 (NDSI bio ) and anthropogenic (NDSI anthro ) elements of the NDSI to the measures produced by 238 CityBioNet and CityAnthroNet, respectively, as these were more comparable. As the AIs are 239 all designed to give a summary of acoustic activity for an entire file, they were analysed on 240 the CitySounds2017 test dataset by treating each 1-second chunk of audio as a separate sound 241 file to enable direct comparisons to CityNet. The AI measures do not have a natural threshold 242 for classification into biotic/non-biotic sound, meaning we could not calculate confusion 243 matrices. However, a threshold between their lowest value and their highest value was used 244 in combination with the range of precision and recall values to form precision-recall curves. 245 All AIs were calculated in R v.3.4.1 (R Core Team 2017) using the 'seewave' v.1.7.6 (Sueur, The precision and recall of CityBioNet was also compared to bulbul (Grill & Schlüter 2017), 249 an algorithm for detecting bird sounds in entire audio recordings in order to summarise avian 250 acoustic activity which was the winning entry in the 2016-7 Bird Audio Detection challenge 251 (Stowell et al. 2016). Like CityNet, bulbul is a CNN-based classifier which uses 252 spectrograms as input. However, it does not use the same normalisation strategies as CityNet, 253 and it was not trained on data from noisy, urban environments. Bulbul was applied to each 254 second of audio data in CitySounds2017 test , using the pre-trained model provided by the 255 authors together with their code. 256

Impact of Non-Biotic Sounds 257
We conducted additional analysis on the non-biotic sounds that affect the predictions of 258 CityBioNet and bulbul, as these were found to be the best performing algorithms for 259 measuring biotic sound. To do this, we created subsets of the CitySounds2017 test dataset 260 comprising all the seconds that contained a range of non-biotic sounds, e.g. a road traffic data 261 subset containing all of the seconds in CitySounds2017 test where the sound of road traffic was 262 present. We then used a Chi-squared test to identify significant differences in the proportion 263 of seconds in which the presence/absence of biotic sound at threshold 0.5 was correctly 264 predicted in the full and subset datasets by each algorithm, and the Cramer's V statistic was 265 used to assess the effect size of differences (Cohen 1992). These analyses were conducted in 266 R v.3.4.1 (R Core Team 2017). 267

Ecological Application 268
We used CityNet to generate daily average patterns of biotic and anthropogenic acoustic 269 activity for two study sites across an urbanisation gradient (sites E29RR and IG62XL with 270 high and low urbanisation respectively, Table S1). To control for the date of recording; both 271 sites were surveyed between May and June 2015. CityNet was run over the entire 7 days of recordings from each site to predict the presence/absence of biotic and anthropogenic sound 273 for every 1-second audio chunk using a 0.5 probability threshold. Measures of biotic and 274 anthropogenic activity were created for each half hour window between midnight and 275 midnight by averaging the predicted number of seconds containing biotic or anthropogenic 276 sound within that window over the entire week.  (Table  281 1, Figure 3). In comparison the ACI, ADI, BI and NDSI bio had a lower average precision 282 (0.663, 0.439, 0.516, and 0.503, respectively) and lower recall at 0.95 (all less than 0.01). 283 CityBioNet also outperformed bulbul which had an average precision of 0.872 and recall at 284 0.95 of 0.398 (Table 1). In comparison to CityAnthroNet, the NDSI anthro had a lower average 285 precision (0.975) and lower recall at 0.95 precision (0.815). When biotic sound was present in 286 recordings, CityBioNet correctly predicted the presence of biotic sound (True Positives) in a 287 greater proportion of audio data than bulbul (33.2% in comparison with 18.5%, for 288 CityBioNet and bulbul respectively) ( Figure 4). However, CityBioNet failed to correctly 289 predict the presence of biotic sound (False Negatives) in 1.7% of recordings in comparison 290 with 1.0% incorrect predictions by bulbul. When biotic sound was absent from recordings, 291 CityBioNet correctly predicted the absence of biotic sound (True Negatives) in 51.6% of the 292 audio data in comparison with 52.6% for bulbul, and CityBioNet failed to correctly predict 293 the absence of biotic sound (False Positives) in 13.5% of audio data in comparison with Impacts of Non-Biotic Sounds 296 CityBioNet was strongly (Cramer's V effect size >0.5) negatively affected by mechanical 297 sound (the presence/absence of biotic sound was correctly predicted in 28.60% less of the 298 data when mechanical sounds were also present) ( Table 2). Bulbul was moderately (Cramer's 299 V effect size 0.1-0.5) negatively affected by the sound of air traffic and wind (the 300 presence/absence of biotic sound was correctly predicted in 5.34% and 6.93% less of the data 301 when air traffic and wind sounds were also present in recordings, respectively). 302

Ecological Application 303
CityNet produced realistic patterns of biotic and anthropogenic acoustic activity in the urban 304 soundscape at two study sites of low and high urban intensity ( Figure 2B and C). At both 305 sites, biotic acoustic activity peaked just after sunrise and declined rapidly after sunset. A 306 second peak of biotic acoustic activity was recorded at sunset at the low urban intensity site 307 but not at the high urban intensity site. At both sites anthropogenic acoustic activity rose 308 sharply after sunrise, remained constant throughout the day and declined after sunset. Retraining CityNet with labelled audio data from other cities would make it possible to use 376 the system to monitor urban biotic and anthropogenic acoustic activity more widely. 377 However, as London is a large and heterogeneous city, CityNet has been trained using a 378 dataset containing sounds that characterise a wide range of urban environments. Our data 379 collection was restricted to a single week at each study site, which limits our ability to assess 380 the ability of CityNet system to detect environmental changes. Future work should focus on 381 the collection of longitudinal acoustic data to assess the sensitivity of the algorithms to detect 382 environmental changes. Our use of human labellers would have introduced subjectivity and 383 bias into our dataset. The task of annotating large audio datasets from acoustically complex 384 urban environments is highly resource intensive, a problem which has been recently tackled 385 with citizen scientists to create the UrbanSounds and UrbanSound8k datasets using audio 386 data from New York city, USA (Salamon, Jacoby & Bello 2014). These comprise short 387 snippets of 10 different urban sounds such as jackhammers, engines idling and gunshots. 388 These datasets do not fully represent the characteristics of urban soundscapes for three 389 reasons. Firstly, they assume only one class of sound is present at each time, while in fact 390 multiple sound types can be present at one time (consider a bird singing while an aeroplane 391 flies overhead). Secondly, they only include anthropogenic sounds, while CityNet measures important states which are not present in UrbanSounds and UrbanSounds8k. Due to these 395 factors, these datasets are unsuitable for the purpose of this research project, although recent 396 work has overcome a few of these shortcoming using synthesised soundscape data (Salamon 397 et al. 2017). This highlights the need for an internationally coordinated effort to create a 398 consistently labelled audio dataset from cities to support the development of automated urban 399 environmental assessment systems with international application. 400 Conclusions 401 The CityNet system for measuring biotic and anthropogenic acoustic activity in noisy urban 402 audio data outperformed the state-of-the-art algorithms for measuring biotic and 403 anthropogenic sound in entire audio recordings. Integrated into an IoT network for recording 404 and analysing audio data in cities it could facilitate urban environmental assessment at greater 405 scales than has been possible to date using traditional methods of biodiversity assessment.   Figure 1. The CityNet analysis pipeline for measuring biotic and anthropogenic acoustic (2). A sliding window is run across the time dimension, and a window of the spectrogram 621 extracted at each step (3). This spectrogram window is pre-processed with four different 622 normalisation strategies, and the results concatenated. This stack of spectrograms is passed 623 through a CNN (5), which was trained on CitySounds2017 train . The CNN gives, at each 1-624 second time step, a prediction of the presence/absence of biotic or anthropogenic acoustic 625 activity (6). Finally, these per-time-step measures can be aggregated to give summaries over 626 time or space (7).   Figure S1. Examples of all sound types present in CitySounds2017. 'Animal' denotes biotic 683 sounds that could not be taxonomically identified. Unidentified sounds not shown due to 684 wide range of sound types within this group. Data is represented in spectrograms (FFT non-685 overlapping Hamming window size=1024) where blue to yellow corresponds to sound 686 amplitude (dB). Frequency (kHz) and time (s) are represented on the y-and x-axes, organisms), anthropogenic (sounds associated with human activities including human speech) 689 and geophonic sounds. 690