UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Cooperative Scene-Event Modelling for Acoustic Scene Classification

Hou, Y; Kang, B; Mitchell, A; Wang, W; Kang, J; Botteldooren, D; (2023) Cooperative Scene-Event Modelling for Acoustic Scene Classification. IEEE/ACM Transactions on Audio Speech and Language Processing , 32 pp. 68-82. 10.1109/TASLP.2023.3323135. Green open access

[thumbnail of Kang_VoR_Cooperative_Scene-Event_Modelling_for_Acoustic_Scene_Classification.pdf]
Preview
Text
Kang_VoR_Cooperative_Scene-Event_Modelling_for_Acoustic_Scene_Classification.pdf

Download (4MB) | Preview

Abstract

Acoustic scene classification (ASC) can be helpful for creating context awareness for intelligent robots. Humans naturally use the relations between acoustic scenes (AS) and audio events (AE) to understand and recognize their surrounding environments. However, in most previous works, ASC and audio event classification (AEC) are treated as independent tasks, with a focus primarily on audio features shared between scenes and events, but not their implicit relations. To address this limitation, we propose a cooperative scene-event modelling (cSEM) framework to automatically model the intricate scene-event relation by an adaptive coupling matrix to improve ASC. Compared with other scene-event modelling frameworks, the proposed cSEM offers the following advantages. First, it reduces the confusion between similar scenes by aligning the information of coarse-grained AS and fine-grained AE in the latent space, and reducing the redundant information between the AS and AE embeddings. Second, it exploits the relation information between AS and AE to improve ASC, which is shown to be beneficial, even if the information of AE is derived from unverified pseudo-labels. Third, it uses a regression-based loss function for cooperative modelling of scene-event relations, which is shown to be more effective than classification-based loss functions. Instantiated from four models based on either Transformer or convolutional neural networks, cSEM is evaluated on real-life and synthetic datasets. Experiments show that cSEM-based models work well in real-life scene-event analysis, offering competitive results on ASC as compared with other multi-feature or multi-model ensemble methods. The ASC accuracy achieved on the TUT2018, TAU2019, and JSSED datasets is 81.0%, 88.9% and 97.2%, respectively.

Type: Article
Title: Cooperative Scene-Event Modelling for Acoustic Scene Classification
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/TASLP.2023.3323135
Publisher version: https://doi.org/10.1109/TASLP.2023.3323135
Language: English
Additional information: This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Keywords: Acoustic scene classification, audio event classification, scene-event relation, cooperative modelling
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of the Built Environment
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of the Built Environment > Bartlett School Env, Energy and Resources
URI: https://discovery.ucl.ac.uk/id/eprint/10181262
Downloads since deposit
17Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item