Search CORE

76 research outputs found

Frame Pairwise Distance Loss for Weakly-supervised Sound Event Detection

Author: Huang Yuxing
Li Taihao
Ouchi Kazushige
Tao Rui
Wang Xiangdong
Yan Long
Zhai Lufeng
Publication venue
Publication date: 07/12/2023
Field of study

Weakly-supervised learning has emerged as a promising approach to leverage limited labeled data in various domains by bridging the gap between fully supervised methods and unsupervised techniques. Acquisition of strong annotations for detecting sound events is prohibitively expensive, making weakly supervised learning a more cost-effective and broadly applicable alternative. In order to enhance the recognition rate of the learning of detection of weakly-supervised sound events, we introduce a Frame Pairwise Distance (FPD) loss branch, complemented with a minimal amount of synthesized data. The corresponding sampling and label processing strategies are also proposed. Two distinct distance metrics are employed to evaluate the proposed approach. Finally, the method is validated on the DCASE 2023 task4 dataset. The obtained experimental results corroborated the efficacy of this approach.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Cooperative Scene-Event Modelling for Acoustic Scene Classification

Author: Botteldooren D
Hou Y
Kang B
Kang J
Mitchell A
Wang W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/10/2023
Field of study

Acoustic scene classification (ASC) can be helpful for creating context awareness for intelligent robots. Humans naturally use the relations between acoustic scenes (AS) and audio events (AE) to understand and recognize their surrounding environments. However, in most previous works, ASC and audio event classification (AEC) are treated as independent tasks, with a focus primarily on audio features shared between scenes and events, but not their implicit relations. To address this limitation, we propose a cooperative scene-event modelling (cSEM) framework to automatically model the intricate scene-event relation by an adaptive coupling matrix to improve ASC. Compared with other scene-event modelling frameworks, the proposed cSEM offers the following advantages. First, it reduces the confusion between similar scenes by aligning the information of coarse-grained AS and fine-grained AE in the latent space, and reducing the redundant information between the AS and AE embeddings. Second, it exploits the relation information between AS and AE to improve ASC, which is shown to be beneficial, even if the information of AE is derived from unverified pseudo-labels. Third, it uses a regression-based loss function for cooperative modelling of scene-event relations, which is shown to be more effective than classification-based loss functions. Instantiated from four models based on either Transformer or convolutional neural networks, cSEM is evaluated on real-life and synthetic datasets. Experiments show that cSEM-based models work well in real-life scene-event analysis, offering competitive results on ASC as compared with other multi-feature or multi-model ensemble methods. The ASC accuracy achieved on the TUT2018, TAU2019, and JSSED datasets is 81.0%, 88.9% and 97.2%, respectively

UCL Discovery

Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes

Author: Erdogan Hakan
Fonseca Eduardo
Hershey John
Salamon Justin
Seetharaman Prem
Serizel Romain
Turpault Nicolas
Wisdom Scott
Publication venue
Publication date: 02/11/2020
Field of study

We propose a benchmark of state-of-the-art sound event detection systems (SED). We designed synthetic evaluation sets to focus on specific sound event detection challenges. We analyze the performance of the submissions to DCASE 2021 task 4 depending on time related modifications (time position of an event and length of clips) and we study the impact of non-target sound events and reverberation. We show that the localization in time of sound events is still a problem for SED systems. We also show that reverberation and non-target sound events are severely degrading the performance of the SED systems. In the latter case, sound separation seems like a promising solution

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection

Author: An Fengyan
Cao Yin
Iqbal Turab
Kong Qiuqiang
Plumbley Mark D.
Wang Wenwu
Publication venue
Publication date: 10/02/2021
Field of study

Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously. We study the SELD task from a multi-task learning perspective. Two open problems are addressed in this paper. Firstly, to detect overlapping sound events of the same type but with different DoAs, we propose to use a trackwise output format and solve the accompanying track permutation problem with permutation-invariant training. Multi-head self-attention is further used to separate tracks. Secondly, a previous finding is that, by using hard parameter-sharing, SELD suffers from a performance loss compared with learning the subtasks separately. This is solved by a soft parameter-sharing scheme. We term the proposed method as Event Independent Network V2 (EINV2), which is an improved version of our previously-proposed method and an end-to-end network for SELD. We show that our proposed EINV2 for joint SED and DoA estimation outperforms previous methods by a large margin, and has comparable performance to state-of-the-art ensemble models.Comment: 5 pages, 2021 IEEE International Conference on Acoustics, Speech and Signal Processin

arXiv.org e-Print Archive

University of Surrey

An analysis of sound event detection under acoustic degradation using multi-resolution systems

Author: de Benito-Gorrón Diego
Ramos Daniel
Toledano Doroteo T.
Publication venue: 'MDPI AG'
Publication date: 06/12/2021
Field of study

The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). In this paper, we analyze the performance of Sound Event Detection systems under diverse artificial acoustic conditions such as high-or low-pass filtering and clipping or dynamic range compression, as well as under an scenario of high overlap between events. For this purpose, the audio was obtained from the Evaluation subset of the DESED dataset, whereas the systems were trained in the context of the DCASE Challenge 2020 Task 4. Our systems are based upon the challenge baseline, which consists of a Convolutional-Recurrent Neural Network trained using the Mean Teacher method, and they employ a multiresolution approach which is able to improve the Sound Event Detection performance through the use of several resolutions during the extraction of Mel-spectrogram features. We provide insights on the benefits of this multiresolution approach in different acoustic settings, and compare the performance of the single-resolution systems in the aforementioned scenarios when using different resolutions. Furthermore, we complement the analysis of the performance in the high-overlap scenario by assessing the degree of overlap of each event category in sound event detection datasetsThis research and the APC were supported by project DSForSec (grant number RTI2018- 098091-B-I00) funded by the Ministry of Science, Innovation and Universities of Spain and the European Regional Development Fund (ERDF

Biblos-e Archivo