76 research outputs found
Frame Pairwise Distance Loss for Weakly-supervised Sound Event Detection
Weakly-supervised learning has emerged as a promising approach to leverage
limited labeled data in various domains by bridging the gap between fully
supervised methods and unsupervised techniques. Acquisition of strong
annotations for detecting sound events is prohibitively expensive, making
weakly supervised learning a more cost-effective and broadly applicable
alternative. In order to enhance the recognition rate of the learning of
detection of weakly-supervised sound events, we introduce a Frame Pairwise
Distance (FPD) loss branch, complemented with a minimal amount of synthesized
data. The corresponding sampling and label processing strategies are also
proposed. Two distinct distance metrics are employed to evaluate the proposed
approach. Finally, the method is validated on the DCASE 2023 task4 dataset. The
obtained experimental results corroborated the efficacy of this approach.Comment: Submitted to ICASSP 202
Cooperative Scene-Event Modelling for Acoustic Scene Classification
Acoustic scene classification (ASC) can be helpful for creating context awareness for intelligent robots. Humans naturally use the relations between acoustic scenes (AS) and audio events (AE) to understand and recognize their surrounding environments. However, in most previous works, ASC and audio event classification (AEC) are treated as independent tasks, with a focus primarily on audio features shared between scenes and events, but not their implicit relations. To address this limitation, we propose a cooperative scene-event modelling (cSEM) framework to automatically model the intricate scene-event relation by an adaptive coupling matrix to improve ASC. Compared with other scene-event modelling frameworks, the proposed cSEM offers the following advantages. First, it reduces the confusion between similar scenes by aligning the information of coarse-grained AS and fine-grained AE in the latent space, and reducing the redundant information between the AS and AE embeddings. Second, it exploits the relation information between AS and AE to improve ASC, which is shown to be beneficial, even if the information of AE is derived from unverified pseudo-labels. Third, it uses a regression-based loss function for cooperative modelling of scene-event relations, which is shown to be more effective than classification-based loss functions. Instantiated from four models based on either Transformer or convolutional neural networks, cSEM is evaluated on real-life and synthetic datasets. Experiments show that cSEM-based models work well in real-life scene-event analysis, offering competitive results on ASC as compared with other multi-feature or multi-model ensemble methods. The ASC accuracy achieved on the TUT2018, TAU2019, and JSSED datasets is 81.0%, 88.9% and 97.2%, respectively
Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes
We propose a benchmark of state-of-the-art sound event detection systems
(SED). We designed synthetic evaluation sets to focus on specific sound event
detection challenges. We analyze the performance of the submissions to DCASE
2021 task 4 depending on time related modifications (time position of an event
and length of clips) and we study the impact of non-target sound events and
reverberation. We show that the localization in time of sound events is still a
problem for SED systems. We also show that reverberation and non-target sound
events are severely degrading the performance of the SED systems. In the latter
case, sound separation seems like a promising solution
An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection (SELD), which jointly
performs sound event detection (SED) and direction-of-arrival (DoA) estimation,
detects the type and occurrence time of sound events as well as their
corresponding DoA angles simultaneously. We study the SELD task from a
multi-task learning perspective. Two open problems are addressed in this paper.
Firstly, to detect overlapping sound events of the same type but with different
DoAs, we propose to use a trackwise output format and solve the accompanying
track permutation problem with permutation-invariant training. Multi-head
self-attention is further used to separate tracks. Secondly, a previous finding
is that, by using hard parameter-sharing, SELD suffers from a performance loss
compared with learning the subtasks separately. This is solved by a soft
parameter-sharing scheme. We term the proposed method as Event Independent
Network V2 (EINV2), which is an improved version of our previously-proposed
method and an end-to-end network for SELD. We show that our proposed EINV2 for
joint SED and DoA estimation outperforms previous methods by a large margin,
and has comparable performance to state-of-the-art ensemble models.Comment: 5 pages, 2021 IEEE International Conference on Acoustics, Speech and
Signal Processin
An analysis of sound event detection under acoustic degradation using multi-resolution systems
The Sound Event Detection task aims to determine the temporal locations of acoustic events in audio clips. In recent years, the relevance of this field is rising due to the introduction of datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). In this paper, we analyze the performance of Sound Event Detection systems under diverse artificial acoustic conditions such as high-or low-pass filtering and clipping or dynamic range compression, as well as under an scenario of high overlap between events. For this purpose, the audio was obtained from the Evaluation subset of the DESED dataset, whereas the systems were trained in the context of the DCASE Challenge 2020 Task 4. Our systems are based upon the challenge baseline, which consists of a Convolutional-Recurrent Neural Network trained using the Mean Teacher method, and they employ a multiresolution approach which is able to improve the Sound Event Detection performance through the use of several resolutions during the extraction of Mel-spectrogram features. We provide insights on the benefits of this multiresolution approach in different acoustic settings, and compare the performance of the single-resolution systems in the aforementioned scenarios when using different resolutions. Furthermore, we complement the analysis of the performance in the high-overlap scenario by assessing the degree of overlap of each event category in sound event detection datasetsThis research and the APC were supported by project DSForSec (grant number RTI2018-
098091-B-I00) funded by the Ministry of Science, Innovation and Universities of Spain and the European Regional Development Fund (ERDF
- …