193 research outputs found
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide
variations in their frequency content and temporal structure. Convolutional
neural networks (CNN) are able to extract higher level features that are
invariant to local spectral and temporal variations. Recurrent neural networks
(RNNs) are powerful in learning the longer term temporal context in the audio
signals. CNNs and RNNs as classifiers have recently shown improved performances
over established methods in various sound recognition tasks. We combine these
two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it
on a polyphonic sound event detection task. We compare the performance of the
proposed CRNN method with CNN, RNN, and other established methods, and observe
a considerable improvement for four different datasets consisting of everyday
sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language
Processing, Special Issue on Sound Scene and Event Analysi
Recommended from our members
Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF
Sound event detection in real-world environments suffers from the interference of non-stationary and time-varying noise. This paper presents an adaptive noise reduction method for sound event detection based on non-negative matrix factorization (NMF). First, a scheme for noise dictionary learning from the input noisy signal is employed by the technique of robust NMF, which supports adaptation to noise variations. The estimated noise dictionary is used to develop a supervised source separation framework in combination with a pre-trained event dictionary. Second, to improve the separation quality, we extend the basic NMF model to a weighted form, with the aim of varying the relative importance of the different components when separating a target sound event from noise. With properly designed weights, the separation process is forced to rely more on those dominant event components, whereas the noise gets greatly suppressed. The proposed method is evaluated on a dataset of the rare sound event detection task of the DCASE 2017 challenge, and achieves comparable results to the top-ranking system based on convolutional recurrent neural networks (CRNNs). The proposed weighted NMF method shows an excellent noise reduction ability, and achieves an improvement of an F-score by 5%, compared to the unweighted approach
Advanced informatics for event detection and temporal localization
PhD ThesisThe primary objective of a Sound Event Detection (SED) system is to detect the prescene
of an acoustic event (i.e., audio tagging) and to return the onset and offset of the identified acoustic event within an audio clip (i.e., temporal localization). Such a system
can be promising in wildlife and biodiversity monitoring, surveillance, and smart-home
applications.
However, developing a system to be adept at both subtasks is not a trivial task. It can
be hindered by the need for a large amount of strongly labeled data, where the event tags
and the corresponding onsets and offsets are known with certainty. This is a limiting factor
as strongly labeled data is challenging to collect and is prone to annotation errors due to
the ambiguity in the perception of onsets and offsets.
In this thesis, we propose to address the lack of strongly labeled data by using pseudo
strongly labeled data, where the event tags are known with certainty while the corresponding onsets and offsets are estimated. While Nonnegative Matrix Factorization can be
used directly for SED but with limited accuracy, we show that it can be a useful tool
for pseudo labeling. We further show that pseudo strongly labeled data estimated using
our proposed methods can improve the accuracy of a SED system developed using deep
learning approaches.
Subsequent work then focused on improving a SED system as a whole rather than a
single subtask. This leads to the proposal of a novel student-teacher training framework
that incorporates a noise-robust loss function, a new cyclic training scheme, an improved
depthwise separable convolution, a triple instance-level temporal pooling approach, and an
improved Transformer encoding layer. Together with synthetic strongly labeled data and a
large corpus of unlabeled data, we show that a SED system developed using our proposed
method is capable of producing state-of-the-art performance
A Closer Look at Weak Label Learning for Audio Events
—Audio content analysis in terms of sound events is an important research problem for a variety of applications. Recently, the development of weak labeling approaches for audio or sound event detection (AED) and availability of large scale weakly labeled dataset have finally opened up the possibility of large scale AED. However, a deeper understanding of how weak labels affect the learning for sound events is still missing from literature. In this work, we first describe a CNN based approach for weakly supervised training of audio events. The approach follows some basic design principle desirable in a learning method relying on weakly labeled audio. We then describe important characteristics, which naturally arise in weakly supervised learning of sound events. We show how these aspects of weak labels affect the generalization of models. More specifically, we study how characteristics such as label density and corruption of labels affects weakly supervised training for audio events. We also study the feasibility of directly obtaining weak labeled data from the web without any manual label and compare it with a dataset which has been manually labeled. The analysis and understanding of these factors should be taken into picture in the development of future weak label learning methods. Audioset, a large scale weakly labeled dataset for sound events is used in our experiments
- …