19,800 research outputs found
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide
variations in their frequency content and temporal structure. Convolutional
neural networks (CNN) are able to extract higher level features that are
invariant to local spectral and temporal variations. Recurrent neural networks
(RNNs) are powerful in learning the longer term temporal context in the audio
signals. CNNs and RNNs as classifiers have recently shown improved performances
over established methods in various sound recognition tasks. We combine these
two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it
on a polyphonic sound event detection task. We compare the performance of the
proposed CRNN method with CNN, RNN, and other established methods, and observe
a considerable improvement for four different datasets consisting of everyday
sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language
Processing, Special Issue on Sound Scene and Event Analysi
Automatic Environmental Sound Recognition: Performance versus Computational Cost
In the context of the Internet of Things (IoT), sound sensing applications
are required to run on embedded platforms where notions of product pricing and
form factor impose hard constraints on the available computing power. Whereas
Automatic Environmental Sound Recognition (AESR) algorithms are most often
developed with limited consideration for computational cost, this article seeks
which AESR algorithm can make the most of a limited amount of computing power
by comparing the sound classification performance em as a function of its
computational cost. Results suggest that Deep Neural Networks yield the best
ratio of sound classification accuracy across a range of computational costs,
while Gaussian Mixture Models offer a reasonable accuracy at a consistently
small cost, and Support Vector Machines stand between both in terms of
compromise between accuracy and computational cost
Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging
Environmental audio tagging is a newly proposed task to predict the presence
or absence of a specific audio event in a chunk. Deep neural network (DNN)
based methods have been successfully adopted for predicting the audio tags in
the domestic audio scene. In this paper, we propose to use a convolutional
neural network (CNN) to extract robust features from mel-filter banks (MFBs),
spectrograms or even raw waveforms for audio tagging. Gated recurrent unit
(GRU) based recurrent neural networks (RNNs) are then cascaded to model the
long-term temporal structure of the audio signal. To complement the input
information, an auxiliary CNN is designed to learn on the spatial features of
stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging)
of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. Compared with our recent DNN-based method, the proposed
structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the
development set. The spatial features can further reduce the EER to 0.10. The
performance of the end-to-end learning on raw waveforms is also comparable.
Finally, on the evaluation set, we get the state-of-the-art performance with
0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US
Learning Audio Sequence Representations for Acoustic Event Classification
Acoustic Event Classification (AEC) has become a significant task for
machines to perceive the surrounding auditory scene. However, extracting
effective representations that capture the underlying characteristics of the
acoustic events is still challenging. Previous methods mainly focused on
designing the audio features in a 'hand-crafted' manner. Interestingly,
data-learnt features have been recently reported to show better performance. Up
to now, these were only considered on the frame-level. In this paper, we
propose an unsupervised learning framework to learn a vector representation of
an audio sequence for AEC. This framework consists of a Recurrent Neural
Network (RNN) encoder and a RNN decoder, which respectively transforms the
variable-length audio sequence into a fixed-length vector and reconstructs the
input sequence on the generated vector. After training the encoder-decoder, we
feed the audio sequences to the encoder and then take the learnt vectors as the
audio sequence representations. Compared with previous methods, the proposed
method can not only deal with the problem of arbitrary-lengths of audio
streams, but also learn the salient information of the sequence. Extensive
evaluation on a large-size acoustic event database is performed, and the
empirical results demonstrate that the learnt audio sequence representation
yields a significant performance improvement by a large margin compared with
other state-of-the-art hand-crafted sequence features for AEC
- …