6,407 research outputs found
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken
communication. Machine learning models such as neural networks have already been proposed for audio signal
modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the
implementation of several neural network-based systems for speech and music event detection over a collection of
77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to
YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The
first one is the training of two different neural networks, one for speech detection and another for music detection.
The second approach consists on training a single neural network to tackle both tasks at the same time. The studied
architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks.
Comparative results are provided in terms of classification performance and model complexity. We would like to
highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid
convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore,
a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most
harmful for the performance of the models, showing some difficult scenarios for the detection of music and speechThis work has been supported by project “DSSL: Redes Profundas y Modelos
de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y
Enfermedades Degenerativas a partir de la Voz” (TEC2015-68172-C2-1-P),
funded by the Ministry of Economy and Competitivity of Spain and FEDE
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide
variations in their frequency content and temporal structure. Convolutional
neural networks (CNN) are able to extract higher level features that are
invariant to local spectral and temporal variations. Recurrent neural networks
(RNNs) are powerful in learning the longer term temporal context in the audio
signals. CNNs and RNNs as classifiers have recently shown improved performances
over established methods in various sound recognition tasks. We combine these
two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it
on a polyphonic sound event detection task. We compare the performance of the
proposed CRNN method with CNN, RNN, and other established methods, and observe
a considerable improvement for four different datasets consisting of everyday
sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language
Processing, Special Issue on Sound Scene and Event Analysi
- …