2,729 research outputs found
Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection
Speech Activity Detection (SAD) plays an important role in mobile communications and automatic speech recognition (ASR). Developing efficient SAD systems for real-world applications is a challenging task due to the presence of noise. We propose a new approach to SAD where we treat it as a two-dimensional multilabel image classification problem. To classify the audio segments, we compute their Short-time Fourier Transform spectrograms and classify them with a Convolutional Recurrent Neural Network (CRNN), traditionally used in image recognition. Our CRNN uses a sigmoid activation function, max-pooling in the frequency domain, and a convolutional operation as a moving average filter to remove misclassified spikes. On the development set of Task 1 of the 2019 Fearless Steps Challenge, our system achieved a decision cost function (DCF) of 2.89%, a 66.4% improvement over the baseline. Moreover, it achieved a DCF score of 3.318% on the evaluation dataset of the challenge, ranking first among all submissions
Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network
This paper proposes to use low-level spatial features extracted from
multichannel audio for sound event detection. We extend the convolutional
recurrent neural network to handle more than one type of these multichannel
features by learning from each of them separately in the initial stages. We
show that instead of concatenating the features of each channel into a single
feature vector the network learns sound events in multichannel audio better
when they are presented as separate layers of a volume. Using the proposed
spatial features over monaural features on the same network gives an absolute
F-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and
2.7% on the TUT-SED 2009 dataset that is fifteen times larger.Comment: Accepted for IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP 2017
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network
Because of their effectiveness in broad practical applications, LSTM networks
have received a wealth of coverage in scientific journals, technical blogs, and
implementation guides. However, in most articles, the inference formulas for
the LSTM network and its parent, RNN, are stated axiomatically, while the
training formulas are omitted altogether. In addition, the technique of
"unrolling" an RNN is routinely presented without justification throughout the
literature. The goal of this paper is to explain the essential RNN and LSTM
fundamentals in a single document. Drawing from concepts in signal processing,
we formally derive the canonical RNN formulation from differential equations.
We then propose and prove a precise statement, which yields the RNN unrolling
technique. We also review the difficulties with training the standard RNN and
address them by transforming the RNN into the "Vanilla LSTM" network through a
series of logical arguments. We provide all equations pertaining to the LSTM
system together with detailed descriptions of its constituent entities. Albeit
unconventional, our choice of notation and the method for presenting the LSTM
system emphasizes ease of understanding. As part of the analysis, we identify
new opportunities to enrich the LSTM system and incorporate these extensions
into the Vanilla LSTM network, producing the most general LSTM variant to date.
The target reader has already been exposed to RNNs and LSTM networks through
numerous available resources and is open to an alternative pedagogical
approach. A Machine Learning practitioner seeking guidance for implementing our
new augmented LSTM model in software for experimentation and research will find
the insights and derivations in this tutorial valuable as well.Comment: 43 pages, 10 figures, 78 reference
- …