485 research outputs found
Improving large vocabulary continuous speech recognition by combining GMM-based and reservoir-based acoustic modeling
In earlier work we have shown that good phoneme recognition is possible with a so-called reservoir, a special type of recurrent neural network. In this paper, different architectures based on Reservoir Computing (RC) for large vocabulary continuous speech recognition are investigated. Besides experiments with HMM hybrids, it is shown that a RC-HMM tandem can achieve the same recognition accuracy as a classical HMM, which is a promising result for such a fairly new paradigm. It is also demonstrated that a state-level combination of the scores of the tandem and the baseline HMM leads to a significant improvement over the baseline. A word error rate reduction of the order of 20\% relative is possible
Convolutional Gated Recurrent Neural Network Incorporating Spatial Features for Audio Tagging
Environmental audio tagging is a newly proposed task to predict the presence
or absence of a specific audio event in a chunk. Deep neural network (DNN)
based methods have been successfully adopted for predicting the audio tags in
the domestic audio scene. In this paper, we propose to use a convolutional
neural network (CNN) to extract robust features from mel-filter banks (MFBs),
spectrograms or even raw waveforms for audio tagging. Gated recurrent unit
(GRU) based recurrent neural networks (RNNs) are then cascaded to model the
long-term temporal structure of the audio signal. To complement the input
information, an auxiliary CNN is designed to learn on the spatial features of
stereo recordings. We evaluate our proposed methods on Task 4 (audio tagging)
of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE
2016) challenge. Compared with our recent DNN-based method, the proposed
structure can reduce the equal error rate (EER) from 0.13 to 0.11 on the
development set. The spatial features can further reduce the EER to 0.10. The
performance of the end-to-end learning on raw waveforms is also comparable.
Finally, on the evaluation set, we get the state-of-the-art performance with
0.12 EER while the performance of the best existing system is 0.15 EER.Comment: Accepted to IJCNN2017, Anchorage, Alaska, US
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Very Deep Convolutional Neural Networks for Robust Speech Recognition
This paper describes the extension and optimization of our previous work on
very deep convolutional neural networks (CNNs) for effective recognition of
noisy speech in the Aurora 4 task. The appropriate number of convolutional
layers, the sizes of the filters, pooling operations and input feature maps are
all modified: the filter and pooling sizes are reduced and dimensions of input
feature maps are extended to allow adding more convolutional layers.
Furthermore appropriate input padding and input feature map selection
strategies are developed. In addition, an adaptation framework using joint
training of very deep CNN with auxiliary features i-vector and fMLLR features
is developed. These modifications give substantial word error rate reductions
over the standard CNN used as baseline. Finally the very deep CNN is combined
with an LSTM-RNN acoustic model and it is shown that state-level weighted log
likelihood score combination in a joint acoustic model decoding scheme is very
effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%,
further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN
joint decoding.Comment: accepted by SLT 201
- …