261 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Light Gated Recurrent Units for Speech Recognition
A field that has directly benefited from the recent advances in deep learning
is Automatic Speech Recognition (ASR). Despite the great achievements of the
past decades, however, a natural and robust human-machine speech interaction
still appears to be out of reach, especially in challenging environments
characterized by significant noise and reverberation. To improve robustness,
modern speech recognizers often employ acoustic models based on Recurrent
Neural Networks (RNNs), that are naturally able to exploit large time contexts
and long-term speech modulations. It is thus of great interest to continue the
study of proper techniques for improving the effectiveness of RNNs in
processing speech signals.
In this paper, we revise one of the most popular RNN models, namely Gated
Recurrent Units (GRUs), and propose a simplified architecture that turned out
to be very effective for ASR. The contribution of this work is two-fold: First,
we analyze the role played by the reset gate, showing that a significant
redundancy with the update gate occurs. As a result, we propose to remove the
former from the GRU design, leading to a more efficient and compact single-gate
model. Second, we propose to replace hyperbolic tangent with ReLU activations.
This variation couples well with batch normalization and could help the model
learn long-term dependencies without numerical issues.
Results show that the proposed architecture, called Light GRU (Li-GRU), not
only reduces the per-epoch training time by more than 30% over a standard GRU,
but also consistently improves the recognition accuracy across different tasks,
input features, noisy conditions, as well as across different ASR paradigms,
ranging from standard DNN-HMM speech recognizers to end-to-end CTC models.Comment: Copyright 2018 IEE
Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition
We investigate the use of generative adversarial networks (GANs) in speech
dereverberation for robust speech recognition. GANs have been recently studied
for speech enhancement to remove additive noises, but there still lacks of a
work to examine their ability in speech dereverberation and the advantages of
using GANs have not been fully established. In this paper, we provide deep
investigations in the use of GAN-based dereverberation front-end in ASR. First,
we study the effectiveness of different dereverberation networks (the generator
in GAN) and find that LSTM leads a significant improvement as compared with
feed-forward DNN and CNN in our dataset. Second, further adding residual
connections in the deep LSTMs can boost the performance as well. Finally, we
find that, for the success of GAN, it is important to update the generator and
the discriminator using the same mini-batch data during training. Moreover,
using reverberant spectrogram as a condition to discriminator, as suggested in
previous studies, may degrade the performance. In summary, our GAN-based
dereverberation front-end achieves 14%-19% relative CER reduction as compared
to the baseline DNN dereverberation network when tested on a strong
multi-condition training acoustic model.Comment: Interspeech 201
- …