917 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
English Conversational Telephone Speech Recognition by Humans and Machines
One of the most difficult speech recognition tasks is accurate recognition of
human to human communication. Advances in deep learning over the last few years
have produced major speech recognition improvements on the representative
Switchboard conversational corpus. Word error rates that just a few years ago
were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now
believed to be within striking range of human performance. This then raises two
issues - what IS human performance, and how far down can we still drive speech
recognition error rates? A recent paper by Microsoft suggests that we have
already achieved human performance. In trying to verify this statement, we
performed an independent set of human performance measurements on two
conversational tasks and found that human performance may be considerably
better than what was earlier reported, giving the community a significantly
harder goal to achieve. We also report on our own efforts in this area,
presenting a set of acoustic and language modeling techniques that lowered the
word error rate of our own English conversational telephone LVCSR system to the
level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000
evaluation, which - at least at the writing of this paper - is a new
performance milestone (albeit not at what we measure to be human performance!).
On the acoustic side, we use a score fusion of three models: one LSTM with
multiple feature inputs, a second LSTM trained with speaker-adversarial
multi-task learning and a third residual net (ResNet) with 25 convolutional
layers and time-dilated convolutions. On the language modeling side, we use
word and character LSTMs and convolutional WaveNet-style language models
Robust Speech Recognition Using Generative Adversarial Networks
This paper describes a general, scalable, end-to-end framework that uses the
generative adversarial network (GAN) objective to enable robust speech
recognition. Encoders trained with the proposed approach enjoy improved
invariance by learning to map noisy audio to the same embedding space as that
of clean audio. Unlike previous methods, the new framework does not rely on
domain expertise or simplifying assumptions as are often needed in signal
processing, and directly encourages robustness in a data-driven way. We show
the new approach improves simulated far-field speech recognition of vanilla
sequence-to-sequence models without specialized front-ends or preprocessing
- …