1,042 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
A practical two-stage training strategy for multi-stream end-to-end speech recognition
The multi-stream paradigm of audio processing, in which several sources are
simultaneously considered, has been an active research area for information
fusion. Our previous study offered a promising direction within end-to-end
automatic speech recognition, where parallel encoders aim to capture diverse
information followed by a stream-level fusion based on attention mechanisms to
combine the different views. However, with an increasing number of streams
resulting in an increasing number of encoders, the previous approach could
require substantial memory and massive amounts of parallel data for joint
training. In this work, we propose a practical two-stage training scheme.
Stage-1 is to train a Universal Feature Extractor (UFE), where encoder outputs
are produced from a single-stream model trained with all data. Stage-2
formulates a multi-stream scheme intending to solely train the attention fusion
module using the UFE features and pretrained components from Stage-1.
Experiments have been conducted on two datasets, DIRHA and AMI, as a
multi-stream scenario. Compared with our previous method, this strategy
achieves relative word error rate reductions of 8.2--32.4%, while consistently
outperforming several conventional combination methods.Comment: submitted to ICASSP 201
- …