10 research outputs found
Improved Noisy Student Training for Automatic Speech Recognition
Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation
We present state-of-the-art automatic speech recognition (ASR) systems
employing a standard hybrid DNN/HMM architecture compared to an attention-based
encoder-decoder design for the LibriSpeech task. Detailed descriptions of the
system development, including model design, pretraining schemes, training
schedules, and optimization approaches are provided for both system
architectures. Both hybrid DNN/HMM and attention-based systems employ
bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we
employ both LSTM and Transformer based architectures. All our systems are built
using RWTHs open-source toolkits RASR and RETURNN. To the best knowledge of the
authors, the results obtained when training on the full LibriSpeech training
set, are the best published currently, both for the hybrid DNN/HMM and the
attention-based systems. Our single hybrid system even outperforms previous
results obtained from combining eight single systems. Our comparison shows that
on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the
attention-based system by 15% relative on the clean and 40% relative on the
other test sets in terms of word error rate. Moreover, experiments on a reduced
100h-subset of the LibriSpeech training corpus even show a more pronounced
margin between the hybrid DNN/HMM and attention-based architectures.Comment: Proceedings of INTERSPEECH 201
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
In conventional speech recognition, phoneme-based models outperform
grapheme-based models for non-phonetic languages such as English. The
performance gap between the two typically reduces as the amount of training
data is increased. In this work, we examine the impact of the choice of
modeling unit for attention-based encoder-decoder models. We conduct
experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various
target units (phoneme, grapheme, and word-piece); across all tasks, we find
that grapheme or word-piece models consistently outperform phoneme-based
models, even though they are evaluated without a lexicon or an external
language model. We also investigate model complementarity: we find that we can
improve WERs by up to 9% relative by rescoring N-best lists generated from a
strong word-piece based baseline with either the phoneme or the grapheme model.
Rescoring an N-best list generated by the phonemic system, however, provides
limited improvements. Further analysis shows that the word-piece-based models
produce more diverse N-best hypotheses, and thus lower oracle WERs, than
phonemic models.Comment: To appear in the proceedings of INTERSPEECH 201
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201