25 research outputs found
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition
We propose a novel approach to semi-supervised automatic speech recognition
(ASR). We first exploit a large amount of unlabeled audio data via
representation learning, where we reconstruct a temporal slice of filterbank
features from past and future context frames. The resulting deep contextualized
acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end
ASR system using a smaller amount of labeled audio data. In our experiments, we
show that systems trained on DeCoAR consistently outperform ones trained on
conventional filterbank features, giving 42% and 19% relative improvement over
the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our
approach can drastically reduce the amount of labeled data required;
unsupervised training on LibriSpeech then supervision with 100 hours of labeled
data achieves performance on par with training on all 960 hours directly.
Pre-trained models and code will be released online.Comment: Accepted to ICASSP 2020 (oral
Stimulated training for automatic speech recognition and keyword search in limited resource conditions
© 2017 IEEE. Training neural network acoustic models on limited quantities of data is a challenging task. A number of techniques have been proposed to improve generalisation. This paper investigates one such technique called stimulated training. It enables standard criteria such as cross-entropy to enforce spatial constraints on activations originating from different units. Having different regions being active depending on the input unit may help network to discriminate better and as a consequence yield lower error rates. This paper investigates stimulated training for automatic speech recognition of a number of languages representing different families, alphabets, phone sets and vocabulary sizes. In particular, it looks at ensembles of stimulated networks to ensure that improved generalisation will withstand system combination effects. In order to assess stimulated training beyond 1-best transcription accuracy, this paper looks at keyword search as a proxy for assessing quality of lattices. Experiments are conducted on IARPA Babel program languages including the surprise language of OpenKWS 2016 competition
Training ASR models by Generation of Contextual Information
Supervised ASR models have reached unprecedented levels of accuracy, thanks
in part to ever-increasing amounts of labelled training data. However, in many
applications and locales, only moderate amounts of data are available, which
has led to a surge in semi- and weakly-supervised learning research. In this
paper, we conduct a large-scale study evaluating the effectiveness of
weakly-supervised learning for speech recognition by using loosely related
contextual information as a surrogate for ground-truth labels. For weakly
supervised training, we use 50k hours of public English social media videos
along with their respective titles and post text to train an encoder-decoder
transformer model. Our best encoder-decoder models achieve an average of 20.8%
WER reduction over a 1000 hours supervised baseline, and an average of 13.4%
WER reduction when using only the weakly supervised encoder for CTC
fine-tuning. Our results show that our setup for weak supervision improved both
the encoder acoustic representations as well as the decoder language generation
abilities