5 research outputs found
Training ASR models by Generation of Contextual Information
Supervised ASR models have reached unprecedented levels of accuracy, thanks
in part to ever-increasing amounts of labelled training data. However, in many
applications and locales, only moderate amounts of data are available, which
has led to a surge in semi- and weakly-supervised learning research. In this
paper, we conduct a large-scale study evaluating the effectiveness of
weakly-supervised learning for speech recognition by using loosely related
contextual information as a surrogate for ground-truth labels. For weakly
supervised training, we use 50k hours of public English social media videos
along with their respective titles and post text to train an encoder-decoder
transformer model. Our best encoder-decoder models achieve an average of 20.8%
WER reduction over a 1000 hours supervised baseline, and an average of 13.4%
WER reduction when using only the weakly supervised encoder for CTC
fine-tuning. Our results show that our setup for weak supervision improved both
the encoder acoustic representations as well as the decoder language generation
abilities