7 research outputs found
Self-Training for End-to-End Speech Recognition
We revisit self-training in the context of end-to-end speech recognition. We
demonstrate that training with pseudo-labels can substantially improve the
accuracy of a baseline model. Key to our approach are a strong baseline
acoustic and language model used to generate the pseudo-labels, filtering
mechanisms tailored to common errors from sequence-to-sequence models, and a
novel ensemble approach to increase pseudo-label diversity. Experiments on the
LibriSpeech corpus show that with an ensemble of four models and label
filtering, self-training yields a 33.9% relative improvement in WER compared
with a baseline trained on 100 hours of labelled data in the noisy speech
setting. In the clean speech setting, self-training recovers 59.3% of the gap
between the baseline and an oracle model, which is at least 93.8% relatively
higher than what previous approaches can achieve.Comment: To be published in the 45th IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP) 202
On semi-supervised LF-MMI training of acoustic models with limited data
International audienceThis work investigates semi-supervised training of acoustic models (AM) with the lattice-free maximum mutual information (LF-MMI) objective in practically relevant scenarios with a limited amount of labeled in-domain data. An error detection driven semi-supervised AM training approach is proposed, in which an error detector controls the hypothesized transcriptions or lattices used as LF-MMI training targets on additional unlabeled data. Under this approach, our first method uses a single error-tagged hypothesis whereas our second method uses a modified supervision lattice. These methods are evaluated and compared with existing semi-supervised AM training methods in three different matched or mismatched, limited data setups. Word error recovery rates of 28 to 89% are reported