25 research outputs found
Improved Noisy Student Training for Automatic Speech Recognition
Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde
PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition
Consonant and vowel reduction are often encountered in speech, which might
cause performance degradation in automatic speech recognition (ASR). Our
recently proposed learning strategy based on masking, Phone Masking Training
(PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT
achieves remarkably improvements, there still exists room for further gains due
to the granularity mismatch between the masking unit of PMT (phoneme) and the
modeling unit (word-piece). To boost the performance of PMT, we propose
multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The
idea of MMUT framework is to split the Encoder into two parts including
acoustic feature sequences to phoneme-level representation (AF-to-PLR) and
phoneme-level representation to word-piece-level representation (PLR-to-WPLR).
It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss
to learn the rich phoneme-level context information brought by PMT.
Experimental results on Uyghur ASR show that the proposed approaches outperform
obviously the pure PMT. We also conduct experiments on the 960-hour Librispeech
benchmark using ESPnet1, which achieves about 10% relative WER reduction on all
the test set without LM fusion comparing with the latest official ESPnet1
pre-trained model.Comment: Accepted to INTERSPEECH 202