75 research outputs found
MUST: A Multilingual Student-Teacher Learning approach for low-resource speech recognition
Student-teacher learning or knowledge distillation (KD) has been previously
used to address data scarcity issue for training of speech recognition (ASR)
systems. However, a limitation of KD training is that the student model classes
must be a proper or improper subset of the teacher model classes. It prevents
distillation from even acoustically similar languages if the character sets are
not same. In this work, the aforementioned limitation is addressed by proposing
a MUltilingual Student-Teacher (MUST) learning which exploits a posteriors
mapping approach. A pre-trained mapping model is used to map posteriors from a
teacher language to the student language ASR. These mapped posteriors are used
as soft labels for KD learning. Various teacher ensemble schemes are
experimented to train an ASR model for low-resource languages. A model trained
with MUST learning reduces relative character error rate (CER) up to 9.5% in
comparison with a baseline monolingual ASR.Comment: Accepted for IEEE ASRU 202
Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers
Lip reading has witnessed unparalleled development in recent years thanks to
deep learning and the availability of large-scale datasets. Despite the
encouraging results achieved, the performance of lip reading, unfortunately,
remains inferior to the one of its counterpart speech recognition, due to the
ambiguous nature of its actuations that makes it challenging to extract
discriminant features from the lip movement videos. In this paper, we propose a
new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen
lip reading by learning from speech recognizers. The rationale behind our
approach is that the features extracted from speech recognizers may provide
complementary and discriminant clues, which are formidable to be obtained from
the subtle movements of the lips, and consequently facilitate the training of
lip readers. This is achieved, specifically, by distilling multi-granularity
knowledge from speech recognizers to lip readers. To conduct this cross-modal
knowledge distillation, we utilize an efficacious alignment scheme to handle
the inconsistent lengths of the audios and videos, as well as an innovative
filtering strategy to refine the speech recognizer's prediction. The proposed
method achieves the new state-of-the-art performance on the CMLR and LRS2
datasets, outperforming the baseline by a margin of 7.66% and 2.75% in
character error rate, respectively.Comment: AAAI 202
Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonie chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust against long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonie input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously. The robustness to long-form and noisy speech is also demonstrated. We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information
- …