44 research outputs found
Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation
Conventional automatic speech recognition (ASR) systems trained from
frame-level alignments can easily leverage posterior fusion to improve ASR
accuracy and build a better single model with knowledge distillation.
End-to-end ASR systems trained using the Connectionist Temporal Classification
(CTC) loss do not require frame-level alignment and hence simplify model
training. However, sparse and arbitrary posterior spike timings from CTC models
pose a new set of challenges in posterior fusion from multiple models and
knowledge distillation between CTC models. We propose a method to train a CTC
model so that its spike timings are guided to align with those of a pre-trained
guiding CTC model. As a result, all models that share the same guiding model
have aligned spike timings. We show the advantage of our method in various
scenarios including posterior fusion of CTC models and knowledge distillation
between CTC models with different architectures. With the 300-hour Switchboard
training data, the single word CTC model distilled from multiple models
improved the word error rates to 13.7%/23.1% from 14.9%/24.1% on the Hub5 2000
Switchboard/CallHome test sets without using any data augmentation, language
model, or complex decoder.Comment: Accepted to Interspeech 201
Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers
Lip reading has witnessed unparalleled development in recent years thanks to
deep learning and the availability of large-scale datasets. Despite the
encouraging results achieved, the performance of lip reading, unfortunately,
remains inferior to the one of its counterpart speech recognition, due to the
ambiguous nature of its actuations that makes it challenging to extract
discriminant features from the lip movement videos. In this paper, we propose a
new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen
lip reading by learning from speech recognizers. The rationale behind our
approach is that the features extracted from speech recognizers may provide
complementary and discriminant clues, which are formidable to be obtained from
the subtle movements of the lips, and consequently facilitate the training of
lip readers. This is achieved, specifically, by distilling multi-granularity
knowledge from speech recognizers to lip readers. To conduct this cross-modal
knowledge distillation, we utilize an efficacious alignment scheme to handle
the inconsistent lengths of the audios and videos, as well as an innovative
filtering strategy to refine the speech recognizer's prediction. The proposed
method achieves the new state-of-the-art performance on the CMLR and LRS2
datasets, outperforming the baseline by a margin of 7.66% and 2.75% in
character error rate, respectively.Comment: AAAI 202
Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition
Recently, the advance in deep learning has brought a considerable improvement
in the end-to-end speech recognition field, simplifying the traditional
pipeline while producing promising results. Among the end-to-end models, the
connectionist temporal classification (CTC)-based model has attracted research
interest due to its non-autoregressive nature. However, such CTC models require
a heavy computational cost to achieve outstanding performance. To mitigate the
computational burden, we propose a simple yet effective knowledge distillation
(KD) for the CTC framework, namely Inter-KD, that additionally transfers the
teacher's knowledge to the intermediate CTC layers of the student network. From
the experimental results on the LibriSpeech, we verify that the Inter-KD shows
better achievements compared to the conventional KD methods. Without using any
language model (LM) and data augmentation, Inter-KD improves the word error
rate (WER) performance from 8.85 % to 6.30 % on the test-clean.Comment: Accepted by 2022 SLT Worksho
Mutual-learning sequence-level knowledge distillation for automatic speech recognition
Automatic speech recognition (ASR) is a crucial technology for man-machine interaction. End-to-end models have been studied recently in deep learning for ASR. However, these models are not suitable for the practical application of ASR due to their large model sizes and computation costs. To address this issue, we propose a novel mutual-learning sequence-level knowledge distillation framework enjoying distinct student structures for ASR. Trained mutually and simultaneously, each student learns not only from the pre-trained teacher but also from its distinct peers, which can improve the generalization capability of the whole network, through making up for the insufficiency of each student and bridging the gap between each student and the teacher. Extensive experiments on the TIMIT and large LibriSpeech corpuses show that, compared with the state-of-the-art methods, the proposed method achieves an excellent balance between recognition accuracy and model compression