293 research outputs found
Improved training for online end-to-end speech recognition systems
Achieving high accuracy with end-to-end speech recognizers requires careful
parameter initialization prior to training. Otherwise, the networks may fail to
find a good local optimum. This is particularly true for online networks, such
as unidirectional LSTMs. Currently, the best strategy to train such systems is
to bootstrap the training from a tied-triphone system. However, this is time
consuming, and more importantly, is impossible for languages without a
high-quality pronunciation lexicon. In this work, we propose an initialization
strategy that uses teacher-student learning to transfer knowledge from a large,
well-trained, offline end-to-end speech recognition model to an online
end-to-end model, eliminating the need for a lexicon or any other linguistic
resources. We also explore curriculum learning and label smoothing and show how
they can be combined with the proposed teacher-student learning for further
improvements. We evaluate our methods on a Microsoft Cortana personal assistant
task and show that the proposed method results in a 19 % relative improvement
in word error rate compared to a randomly-initialized baseline system.Comment: Interspeech 201
Mutual-learning sequence-level knowledge distillation for automatic speech recognition
Automatic speech recognition (ASR) is a crucial technology for man-machine interaction. End-to-end models have been studied recently in deep learning for ASR. However, these models are not suitable for the practical application of ASR due to their large model sizes and computation costs. To address this issue, we propose a novel mutual-learning sequence-level knowledge distillation framework enjoying distinct student structures for ASR. Trained mutually and simultaneously, each student learns not only from the pre-trained teacher but also from its distinct peers, which can improve the generalization capability of the whole network, through making up for the insufficiency of each student and bridging the gap between each student and the teacher. Extensive experiments on the TIMIT and large LibriSpeech corpuses show that, compared with the state-of-the-art methods, the proposed method achieves an excellent balance between recognition accuracy and model compression
EM-Network: Oracle Guided Self-distillation for Sequence Learning
We introduce EM-Network, a novel self-distillation approach that effectively
leverages target information for supervised sequence-to-sequence (seq2seq)
learning. In contrast to conventional methods, it is trained with oracle
guidance, which is derived from the target sequence. Since the oracle guidance
compactly represents the target-side context that can assist the sequence model
in solving the task, the EM-Network achieves a better prediction compared to
using only the source input. To allow the sequence model to inherit the
promising capability of the EM-Network, we propose a new self-distillation
strategy, where the original sequence model can benefit from the knowledge of
the EM-Network in a one-stage manner. We conduct comprehensive experiments on
two types of seq2seq models: connectionist temporal classification (CTC) for
speech recognition and attention-based encoder-decoder (AED) for machine
translation. Experimental results demonstrate that the EM-Network significantly
advances the current state-of-the-art approaches, improving over the best prior
work on speech recognition and establishing state-of-the-art performance on
WMT'14 and IWSLT'14.Comment: ICML 202
ASR is all you need: cross-modal distillation for lip reading
The goal of this work is to train strong models for visual speech recognition
without requiring human annotated ground truth data. We achieve this by
distilling from an Automatic Speech Recognition (ASR) model that has been
trained on a large-scale audio-only corpus. We use a cross-modal distillation
method that combines Connectionist Temporal Classification (CTC) with a
frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that
ground truth transcriptions are not necessary to train a lip reading system;
(ii) we show how arbitrary amounts of unlabelled video data can be leveraged to
improve performance; (iii) we demonstrate that distillation significantly
speeds up training; and, (iv) we obtain state-of-the-art results on the
challenging LRS2 and LRS3 datasets for training only on publicly available
data.Comment: ICASSP 202
Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition
Recently, the advance in deep learning has brought a considerable improvement
in the end-to-end speech recognition field, simplifying the traditional
pipeline while producing promising results. Among the end-to-end models, the
connectionist temporal classification (CTC)-based model has attracted research
interest due to its non-autoregressive nature. However, such CTC models require
a heavy computational cost to achieve outstanding performance. To mitigate the
computational burden, we propose a simple yet effective knowledge distillation
(KD) for the CTC framework, namely Inter-KD, that additionally transfers the
teacher's knowledge to the intermediate CTC layers of the student network. From
the experimental results on the LibriSpeech, we verify that the Inter-KD shows
better achievements compared to the conventional KD methods. Without using any
language model (LM) and data augmentation, Inter-KD improves the word error
rate (WER) performance from 8.85 % to 6.30 % on the test-clean.Comment: Accepted by 2022 SLT Worksho
- …