38,672 research outputs found

    f-Divergence Minimization for Sequence-Level Knowledge Distillation

    Full text link
    Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.Comment: Accepted by ACL 202

    Mutual-learning sequence-level knowledge distillation for automatic speech recognition

    Get PDF
    Automatic speech recognition (ASR) is a crucial technology for man-machine interaction. End-to-end models have been studied recently in deep learning for ASR. However, these models are not suitable for the practical application of ASR due to their large model sizes and computation costs. To address this issue, we propose a novel mutual-learning sequence-level knowledge distillation framework enjoying distinct student structures for ASR. Trained mutually and simultaneously, each student learns not only from the pre-trained teacher but also from its distinct peers, which can improve the generalization capability of the whole network, through making up for the insufficiency of each student and bridging the gap between each student and the teacher. Extensive experiments on the TIMIT and large LibriSpeech corpuses show that, compared with the state-of-the-art methods, the proposed method achieves an excellent balance between recognition accuracy and model compression

    Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

    Full text link
    Benefiting from the sequence-level knowledge distillation, the Non-Autoregressive Transformer (NAT) achieves great success in neural machine translation tasks. However, existing knowledge distillation has side effects, such as propagating errors from the teacher to NAT students, which may limit further improvements of NAT models and are rarely discussed in existing research. In this paper, we introduce selective knowledge distillation by introducing an NAT evaluator to select NAT-friendly targets that are of high quality and easy to learn. In addition, we introduce a simple yet effective progressive distillation method to boost NAT performance. Experiment results on multiple WMT language directions and several representative NAT models show that our approach can realize a flexible trade-off between the quality and complexity of training data for NAT models, achieving strong performances. Further analysis shows that distilling only 5% of the raw translations can help an NAT outperform its counterpart trained on raw data by about 2.4 BLEU

    Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation

    Full text link
    Conventional automatic speech recognition (ASR) systems trained from frame-level alignments can easily leverage posterior fusion to improve ASR accuracy and build a better single model with knowledge distillation. End-to-end ASR systems trained using the Connectionist Temporal Classification (CTC) loss do not require frame-level alignment and hence simplify model training. However, sparse and arbitrary posterior spike timings from CTC models pose a new set of challenges in posterior fusion from multiple models and knowledge distillation between CTC models. We propose a method to train a CTC model so that its spike timings are guided to align with those of a pre-trained guiding CTC model. As a result, all models that share the same guiding model have aligned spike timings. We show the advantage of our method in various scenarios including posterior fusion of CTC models and knowledge distillation between CTC models with different architectures. With the 300-hour Switchboard training data, the single word CTC model distilled from multiple models improved the word error rates to 13.7%/23.1% from 14.9%/24.1% on the Hub5 2000 Switchboard/CallHome test sets without using any data augmentation, language model, or complex decoder.Comment: Accepted to Interspeech 201
    • …
    corecore