GKD: Generalized Knowledge Distillation for Auto-regressive Sequence
  Models

Agarwal, Rishabh; Bachem, Olivier; Geist, Matthieu; Ramos, Sabela; Stanczyk, Piotr; Vieillard, Nino

GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

Authors: Rishabh Agarwal
Olivier Bachem
Matthieu Geist
Sabela Ramos
Piotr Stanczyk
Nino Vieillard
Publication date: 23 June 2023
Publisher

Abstract

Knowledge distillation is commonly used for compressing neural networks to reduce their inference cost and memory footprint. However, current distillation methods for auto-regressive models, such as generative language models (LMs), suffer from two key issues: (1) distribution mismatch between output sequences during training and the sequences generated by the student during its deployment, and (2) model under-specification, where the student model may not be expressive enough to fit the teacher's distribution. To address these issues, we propose Generalized Knowledge Distillation (GKD). GKD mitigates distribution mismatch by sampling output sequences from the student during training. Furthermore, GKD handles model under-specification by optimizing alternative divergences, such as reverse KL, that focus on generating samples from the student that are likely under the teacher's distribution. We demonstrate that GKD outperforms commonly-used approaches for distilling LLMs on summarization, machine translation, and arithmetic reasoning tasks.Comment: First two authors contributed equall

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.13649

Last time updated on 28/06/2023