Neural Architecture Search for Effective Teacher-Student Knowledge
  Transfer in Language Models

Bhattacharjee, Bishwaranjan; El-Kurdi, Yousef; Merler, Michele; Panda, Rameswar; Trivedi, Aashka; Udagawa, Takuma

Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Authors: Bishwaranjan Bhattacharjee
Yousef El-Kurdi
Michele Merler
Rameswar Panda
Aashka Trivedi
Takuma Udagawa
Publication date: 13 October 2023
Publisher

Abstract

Large pretrained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments. However, KD can be ineffective when the student is manually selected from a set of existing options, since it can be a sub-optimal choice within the space of all possible student architectures. We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task agnostic distillation from a multilingual teacher. In each episode of the search process, a NAS controller predicts a reward based on the distillation loss and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full training corpus. KD-NAS can automatically trade off efficiency and effectiveness, and recommends architectures suitable to various latency budgets. Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance, and has been deployed in 3 software offerings requiring large throughput, low latency and deployment on CPU.Comment: 11 pages, 5 figure

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2303.09639

Last time updated on 28/03/2023