2 research outputs found
Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains
Pre-trained language models have been applied to various NLP tasks with
considerable performance gains. However, the large model sizes, together with
the long inference time, limit the deployment of such models in real-time
applications. One line of model compression approaches considers knowledge
distillation to distill large teacher models into small student models. Most of
these studies focus on single-domain only, which ignores the transferable
knowledge from other domains. We notice that training a teacher with
transferable knowledge digested across domains can achieve better
generalization capability to help knowledge distillation. Hence we propose a
Meta-Knowledge Distillation (Meta-KD) framework to build a meta-teacher model
that captures transferable knowledge across domains and passes such knowledge
to students. Specifically, we explicitly force the meta-teacher to capture
transferable knowledge at both instance-level and feature-level from multiple
domains, and then propose a meta-distillation algorithm to learn single-domain
student models with guidance from the meta-teacher. Experiments on public
multi-domain NLP tasks show the effectiveness and superiority of the proposed
Meta-KD framework. Further, we also demonstrate the capability of Meta-KD in
the settings where the training data is scarce