1 research outputs found
Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation
We explore best practices for training small, memory efficient machine
translation models with sequence-level knowledge distillation in the domain
adaptation setting. While both domain adaptation and knowledge distillation are
widely-used, their interaction remains little understood. Our large-scale
empirical results in machine translation (on three language pairs with three
domains each) suggest distilling twice for best performance: once using
general-domain data and again using in-domain data with an adapted teacher.Comment: Accepted to WNGT 2020 Workshop at ACL 2020 Conference. Code is at
http://github.com/mitchellgordon95/kd-au