1,407 research outputs found
Cross-lingual Distillation for Text Classification
Cross-lingual text classification(CLTC) is the task of classifying documents
written in different languages into the same taxonomy of categories. This paper
presents a novel approach to CLTC that builds on model distillation, which
adapts and extends a framework originally proposed for model compression. Using
soft probabilistic predictions for the documents in a label-rich language as
the (induced) supervisory labels in a parallel corpus of documents, we train
classifiers successfully for new languages in which labeled training data are
not available. An adversarial feature adaptation technique is also applied
during the model training to reduce distribution mismatch. We conducted
experiments on two benchmark CLTC datasets, treating English as the source
language and German, French, Japan and Chinese as the unlabeled target
languages. The proposed approach had the advantageous or comparable performance
of the other state-of-art methods.Comment: Accepted at ACL 2017; Code available at
https://github.com/xrc10/cross-distil
An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models
Knowledge distillation (KD) is a well-known method for compressing neural
models. However, works focusing on distilling knowledge from large multilingual
neural machine translation (MNMT) models into smaller ones are practically
nonexistent, despite the popularity and superiority of MNMT. This paper bridges
this gap by presenting an empirical investigation of knowledge distillation for
compressing MNMT models. We take Indic to English translation as a case study
and demonstrate that commonly used language-agnostic and language-aware KD
approaches yield models that are 4-5x smaller but also suffer from performance
drops of up to 3.5 BLEU. To mitigate this, we then experiment with design
considerations such as shallower versus deeper models, heavy parameter sharing,
multi-stage training, and adapters. We observe that deeper compact models tend
to be as good as shallower non-compact ones, and that fine-tuning a distilled
model on a High-Quality subset slightly boosts translation quality. Overall, we
conclude that compressing MNMT models via KD is challenging, indicating immense
scope for further research.Comment: accepted at EAMT 202
- …