4 research outputs found
Structure-Level Knowledge Distillation For Multilingual Sequence Labeling
Multilingual sequence labeling is a task of predicting label sequences using
a single unified model for multiple languages. Compared with relying on
multiple monolingual models, using a multilingual model has the benefit of a
smaller model size, easier in online serving, and generalizability to
low-resource languages. However, current multilingual models still underperform
individual monolingual models significantly due to model capacity limitations.
In this paper, we propose to reduce the gap between monolingual models and the
unified multilingual model by distilling the structural knowledge of several
monolingual models (teachers) to the unified multilingual model (student). We
propose two novel KD methods based on structure-level information: (1)
approximately minimizes the distance between the student's and the teachers'
structure level probability distributions, (2) aggregates the structure-level
knowledge to local distributions and minimizes the distance between two local
probability distributions. Our experiments on 4 multilingual tasks with 25
datasets show that our approaches outperform several strong baselines and have
stronger zero-shot generalizability than both the baseline model and teacher
models.Comment: Accepted to ACL 2020, camera-ready. 14 page
Sources of Transfer in Multilingual Named Entity Recognition
Named-entities are inherently multilingual, and annotations in any given
language may be limited. This motivates us to consider polyglot named-entity
recognition (NER), where one model is trained using annotated data drawn from
more than one language. However, a straightforward implementation of this
simple idea does not always work in practice: naive training of NER models
using annotated data drawn from multiple languages consistently underperforms
models trained on monolingual data alone, despite having access to more
training data. The starting point of this paper is a simple solution to this
problem, in which polyglot models are fine-tuned on monolingual data to
consistently and significantly outperform their monolingual counterparts. To
explain this phenomena, we explore the sources of multilingual transfer in
polyglot NER models and examine the weight structure of polyglot models
compared to their monolingual counterparts. We find that polyglot models
efficiently share many parameters across languages and that fine-tuning may
utilize a large number of those parameters.Comment: ACL 202
Coupling Distant Annotation and Adversarial Training for Cross-Domain Chinese Word Segmentation
Fully supervised neural approaches have achieved significant progress in the
task of Chinese word segmentation (CWS). Nevertheless, the performance of
supervised models tends to drop dramatically when they are applied to
out-of-domain data. Performance degradation is caused by the distribution gap
across domains and the out of vocabulary (OOV) problem. In order to
simultaneously alleviate these two issues, this paper proposes to couple
distant annotation and adversarial training for cross-domain CWS. For distant
annotation, we rethink the essence of "Chinese words" and design an automatic
distant annotation mechanism that does not need any supervision or pre-defined
dictionaries from the target domain. The approach could effectively explore
domain-specific words and distantly annotate the raw texts for the target
domain. For adversarial training, we develop a sentence-level training
procedure to perform noise reduction and maximum utilization of the source
domain information. Experiments on multiple real-world datasets across various
domains show the superiority and robustness of our model, significantly
outperforming previous state-of-the-art cross-domain CWS methods.Comment: Accepted by ACL 202
Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language
To better tackle the named entity recognition (NER) problem on languages with
little/no labeled data, cross-lingual NER must effectively leverage knowledge
learned from source languages with rich labeled data. Previous works on
cross-lingual NER are mostly based on label projection with pairwise texts or
direct model transfer. However, such methods either are not applicable if the
labeled data in the source languages is unavailable, or do not leverage
information contained in unlabeled data in the target language. In this paper,
we propose a teacher-student learning method to address such limitations, where
NER models in the source languages are used as teachers to train a student
model on unlabeled data in the target language. The proposed method works for
both single-source and multi-source cross-lingual NER. For the latter, we
further propose a similarity measuring method to better weight the supervision
from different teacher models. Extensive experiments for 3 target languages on
benchmark datasets well demonstrate that our method outperforms existing
state-of-the-art methods for both single-source and multi-source cross-lingual
NER.Comment: This paper is accepted by ACL2020. Code is available at
https://github.com/microsoft/vert-papers/tree/master/papers/SingleMulti-T