134 research outputs found
Isomorphic Transfer of Syntactic Structures in Cross-Lingual NLP
The transfer or share of knowledge between languages is a popular solution to resource scarcity in NLP. However, the effectiveness of cross-lingual transfer can be challenged by variation in syntactic structures. Frameworks such as Universal Dependencies (UD) are designed to be cross-lingually consistent, but even in carefully designed resources trees representing equivalent sentences may not always overlap. In this paper, we measure cross-lingual syntactic variation, or anisomorphism, in the UD treebank collection, considering both morphological and structural properties. We show that reducing the level of anisomorphism yields consistent gains in cross-lingual transfer tasks. We introduce a source language selection procedure that facilitates effective cross-lingual parser transfer, and propose a typologically driven method for syntactic tree processing which reduces anisomorphism. Our results show the effectiveness of this method for both machine translation and cross-lingual sentence similarity, demonstrating the importance of syntactic structure compatibility for boosting cross-lingual transfer in NLP
Cross-lingual alignments of ELMo contextual embeddings
Building machine learning prediction models for a specific NLP task requires
sufficient training data, which can be difficult to obtain for less-resourced
languages. Cross-lingual embeddings map word embeddings from a less-resourced
language to a resource-rich language so that a prediction model trained on data
from the resource-rich language can also be used in the less-resourced
language. To produce cross-lingual mappings of recent contextual embeddings,
anchor points between the embedding spaces have to be words in the same
context. We address this issue with a novel method for creating cross-lingual
contextual alignment datasets. Based on that, we propose several cross-lingual
mapping methods for ELMo embeddings. The proposed linear mapping methods use
existing Vecmap and MUSE alignments on contextual ELMo embeddings. Novel
nonlinear ELMoGAN mapping methods are based on GANs and do not assume
isomorphic embedding spaces. We evaluate the proposed mapping methods on nine
languages, using four downstream tasks: named entity recognition (NER),
dependency parsing (DP), terminology alignment, and sentiment analysis. The
ELMoGAN methods perform very well on the NER and terminology alignment tasks,
with a lower cross-lingual loss for NER compared to the direct training on some
languages. In DP and sentiment analysis, linear contextual alignment variants
are more successful.Comment: 30 pages, 5 figure
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
Is Supervised Syntactic Parsing Beneficial for Language Understanding? An Empirical Investigation
Traditional NLP has long held (supervised) syntactic parsing necessary for
successful higher-level language understanding. The recent advent of end-to-end
neural language learning, self-supervised via language modeling (LM), and its
success on a wide range of language understanding tasks, however, questions
this belief. In this work, we empirically investigate the usefulness of
supervised parsing for semantic language understanding in the context of
LM-pretrained transformer networks. Relying on the established fine-tuning
paradigm, we first couple a pretrained transformer with a biaffine parsing
head, aiming to infuse explicit syntactic knowledge from Universal Dependencies
(UD) treebanks into the transformer. We then fine-tune the model for language
understanding (LU) tasks and measure the effect of the intermediate parsing
training (IPT) on downstream LU performance. Results from both monolingual
English and zero-shot language transfer experiments (with intermediate
target-language parsing) show that explicit formalized syntax, injected into
transformers through intermediate supervised parsing, has very limited and
inconsistent effect on downstream LU performance. Our results, coupled with our
analysis of transformers' representation spaces before and after intermediate
parsing, make a significant step towards providing answers to an essential
question: how (un)availing is supervised parsing for high-level semantic
language understanding in the era of large neural models
- …