19 research outputs found
Continuous multilinguality with language vectors
Peer reviewe
Learning Language Representations for Typology Prediction
One central mystery of neural NLP is what neural models "know" about their
subject matter. When a neural machine translation system learns to translate
from one language to another, does it learn the syntax or semantics of the
languages? Can this knowledge be extracted from the system to fill holes in
human scientific knowledge? Existing typological databases contain relatively
full feature specifications for only a few hundred languages. Exploiting the
existence of parallel texts in more than a thousand languages, we build a
massive many-to-one neural machine translation (NMT) system from 1017 languages
into English, and use this to predict information missing from typological
databases. Experiments show that the proposed method is able to infer not only
syntactic, but also phonological and phonetic inventory features, and improves
over a baseline that has access to information about the languages' geographic
and phylogenetic neighbors.Comment: EMNLP 201
Does Syntactic Knowledge in Multilingual Language Models Transfer Across Languages?
Recent work has shown that neural models can be successfully trained on multiple languages simultaneously. We investigate whether such models learn to share and exploit common syntactic knowledge among the languages on which they are trained. This extended abstract presents our preliminary result
In search of isoglosses: continuous and discrete language embeddings in Slavic historical phonology
This paper investigates the ability of neural network architectures to
effectively learn diachronic phonological generalizations in a multilingual
setting. We employ models using three different types of language embedding
(dense, sigmoid, and straight-through). We find that the Straight-Through model
outperforms the other two in terms of accuracy, but the Sigmoid model's
language embeddings show the strongest agreement with the traditional
subgrouping of the Slavic languages. We find that the Straight-Through model
has learned coherent, semi-interpretable information about sound change, and
outline directions for future research
Multi-Domain Neural Machine Translation
We present an approach to neural machine translation (NMT) that supports
multiple domains in a single model and allows switching between the domains
when translating. The core idea is to treat text domains as distinct languages
and use multilingual NMT methods to create multi-domain translation systems, we
show that this approach results in significant translation quality gains over
fine-tuning. We also explore whether the knowledge of pre-specified text
domains is necessary, turns out that it is after all, but also that when it is
not known quite high translation quality can be reached.Comment: Accepted to EAMT'2018, In Proceedings of the 21st Annual Conference
of the European Association for Machine Translation (EAMT'2018
A Probabilistic Generative Model of Linguistic Typology
In the principles-and-parameters framework, the structural features of
languages depend on parameters that may be toggled on or off, with a single
parameter often dictating the status of multiple features. The implied
covariance between features inspires our probabilisation of this line of
linguistic inquiry---we develop a generative model of language based on
exponential-family matrix factorisation. By modelling all languages and
features within the same architecture, we show how structural similarities
between languages can be exploited to predict typological features with
near-perfect accuracy, outperforming several baselines on the task of
predicting held-out features. Furthermore, we show that language embeddings
pre-trained on monolingual text allow for generalisation to unobserved
languages. This finding has clear practical and also theoretical implications:
the results confirm what linguists have hypothesised, i.e.~that there are
significant correlations between typological features and languages.Comment: NAACL 2019, 12 page