332 research outputs found
Meemi: A Simple Method for Post-processing and Integrating Cross-lingual Word Embeddings
Word embeddings have become a standard resource in the toolset of any Natural
Language Processing practitioner. While monolingual word embeddings encode
information about words in the context of a particular language, cross-lingual
embeddings define a multilingual space where word embeddings from two or more
languages are integrated together. Current state-of-the-art approaches learn
these embeddings by aligning two disjoint monolingual vector spaces through an
orthogonal transformation which preserves the structure of the monolingual
counterparts. In this work, we propose to apply an additional transformation
after this initial alignment step, which aims to bring the vector
representations of a given word and its translations closer to their average.
Since this additional transformation is non-orthogonal, it also affects the
structure of the monolingual spaces. We show that our approach both improves
the integration of the monolingual spaces as well as the quality of the
monolingual spaces themselves. Furthermore, because our transformation can be
applied to an arbitrary number of languages, we are able to effectively obtain
a truly multilingual space. The resulting (monolingual and multilingual) spaces
show consistent gains over the current state-of-the-art in standard intrinsic
tasks, namely dictionary induction and word similarity, as well as in extrinsic
tasks such as cross-lingual hypernym discovery and cross-lingual natural
language inference.Comment: 22 pages, 2 figures, 9 tables. Preprint submitted to Natural Language
Engineerin
Density Matching for Bilingual Word Embedding
Recent approaches to cross-lingual word embedding have generally been based
on linear transformations between the sets of embedding vectors in the two
languages. In this paper, we propose an approach that instead expresses the two
monolingual embedding spaces as probability densities defined by a Gaussian
mixture model, and matches the two densities using a method called normalizing
flow. The method requires no explicit supervision, and can be learned with only
a seed dictionary of words that have identical strings. We argue that this
formulation has several intuitively attractive properties, particularly with
the respect to improving robustness and generalization to mappings between
difficult language pairs or word pairs. On a benchmark data set of bilingual
lexicon induction and cross-lingual word similarity, our approach can achieve
competitive or superior performance compared to state-of-the-art published
results, with particularly strong results being found on etymologically distant
and/or morphologically rich languages.Comment: Accepted by NAACL-HLT 201
Baselines and test data for cross-lingual inference
The recent years have seen a revival of interest in textual entailment,
sparked by i) the emergence of powerful deep neural network learners for
natural language processing and ii) the timely development of large-scale
evaluation datasets such as SNLI. Recast as natural language inference, the
problem now amounts to detecting the relation between pairs of statements: they
either contradict or entail one another, or they are mutually neutral. Current
research in natural language inference is effectively exclusive to English. In
this paper, we propose to advance the research in SNLI-style natural language
inference toward multilingual evaluation. To that end, we provide test data for
four major languages: Arabic, French, Spanish, and Russian. We experiment with
a set of baselines. Our systems are based on cross-lingual word embeddings and
machine translation. While our best system scores an average accuracy of just
over 75%, we focus largely on enabling further research in multilingual
inference.Comment: To appear at LREC 201
Meemi: a simple method for post-processing and integrating cross-lingual word embeddings
Word embeddings have become a standard resource in the toolset of any Natural Language Processing
practitioner. While monolingual word embeddings encode information about words in the context of a
particular language, cross-lingual embeddings define a multilingual space where word embeddings from
two or more languages are integrated together. Current state-of-the-art approaches learn these embeddings
by aligning two disjoint monolingual vector spaces through an orthogonal transformation which preserves
the structure of the monolingual counterparts. In this work, we propose to apply an additional transformation after this initial alignment step, which aims to bring the vector representations of a given word and its
translations closer to their average. Since this additional transformation is non-orthogonal, it also affects
the structure of the monolingual spaces. We show that our approach both improves the integration of the
monolingual spaces as well as the quality of the monolingual spaces themselves. Furthermore, because
our transformation can be applied to an arbitrary number of languages, we are able to effectively obtain a
truly multilingual space. The resulting (monolingual and multilingual) spaces show consistent gains over
the current state-of-the-art in standard intrinsic tasks, namely dictionary induction and word similarity,
as well as in extrinsic tasks such as cross-lingual hypernym discovery and cross-lingual natural language
inference
Do we really need fully unsupervised cross-lingual embeddings?
Recent efforts in cross-lingual word embedding (CLWE) learning have predominantly focused on fully unsupervised approaches that project monolingual embeddings into a shared cross-lingual space without any cross-lingual signal. The lack of any supervision makes such approaches conceptually attractive. Yet, their only core difference from (weakly) supervised projection-based CLWE methods is in the way they obtain a seed dictionary used to initialize an iterative self-learning procedure. The fully unsupervised methods have arguably become more robust, and their primary use case is CLWE induction for pairs of resource-poor and distant languages. In this paper, we question the ability of even the most robust unsupervised CLWE approaches to induce meaningful CLWEs in these more challenging settings. A series of bilingual lexicon induction (BLI) experiments with 15 diverse languages (210 language pairs) show that fully unsupervised CLWE methods still fail for a large number of language pairs (e.g., they yield zero BLI performance for 87/210 pairs). Even when they succeed, they never surpass the performance of weakly supervised methods (seeded with 500-1,000 translation pairs) using the same self-learning procedure in any BLI setup, and the gaps are often substantial. These findings call for revisiting the main motivations behind fully unsupervised CLWE methods
A survey of cross-lingual word embedding models
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.</jats:p
- …