35 research outputs found
Analyzing the Limitations of Cross-lingual Word Embedding Mappings
Recent research in cross-lingual word embeddings has almost exclusively
focused on offline methods, which independently train word embeddings in
different languages and map them to a shared space through linear
transformations. While several authors have questioned the underlying
isomorphism assumption, which states that word embeddings in different
languages have approximately the same structure, it is not clear whether this
is an inherent limitation of mapping approaches or a more general issue when
learning cross-lingual embeddings. So as to answer this question, we experiment
with parallel corpora, which allows us to compare offline mapping to an
extension of skip-gram that jointly learns both embedding spaces. We observe
that, under these ideal conditions, joint learning yields to more isomorphic
embeddings, is less sensitive to hubness, and obtains stronger results in
bilingual lexicon induction. We thus conclude that current mapping methods do
have strong limitations, calling for further research to jointly learn
cross-lingual embeddings with a weaker cross-lingual signal.Comment: ACL 201
Bilingual Lexicon Induction through Unsupervised Machine Translation
A recent research line has obtained strong results on bilingual lexicon
induction by aligning independently trained word embeddings in two languages
and using the resulting cross-lingual embeddings to induce word translation
pairs through nearest neighbor or related retrieval methods. In this paper, we
propose an alternative approach to this problem that builds on the recent work
on unsupervised machine translation. This way, instead of directly inducing a
bilingual lexicon from cross-lingual embeddings, we use them to build a
phrase-table, combine it with a language model, and use the resulting machine
translation system to generate a synthetic parallel corpus, from which we
extract the bilingual lexicon using statistical word alignment techniques. As
such, our method can work with any word embedding and cross-lingual mapping
technique, and it does not require any additional resource besides the
monolingual corpus used to train the embeddings. When evaluated on the exact
same cross-lingual embeddings, our proposed method obtains an average
improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS
retrieval, establishing a new state-of-the-art in the standard MUSE dataset.Comment: ACL 201
Density Matching for Bilingual Word Embedding
Recent approaches to cross-lingual word embedding have generally been based
on linear transformations between the sets of embedding vectors in the two
languages. In this paper, we propose an approach that instead expresses the two
monolingual embedding spaces as probability densities defined by a Gaussian
mixture model, and matches the two densities using a method called normalizing
flow. The method requires no explicit supervision, and can be learned with only
a seed dictionary of words that have identical strings. We argue that this
formulation has several intuitively attractive properties, particularly with
the respect to improving robustness and generalization to mappings between
difficult language pairs or word pairs. On a benchmark data set of bilingual
lexicon induction and cross-lingual word similarity, our approach can achieve
competitive or superior performance compared to state-of-the-art published
results, with particularly strong results being found on etymologically distant
and/or morphologically rich languages.Comment: Accepted by NAACL-HLT 201