358 research outputs found
Bilingual distributed word representations from document-aligned comparable data
We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and contextcounting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.This work was done while Ivan Vuli c was a postdoctoral researcher at Department of Computer Science, KU Leuven supported by the PDM Kort fellowship (PDMK/14/117). The work was also supported by the SCATE project (IWT-SBO 130041) and the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (648909)
Bilingual Lexicon Induction through Unsupervised Machine Translation
A recent research line has obtained strong results on bilingual lexicon
induction by aligning independently trained word embeddings in two languages
and using the resulting cross-lingual embeddings to induce word translation
pairs through nearest neighbor or related retrieval methods. In this paper, we
propose an alternative approach to this problem that builds on the recent work
on unsupervised machine translation. This way, instead of directly inducing a
bilingual lexicon from cross-lingual embeddings, we use them to build a
phrase-table, combine it with a language model, and use the resulting machine
translation system to generate a synthetic parallel corpus, from which we
extract the bilingual lexicon using statistical word alignment techniques. As
such, our method can work with any word embedding and cross-lingual mapping
technique, and it does not require any additional resource besides the
monolingual corpus used to train the embeddings. When evaluated on the exact
same cross-lingual embeddings, our proposed method obtains an average
improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS
retrieval, establishing a new state-of-the-art in the standard MUSE dataset.Comment: ACL 201
A survey of cross-lingual word embedding models
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.</jats:p
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
Recent work has managed to learn cross-lingual word embeddings without
parallel data by mapping monolingual embeddings to a shared space through
adversarial training. However, their evaluation has focused on favorable
conditions, using comparable corpora or closely-related languages, and we show
that they often fail in more realistic scenarios. This work proposes an
alternative approach based on a fully unsupervised initialization that
explicitly exploits the structural similarity of the embeddings, and a robust
self-learning algorithm that iteratively improves this solution. Our method
succeeds in all tested scenarios and obtains the best published results in
standard datasets, even surpassing previous supervised systems. Our
implementation is released as an open source project at
https://github.com/artetxem/vecmapComment: ACL 201
On the limitations of unsupervised bilingual dictionary induction
Unsupervised machine translation---i.e., not assuming any cross-lingual supervision signal, whether a dictionary, translations, or comparable corpora---seems impossible, but nevertheless, Conneau et al. (2018) recently proposed a fully unsupervised machine translation (MT) model. The model relies heavily on an adversarial, unsupervised alignment of word embedding spaces for bilingual dictionary induction, which we examine here. Our results identify the limitations of current unsupervised MT: unsupervised bilingual dictionary induction performs much worse on morphologically rich languages that are not dependent marking, when monolingual corpora from different domains or different embedding algorithms are used. We show that a simple trick, exploiting a weak supervision signal from identical words, enables more robust induction, and establish a near-perfect correlation between unsupervised bilingual dictionary induction performance and a previously unexplored graph similarity metric
Bilingual lexicon induction by learning to combine word-level and character-level representations
We study the problem of bilingual lexicon induction (BLI) in a setting where some translation resources are available, but unknown translations are sought for certain, possibly domain-specific terminology. We frame BLI as a classification problem for which we design a neural network based classification architecture composed of recurrent long short-term memory and deep feed forward networks. The results show that word- and character-level representations each improve state-of-the-art results for BLI, and the best results are obtained by exploiting the synergy between these word- and character-level representations in the classification model
Itzulpen automatiko gainbegiratu gabea
192 p.Modern machine translation relies on strong supervision in the form of parallel corpora. Such arequirement greatly departs from the way in which humans acquire language, and poses a major practicalproblem for low-resource language pairs. In this thesis, we develop a new paradigm that removes thedependency on parallel data altogether, relying on nothing but monolingual corpora to train unsupervisedmachine translation systems. For that purpose, our approach first aligns separately trained wordrepresentations in different languages based on their structural similarity, and uses them to initializeeither a neural or a statistical machine translation system, which is further trained through iterative backtranslation.While previous attempts at learning machine translation systems from monolingual corporahad strong limitations, our work¿along with other contemporaneous developments¿is the first to reportpositive results in standard, large-scale settings, establishing the foundations of unsupervised machinetranslation and opening exciting opportunities for future research
- …