501 research outputs found
Duality Regularization for Unsupervised Bilingual Lexicon Induction
Unsupervised bilingual lexicon induction naturally exhibits duality, which
results from symmetry in back-translation. For example, EN-IT and IT-EN
induction can be mutually primal and dual problems. Current state-of-the-art
methods, however, consider the two tasks independently. In this paper, we
propose to train primal and dual models jointly, using regularizers to
encourage consistency in back translation cycles. Experiments across 6 language
pairs show that the proposed method significantly outperforms competitive
baselines, obtaining the best-published results on a standard benchmark
Bilingual Lexicon Induction through Unsupervised Machine Translation
A recent research line has obtained strong results on bilingual lexicon
induction by aligning independently trained word embeddings in two languages
and using the resulting cross-lingual embeddings to induce word translation
pairs through nearest neighbor or related retrieval methods. In this paper, we
propose an alternative approach to this problem that builds on the recent work
on unsupervised machine translation. This way, instead of directly inducing a
bilingual lexicon from cross-lingual embeddings, we use them to build a
phrase-table, combine it with a language model, and use the resulting machine
translation system to generate a synthetic parallel corpus, from which we
extract the bilingual lexicon using statistical word alignment techniques. As
such, our method can work with any word embedding and cross-lingual mapping
technique, and it does not require any additional resource besides the
monolingual corpus used to train the embeddings. When evaluated on the exact
same cross-lingual embeddings, our proposed method obtains an average
improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS
retrieval, establishing a new state-of-the-art in the standard MUSE dataset.Comment: ACL 201
Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents
Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked language model. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement
A Theory of Unsupervised Translation Motivated by Understanding Animal Communication
Recent years have seen breakthroughs in neural language models that capture
nuances of language, culture, and knowledge. Neural networks are capable of
translating between languages -- in some cases even between two languages where
there is little or no access to parallel translations, in what is known as
Unsupervised Machine Translation (UMT). Given this progress, it is intriguing
to ask whether machine learning tools can ultimately enable understanding
animal communication, particularly that of highly intelligent animals. Our work
is motivated by an ambitious interdisciplinary initiative, Project CETI, which
is collecting a large corpus of sperm whale communications for machine
analysis.
We propose a theoretical framework for analyzing UMT when no parallel data
are available and when it cannot be assumed that the source and target corpora
address related subject domains or posses similar linguistic structure. The
framework requires access to a prior probability distribution that should
assign non-zero probability to possible translations. We instantiate our
framework with two models of language. Our analysis suggests that accuracy of
translation depends on the complexity of the source language and the amount of
``common ground'' between the source language and target prior.
We also prove upper bounds on the amount of data required from the source
language in the unsupervised setting as a function of the amount of data
required in a hypothetical supervised setting. Surprisingly, our bounds suggest
that the amount of source data required for unsupervised translation is
comparable to the supervised setting. For one of the language models which we
analyze we also prove a nearly matching lower bound.
Our analysis is purely information-theoretic and as such can inform how much
source data needs to be collected, but does not yield a computationally
efficient procedure
- …