5 research outputs found
A Multilingual View of Unsupervised Machine Translation
We present a probabilistic framework for multilingual neural machine
translation that encompasses supervised and unsupervised setups, focusing on
unsupervised translation. In addition to studying the vanilla case where there
is only monolingual data available, we propose a novel setup where one language
in the (source, target) pair is not associated with any parallel data, but
there may exist auxiliary parallel data that contains the other. This auxiliary
data can naturally be utilized in our probabilistic framework via a novel
cross-translation loss term. Empirically, we show that our approach results in
higher BLEU scores over state-of-the-art unsupervised models on the WMT'14
English-French, WMT'16 English-German, and WMT'16 English-Romanian datasets in
most directions. In particular, we obtain a +1.65 BLEU advantage over the
best-performing unsupervised model in the Romanian-English direction.Comment: Added new reference, fixed typo
Cross-lingual Retrieval for Iterative Self-Supervised Training
Recent studies have demonstrated the cross-lingual alignment ability of
multilingual pretrained language models. In this work, we found that the
cross-lingual alignment can be further improved by training seq2seq models on
sentence pairs mined using their own encoder outputs. We utilized these
findings to develop a new approach -- cross-lingual retrieval for iterative
self-supervised training (CRISS), where mining and training processes are
applied iteratively, improving cross-lingual alignment and translation ability
at the same time. Using this method, we achieved state-of-the-art unsupervised
machine translation results on 9 language directions with an average
improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the
XTREME benchmark on 16 languages with an average improvement of 21.5% in
absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU
improvement on average compared to mBART, when finetuned on supervised machine
translation downstream tasks
Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages
Unsupervised translation has reached impressive performance on resource-rich
language pairs such as English-French and English-German. However, early
studies have shown that in more realistic settings involving low-resource, rare
languages, unsupervised translation performs poorly, achieving less than 3.0
BLEU. In this work, we show that multilinguality is critical to making
unsupervised systems practical for low-resource settings. In particular, we
present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali,
Sinhala, and Turkish) to and from English directions, which leverages
monolingual and auxiliary parallel data from other high-resource language pairs
via a three-stage training scheme. We outperform all current state-of-the-art
unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU.
Additionally, we outperform a large collection of supervised WMT submissions
for various language pairs as well as match the performance of the current
state-of-the-art supervised model for Nepali-English. We conduct a series of
ablation studies to establish the robustness of our model under different
degrees of data quality, as well as to analyze the factors which led to the
superior performance of the proposed approach over traditional unsupervised
models.Comment: Accepted to NAACL 202
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information
We investigate the following question for machine translation (MT): can we
develop a single universal MT model to serve as the common seed and obtain
derivative and improved models on arbitrary language pairs? We propose mRASP,
an approach to pre-train a universal multilingual neural machine translation
model. Our key idea in mRASP is its novel technique of random aligned
substitution, which brings words and phrases with similar meanings across
multiple languages closer in the representation space. We pre-train a mRASP
model on 32 language pairs jointly with only public datasets. The model is then
fine-tuned on downstream language pairs to obtain specialized MT models. We
carry out extensive experiments on 42 translation directions across a diverse
settings, including low, medium, rich resource, and as well as transferring to
exotic language pairs. Experimental results demonstrate that mRASP achieves
significant performance improvement compared to directly training on those
target pairs. It is the first time to verify that multiple low-resource
language pairs can be utilized to improve rich resource MT. Surprisingly, mRASP
is even able to improve the translation quality on exotic languages that never
occur in the pre-training corpus. Code, data, and pre-trained models are
available at https://github.com/linzehui/mRASP.Comment: EMNLP 202
Reference Language based Unsupervised Neural Machine Translation
Exploiting common language as an auxiliary for better translation has a long
tradition in machine translation, which lets supervised learning based machine
translation enjoy the enhancement delivered by the well-used pivot language, in
case that the prerequisite of parallel corpus from source language to target
language cannot be fully satisfied. The rising of unsupervised neural machine
translation (UNMT) seems completely relieving the parallel corpus curse, though
still subject to unsatisfactory performance so far due to vague clues available
used for its core back-translation training. Further enriching the idea of
pivot translation by freeing the use of parallel corpus other than its
specified source and target, we propose a new reference language based UNMT
framework, in which the reference language only shares parallel corpus with the
source, indicating clear enough signal to help the reconstruction training of
UNMT through a proposed reference agreement mechanism. Experimental results
show that our methods improve the quality of UNMT over that of a strong
baseline in terms of only one auxiliary language, demonstrating the usefulness
of the proposed reference language based UNMT with a good start