83,916 research outputs found
Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation
There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks—English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish—and one real-world task, Norwegian to North Sámi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss
Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages
An Automatic Evaluation of the WMT22 General Machine Translation Task
This report presents an automatic evaluation of the general machine
translation task of the Seventh Conference on Machine Translation (WMT22). It
evaluates a total of 185 systems for 21 translation directions including
high-resource to low-resource language pairs and from closely related to
distant languages. This large-scale automatic evaluation highlights some of the
current limits of state-of-the-art machine translation systems. It also shows
how automatic metrics, namely chrF, BLEU, and COMET, can complement themselves
to mitigate their own limits in terms of interpretability and accuracy.Comment: Update: correction, fr->de and de-> tables were switche
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss
Although unsupervised neural machine translation (UNMT) has achieved success
in many language pairs, the copying problem, i.e., directly copying some parts
of the input sentence as the translation, is common among distant language
pairs, especially when low-resource languages are involved. We find this issue
is closely related to an unexpected copying behavior during online
back-translation (BT). In this work, we propose a simple but effective training
schedule that incorporates a language discriminator loss. The loss imposes
constraints on the intermediate translation so that the translation is in the
desired language. By conducting extensive experiments on different language
pairs, including similar and distant, high and low-resource languages, we find
that our method alleviates the copying problem, thus improving the translation
performance on low-resource languages.Comment: IWSLT 202
Unsupervised Machine Translation On Dravidian Languages
Unsupervised neural machine translation (UNMT) is beneficial especially for
low resource languages such as those from the Dravidian family. However, UNMT
systems tend to fail in realistic scenarios involving actual low resource
languages. Recent works propose to utilize auxiliary parallel data and have
achieved state-of-the-art results. In this work, we focus on unsupervised
translation between English and Kannada, a low resource Dravidian language. We
additionally utilize a limited amount of auxiliary data between English and
other related Dravidian languages. We show that unifying the writing systems is
essential in unsupervised translation between the Dravidian languages. We
explore several model architectures that use the auxiliary data in order to
maximize knowledge sharing and enable UNMT for distant language pairs. Our
experiments demonstrate that it is crucial to include auxiliary languages that
are similar to our focal language, Kannada. Furthermore, we propose a metric to
measure language similarity and show that it serves as a good indicator for
selecting the auxiliary languages
Improving Machine Translation Quality with Denoising Autoencoder and Pre-Ordering
The problems in machine translation are related to the characteristics of a family of languages, especially syntactic divergences between languages. In the translation task, having both source and target languages in the same language family is a luxury that cannot be relied upon. The trained models for the task must overcome such differences either through manual augmentations or automatically inferred capacity built into the model design. In this work, we investigated the impact of multiple methods of differing word orders during translation and further experimented in assimilating the source languages syntax to the target word order using pre-ordering. We focused on the field of extremely low-resource scenarios. We also conducted experiments on practical data augmentation techniques that support the reordering capacity of the models through varying the target objectives, adding the secondary goal of removing noises or reordering broken input sequences. In particular, we propose methods to improve translat on quality with the denoising autoencoder in Neural Machine Translation (NMT) and pre-ordering method in Phrase-based Statistical Machine Translation (PBSMT). The experiments with a number of English-Vietnamese pairs show the improvement in BLEU scores as compared to both the NMT and SMT systems
- …