427 research outputs found
A Large-Scale Study of Machine Translation in Turkic Languages
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.Peer reviewe
Evaluating Multiway Multilingual NMT in the Turkic Languages
Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe
A MT System from Turkmen to Turkish employing finite state and statistical methods
In this work, we present a MT system from Turkmen to Turkish. Our system exploits the similarity of the languages by using a modified version of direct translation method. However, the complex inflectional and derivational morphology of the Turkic languages necessitate special treatment for word-by-word translation model. We also employ morphology-aware multi-word processing and statistical disambiguation processes in our system. We believe that this approach is valid for most of the Turkic languages and the architecture implemented using FSTs can be easily extended to those languages
The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual
analysis in morphology examined transfer learning of inflection between 100
language pairs, as well as contextual lemmatization and morphosyntactic
description in 66 languages. The first task evolves past years' inflection
tasks by examining transfer of morphological inflection knowledge from a
high-resource language to a low-resource language. This year also presents a
new second challenge on lemmatization and morphological feature analysis in
context. All submissions featured a neural component and built on either this
year's strong baselines or highly ranked systems from previous years' shared
tasks. Every participating team improved in accuracy over the baselines for the
inflection task (though not Levenshtein distance), and every team in the
contextual analysis task improved on both state-of-the-art neural and
non-neural baselines.Comment: Presented at SIGMORPHON 201
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
We introduce an architecture to learn joint multilingual sentence
representations for 93 languages, belonging to more than 30 different language
families and written in 28 different scripts. Our system uses a single BiLSTM
encoder with a shared BPE vocabulary for all languages, which is coupled with
an auxiliary decoder and trained on publicly available parallel corpora. This
enables us to learn a classifier on top of the resulting sentence embeddings
using English annotated data only, and transfer it to any of the 93 languages
without any modification. Our approach sets a new state-of-the-art on zero-shot
cross-lingual natural language inference for all the 14 languages in the XNLI
dataset but one. We also achieve very competitive results in cross-lingual
document classification (MLDoc dataset). Our sentence embeddings are also
strong at parallel corpus mining, establishing a new state-of-the-art in the
BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new
test set of aligned sentences in 122 languages based on the Tatoeba corpus, and
show that our sentence embeddings obtain strong results in multilingual
similarity search even for low-resource languages. Our PyTorch implementation,
pre-trained encoder and the multilingual test set will be freely available
Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings
Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following:
* We release âlanguage packsâ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations.
* We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora.
* We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains.
* We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams.
This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations.
We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available
Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation
Neural machine translation (NMT) has progressed rapidly over the past several
years, and modern models are able to achieve relatively high quality using only
monolingual text data, an approach dubbed Unsupervised Machine Translation
(UNMT). However, these models still struggle in a variety of ways, including
aspects of translation that for a human are the easiest - for instance,
correctly translating common nouns. This work explores a cheap and abundant
resource to combat this problem: bilingual lexica. We test the efficacy of
bilingual lexica in a real-world set-up, on 200-language translation models
trained on web-crawled text. We present several findings: (1) using lexical
data augmentation, we demonstrate sizable performance gains for unsupervised
translation; (2) we compare several families of data augmentation,
demonstrating that they yield similar improvements, and can be combined for
even greater improvements; (3) we demonstrate the importance of carefully
curated lexica over larger, noisier ones, especially with larger models; and
(4) we compare the efficacy of multilingual lexicon data versus
human-translated parallel data. Finally, we open-source GATITOS (available at
https://github.com/google-research/url-nlp/tree/main/gatitos), a new
multilingual lexicon for 26 low-resource languages, which had the highest
performance among lexica in our experiments
- âŠ