Search CORE

23 research outputs found

Building the tatar-Russian NMT system based on re-translation of multilingual data

Author
Publication venue
Publication date: 01/01/2018
Field of study

© Springer Nature Switzerland AG 2018. This paper assesses the possibility of combining the rule-based and the neural network approaches to the construction of the machine translation system for the Tatar-Russian language pair. We propose a rule-based system that allows using parallel data of a group of 6 Turkic languages (Tatar, Kazakh, Kyrgyz, Crimean-Tatar, Uzbek, Turkish) and the Russian language to overcome the problem of limited Tatar-Russian data. We incorporated modern approaches for data augmentation, neural networks training and linguistically motivated rule-based methods. The main results of the work are the creation of the first neural Tatar-Russian translation system and the improvement of the translation quality in this language pair in terms of BLEU scores from 12 to 39 and from 17 to 45 for both translation directions (comparing to the existing translation system). Also the translation between any of the Tatar, Kazakh, Kyrgyz, Crimean Tatar, Uzbek, Turkish languages becomes possible, which allows to translate from all of these Turkic languages into Russian using Tatar as an intermediate language

Kazan Federal University Digital Repository

Evaluating Multiway Multilingual NMT in the Turkic Languages

Author: Ataman Duygu
Babu Anoop
Chellappan Sriram
Firat Orhan
Ivanova Sardana
Kreutzer Julia
Licato John
Mirzakhalov Jamshidbek
Moydinboyev Bekhzodbek
Pulatova Shaxnoza
Tyers Francis M.
Uzokova Mokhiyakhon
Wahab Ahsan
Publication venue: The Association for Computational Linguistics
Publication date: 01/11/2021
Field of study

Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

A Large-Scale Study of Machine Translation in Turkic Languages

Author: Ataman Duygu
Babu Anoop
Chellappan Sriram
Firat Orhan
Hajili Mammad
Ivanova Sardana
Kariev Sherzod
Khaytbaev Abror
Laverghetta Jr. Antonio
Mirzakhalov Jamshidbek
Moydinboyev Bekhzodbek
Onal Esra
Otabek Abduraufov Otabek
Pulatova Shaxnoza
Tyers Francis M.
Wahab Ahsan
Publication venue: The Association for Computational Linguistics
Publication date: 01/11/2021
Field of study

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Author: Artetxe Mikel
Schwenk Holger
Publication venue
Publication date: 11/07/1922
Field of study

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting sentence embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our approach sets a new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one. We also achieve very competitive results in cross-lingual document classification (MLDoc dataset). Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new test set of aligned sentences in 122 languages based on the Tatoeba corpus, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our PyTorch implementation, pre-trained encoder and the multilingual test set will be freely available

arXiv.org e-Print Archive

University of Michigan Library Repository

Lego-MT: Towards Detachable Models in Massively Multilingual Machine Translation

Author: Kong Lingpeng
Li Lei
Lu Yinquan
Qiao Yu
Xu Jingjing
Yuan Fei
Zhu WenHao
Publication venue
Publication date: 28/05/2023
Field of study

Multilingual neural machine translation (MNMT) aims to build a unified model for many language directions. Existing monolithic models for MNMT encounter two challenges: parameter interference among languages and inefficient inference for large models. In this paper, we revisit the classic multi-way structures and develop a detachable model by assigning each language (or group of languages) to an individual branch that supports plug-and-play training and inference. To address the needs of learning representations for all languages in a unified space, we propose a novel efficient training recipe, upon which we build an effective detachable model, Lego-MT. For a fair comparison, we collect data from OPUS and build a translation benchmark covering 433 languages and 1.3B parallel data. Experiments show that Lego-MT with 1.2B parameters brings an average gain of 3.2 spBLEU. It even outperforms M2M-100 with 12B parameters. The proposed training recipe brings a 28.2

\times

speedup over the conventional multi-way training method.\footnote{ \url{https://github.com/CONE-MT/Lego-MT}.}Comment: ACL 2023 Finding

arXiv.org e-Print Archive

Proceedings of the 21st Annual Conference of the European Association for Machine Translation: 28-30 May 2018, Universitat d'Alacant, Alacant, Spain

Author: Esplà-Gomis Miquel (ed.)
Forcada Mikel L. (ed.)
Martins André (ed.)
Popović Maja (ed.)
Pérez-Ortiz Juan Antonio (ed.)
Rico Celia (ed.)
Sánchez-Martínez Felipe (ed.)
Van den Bogaert Joachim (ed.)
Publication venue: European Association for Machine Translation
Publication date: 01/01/2018
Field of study

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

2nd Conference on Language, Data and Knowledge (LDK 2019), May 20–23, 2019, Leipzig, Germany

Author: Buitelaar Paul
Chiarcos Christian
de Melo Gerard
Dojchinovski Milan
Eskevich Maria
Fäth Christian
Klimek Bettina
McCrae John P.
Publication venue
Publication date: 27/04/2023
Field of study

OPUS Augsburg