Search CORE

8,284 research outputs found

Multilingual Neural Machine Translation for Low-Resource Languages

Author
Publication venue
Publication date: 03/09/2019
Field of study

open4Neural Machine Translation (NMT) has been shown to be more effective in translation tasks compared to phrase-based statistical machine translation. However, NMT systems are limited in translating low-resourced languages, due to the significant amount of parallel data that is required to learn useful mappings between languages. In this work, we show how so-called multilingual NMT can help to tackle the challenges associated with low-resourced language translation. The principle of multilingual NMT is to force the creation of hidden representations of words in a shared semantic space across multiple languages, thus enabling a positive parameter transfer across languages. In particular, we present multilingual translation experiments with three languages (English, Italian, Romanian) covering six translation directions, utilizing both recurrent neural networks and transformer (or self-attentive) neural networks. We then focus on the zero-shot translation problem, that is how to leverage multi-lingual data in order to learn translation directions uncovered by the data. Hence, we introduce our recently proposed iterative self-training method, that incrementally improves a multilingual NMT on a zero-shot direction by just relying on monolingual data. Our results on TED talks data show that multilingualNMT outperforms conventional bilingual NMT, that the transformer NMT outperforms re-current NMT, and that zero-shot NMT outperforms conventional pivoting methods and even matches the performance of a fully-trained bilingual systemopenSurafel M. Lakew, Marcello Federico, Matteo Negri, Marco TurchiLakew, Surafel M.; Federico, Marcello; Negri, Matteo; Turchi, Marc

Archivio della ricerca - Fondazione Bruno Kessler

Multilingual Neural Machine Translation for Low-Resource Languages

Author: Federico Marcello
Lakew Surafel M.
Negri Matteo
Turchi Marco
Publication venue: 'OpenEdition'
Publication date: 15/12/2020
Field of study

In recent years, Neural Machine Translation (NMT) has been shown to be more effective than phrase-based statistical methods, thus quickly becoming the state of the art in machine translation (MT). However, NMT systems are limited in translating low-resourced languages, due to the significant amount of parallel data that is required to learn useful mappings between languages. In this work, we show how the so-called multilingual NMT can help to tackle the challenges associated with low-resourced language translation. The underlying principle of multilingual NMT is to force the creation of hidden representations of words in a shared semantic space across multiple languages, thus enabling a positive parameter transfer across languages. Along this direction, we present multilingual translation experiments with three languages (English, Italian, Romanian) covering six translation directions, utilizing both recurrent neural networks and transformer (or self-attentive) neural networks. We then focus on the zero-shot translation problem, that is how to leverage multi-lingual data in order to learn translation directions that are not covered by the available training material. To this aim, we introduce our recently proposed iterative self-training method, which incrementally improves a multilingual NMT on a zero-shot direction by just relying on monolingual data. Our results on TED talks data show that multilingual NMT outperforms conventional bilingual NMT, that the transformer NMT outperforms recurrent NMT, and that zero-shot NMT outperforms conventional pivoting methods and even matches the performance of a fully-trained bilingual system

OpenEdition

Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

Author: Gronroos Stig-Arne
Kurimo Mikko
Virpioja Sami
Publication venue
Publication date: 01/12/2020
Field of study

There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks—English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish—and one real-world task, Norwegian to North Sámi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe

arXiv.org e-Print Archive

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

Towards a Deep Understanding of Multilingual End-to-End Speech Translation

Author: Lei Yikun
Sun Haoran
Xiong Deyi
Zhao Xiaohu
Zhu Shaolin
Publication venue
Publication date: 31/10/2023
Field of study

In this paper, we employ Singular Value Canonical Correlation Analysis (SVCCA) to analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages. SVCCA enables us to estimate representational similarity across languages and layers, enhancing our understanding of the functionality of multilingual speech translation and its potential connection to multilingual neural machine translation. The multilingual speech translation model is trained on the CoVoST 2 dataset in all possible directions, and we utilize LASER to extract parallel bitext data for SVCCA analysis. We derive three major findings from our analysis: (I) Linguistic similarity loses its efficacy in multilingual speech translation when the training data for a specific language is limited. (II) Enhanced encoder representations and well-aligned audio-text data significantly improve translation quality, surpassing the bilingual counterparts when the training data is not compromised. (III) The encoder representations of multilingual speech translation demonstrate superior performance in predicting phonetic features in linguistic typology prediction. With these findings, we propose that releasing the constraint of limited data for low-resource languages and subsequently combining them with linguistically related high-resource languages could offer a more effective approach for multilingual end-to-end speech translation.Comment: Accepted to Findings of EMNLP 202

arXiv.org e-Print Archive

Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation

Author: Arivazhagan Naveen
Bapna Ankur
Firat Orhan
Johnson Melvin
Raman Karthik
Riesa Jason
Siddhant Aditya
Tsai Henry
Publication venue
Publication date: 01/09/2019
Field of study

The recently proposed massively multilingual neural machine translation (NMT) system has been shown to be capable of translating over 100 languages to and from English within a single model. Its improved translation performance on low resource languages hints at potential cross-lingual transfer capability for downstream tasks. In this paper, we evaluate the cross-lingual effectiveness of representations from the encoder of a massively multilingual NMT model on 5 downstream classification and sequence labeling tasks covering a diverse set of over 50 languages. We compare against a strong baseline, multilingual BERT (mBERT), in different cross-lingual transfer learning scenarios and show gains in zero-shot transfer in 4 out of these 5 tasks

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Massively Multilingual Neural Machine Translation

Author: Aharoni Roee
Firat Orhan
Johnson Melvin
Publication venue
Publication date: 01/01/2019
Field of study

Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English within a single model. We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions. We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages. Our experiments on a large-scale dataset with 102 languages to and from English and up to one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT.Comment: Accepted as a long paper in NAACL 201

arXiv.org e-Print Archive

Crossref

Meta-Learning for Low-Resource Neural Machine Translation

Author: Chen Yun
Cho Kyunghyun
Gu Jiatao
Li Victor O. K.
Wang Yong
Publication venue
Publication date: 01/01/2018
Field of study

In this paper, we propose to extend the recently introduced model-agnostic meta-learning algorithm (MAML) for low-resource neural machine translation (NMT). We frame low-resource translation as a meta-learning problem, and we learn to adapt to low-resource languages based on multilingual high-resource language tasks. We use the universal lexical representation~\citep{gu2018universal} to overcome the input-output mismatch across different languages. We evaluate the proposed meta-learning strategy using eighteen European languages (Bg, Cs, Da, De, El, Es, Et, Fr, Hu, It, Lt, Nl, Pl, Pt, Sk, Sl, Sv and Ru) as source tasks and five diverse languages (Ro, Lv, Fi, Tr and Ko) as target tasks. We show that the proposed approach significantly outperforms the multilingual, transfer learning based approach~\citep{zoph2016transfer} and enables us to train a competitive NMT system with only a fraction of training examples. For instance, the proposed approach can achieve as high as 22.04 BLEU on Romanian-English WMT'16 by seeing only 16,000 translated words (~600 parallel sentences).Comment: Accepted as a full paper at EMNLP 201

arXiv.org e-Print Archive

Crossref

An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation – An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation

Author: Ha Thanh-Le
Ngo Thi-Vinh
Nguyen Le-Minh
Nguyen Phuong-Thai
Nguyen Van Vinh
Publication venue: Taylor and Francis
Publication date: 11/08/2022
Field of study

Data sparsity is one of the challenges for low-resource language pairs in Neural Machine Translation (NMT). Previous works have presented different approaches for data augmentation, but they mostly require additional resources and obtain low-quality dummy data in the low-resource issue. This paper proposes a simple and effective novel for generating synthetic bilingual data without using external resources as in previous approaches. Moreover, some works recently have shown that multilingual translation or transfer learning can boost the translation quality in low-resource situations. However, for logographic languages such as Chinese or Japanese, this approach is still limited due to the differences in translation units in the vocabularies. Although Japanese texts contain Kanji characters that are derived from Chinese characters, and they are quite homologous in sharp and meaning, the word orders in the sentences of these languages have a big divergence. Our study will investigate these impacts in machine translation. In addition, a combined pre-trained model is also leveraged to demonstrate the efficacy of translation tasks in the more high-resource scenario. Our experiments present performance improvements up to +6.2 and +7.8 BLEU scores over bilingual baseline systems on two low-resource translation tasks from Chinese to Vietnamese and Japanese to Vietnamese

KITopen

Directory of Open Access Journals