Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English–Telugu

Dandapat, Sandipan; Federmann, Christian

Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English–Telugu

Authors: Sandipan Dandapat
Christian Federmann
Publication date: 1 January 2018
Publisher: European Association for Machine Translation

Abstract

Telugu is the fifteenth most commonly spoken language in the world with an estimated reach of 75 million people in the Indian subcontinent. At the same time, it is a severely low resourced language. In this paper, we present work on English–Telugu general domain machine translation (MT) systems using small amounts of parallel data. The baseline statistical (SMT) and neural MT (NMT) systems do not yield acceptable translation quality, mostly due to limited resources. However, the use of synthetic parallel data (generated using back translation, based on an NMT engine) significantly improves translation quality and allows NMT to outperform SMT. We extend back translation and propose a new, iterative data augmentation (IDA) method. Filtering of synthetic data and IDA both further boost translation quality of our final NMT systems, as measured by BLEU scores on all test sets and based on state-of-the-art human evaluation

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

RUA

oai:rua.ua.es:10045/76090

Last time updated on 09/04/2020

Repositorio Institucional de la Universidad de Alicante

oai:rua.ua.es:10045/76090

Last time updated on 17/06/2018