Revisiting Low Resource Status of Indian Languages in Machine
  Translation

Arora Sanjeev; Barrault Loïc; Bañón Marta; Dabre Raj; Goyal Vikrant; Jha Girish Nath; Koehn Philipp; Kudo Taku; Kunchukuttan Anoop; Nakazawa Toshiaki; Nakazawa Toshiaki; Nakazawa Toshiaki; Papineni Kishore; Parida Shantipriya; Post Matt; Ramasamy Loganathan; Rudrabha Mukhopadhyay Prajwal KR; Schwenk Holger; Sennrich Rico; Sennrich Rico; Siripragada Shashank

Revisiting Low Resource Status of Indian Languages in Machine Translation

Authors: Arora Sanjeev
Barrault Loïc
Bañón Marta
Dabre Raj
Goyal Vikrant
Jha Girish Nath
Koehn Philipp
Kudo Taku
Kunchukuttan Anoop
Nakazawa Toshiaki
Nakazawa Toshiaki
Nakazawa Toshiaki
Papineni Kishore
Parida Shantipriya
Post Matt
Ramasamy Loganathan
Rudrabha Mukhopadhyay Prajwal KR
Schwenk Holger
Sennrich Rico
Sennrich Rico
Siripragada Shashank
Publication date: 4 November 2020
Publisher: 'Association for Computing Machinery (ACM)'
Doi

Abstract

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie

Similar works

Full text

Available Versions

Crossref

Last time updated on 11/08/2021