Search CORE

10,864 research outputs found

A Multilingual Parallel Corpora Collection Effort for Indian Languages

Author: Jawahar C V
Namboodiri Vinay P.
Philip Jerin
Siripragada Shashank
Publication venue
Publication date: 11/05/2020
Field of study

We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.Comment: 9 pages. Accepted in LREC 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Translation Quality Estimation for Indian Languages

Author: Gupta Manish
Jhaveri Nisarg
Varma Vasudeva
Publication venue: European Association for Machine Translation
Publication date: 01/01/2018
Field of study

Translation Quality Estimation (QE) aims to estimate the quality of an automated machine translation (MT) output without any human intervention or reference translation. With the increasing use of MT systems in various cross-lingual applications, the need and applicability of QE systems is increasing. We study existing approaches and propose multiple neural network approaches for sentence-level QE, with a focus on MT outputs in Indian languages. For this, we also introduce five new datasets for four language pairs: two for English–Gujarati, and one each for English–Hindi, English–Telugu and English–Bengali, which includes one manually post-edited dataset for English–Gujarati. These Indian languages are spoken by around 689M speakers world-wide. We compare results obtained using our proposed models with multiple state-of-the-art systems including the winning system in the WMT17 shared task on QE and show that our proposed neural model which combines the discriminative power of carefully chosen features with Siamese Convolutional Neural Networks (CNNs) works best for all Indian language datasets

Repositorio Institucional de la Universidad de Alicante

Revisiting Low Resource Status of Indian Languages in Machine Translation

Author: Arora Sanjeev
Barrault Loïc
Bañón Marta
Dabre Raj
Goyal Vikrant
Jha Girish Nath
Koehn Philipp
Kudo Taku
Kunchukuttan Anoop
Nakazawa Toshiaki
Nakazawa Toshiaki
Nakazawa Toshiaki
Papineni Kishore
Parida Shantipriya
Post Matt
Ramasamy Loganathan
Rudrabha Mukhopadhyay Prajwal KR
Schwenk Holger
Sennrich Rico
Sennrich Rico
Siripragada Shashank
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/11/2020
Field of study

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie

arXiv.org e-Print Archive

Crossref