8 research outputs found
Revisiting Low Resource Status of Indian Languages in Machine Translation
Indian language machine translation performance is hampered due to the lack
of large scale multi-lingual sentence aligned corpora and robust benchmarks.
Through this paper, we provide and analyse an automated framework to obtain
such a corpus for Indian language neural machine translation (NMT) systems. Our
pipeline consists of a baseline NMT system, a retrieval module, and an
alignment module that is used to work with publicly available websites such as
press releases by the government. The main contribution towards this effort is
to obtain an incremental method that uses the above pipeline to iteratively
improve the size of the corpus as well as improve each of the components of our
system. Through our work, we also evaluate the design choices such as the
choice of pivoting language and the effect of iterative incremental increase in
corpus size. Our work in addition to providing an automated framework also
results in generating a relatively larger corpus as compared to existing
corpora that are available for Indian languages. This corpus helps us obtain
substantially improved results on the publicly available WAT evaluation
benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie
"A Little is Enough": Few-Shot Quality Estimation based Corpus Filtering improves Machine Translation
Quality Estimation (QE) is the task of evaluating the quality of a
translation when reference translation is not available. The goal of QE aligns
with the task of corpus filtering, where we assign the quality score to the
sentence pairs present in the pseudo-parallel corpus. We propose a Quality
Estimation based Filtering approach to extract high-quality parallel data from
the pseudo-parallel corpus. To the best of our knowledge, this is a novel
adaptation of the QE framework to extract quality parallel corpus from the
pseudo-parallel corpus. By training with this filtered corpus, we observe an
improvement in the Machine Translation (MT) system's performance by up to 1.8
BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language
pairs, over the baseline model. The baseline model is the one that is trained
on the whole pseudo-parallel corpus. Our Few-shot QE model transfer learned
from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali
training instances, shows an improvement of up to 0.6 BLEU points for
Hindi-Bengali language pair, compared to the baseline model. This demonstrates
the promise of transfer learning in the setting under discussion. QE systems
typically require in the order of (7K-25K) of training data. Our Hindi-Bengali
QE is trained on only 500 instances of training that is 1/40th of the normal
requirement and achieves comparable performance. All the scripts and datasets
utilized in this study will be publicly available
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
India has a rich linguistic landscape with languages from 4 major language
families spoken by over a billion people. 22 of these languages are listed in
the Constitution of India (referred to as scheduled languages) are the focus of
this work. Given the linguistic diversity, high-quality and accessible Machine
Translation (MT) systems are essential in a country like India. Prior to this
work, there was (i) no parallel training data spanning all the 22 languages,
(ii) no robust benchmarks covering all these languages and containing content
relevant to India, and (iii) no existing translation models which support all
the 22 scheduled languages of India. In this work, we aim to address this gap
by focusing on the missing pieces required for enabling wide, easy, and open
access to good machine translation systems for all 22 scheduled Indian
languages. We identify four key areas of improvement: curating and creating
larger training datasets, creating diverse and high-quality benchmarks,
training multilingual models, and releasing models with open access. Our first
contribution is the release of the Bharat Parallel Corpus Collection (BPCC),
the largest publicly available parallel corpora for Indic languages. BPCC
contains a total of 230M bitext pairs, of which a total of 126M were newly
added, including 644K manually translated sentence pairs created as part of
this work. Our second contribution is the release of the first n-way parallel
benchmark covering all 22 Indian languages, featuring diverse domains,
Indian-origin content, and source-original test sets. Next, we present
IndicTrans2, the first model to support all 22 languages, surpassing existing
models on multiple existing and new benchmarks created as a part of this work.
Lastly, to promote accessibility and collaboration, we release our models and
associated data with permissive licenses at
https://github.com/ai4bharat/IndicTrans2