3 research outputs found
Revisiting Low Resource Status of Indian Languages in Machine Translation
Indian language machine translation performance is hampered due to the lack
of large scale multi-lingual sentence aligned corpora and robust benchmarks.
Through this paper, we provide and analyse an automated framework to obtain
such a corpus for Indian language neural machine translation (NMT) systems. Our
pipeline consists of a baseline NMT system, a retrieval module, and an
alignment module that is used to work with publicly available websites such as
press releases by the government. The main contribution towards this effort is
to obtain an incremental method that uses the above pipeline to iteratively
improve the size of the corpus as well as improve each of the components of our
system. Through our work, we also evaluate the design choices such as the
choice of pivoting language and the effect of iterative incremental increase in
corpus size. Our work in addition to providing an automated framework also
results in generating a relatively larger corpus as compared to existing
corpora that are available for Indian languages. This corpus helps us obtain
substantially improved results on the publicly available WAT evaluation
benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie
Hindi to English: Transformer-Based Neural Machine Translation
Machine Translation (MT) is one of the most prominent tasks in Natural
Language Processing (NLP) which involves the automatic conversion of texts from
one natural language to another while preserving its meaning and fluency.
Although the research in machine translation has been going on since multiple
decades, the newer approach of integrating deep learning techniques in natural
language processing has led to significant improvements in the translation
quality. In this paper, we have developed a Neural Machine Translation (NMT)
system by training the Transformer model to translate texts from Indian
Language Hindi to English. Hindi being a low resource language has made it
difficult for neural networks to understand the language thereby leading to a
slow growth in the development of neural machine translators. Thus, to address
this gap, we implemented back-translation to augment the training data and for
creating the vocabulary, we experimented with both word and subword level
tokenization using Byte Pair Encoding (BPE) thereby ending up training the
Transformer in 10 different configurations. This led us to achieve a
state-of-the-art BLEU score of 24.53 on the test set of IIT Bombay
English-Hindi Corpus in one of the configurations.Comment: 10 pages, 2 figure