4,916 research outputs found
Revisiting Low Resource Status of Indian Languages in Machine Translation
Indian language machine translation performance is hampered due to the lack
of large scale multi-lingual sentence aligned corpora and robust benchmarks.
Through this paper, we provide and analyse an automated framework to obtain
such a corpus for Indian language neural machine translation (NMT) systems. Our
pipeline consists of a baseline NMT system, a retrieval module, and an
alignment module that is used to work with publicly available websites such as
press releases by the government. The main contribution towards this effort is
to obtain an incremental method that uses the above pipeline to iteratively
improve the size of the corpus as well as improve each of the components of our
system. Through our work, we also evaluate the design choices such as the
choice of pivoting language and the effect of iterative incremental increase in
corpus size. Our work in addition to providing an automated framework also
results in generating a relatively larger corpus as compared to existing
corpora that are available for Indian languages. This corpus helps us obtain
substantially improved results on the publicly available WAT evaluation
benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie
A Correlational Encoder Decoder Architecture for Pivot Based Sequence Generation
Interlingua based Machine Translation (MT) aims to encode multiple languages
into a common linguistic representation and then decode sentences in multiple
target languages from this representation. In this work we explore this idea in
the context of neural encoder decoder architectures, albeit on a smaller scale
and without MT as the end goal. Specifically, we consider the case of three
languages or modalities X, Z and Y wherein we are interested in generating
sequences in Y starting from information available in X. However, there is no
parallel training data available between X and Y but, training data is
available between X & Z and Z & Y (as is often the case in many real world
applications). Z thus acts as a pivot/bridge. An obvious solution, which is
perhaps less elegant but works very well in practice is to train a two stage
model which first converts from X to Z and then from Z to Y. Instead we explore
an interlingua inspired solution which jointly learns to do the following (i)
encode X and Z to a common representation and (ii) decode Y from this common
representation. We evaluate our model on two tasks: (i) bridge transliteration
and (ii) bridge captioning. We report promising results in both these
applications and believe that this is a right step towards truly interlingua
inspired encoder decoder architectures.Comment: 10 page
- …