772 research outputs found
A Correlational Encoder Decoder Architecture for Pivot Based Sequence Generation
Interlingua based Machine Translation (MT) aims to encode multiple languages
into a common linguistic representation and then decode sentences in multiple
target languages from this representation. In this work we explore this idea in
the context of neural encoder decoder architectures, albeit on a smaller scale
and without MT as the end goal. Specifically, we consider the case of three
languages or modalities X, Z and Y wherein we are interested in generating
sequences in Y starting from information available in X. However, there is no
parallel training data available between X and Y but, training data is
available between X & Z and Z & Y (as is often the case in many real world
applications). Z thus acts as a pivot/bridge. An obvious solution, which is
perhaps less elegant but works very well in practice is to train a two stage
model which first converts from X to Z and then from Z to Y. Instead we explore
an interlingua inspired solution which jointly learns to do the following (i)
encode X and Z to a common representation and (ii) decode Y from this common
representation. We evaluate our model on two tasks: (i) bridge transliteration
and (ii) bridge captioning. We report promising results in both these
applications and believe that this is a right step towards truly interlingua
inspired encoder decoder architectures.Comment: 10 page
Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages
Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription
Multilingual Neural Machine Translation System for Indic to Indic Languages
This paper gives an Indic-to-Indic (IL-IL) MNMT baseline model for 11 ILs
implemented on the Samanantar corpus and analyzed on the Flores-200 corpus. All
the models are evaluated using the BLEU score. In addition, the languages are
classified under three groups namely East Indo- Aryan (EI), Dravidian (DR), and
West Indo-Aryan (WI). The effect of language relatedness on MNMT model
efficiency is studied. Owing to the presence of large corpora from English (EN)
to ILs, MNMT IL-IL models using EN as a pivot are also built and examined. To
achieve this, English- Indic (EN-IL) models are also developed, with and
without the usage of related languages. Results reveal that using related
languages is beneficial for the WI group only, while it is detrimental for the
EI group and shows an inconclusive effect on the DR group, but it is useful for
EN-IL models. Thus, related language groups are used to develop pivot MNMT
models. Furthermore, the IL corpora are transliterated from the corresponding
scripts to a modified ITRANS script, and the best MNMT models from the previous
approaches are built on the transliterated corpus. It is observed that the
usage of pivot models greatly improves MNMT baselines with AS-TA achieving the
minimum BLEU score and PA-HI achieving the maximum score. Among languages, AS,
ML, and TA achieve the lowest BLEU score, whereas HI, PA, and GU perform the
best. Transliteration also helps the models with few exceptions. The best
increment of scores is observed in ML, TA, and BN and the worst average
increment is observed in KN, HI, and PA, across all languages. The best model
obtained is the PA-HI language pair trained on PAWI transliterated corpus which
gives 24.29 BLEU.Comment: 38 pages, 2 figure
- …