1 research outputs found
Role of Morphology Injection in Statistical Machine Translation
Phrase-based Statistical models are more commonly used as they perform
optimally in terms of both, translation quality and complexity of the system.
Hindi and in general all Indian languages are morphologically richer than
English. Hence, even though Phrase-based systems perform very well for the less
divergent language pairs, for English to Indian language translation, we need
more linguistic information (such as morphology, parse tree, parts of speech
tags, etc.) on the source side. Factored models seem to be useful in this case,
as Factored models consider word as a vector of factors. These factors can
contain any information about the surface word and use it while translating.
Hence, the objective of this work is to handle morphological inflections in
Hindi and Marathi using Factored translation models while translating from
English. SMT approaches face the problem of data sparsity while translating
into a morphologically rich language. It is very unlikely for a parallel corpus
to contain all morphological forms of words. We propose a solution to generate
these unseen morphological forms and inject them into original training
corpora. In this paper, we study factored models and the problem of sparseness
in context of translation to morphologically rich languages. We propose a
simple and effective solution which is based on enriching the input with
various morphological forms of words. We observe that morphology injection
improves the quality of translation in terms of both adequacy and fluency. We
verify this with the experiments on two morphologically rich languages: Hindi
and Marathi, while translating from English.Comment: 36 pages, 12 figures, 15 tables, Modified version Published in: ACM
Transactions on Asian and Low-Resource Language Information Processing
(TALLIP) TALLIP Homepage archive Volume 17 Issue 1, September 2017
Issue-in-Progress,Article No.