341 research outputs found
Modeling Target-Side Inflection in Neural Machine Translation
NMT systems have problems with large vocabulary sizes. Byte-pair encoding
(BPE) is a popular approach to solving this problem, but while BPE allows the
system to generate any target-side word, it does not enable effective
generalization over the rich vocabulary in morphologically rich languages with
strong inflectional phenomena. We introduce a simple approach to overcome this
problem by training a system to produce the lemma of a word and its
morphologically rich POS tag, which is then followed by a deterministic
generation step. We apply this strategy for English-Czech and English-German
translation scenarios, obtaining improvements in both settings. We furthermore
show that the improvement is not due to only adding explicit morphological
information.Comment: Accepted as a research paper at WMT17. (Updated version with
corrected references.
Example-based machine translation of the Basque language
Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus
(270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art
approaches according to several common automatic evaluation metrics
Ongoing study for enhancing chinese-spanish translation with morphology strategies
Chinese and Spanish have different morphology structures, which poses a big challenge for translating between this pair of languages. In this paper, we analyze several strategies to better generalize from the Chinese non-morphology-based language to the Spanish rich morphologybased language. Strategies use a first-step of Spanish morphology-based simplifications and a second-step of fullform generation. The latter can be done using a translation system or classification methods. Finally, both steps are combined either by concatenation in cascade or integration using a factored-based style. Ongoing experiments (based on the United Nations corpus) and their results are described.Postprint (published version
Hybrid Arabic–French machine translation using syntactic re-ordering and morphological pre-processing
This is an accepted manuscript of an article published by Elsevier BV in Computer Speech & Language on 08/11/2014, available online: https://doi.org/10.1016/j.csl.2014.10.007
The accepted version of the publication may differ from the final published version.Arabic is a highly inflected language and a morpho-syntactically complex language with many differences compared to several languages that are heavily studied. It may thus require good pre-processing as it presents significant challenges for Natural Language Processing (NLP), specifically for Machine Translation (MT). This paper aims to examine how Statistical Machine Translation (SMT) can be improved using rule-based pre-processing and language analysis. We describe a hybrid translation approach coupling an Arabic–French statistical machine translation system using the Moses decoder with additional morphological rules that reduce the morphology of the source language (Arabic) to a level that makes it closer to that of the target language (French). Moreover, we introduce additional swapping rules for a structural matching between the source language and the target language. Two structural changes involving the positions of the pronouns and verbs in both the source and target languages have been attempted. The results show an improvement in the quality of translation and a gain in terms of BLEU score after introducing a pre-processing scheme for Arabic and applying these rules based on morphological variations and verb re-ordering (VS into SV constructions) in the source language (Arabic) according to their positions in the target language (French). Furthermore, a learning curve shows the improvement in terms on BLEU score under scarce- and large-resources conditions. The proposed approach is completed without increasing the amount of training data or radically changing the algorithms that can affect the translation or training engines.This paper is based upon work supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant number 356097-08.Published versio
A Nested Attention Neural Hybrid Model for Grammatical Error Correction
Grammatical error correction (GEC) systems strive to correct both global
errors in word order and usage, and local errors in spelling and inflection.
Further developing upon recent work on neural machine translation, we propose a
new hybrid neural model with nested attention layers for GEC. Experiments show
that the new model can effectively correct errors of both types by
incorporating word and character-level information,and that the model
significantly outperforms previous neural models for GEC as measured on the
standard CoNLL-14 benchmark dataset. Further analysis also shows that the
superiority of the proposed model can be largely attributed to the use of the
nested attention mechanism, which has proven particularly effective in
correcting local errors that involve small edits in orthography
- …