257 research outputs found
Towards String-to-Tree Neural Machine Translation
We present a simple method to incorporate syntactic information about the
target language in a neural machine translation system by translating into
linearized, lexicalized constituency trees. An experiment on the WMT16
German-English news translation task resulted in an improved BLEU score when
compared to a syntax-agnostic NMT baseline trained on the same dataset. An
analysis of the translations from the syntax-aware system shows that it
performs more reordering during translation in comparison to the baseline. A
small-scale human evaluation also showed an advantage to the syntax-aware
system.Comment: Accepted as a short paper in ACL 201
Coupling hierarchical word reordering and decoding in phrase-based statistical machine translation
In this paper, we start with the existing idea of taking reordering rules automatically derived from syntactic representations, and applying them in a preprocessing step before translation to make the source sentence structurally more like the target; and we propose a new approach to hierarchically extracting these rules. We evaluate this, combined with a lattice-based decoding, and show improvements over stateof-the-art distortion models.Postprint (published version
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Source side pre-ordering using recurrent neural networks for English-Myanmar machine translation
Word reordering has remained one of the challenging problems for machine translation when translating between language pairs with different word orders e.g. English and Myanmar. Without reordering between these languages, a source sentence may be translated directly with similar word order and translation can not be meaningful. Myanmar is a subject-objectverb (SOV) language and an effective reordering is essential for translation. In this paper, we applied a pre-ordering approach using recurrent neural networks to pre-order words of the source Myanmar sentence into target English’s word order. This neural pre-ordering model is automatically derived from parallel word-aligned data with syntactic and lexical features based on dependency parse trees of the source sentences. This can generate arbitrary permutations that may be non-local on the sentence and can be combined into English-Myanmar machine translation. We exploited the model to reorder English sentences into Myanmar-like word order as a preprocessing stage for machine translation, obtaining improvements quality comparable to baseline rule-based pre-ordering approach on asian language treebank (ALT) corpus
Linguistic Structure in Statistical Machine Translation
This thesis investigates the influence of linguistic structure in statistical machine translation. We develop a word reordering model based on syntactic parse trees and address the issues of pronouns and morphological agreement with a source discriminative word lexicon predicting the translation for individual words using structural features. When used in phrase-based machine translation, the models improve the translation for language pairs with different word order and morphological variation
English-Latvian SMT: the challenge of translating into a free word order language
This paper presents a comparative study of two approaches to
statistical machine translation (SMT) and their application to
a task of English-to-Latvian translation, which is still an open
research line in the field of automatic translation.
We consider a state-of-the-art phrase-based SMT and an
alternative N-gram-based SMT systems. The major differences
between these two approaches lie in the distinct representations
of bilingual units, which are the components of the
bilingual model driving translation process and in the statistical
modeling of the translation context.
Latvian being a rather free word order language implies
additional difficulties to the translation process. We contrast
different reordering models and investigate how well they
deal with the word ordering issue.
Moving beyond automatic scores of translation quality
that are classically presented in MT research papers, we contribute
presenting a manual error analysis of MT systems output
that helps to shed light on advantages and disadvantages
of the SMT systems under consideration and identify the most
prominent source of errors typical for both SMT systems.Postprint (published version
EUSMT: incorporating linguistic information to SMT for a morphologically rich language. Its use in SMT-RBMT-EBMT hybridation
148 p.: graf.This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language.
First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data.
Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline.
Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.
This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language.
First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data.
Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline.
Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.Eusko Jaurlaritzaren ikertzaileak prestatzeko beka batekin (BFI05.326)eginda
Towards improving English-Latvian translation: a system comparison and a new rescoring feature
This paper presents a comparative study of two alternative approaches to statistical machine translation (SMT) and their application to
a task of English-to-Latvian translation. Furthermore, a novel feature intending to reflect the relatively free word order scheme of the
Latvian language is proposed and successfully applied on the n-best list rescoring step. Moving beyond classical automatic scores of
translation quality that are classically presented in MT research papers, we contribute presenting a manual error analysis of MT systems
output that helps to shed light on advantages and disadvantages of the SMT systems under consideration.Postprint (published version
- …