230 research outputs found
Modeling Target-Side Inflection in Neural Machine Translation
NMT systems have problems with large vocabulary sizes. Byte-pair encoding
(BPE) is a popular approach to solving this problem, but while BPE allows the
system to generate any target-side word, it does not enable effective
generalization over the rich vocabulary in morphologically rich languages with
strong inflectional phenomena. We introduce a simple approach to overcome this
problem by training a system to produce the lemma of a word and its
morphologically rich POS tag, which is then followed by a deterministic
generation step. We apply this strategy for English-Czech and English-German
translation scenarios, obtaining improvements in both settings. We furthermore
show that the improvement is not due to only adding explicit morphological
information.Comment: Accepted as a research paper at WMT17. (Updated version with
corrected references.
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Annotating tense, mood and voice for English, French and German
We present the first open-source tool forannotating morphosyntactic tense, mood and voice for English, French and German verbal complexes. The annotation is based on a set of language-specific rules, which are applied on dependency trees and leverage information about lemmas,
morphological properties and POS-tags of the verbs. Our tool has an average accuracy of about 76%. The tense, mood and voice features are useful both as features in computational modeling and for corpuslinguistic research
EUSMT: incorporating linguistic information to SMT for a morphologically rich language. Its use in SMT-RBMT-EBMT hybridation
148 p.: graf.This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language.
First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data.
Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline.
Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.
This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language.
First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data.
Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline.
Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.Eusko Jaurlaritzaren ikertzaileak prestatzeko beka batekin (BFI05.326)eginda
Unification-based constraints for statistical machine translation
Morphology and syntax have both received attention in statistical machine translation
research, but they are usually treated independently and the historical emphasis on
translation into English has meant that many morphosyntactic issues remain under-researched.
Languages with richer morphologies pose additional problems and conventional
approaches tend to perform poorly when either source or target language has
rich morphology.
In both computational and theoretical linguistics, feature structures together with
the associated operation of unification have proven a powerful tool for modelling many
morphosyntactic aspects of natural language. In this thesis, we propose a framework
that extends a state-of-the-art syntax-based model with a feature structure lexicon and
unification-based constraints on the target-side of the synchronous grammar. Whilst
our framework is language-independent, we focus on problems in the translation of
English to German, a language pair that has a high degree of syntactic reordering and
rich target-side morphology.
We first apply our approach to modelling agreement and case government phenomena.
We use the lexicon to link surface form words with grammatical feature
values, such as case, gender, and number, and we use constraints to enforce feature
value identity for the words in agreement and government relations. We demonstrate
improvements in translation quality of up to 0.5 BLEU over a strong baseline model.
We then examine verbal complex production, another aspect of translation that
requires the coordination of linguistic features over multiple words, often with long-range
discontinuities. We develop a feature structure representation of verbal complex
types, using constraint failure as an indicator of translation error and use this to automatically
identify and quantify errors that occur in our baseline system. A manual
analysis and classification of errors informs an extended version of the model that incorporates
information derived from a parse of the source. We identify clause spans
and use model features to encourage the generation of complete verbal complex types.
We are able to improve accuracy as measured using precision and recall against values
extracted from the reference test sets.
Our framework allows for the incorporation of rich linguistic information and we
present sketches of further applications that could be explored in future work
- …