1,191 research outputs found
Accuracy-based scoring for DOT: towards direct error minimization for data-oriented translation
In this work we present a novel technique to rescore fragments in the Data-Oriented Translation model based on their contribution to translation accuracy. We describe
three new rescoring methods, and present the initial results of a pilot experiment on a small subset of the Europarl corpus. This work is a proof-of-concept, and
is the first step in directly optimizing translation
decisions solely on the hypothesized accuracy of potential translations resulting from those decisions
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Linguistic Structure in Statistical Machine Translation
This thesis investigates the influence of linguistic structure in statistical machine translation. We develop a word reordering model based on syntactic parse trees and address the issues of pronouns and morphological agreement with a source discriminative word lexicon predicting the translation for individual words using structural features. When used in phrase-based machine translation, the models improve the translation for language pairs with different word order and morphological variation
Reassessing the proper place of man and machine in translation: a pre-translation scenario
Traditionally, human--machine interaction to reach an improved machine translation (MT) output takes place ex-post and consists of correcting this output. In this work, we investigate other modes of intervention in the MT process. We propose a Pre-Edition protocol that involves: (a) the detection of MT translation difficulties; (b) the resolution of those difficulties by a human translator, who provides their translations (pre-translation); and (c) the integration of the obtained information prior to the automatic translation. This approach can meet individual interaction preferences of certain translators and can be particularly useful for production environments, where more control over output quality is needed. Early resolution of translation difficulties can prevent downstream errors, thus improving the final translation quality ``for free''. We show that translation difficulty can be reliably predicted for English for various source units. We demonstrate that the pre-translation information can be successfully exploited by an MT system and that the indirect effects are genuine, accounting for around 16% of the total improvement. We also provide a study of the human effort involved in the resolution process
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
Recent studies have revealed a number of pathologies of neural machine
translation (NMT) systems. Hypotheses explaining these mostly suggest that
there is something fundamentally wrong with NMT as a model or its training
algorithm, maximum likelihood estimation (MLE). Most of this evidence was
gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at
identifying the highest-scoring translation, i.e. the mode, under the model
distribution. We argue that the evidence corroborates the inadequacy of MAP
decoding more than casts doubt on the model and its training algorithm. In this
work, we criticise NMT models probabilistically showing that stochastic samples
following the model's own generative story do reproduce various statistics of
the training data well, but that it is beam search that strays from such
statistics. We show that some of the known pathologies of NMT are due to MAP
decoding and not to NMT's statistical assumptions nor MLE. In particular, we
show that the most likely translations under the model accumulate so little
probability mass that the mode can be considered essentially arbitrary. We
therefore advocate for the use of decision rules that take into account
statistics gathered from the model distribution holistically. As a proof of
concept we show that a straightforward implementation of minimum Bayes risk
decoding gives good results outperforming beam search using as little as 30
samples, confirming that MLE-trained NMT models do capture important aspects of
translation well in expectation
- …