28 research outputs found
Using linear interpolation and weighted reordering hypotheses in the moses system
This paper proposes to introduce a novel reordering model in the open-source Moses toolkit. The main idea is to provide
weighted reordering hypotheses to the SMT decoder. These hypotheses are built using a first-step Ngram-based SMT
translation from a source language into a third representation that is called reordered source language. Each hypothesis
has its own weight provided by the Ngram-based decoder. This proposed reordering technique offers a better and more
efficient translation when compared to both the distance-based and the lexicalized reordering. In addition to this reordering
approach, this paper describes a domain adaptation technique which is based on a linear combination of an specific indomain
and an extra out-domain translation models. Results for both approaches are reported in the Arabic-to-English
2008 IWSLT task. When implementing the weighted reordering hypotheses and the domain adaptation technique in the
final translation system, translation results reach improvements up to 2.5 BLEU compared to a standard state-of-the-art
Moses baseline system.Postprint (published version
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Translation quality and productivity: a study on rich morphology languages
© 2017 The Authors. Published by Asia-Pacific Association for Machine Translation. This is an open access article available under a Creative Commons licence.
The published version can be accessed at the following link on the publisher’s website: http://aamt.info/app-def/S-102/mtsummit/2017/wp-content/uploads/sites/2/2017/09/MTSummitXVI_ResearchTrack.pdfSpecia, L., Blain, F., Harris, K., Burchardt, A. et al. (2017) Translation quality and productivity: a study on rich morphology languages. In, Machine Translation Summit XVI, Vol 1. MT Research Track, Kurohashi, S., and Fung, P., Nagoya, Aichi, Japan: Asia-Pacific Association for Machine Translation, pp. 55-71.This work was supported by the QT21 project (H2020 No. 645452)
Novel statistical approaches to text classification, machine translation and computer-assisted translation
Esta tesis presenta diversas contribuciones en los campos de la
clasificación automática de texto, traducción automática y traducción
asistida por ordenador bajo el marco estadístico.
En clasificación automática de texto, se propone una nueva aplicación
llamada clasificación de texto bilingüe junto con una serie de modelos
orientados a capturar dicha información bilingüe. Con tal fin se
presentan dos aproximaciones a esta aplicación; la primera de ellas se
basa en una asunción naive que contempla la independencia entre las
dos lenguas involucradas, mientras que la segunda, más sofisticada,
considera la existencia de una correlación entre palabras en
diferentes lenguas. La primera aproximación dió lugar al desarrollo de
cinco modelos basados en modelos de unigrama y modelos de n-gramas
suavizados. Estos modelos fueron evaluados en tres tareas de
complejidad creciente, siendo la más compleja de estas tareas
analizada desde el punto de vista de un sistema de ayuda a la
indexación de documentos. La segunda aproximación se caracteriza por
modelos de traducción capaces de capturar correlación entre palabras
en diferentes lenguas. En nuestro caso, el modelo de traducción
elegido fue el modelo M1 junto con un modelo de unigramas. Este
modelo fue evaluado en dos de las tareas más simples superando la
aproximación naive, que asume la independencia entre palabras en
differentes lenguas procedentes de textos bilingües.
En traducción automática, los modelos estadísticos de traducción
basados en palabras M1, M2 y HMM son extendidos bajo el marco de la
modelización mediante mixturas, con el objetivo de definir modelos de
traducción dependientes del contexto. Asimismo se extiende un
algoritmo iterativo de búsqueda basado en programación dinámica,
originalmente diseñado para el modelo M2, para el caso de mixturas de
modelos M2. Este algoritmo de búsqueda nCivera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2502Palanci