247 research outputs found
A Chunk-Based Reordering Model for Phrase-Based SMT Systems
This paper proposed a novel reordering model based on the reordering of source language chunks. This model is used as a preprocessing step of phrase-based translation models and could be well integrated with them. At the same time, as a chunk-based model, syntax information could be concerned in the process of reordering while the entire parsing of the source sentence is not required. Two experiments were carried out and the results showed that the proposed model could improve the performance of a phrase-based statistical machine translation (SMT) system greatly
Reordering in statistical machine translation
PhDMachine translation is a challenging task that its difficulties arise from several characteristics
of natural language. The main focus of this work is on reordering as one of
the major problems in MT and statistical MT, which is the method investigated in this
research. The reordering problem in SMT originates from the fact that not all the words
in a sentence can be consecutively translated. This means words must be skipped and
be translated out of their order in the source sentence to produce a fluent and grammatically
correct sentence in the target language. The main reason that reordering is
needed is the fundamental word order differences between languages. Therefore, reordering
becomes a more dominant issue, the more source and target languages are
structurally different.
The aim of this thesis is to study the reordering phenomenon by proposing new methods
of dealing with reordering in SMT decoders and evaluating the effectiveness of
the methods and the importance of reordering in the context of natural language processing
tasks. In other words, we propose novel ways of performing the decoding to
improve the reordering capabilities of the SMT decoder and in addition we explore
the effect of improving the reordering on the quality of specific NLP tasks, namely
named entity recognition and cross-lingual text association. Meanwhile, we go beyond
reordering in text association and present a method to perform cross-lingual text fragment
alignment, based on models of divergence from randomness.
The main contribution of this thesis is a novel method named dynamic distortion,
which is designed to improve the ability of the phrase-based decoder in performing
reordering by adjusting the distortion parameter based on the translation context. The
model employs a discriminative reordering model, which is combining several fea-
2
tures including lexical and syntactic, to predict the necessary distortion limit for each
sentence and each hypothesis expansion. The discriminative reordering model is also
integrated into the decoder as an extra feature. The method achieves substantial improvements
over the baseline without increase in the decoding time by avoiding reordering
in unnecessary positions.
Another novel method is also presented to extend the phrase-based decoder to dynamically
chunk, reorder, and apply phrase translations in tandem. Words inside the chunks
are moved together to enable the decoder to make long-distance reorderings to capture
the word order differences between languages with different sentence structures.
Another aspect of this work is the task-based evaluation of the reordering methods and
other translation algorithms used in the phrase-based SMT systems. With more successful
SMT systems, performing multi-lingual and cross-lingual tasks through translating
becomes more feasible. We have devised a method to evaluate the performance
of state-of-the art named entity recognisers on the text translated by a SMT decoder.
Specifically, we investigated the effect of word reordering and incorporating reordering
models in improving the quality of named entity extraction.
In addition to empirically investigating the effect of translation in the context of crosslingual
document association, we have described a text fragment alignment algorithm
to find sections of the two documents in different languages, that are content-wise related.
The algorithm uses similarity measures based on divergence from randomness
and word-based translation models to perform text fragment alignment on a collection
of documents in two different languages.
All the methods proposed in this thesis are extensively empirically examined. We have
tested all the algorithms on common translation collections used in different evaluation
campaigns. Well known automatic evaluation metrics are used to compare the
suggested methods to a state-of-the art baseline and results are analysed and discussed
Source-side syntactic reordering patterns with functional words for improved phrase-based SMT
Inspired by previous source-side syntactic reordering methods for SMT, this paper focuses on using automatically learned syntactic reordering patterns with functional words which indicate structural reorderings between the source and target language. This approach takes advantage of phrase alignments and source-side parse trees for pattern extraction, and then filters out those patterns without functional words. Word lattices transformed by the generated patterns are fed into PBSMT systems to incorporate potential reorderings from the inputs. Experiments are carried out on a medium-sized corpus for a Chinese–English SMT task. The proposed method outperforms the baseline system by 1.38% relative on a randomly selected testset and 10.45% relative on the NIST 2008 testset in terms of BLEU score. Furthermore, a system with just 61.88% of the patterns filtered by functional words obtains a comparable performance with the unfiltered one on the randomly selected testset, and achieves 1.74% relative improvements on the NIST 2008 testset
Combining data-driven MT systems for improved sign language translation
In this paper, we investigate the feasibility of combining two data-driven machine translation (MT) systems for the translation of sign languages (SLs). We take the MT systems of two prominent data-driven research groups, the MaTrEx system developed at DCU and the Statistical Machine
Translation (SMT) system developed at RWTH Aachen University, and apply their respective approaches to the task of translating Irish Sign Language and German Sign Language into English and German. In a set of experiments supported by automatic evaluation results, we show that
there is a definite value to the prospective merging of MaTrEx’s Example-Based MT chunks and distortion limit increase with RWTH’s constraint reordering
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
OpenMaTrEx: a free/open-source marker-driven example-based machine translation system
We describe OpenMaTrEx, a free/open-source example based
machine translation (EBMT) system based on the marker hypothesis, comprising a marker-driven chunker, a collection of chunk aligners, and two engines: one based on a simple proof-of-concept monotone EBMT recombinator and a Moses-based statistical decoder. OpenMaTrEx is a free/open-source release of the basic components of MaTrEx, the Dublin City University machine translation system
Improved phrase-based SMT with syntactic reordering patterns learned from lattice scoring
In this paper, we present a novel approach to incorporate source-side syntactic reordering patterns into phrase-based SMT. The main contribution of this work is to use the lattice scoring approach to exploit and utilize reordering
information that is favoured by the baseline PBSMT system. By referring to the parse trees of the training corpus, we represent the observed reorderings with source-side
syntactic patterns. The extracted patterns are then used to convert the parsed inputs into word lattices, which contain both the original source sentences and their potential reorderings. Weights of the word lattices are estimated from the observations of the syntactic reordering patterns in the training corpus. Finally, the PBSMT system is tuned
and tested on the generated word lattices to show the benefits of adding potential sourceside reorderings in the inputs. We confirmed the effectiveness of our proposed method on a medium-sized corpus for Chinese-English
machine translation task. Our method outperformed the baseline system by 1.67% relative on a randomly selected testset and 8.56% relative on the NIST 2008 testset in terms of BLEU score
Translating Phrases in Neural Machine Translation
Phrases play an important role in natural language understanding and machine
translation (Sag et al., 2002; Villavicencio et al., 2005). However, it is
difficult to integrate them into current neural machine translation (NMT) which
reads and generates sentences word by word. In this work, we propose a method
to translate phrases in NMT by integrating a phrase memory storing target
phrases from a phrase-based statistical machine translation (SMT) system into
the encoder-decoder architecture of NMT. At each decoding step, the phrase
memory is first re-written by the SMT model, which dynamically generates
relevant target phrases with contextual information provided by the NMT model.
Then the proposed model reads the phrase memory to make probability estimations
for all phrases in the phrase memory. If phrase generation is carried on, the
NMT decoder selects an appropriate phrase from the memory to perform phrase
translation and updates its decoding state by consuming the words in the
selected phrase. Otherwise, the NMT decoder generates a word from the
vocabulary as the general NMT decoder does. Experiment results on the Chinese
to English translation show that the proposed model achieves significant
improvements over the baseline on various test sets.Comment: Accepted by EMNLP 201
MaTrEx: the DCU machine translation system for ICON 2008
In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the NLP Tools Contest of the International
Conference on Natural Language Processing (ICON 2008). This was our first ever attempt at working on any Indian language. In this participation, we focus on various techniques for word and phrase alignment to improve system quality. For the English-Hindi translation task we exploit
source-language reordering. We also carried out experiments combining both in-domain and out-of-domain data to improve
the system performance and, as a post-processing step we transliterate out-of-vocabulary items
- …