1,189 research outputs found
Bayesian reordering model with feature selection
In phrase-based statistical machine translation systems, variation in grammatical structures between source and target languages can cause large movements of phrases. Modeling such movements is crucial in achieving translations of long sentences that appear natural in the target language. We explore generative learning approach to phrase reordering in Arabic to English. Formulating the reordering problem as a classification problem and using naive Bayes with feature selection, we achieve an improvement in the BLEU score over a lexicalized reordering model. The proposed model is compact, fast and scalable to a large corpus
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
Dependency relations as source context in phrase-based SMT
The Phrase-Based Statistical Machine Translation (PB-SMT) model has recently begun to include source context modeling, under the assumption that the proper lexical
choice of an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features such as words, parts-of-speech, and
supertags have been explored as effective source context in SMT. In this paper, we show that position-independent syntactic dependency relations of the head of a source phrase can be modeled as useful source context to improve target phrase selection and thereby improve overall performance of PB-SMT. On a DutchβEnglish translation task, by combining dependency relations and syntactic contextual features (part-of-speech), we achieved a 1.0 BLEU (Papineni et al., 2002) point improvement (3.1% relative) over the baseline
Improved phrase-based SMT with syntactic reordering patterns learned from lattice scoring
In this paper, we present a novel approach to incorporate source-side syntactic reordering patterns into phrase-based SMT. The main contribution of this work is to use the lattice scoring approach to exploit and utilize reordering
information that is favoured by the baseline PBSMT system. By referring to the parse trees of the training corpus, we represent the observed reorderings with source-side
syntactic patterns. The extracted patterns are then used to convert the parsed inputs into word lattices, which contain both the original source sentences and their potential reorderings. Weights of the word lattices are estimated from the observations of the syntactic reordering patterns in the training corpus. Finally, the PBSMT system is tuned
and tested on the generated word lattices to show the benefits of adding potential sourceside reorderings in the inputs. We confirmed the effectiveness of our proposed method on a medium-sized corpus for Chinese-English
machine translation task. Our method outperformed the baseline system by 1.67% relative on a randomly selected testset and 8.56% relative on the NIST 2008 testset in terms of BLEU score
Discriminative Reordering Models for Statistical Machine Translation
We present discriminative reordering models for phrase-based statistical machine translation. The models are trained using the maximum entropy principle. We use several types of features: based on words, based on word classes, based on the local context. We evaluate the overall performance of the reordering models as well as the contribution of the individual feature types on a word-aligned corpus. Additionally, we show improved translation performance using these reordering models compared to a state-of-the-art baseline system.
BattRAE: Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings
In this paper, we propose a bidimensional attention based recursive
autoencoder (BattRAE) to integrate clues and sourcetarget interactions at
multiple levels of granularity into bilingual phrase representations. We employ
recursive autoencoders to generate tree structures of phrases with embeddings
at different levels of granularity (e.g., words, sub-phrases and phrases). Over
these embeddings on the source and target side, we introduce a bidimensional
attention network to learn their interactions encoded in a bidimensional
attention matrix, from which we extract two soft attention weight distributions
simultaneously. These weight distributions enable BattRAE to generate
compositive phrase representations via convolution. Based on the learned phrase
representations, we further use a bilinear neural model, trained via a
max-margin method, to measure bilingual semantic similarity. To evaluate the
effectiveness of BattRAE, we incorporate this semantic similarity as an
additional feature into a state-of-the-art SMT system. Extensive experiments on
NIST Chinese-English test sets show that our model achieves a substantial
improvement of up to 1.63 BLEU points on average over the baseline.Comment: 7 pages, accepted by AAAI 201
Reordering in statistical machine translation
PhDMachine translation is a challenging task that its difficulties arise from several characteristics
of natural language. The main focus of this work is on reordering as one of
the major problems in MT and statistical MT, which is the method investigated in this
research. The reordering problem in SMT originates from the fact that not all the words
in a sentence can be consecutively translated. This means words must be skipped and
be translated out of their order in the source sentence to produce a fluent and grammatically
correct sentence in the target language. The main reason that reordering is
needed is the fundamental word order differences between languages. Therefore, reordering
becomes a more dominant issue, the more source and target languages are
structurally different.
The aim of this thesis is to study the reordering phenomenon by proposing new methods
of dealing with reordering in SMT decoders and evaluating the effectiveness of
the methods and the importance of reordering in the context of natural language processing
tasks. In other words, we propose novel ways of performing the decoding to
improve the reordering capabilities of the SMT decoder and in addition we explore
the effect of improving the reordering on the quality of specific NLP tasks, namely
named entity recognition and cross-lingual text association. Meanwhile, we go beyond
reordering in text association and present a method to perform cross-lingual text fragment
alignment, based on models of divergence from randomness.
The main contribution of this thesis is a novel method named dynamic distortion,
which is designed to improve the ability of the phrase-based decoder in performing
reordering by adjusting the distortion parameter based on the translation context. The
model employs a discriminative reordering model, which is combining several fea-
2
tures including lexical and syntactic, to predict the necessary distortion limit for each
sentence and each hypothesis expansion. The discriminative reordering model is also
integrated into the decoder as an extra feature. The method achieves substantial improvements
over the baseline without increase in the decoding time by avoiding reordering
in unnecessary positions.
Another novel method is also presented to extend the phrase-based decoder to dynamically
chunk, reorder, and apply phrase translations in tandem. Words inside the chunks
are moved together to enable the decoder to make long-distance reorderings to capture
the word order differences between languages with different sentence structures.
Another aspect of this work is the task-based evaluation of the reordering methods and
other translation algorithms used in the phrase-based SMT systems. With more successful
SMT systems, performing multi-lingual and cross-lingual tasks through translating
becomes more feasible. We have devised a method to evaluate the performance
of state-of-the art named entity recognisers on the text translated by a SMT decoder.
Specifically, we investigated the effect of word reordering and incorporating reordering
models in improving the quality of named entity extraction.
In addition to empirically investigating the effect of translation in the context of crosslingual
document association, we have described a text fragment alignment algorithm
to find sections of the two documents in different languages, that are content-wise related.
The algorithm uses similarity measures based on divergence from randomness
and word-based translation models to perform text fragment alignment on a collection
of documents in two different languages.
All the methods proposed in this thesis are extensively empirically examined. We have
tested all the algorithms on common translation collections used in different evaluation
campaigns. Well known automatic evaluation metrics are used to compare the
suggested methods to a state-of-the art baseline and results are analysed and discussed
- β¦