26,004 research outputs found
syntactic recordering in statistical machine translation
Reordering has been an important topic in statistical machine translation
(SMT) as long as SMT has been around. State-of-the-art SMT systems such
as Pharaoh (Koehn, 2004a) still employ a simplistic model of the reordering
process to do non-local reordering. This model penalizes any reordering no
matter the words. The reordering is only selected if it leads to a translation
that looks like a much better sentence than the alternative.
Recent developments have, however, seen improvements in translation
quality following from syntax-based reordering. One such development
is the pre-translation approach that adjusts the source sentence to resemble
target language word order prior to translation. This is done based on
rules that are either manually created or automatically learned from word
aligned parallel corpora.
We introduce a novel approach to syntactic reordering. This approach
provides better exploitation of the information in the reordering rules and
eliminates problematic biases of previous approaches. Although the approach
is examined within a pre-translation reordering framework, it easily
extends to other frameworks. Our approach significantly outperforms a
state-of-the-art phrase-based SMT system and previous approaches to pretranslation
reordering, including (Li et al., 2007; Zhang et al., 2007b; Crego
& MariË no, 2007). This is consistent both for a very close language pair,
English-Danish, and a very distant language pair, English-Arabic.
We also propose automatic reordering rule learning based on a rich set
of linguistic information. As opposed to most previous approaches that
extract a large set of rules, our approach produces a small set of predominantly
general rules. These provide a good reflection of the main reordering
issues of a given language pair. We examine the influence of several
parameters that may have influence on the quality of the rules learned.
Finally, we provide a new approach for improving automatic word alignment.
This word alignment is used in the above task of automatically learning
reordering rules. Our approach learns from hand aligned data how to
combine several automatic word alignments to one superior word alignment.
The automatic word alignments are created from the same data that
has been preprocessed with different tokenization schemes. Thus utilizing
the different strengths that different tokenization schemes exhibit in word
alignment. We achieve a 38% error reduction for the automatic word alignmen
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
The impact of source-side syntactic reordering on hierarchical phrase-based SMT
Syntactic reordering has been demonstrated
to be helpful and effective for handling
different word orders between source
and target languages in SMT. However, in
terms of hierarchial PB-SMT (HPB), does
the syntactic reordering still has a significant
impact on its performance? This
paper introduces a reordering approach
which explores the { (DE) grammatical
structure in Chinese. We employ
the Stanford DE classifier to recognise
the DE structures in both training and
test sentences of Chinese, and then perform
word reordering to make the Chinese
sentences better match the word order
of English. The annotated and reordered
training data and test data are applied
to a re-implemented HPB system and
the impact of the DE construction is examined.
The experiments are conducted
on the NIST 2008 evaluation data and experimental
results show that the BLEU
and METEOR scores are significantly improved
by 1.83/8.91 and 1.17/2.73 absolute/
relative points respectively
Description of the Chinese-to-Spanish rule-based machine translation system developed with a hybrid combination of human annotation and statistical techniques
Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair.
This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules.
The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMTâs coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.Peer ReviewedPostprint (author's final draft
Exploiting alignment techniques in MATREX: the DCU machine translation system for IWSLT 2008
In this paper, we give a description of the machine translation (MT) system developed at DCU that was used for our third participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2008). In this participation, we focus on various techniques for word and phrase alignment to improve system quality. Specifically, we try out our word packing and syntax-enhanced word alignment techniques for the ChineseâEnglish task and for the EnglishâChinese task for the first time. For all translation tasks except ArabicâEnglish, we exploit linguistically motivated bilingual phrase pairs extracted from parallel treebanks. We smooth our translation tables with out-of-domain word translations for the ArabicâEnglish and ChineseâEnglish tasks in order to solve the problem of the high number of out of vocabulary items. We also carried out experiments combining both in-domain and out-of-domain data to improve system performance and, finally, we deploy a majority voting procedure combining a language model based method and a translation-based method for case and punctuation restoration. We participated in all the translation
tasks and translated both the single-best ASR hypotheses and
the correct recognition results. The translation results confirm that our new word and phrase alignment techniques are often helpful in improving translation quality, and the data combination method we proposed can significantly improve system performance
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
- âŠ