29 research outputs found
A detailed analysis of phrase-based and syntax-based machine translation: the search for systematic differences
This paper describes a range of automatic and manual comparisons of phrase-based and syntax-based statistical machine translation methods applied to English-German and
English-French translation of user-generated content. The syntax-based methods underperform the phrase-based models and the relaxation of syntactic constraints to broaden translation rule coverage means that these models do not necessarily generate output which is more grammatical than the output produced by the phrase-based models. Although the
systems generate different output and can potentially
be fruitfully combined, the lack of systematic difference between these models makes the combination task more challenging
CCG-augmented hierarchical phrase-based statistical machine translation
Augmenting Statistical Machine Translation (SMT) systems with syntactic information aims at improving translation quality. Hierarchical Phrase-Based (HPB) SMT takes a step toward incorporating syntax in Phrase-Based (PB) SMT by modelling one aspect of language syntax, namely the hierarchical structure of phrases. Syntax Augmented Machine Translation (SAMT) further incorporates syntactic information extracted using context free phrase structure grammar (CF-PSG) in the HPB SMT model. One of the main challenges facing CF-PSG-based augmentation approaches for SMT systems emerges from the difference in the definition of the constituent in CF-PSG and the âphraseâ in SMT systems, which hinders the ability of CF-PSG to express the syntactic function of many SMT phrases. Although the SAMT approach to solving this problem using âCCG-likeâ operators to combine constituent labels improves syntactic constraint coverage, it significantly increases their sparsity, which restricts translation and negatively affects its quality.
In this thesis, we address the problems of sparsity and limited coverage of syntactic constraints facing the CF-PSG-based syntax augmentation approaches for HPB SMT using Combinatory Cateogiral Grammar (CCG). We demonstrate that
CCGâs flexible structures and rich syntactic descriptors help to extract richer, more expressive and less sparse syntactic constraints with better coverage than CF-PSG,
which enables our CCG-augmented HPB system to outperform the SAMT system. We also try to soften the syntactic constraints imposed by CCG category nonterminal labels by extracting less fine-grained CCG-based labels. We demonstrate that CCG label simplification helps to significantly improve the performance of our CCG category HPB system. Finally, we identify the factors which limit the coverage of the syntactic constraints in our CCG-augmented HPB model. We then try to tackle these factors by extending the definition of the nonterminal label to be composed of a sequence of CCG categories and augmenting the glue grammar with CCG combinatory rules. We demonstrate that our extension approaches help to significantly increase the scope of the syntactic constraints applied in our CCG-augmented HPB model and achieve significant improvements over the HPB SMT baseline
Statistical Machine Translation Using the Self-Organizing Map
The paper describes a contextual environment using the Self-Organizing Map, which can model a semantic agent (SOMAgent) that learns the correct meaning of a word used in context in order to deal with specific phenomena such as ambiguity, and to generate more precise alignments that can improve the first choice of the Statistical Machine Translation system giving linguistic knowledge
Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor
We present progress on Joshua, an opensource decoder for hierarchical and syntaxbased machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats.
A tree does not make a well-formed sentence: Improving syntactic string-to-tree statistical machine translation with more linguistic knowledge
AbstractSynchronous context-free grammars (SCFGs) can be learned from parallel texts that are annotated with target-side syntax, and can produce translations by building target-side syntactic trees from source strings. Ideally, producing syntactic trees would entail that the translation is grammatically well-formed, but in reality, this is often not the case. Focusing on translation into German, we discuss various ways in which string-to-tree translation models over- or undergeneralise. We show how these problems can be addressed by choosing a suitable parser and modifying its output, by introducing linguistic constraints that enforce morphological agreement and constrain subcategorisation, and by modelling the productive generation of German compounds
The role of syntax and semantics in machine translation and quality estimation of machine-translated user-generated content
The availability of the Internet has led to a steady increase in the volume of online user-generated content, the majority of which is in English. Machine-translating this content to other languages can help disseminate the information contained in it to a broader audience. However, reliably publishing these translations requires a prior estimate of their quality. This thesis is concerned with the statistical machine translation of Symantec's Norton forum content, focusing in particular on its quality estimation (QE) using syntactic and semantic information. We compare the output of phrase-based and syntax-based English-to-French and English-to-German machine translation (MT) systems automatically and manually, and nd that the syntax-based methods do not necessarily handle grammar-related phenomena in translation better than the phrase-based methods. Although these systems generate suciently dierent outputs, the apparent lack of a systematic dierence between these outputs impedes its utilisation in a combination framework. To investigate the role of syntax and semantics in quality estimation of machine translation, we create SymForum, a data set containing French machine translations of English sentences from Norton forum content, their post-edits and their adequacy and uency scores. We use syntax in quality estimation via tree kernels, hand-crafted features and their combination, and nd it useful both alone and in combination with surface-driven features. Our analyses show that neither the accuracy of the syntactic parses used by these systems nor the parsing quality of the MT output aect QE performance. We also nd that adding more structure to French Treebank
parse trees can be useful for syntax-based QE. We use semantic role labelling (SRL) for our semantic-based QE experiments. We experiment with the limited resources that are available for French and nd that a small manually annotated training set is substantially more useful than a much larger articially created set. We use SRL in quality estimation using tree kernels, hand-crafted features and their combination. Additionally, we introduce PAM, a QE metric based on the predicate-argument structure match between source and target. We nd that the SRL quality, especially on the target side, is the major factor negatively aecting the performance of the semantic-based QE. Finally, we annotate English and French Norton forum sentences with their phrase structure syntax using an annotation strategy adapted for user-generated text. We nd that user errors occur in only a small fraction of the data, but their correction does improve parsing performance. These treebanks (Foreebank) prove to be useful as supplementary training data in adapting the parsers to the forum text. The improved parses ultimately increase the performance of the semantic-based QE. However, a reliable semantic-based QE system requires further improvements in the quality of the underlying semantic role labelling
Unification-based constraints for statistical machine translation
Morphology and syntax have both received attention in statistical machine translation
research, but they are usually treated independently and the historical emphasis on
translation into English has meant that many morphosyntactic issues remain under-researched.
Languages with richer morphologies pose additional problems and conventional
approaches tend to perform poorly when either source or target language has
rich morphology.
In both computational and theoretical linguistics, feature structures together with
the associated operation of unification have proven a powerful tool for modelling many
morphosyntactic aspects of natural language. In this thesis, we propose a framework
that extends a state-of-the-art syntax-based model with a feature structure lexicon and
unification-based constraints on the target-side of the synchronous grammar. Whilst
our framework is language-independent, we focus on problems in the translation of
English to German, a language pair that has a high degree of syntactic reordering and
rich target-side morphology.
We first apply our approach to modelling agreement and case government phenomena.
We use the lexicon to link surface form words with grammatical feature
values, such as case, gender, and number, and we use constraints to enforce feature
value identity for the words in agreement and government relations. We demonstrate
improvements in translation quality of up to 0.5 BLEU over a strong baseline model.
We then examine verbal complex production, another aspect of translation that
requires the coordination of linguistic features over multiple words, often with long-range
discontinuities. We develop a feature structure representation of verbal complex
types, using constraint failure as an indicator of translation error and use this to automatically
identify and quantify errors that occur in our baseline system. A manual
analysis and classification of errors informs an extended version of the model that incorporates
information derived from a parse of the source. We identify clause spans
and use model features to encourage the generation of complete verbal complex types.
We are able to improve accuracy as measured using precision and recall against values
extracted from the reference test sets.
Our framework allows for the incorporation of rich linguistic information and we
present sketches of further applications that could be explored in future work