159 research outputs found
A syntactic skeleton for statistical machine translation
We present a method for improving statistical machine translation performance by using linguistically motivated syntactic information. Our algorithm recursively decomposes source language sentences into syntactically simpler and shorter chunks, and recomposes their translation to form target language sentences. This improves both the word order and lexical selection of the translation. We report statistically significant relative improvementsof 3.3% BLEU score in an experiment (English!Spanish) carried out on
an 800-sentence test set extracted from the Europarl corpus
TMX markup: a challenge when adapting SMT to the localisation environment
Translation memory (TM) plays an important role in localisation workflows and is used as an efficient and fundamental tool to carry out translation. In recent years, statistical machine translation (SMT) techniques have been rapidly developed, and the translation quality and speed have been significantly improved as well. However,when applying SMT technique to facilitate post-editing in the localisation industry, we need to adapt SMT to the TM data which is formatted with special mark-up. In this paper, we explore some issues when adapting SMT to Symantec formatted TM data.
Three different methods are proposed to handle the Translation Memory eXchange (TMX) markup and a comparative study is carried out between them. Furthermore, we also compare the TMX-based SMT systems with a customised SYSTRAN system through human evaluation and automatic evaluation metrics. The experimental results conducted on the French and English language pair show that the SMT can perform well using TMX as input format either during training or at runtime
Initial explorations in English to Turkish statistical machine translation
This paper presents some very preliminary results for and problems in developing a statistical machine translation system from English to Turkish. Starting with a baseline word model trained from about 20K aligned sentences, we explore various ways of exploiting morphological structure to improve upon the baseline system. As Turkish is a language with complex agglutinative word structures, we experiment withmorphologically segmented and disambiguated versions of the parallel texts in order to also uncover relations between morphemes and function words in one language with morphemes and functions words in the other, in addition to relations between open class content words. Morphological segmentation on the Turkish side also conflates the statistics from allomorphs so that sparseness can be alleviated to a certain extent. We find that this approach coupled with a simple grouping of most frequent morphemes and function words on both sides improve the BLEU score from the baseline of 0.0752 to 0.0913 with the small training data. We close with a discussion on why one should not expect distortion parameters to model word-local morpheme ordering and that a new approach to handling complex morphotactics is needed
Towards String-to-Tree Neural Machine Translation
We present a simple method to incorporate syntactic information about the
target language in a neural machine translation system by translating into
linearized, lexicalized constituency trees. An experiment on the WMT16
German-English news translation task resulted in an improved BLEU score when
compared to a syntax-agnostic NMT baseline trained on the same dataset. An
analysis of the translations from the syntax-aware system shows that it
performs more reordering during translation in comparison to the baseline. A
small-scale human evaluation also showed an advantage to the syntax-aware
system.Comment: Accepted as a short paper in ACL 201
Parallel Treebanks in Phrase-Based Statistical Machine Translation
Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by
hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PBSMT) system leads to significant improvements in translation quality. We describe further experiments on incorporating parallel treebank information into PBSMT, such as word alignments. We investigate the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the potential of parallel treebanks in other paradigms of MT
Top-Rank Enhanced Listwise Optimization for Statistical Machine Translation
Pairwise ranking methods are the basis of many widely used discriminative
training approaches for structure prediction problems in natural language
processing(NLP). Decomposing the problem of ranking hypotheses into pairwise
comparisons enables simple and efficient solutions. However, neglecting the
global ordering of the hypothesis list may hinder learning. We propose a
listwise learning framework for structure prediction problems such as machine
translation. Our framework directly models the entire translation list's
ordering to learn parameters which may better fit the given listwise samples.
Furthermore, we propose top-rank enhanced loss functions, which are more
sensitive to ranking errors at higher positions. Experiments on a large-scale
Chinese-English translation task show that both our listwise learning framework
and top-rank enhanced listwise losses lead to significant improvements in
translation quality.Comment: Accepted to CONLL 201
A detailed analysis of phrase-based and syntax-based machine translation: the search for systematic differences
This paper describes a range of automatic and manual comparisons of phrase-based and syntax-based statistical machine translation methods applied to English-German and
English-French translation of user-generated content. The syntax-based methods underperform the phrase-based models and the relaxation of syntactic constraints to broaden translation rule coverage means that these models do not necessarily generate output which is more grammatical than the output produced by the phrase-based models. Although the
systems generate different output and can potentially
be fruitfully combined, the lack of systematic difference between these models makes the combination task more challenging
Accuracy-based scoring for DOT: towards direct error minimization for data-oriented translation
In this work we present a novel technique to rescore fragments in the Data-Oriented Translation model based on their contribution to translation accuracy. We describe
three new rescoring methods, and present the initial results of a pilot experiment on a small subset of the Europarl corpus. This work is a proof-of-concept, and
is the first step in directly optimizing translation
decisions solely on the hypothesized accuracy of potential translations resulting from those decisions
- …