2,596 research outputs found
Initial explorations in English to Turkish statistical machine translation
This paper presents some very preliminary results for and problems in developing a statistical machine translation system from English to Turkish. Starting with a baseline word model trained from about 20K aligned sentences, we explore various ways of exploiting morphological structure to improve upon the baseline system. As Turkish is a language with complex agglutinative word structures, we experiment withmorphologically segmented and disambiguated versions of the parallel texts in order to also uncover relations between morphemes and function words in one language with morphemes and functions words in the other, in addition to relations between open class content words. Morphological segmentation on the Turkish side also conflates the statistics from allomorphs so that sparseness can be alleviated to a certain extent. We find that this approach coupled with a simple grouping of most frequent morphemes and function words on both sides improve the BLEU score from the baseline of 0.0752 to 0.0913 with the small training data. We close with a discussion on why one should not expect distortion parameters to model word-local morpheme ordering and that a new approach to handling complex morphotactics is needed
Example-based machine translation of the Basque language
Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus
(270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art
approaches according to several common automatic evaluation metrics
Incorporating Human Translator Style into English-Turkish Literary Machine Translation
Although machine translation systems are mostly designed to serve in the
general domain, there is a growing tendency to adapt these systems to other
domains like literary translation. In this paper, we focus on English-Turkish
literary translation and develop machine translation models that take into
account the stylistic features of translators. We fine-tune a pre-trained
machine translation model by the manually-aligned works of a particular
translator. We make a detailed analysis of the effects of manual and automatic
alignments, data augmentation methods, and corpus size on the translations. We
propose an approach based on stylistic features to evaluate the style of a
translator in the output translations. We show that the human translator style
can be highly recreated in the target machine translations by adapting the
models to the style of the translator
The Swedish-Turkish Parallel Corpus and Tools for its Creation
Proceedings of the 16th Nordic Conference
of Computational Linguistics NODALIDA-2007.
Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit.
University of Tartu, Tartu, 2007.
ISBN 978-9985-4-0513-0 (online)
ISBN 978-9985-4-0514-7 (CD-ROM)
pp. 136-143
SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment
Word alignments are essential for a variety of NLP tasks. Therefore, choosing
the best approaches for their creation is crucial. However, the scarce
availability of gold evaluation data makes the choice difficult. We propose
SilverAlign, a new method to automatically create silver data for the
evaluation of word aligners by exploiting machine translation and minimal
pairs. We show that performance on our silver data correlates well with gold
benchmarks for 9 language pairs, making our approach a valid resource for
evaluation of different domains and languages when gold data are not available.
This addresses the important scenario of missing gold data alignments for
low-resource languages
Comparative evaluation of research vs. Online MT systems
This paper reports MT evaluation experiments that were conducted at the end of year 1 of the EU-funded CoSyne
1 project for three language combinations, considering translations from German, Italian and Dutch into English. We present a comparative evaluation of the MT software developed within the project against four of the leading free webbased MT systems across a range of state-of-the-art automatic evaluation metrics. The data sets from the news domain that were created and used for training purposes and also for this evaluation exercise, which are available to the research community, are also described. The evaluation results for the news domain are very encouraging: the CoSyne MT software consistently beats the rule-based MT systems, and for translations from Italian and Dutch into English in particular the scores given by some of the standard automatic evaluation metrics are not too distant from those obtained by wellestablished statistical online MT systems
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
- …