9 research outputs found
DCEP - Digital Corpus of the European Parliament
The paper presents a new highly multilingual sentence-aligned parallel corpus consisting of various document types and covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institution. Corpus statistics, required preprocessing, sentence alignment, and possible gains in statistical machine translation when adding this corpus to the previously existing ones are also considered.JRC.G.2-Global security and crisis managemen
Automatic Evaluation of Parallel Bilingual Data Quality
Statistical machine translation is an approach dependent particularly on huge amount of parallel bilingual data. It is used to train a translation model. The translation model works instead of a rule-based transfer; in some systems even lexical. It is believed that quality of the translation may be improved with more data for training. I have tried contrary to give less data and watch how the score of the translation changes. I selected sentence pairs to stay a part of the corpus with some key fi rst randomly, then according to sentence length ratio and finaly according to the number of word couples that a dictionary knows as translation pairs. I show that selection according to an advisable criteria slows down falling of NIST and BLEU score with decreasing size of the corpus and in some cases may tend even to better score. Decreasing the corpus size also lead to faster evaluation and less need of space. It may be useful in an implementation of the machine translation system in small devices with limited system resources
MORFO
The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects
MORFO
The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects
Automatic Evaluation of Parallel Bilingual Data Quality
Statistical machine translation is an approach dependent particularly on huge amount of parallel bilingual data. It is used to train a translation model. The translation model works instead of a rule-based transfer; in some systems even lexical. It is believed that quality of the translation may be improved with more data for training. I have tried contrary to give less data and watch how the score of the translation changes. I selected sentence pairs to stay a part of the corpus with some key fi rst randomly, then according to sentence length ratio and finaly according to the number of word couples that a dictionary knows as translation pairs. I show that selection according to an advisable criteria slows down falling of NIST and BLEU score with decreasing size of the corpus and in some cases may tend even to better score. Decreasing the corpus size also lead to faster evaluation and less need of space. It may be useful in an implementation of the machine translation system in small devices with limited system resources
Statistical Machine Translation between Related and Unrelated Languages
In this paper we describe an attempt to compare how relatedness of languages
can influence the performance of statistical machine translation (SMT). We
apply the Moses toolkit on the Czech-English-Russian corpus UMC 0.1 in order to
train two translation systems: Russian-Czech and English-Czech. The quality
of the translation is evaluated on an independent test set of 1000 sentences
parallel in all three languages using an automatic metric (BLEU score) as well
as manual judgments. We examine whether the quality of Russian-Czech is better
thanks to the relatedness of the languages and similar characteristics of word
order and morphological richness. Additionally, we present and discuss
the most frequent translation errors for both language pairs
UMC003: Czech-English-Russian Tri-parallel Test Set for MT
UMC003 is a cleaned tokenized development and test set to accompany the training data in UMC 0.1 aimed at Czech-English-Russian machine translation. For more information about the test set, see the README file in the UMC003 package