Search CORE

9 research outputs found

DCEP - Digital Corpus of the European Parliament

Author: HAJLAOUI Najeh
KOLOVRATNÍK David
STEINBERGER Ralf
VAEYRYNEN JAAKKO
VARGA Dániel
Publication venue: European Language Resources Association (ELRA)
Publication date: 12/12/2013
Field of study

The paper presents a new highly multilingual sentence-aligned parallel corpus consisting of various document types and covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institution. Corpus statistics, required preprocessing, sentence alignment, and possible gains in statistical machine translation when adding this corpus to the previously existing ones are also considered.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

Automatic Evaluation of Parallel Bilingual Data Quality

Author: Kolovratník David
Publication venue
Publication date: 01/01/2007
Field of study

Statistical machine translation is an approach dependent particularly on huge amount of parallel bilingual data. It is used to train a translation model. The translation model works instead of a rule-based transfer; in some systems even lexical. It is believed that quality of the translation may be improved with more data for training. I have tried contrary to give less data and watch how the score of the translation changes. I selected sentence pairs to stay a part of the corpus with some key fi rst randomly, then according to sentence length ratio and finaly according to the number of word couples that a dictionary knows as translation pairs. I show that selection according to an advisable criteria slows down falling of NIST and BLEU score with decreasing size of the corpus and in some cases may tend even to better score. Decreasing the corpus size also lead to faster evaluation and less need of space. It may be useful in an implementation of the machine translation system in small devices with limited system resources

National Repository of Grey Literature

MORFO

Author: Kolovratník David
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 02/11/2009
Field of study

The MORFO system for morphological analysis of Czech consists of four units: the analyzer, the generator, the dictionary editor, and the library with the shared source code for handling dictionary objects

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

MORFO

Author: Kolovratník David
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 02/11/2009
Field of study

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Automatic Evaluation of Parallel Bilingual Data Quality

Author: Kolovratník David
Publication venue
Publication date: 01/01/2007
Field of study

CU Digital Repository

National Repository of Grey Literature

Statistical Machine Translation between Related and Unrelated Languages

Author: Bojar Ondřej
Klyueva Natalia
Kolovratník David
Publication venue
Publication date: 01/01/2009
Field of study

In this paper we describe an attempt to compare how relatedness of languages can influence the performance of statistical machine translation (SMT). We apply the Moses toolkit on the Czech-English-Russian corpus UMC 0.1 in order to train two translation systems: Russian-Czech and English-Czech. The quality of the translation is evaluated on an independent test set of 1000 sentences parallel in all three languages using an automatic metric (BLEU score) as well as manual judgments. We examine whether the quality of Russian-Czech is better thanks to the relatedness of the languages and similar characteristics of word order and morphological richness. Additionally, we present and discuss the most frequent translation errors for both language pairs

Biblio at Institute of Formal and Applied Linguistics

UMC003: Czech-English-Russian Tri-parallel Test Set for MT

Author: Bojar Ondřej
Klyueva Natalia
Kolovratník David
Publication venue: 'Charles University in Prague, Karolinum Press'
Publication date: 01/01/2009
Field of study

UMC003 is a cleaned tokenized development and test set to accompany the training data in UMC 0.1 aimed at Czech-English-Russian machine translation. For more information about the test set, see the README file in the UMC003 package

Biblio at Institute of Formal and Applied Linguistics