Automatic Evaluation of Parallel Bilingual Data Quality

Kolovratník, David

Automatic Evaluation of Parallel Bilingual Data Quality

Authors: David Kolovratník
Publication date: 1 January 2007
Publisher

Abstract

Statistical machine translation is an approach dependent particularly on huge amount of parallel bilingual data. It is used to train a translation model. The translation model works instead of a rule-based transfer; in some systems even lexical. It is believed that quality of the translation may be improved with more data for training. I have tried contrary to give less data and watch how the score of the translation changes. I selected sentence pairs to stay a part of the corpus with some key fi rst randomly, then according to sentence length ratio and finaly according to the number of word couples that a dictionary knows as translation pairs. I show that selection according to an advisable criteria slows down falling of NIST and BLEU score with decreasing size of the corpus and in some cases may tend even to better score. Decreasing the corpus size also lead to faster evaluation and less need of space. It may be useful in an implementation of the machine translation system in small devices with limited system resources

Similar works

Full text

Available Versions

National Repository of Grey Literature

oai:invenio.nusl.cz:472415

Last time updated on 29/07/2022