Does more data always yield better translations?

Andrés Ferrer, Jesús; Casacuberta Nolla, Francisco; Gascó Mora, Guillem; Rocha Sánchez, Martha Alicia; Sanchis Trilles, Germán

Does more data always yield better translations?

Authors: Jesús Andrés Ferrer
Francisco Casacuberta Nolla
Guillem Gascó Mora
Martha Alicia Rocha Sánchez
Germán Sanchis Trilles
Publication date: 23 April 2012
Publisher: 'Association for Computational Linguistics (ACL)'

Abstract

Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214S15216

Similar works

Full text

Available Versions

RiuNet

oai:riunet.upv.es:10251/35214

Last time updated on 04/02/2021

RiuNet

oai:riunet.upv.es:10251/35214

Last time updated on 25/02/2014

RiuNet

oai:riunet.upv.es:10251/35214

Last time updated on 05/02/2021