Nowadays, there are large amounts of data
available to train statistical machine translation
systems. However, it is not clear
whether all the training data actually help
or not. A system trained on a subset of such
huge bilingual corpora might outperform
the use of all the bilingual data. This paper
studies such issues by analysing two training
data selection techniques: one based
on approximating the probability of an indomain
corpus; and another based on infrequent
n-gram occurrence. Experimental
results not only report significant improvements
over random sentence selection but
also an improvement over a system trained
with the whole available data. Surprisingly,
the improvements are obtained with just a
small fraction of the data that accounts for
less than 0.5% of the sentences. Afterwards,
we show that a much larger room for
improvement exists, although this is done
under non-realistic conditions.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under
grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214S15216