3 research outputs found
Does more data always yield better translations?
Nowadays, there are large amounts of data
available to train statistical machine translation
systems. However, it is not clear
whether all the training data actually help
or not. A system trained on a subset of such
huge bilingual corpora might outperform
the use of all the bilingual data. This paper
studies such issues by analysing two training
data selection techniques: one based
on approximating the probability of an indomain
corpus; and another based on infrequent
n-gram occurrence. Experimental
results not only report significant improvements
over random sentence selection but
also an improvement over a system trained
with the whole available data. Surprisingly,
the improvements are obtained with just a
small fraction of the data that accounts for
less than 0.5% of the sentences. Afterwards,
we show that a much larger room for
improvement exists, although this is done
under non-realistic conditions.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under
grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214S15216
TransLectures
transLectures (Transcription and Translation of Video Lectures)
is an EU STREP project in which advanced automatic speech
recognition and machine translation techniques are being tested on large
video lecture repositories. The project began in November 2011 and will
run for three years. This paper will outline the project¿s main motivation
and objectives, and give a brief description of the two main repositories
being considered: VideoLectures.NET and poliMedia. The first results
obtained by the UPV group for the poliMedia repository will also be
provided.The research leading to these results has received funding
from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287755. Funding was also provided by the Spanish Government (iTrans2 project, TIN2009-14511; FPI scholarship BES-2010-033005;
FPU scholarship AP2010-4349)Silvestre Cerdà, JA.; Del Agua Teba, MA.; Garcés Díaz-Munío, GV.; Gascó Mora, G.; Giménez Pastor, A.; Martínez-Villaronga, AA.; Pérez González De Martos, AM.... (2012). TransLectures. IberSPEECH 2012. 345-351. http://hdl.handle.net/10251/3729034535
Phrase-Based ITG decoder
In this master thesis we implement a MT decoder that use the main strenghts of syntax and phrase based approaches. The decoder has two parts: a parser that obtains the syntactic tree, and a tree-tostring method to find the most likely target-language sentence. Finally, we present experiments over two corpora Spanish-English and German-English.Gascó Mora, G. (2008). Phrase-Based ITG decoder. http://hdl.handle.net/10251/13296Archivo delegad