Search CORE

3 research outputs found

Does more data always yield better translations?

Author: Andrés Ferrer Jesús
Casacuberta Nolla Francisco
Gascó Mora Guillem
Rocha Sánchez Martha Alicia
Sanchis Trilles Germán
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 23/04/2012
Field of study

Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly, the improvements are obtained with just a small fraction of the data that accounts for less than 0.5% of the sentences. Afterwards, we show that a much larger room for improvement exists, although this is done under non-realistic conditions.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. 287755. This work was also supported by the Spanish MEC/MICINN under the MIPRCV ”Consolider Ingenio 2010” program (CSD2007-00018), and iTrans2 (TIN2009-14511) project. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and Instituto Tecnológico de León, DGEST-PROMEP y CONACYT, México.Gascó Mora, G.; Rocha Sánchez, MA.; Sanchis Trilles, G.; Andrés Ferrer, J.; Casacuberta Nolla, F. (2012). Does more data always yield better translations?. Association for Computational Linguistics. 152-161. http://hdl.handle.net/10251/35214S15216

CiteSeerX

RiuNet

TransLectures

Author: Andrés Ferrer Jesús
Civera Saiz Jorge
Del Agua Teba Miguel Angel
Garcés Díaz-Munío Gonzalo Vicente
Gascó Mora Guillem
Giménez Pastor Adrián
Juan Císcar Alfonso
Martínez-Villaronga Adrià Agustí
Pérez González de Martos Alejandro Manuel
Sanchis Navarro José Alberto
Serrano Martínez-Santos Nicolás
Silvestre Cerdà Joan Albert
Spencer Rachel Nadine
Sánchez-Cortina Isaías
Valor Miró Juan Daniel
Publication venue: IberSPEECH 2012
Publication date: 21/11/2012
Field of study

transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project¿s main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMedia. The first results obtained by the UPV group for the poliMedia repository will also be provided.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287755. Funding was also provided by the Spanish Government (iTrans2 project, TIN2009-14511; FPI scholarship BES-2010-033005; FPU scholarship AP2010-4349)Silvestre Cerdà, JA.; Del Agua Teba, MA.; Garcés Díaz-Munío, GV.; Gascó Mora, G.; Giménez Pastor, A.; Martínez-Villaronga, AA.; Pérez González De Martos, AM.... (2012). TransLectures. IberSPEECH 2012. 345-351. http://hdl.handle.net/10251/3729034535

RiuNet

Phrase-Based ITG decoder

Author: Gascó Mora Guillem
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 21/11/2011
Field of study

In this master thesis we implement a MT decoder that use the main strenghts of syntax and phrase based approaches. The decoder has two parts: a parser that obtains the syntactic tree, and a tree-tostring method to find the most likely target-language sentence. Finally, we present experiments over two corpora Spanish-English and German-English.Gascó Mora, G. (2008). Phrase-Based ITG decoder. http://hdl.handle.net/10251/13296Archivo delegad

RiuNet