Vector sentences representation for data selection in statistical machine translation

Axelrod; Banerjee; Bengio; Brants; Callison-Burch; Chinea-Rios; Collobert; Duh; Elman; Foster; Francisco Casacuberta; Gao; Gascó; Germán Sanchis-Trilles; Glorot; Kneser; Koehn; Koehn; Koehn; Koehn; Kågebäck; Lü; Mansour; Mara Chinea-Rios; McClelland; Mikolov; Mikolov; Mitchell; Moore; Och; Och; Och; Ortiz-Martínez; Papineni; Paulus; Rousseau; Schwenk; Schwenk; Snover; Socher; Socher; Stolcke; Tiedemann; Tillmann; Udupa; Yasuda

Vector sentences representation for data selection in statistical machine translation

Authors: Axelrod
Banerjee
Bengio
Brants
Callison-Burch
Chinea-Rios
Collobert
Duh
Elman
Foster
Francisco Casacuberta
Gao
Gascó
Germán Sanchis-Trilles
Glorot
Kneser
Koehn
Koehn
Koehn
Koehn
Kågebäck
Lü
Mansour
Mara Chinea-Rios
McClelland
Mikolov
Mikolov
Mitchell
Moore
Och
Och
Och
Ortiz-Martínez
Papineni
Paulus
Rousseau
Schwenk
Schwenk
Snover
Socher
Socher
Stolcke
Tiedemann
Tillmann
Udupa
Yasuda
Publication date: 1 July 2019
Publisher: 'Elsevier BV'
Doi

Abstract

[EN] One of the most popular approaches to machine translation consists in formulating the problem as a pattern recognition approach. Under this perspective, bilingual corpora are precious resources, as they allow for a proper estimation of the underlying models. In this framework, selecting the best possible corpus is critical, and data selection aims to find the best subset of the bilingual sentences from an available pool of sentences such that the final translation quality is improved. In this paper, we present a new data selection technique that leverages a continuous vector-space representation of sentences. Experimental results report improvements compared not only with a system trained only with in-domain data, but also compared with a system trained on all the available data. Finally, we compared our proposal with other state-of-the-art data selection techniques (Cross-entropy selection and Infrequent ngrams recovery) in two different scenarios, obtaining very promising results with our proposal: our data selection strategy is able to yield results that are at least as good as the best-performing strfategy for each scenario. The empirical results reported are coherent across different language pairs.Work supported by the Generalitat Valenciana under grant ALMAMATER (PrometeoII/2014/030) and the FPI (2014) grant by Universitat Politècnica de València.Chinea-Rios, M.; Sanchis Trilles, G.; Casacuberta Nolla, F. (2019). Vector sentences representation for data selection in statistical machine translation. Computer Speech & Language. 56:1-16. https://doi.org/10.1016/j.csl.2018.12.005S1165

Similar works

Full text

Available Versions

RiuNet

oai:riunet.upv.es:10251/155404

Last time updated on 08/04/2021

Crossref

Last time updated on 30/10/2020