Improving Statistical Machine Translation Performance by Training Data Selection and Optimization

Jin Huang; Qun Liu; Yajuan Lü

Improving Statistical Machine Translation Performance by Training Data Selection and Optimization

Authors: Jin Huang
Qun Liu
Yajuan Lü
Publication date
Publisher

Abstract

Parallel corpus is an indispensable resource for translation model training in statistical machine translation (SMT). Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting full potential of the existing parallel corpora. Two kinds of methods are proposed: offline data optimization and online model optimization. The offline method adapts the training data by redistributing the weight of each training sentence pairs. The online method adapts the translation model by redistributing the weight of each predefined submodels. Information retrieval model is used for the weighting scheme in both methods. Experimental results show that without using any additional resource, both methods can improve SMT performance significantly.

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.77.56...

Last time updated on 22/10/2014

CiteSeerX

oai:CiteSeerX.psu:10.1.1.581.7...

Last time updated on 29/10/2017