2 research outputs found
MTNT: A Testbed for Machine Translation of Noisy Text
Noisy or non-standard input text can cause disastrous mistranslations in most
modern Machine Translation (MT) systems, and there has been growing research
interest in creating noise-robust MT systems. However, as of yet there are no
publicly available parallel corpora of with naturally occurring noisy inputs
and translations, and thus previous work has resorted to evaluating on
synthetically created datasets. In this paper, we propose a benchmark dataset
for Machine Translation of Noisy Text (MTNT), consisting of noisy comments on
Reddit (www.reddit.com) and professionally sourced translations. We
commissioned translations of English comments into French and Japanese, as well
as French and Japanese comments into English, on the order of 7k-37k sentences
per language pair. We qualitatively and quantitatively examine the types of
noise included in this dataset, then demonstrate that existing MT models fail
badly on a number of noise-related phenomena, even after performing adaptation
on a small training set of in-domain data. This indicates that this dataset can
provide an attractive testbed for methods tailored to handling noisy text in
MT. The data is publicly available at www.cs.cmu.edu/~pmichel1/mtnt/.Comment: EMNLP 2018 Long Pape
Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation
This paper addresses the problem of dynamic model parameter selection for loglinear model based statistical machine translation (SMT) systems. In this work, we propose a principled method for this task by transforming it to a test data dependent development set selection problem. We present two algorithms for automatic development set construction, and evaluated our method on several NIST data sets for the Chinese-English translation task. Experimental results show that our method can effectively adapt log-linear model parameters to different test data, and consistently achieves good translation performance compared with conventional methods that use a fixed model parameter setting across different data sets.