Skip to main content
Article thumbnail
Location of Repository

Improving English-Spanish statistical machine translation: Experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing

By Preslav Nakov


We describe the experiments of the UC Berkeley team on improving English-Spanish machine translation of news text, as part of the WMT’08 Shared Translation Task. We experiment with domain adaptation, combining a small in-domain news bi-text and a large out-of-domain one from the Europarl corpus, building two separate phrase translation models and two separate language models. We further add a third phrase translation model trained on a version of the news bi-text augmented with monolingual sentencelevel syntactic paraphrases on the sourcelanguage side, and we combine all models in a log-linear model using minimum error rate training. Finally, we experiment with different tokenization and recasing rules, achieving 35.09 % Bleu score on the WMT’07 news test data when translating from English to Spanish, which is a sizable improvement over the highest Bleu score achieved on that dataset at WMT’07: 33.10 % (in fact, by our system). On the WMT’08 English to Spanish news translation, we achieve 21.92%, which makes our team the second best on Bleu score.

Year: 2013
OAI identifier: oai:CiteSeerX.psu:
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • (external link)
  • (external link)
  • Suggested articles

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.