Skip to main content
Article thumbnail
Location of Repository

Microblogs as Parallel Corpora

By Wang Ling, Guang Xiang, Chris Dyer, Alan Black and Isabel Trancoso

Abstract

In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet ” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at www.cs.cmu.edu/%7Elingwang/utopia.

Year: 2013
OAI identifier: oai:CiteSeerX.psu:10.1.1.352.3088
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.cs.cmu.edu/~guangx/... (external link)
  • www.cs.cmu.edu/%7Elingwang/uto... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.