Automatic acquisition of Chinese-English parallel corpus from the web. In: ECIR2006

Jianfeng Gao; Ke Wu; Phil Vines; Ying Zhang

Automatic acquisition of Chinese-English parallel corpus from the web. In: ECIR2006

Authors: Jianfeng Gao
Ke Wu
Phil Vines
Ying Zhang
Publication date
Publisher: Springer

Abstract

Abstract. Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale corpora have been available, thus restricting their practical use. This paper describes a system that overcomes this limitation by automatically collecting high quality parallel bilingual corpora from the web. Previous systems used a single principle feature for parallel web page verification, whereas we use multiple features to identify parallel texts via a k-nearest-neighbor classifier. Our system was evaluated using a data set containing 6500 Chinese– English candidate parallel pairs that have been manually annotated. Experiments show that the use of a k-nearest-neighbors classifier with multiple features achieves substantial improvements over the systems that use any one of these features. The system achieved a precision rate of 95 % and a recall rate of 97%, and thus is a significant improvement over earlier work.

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.72.29...

Last time updated on 22/10/2014