research

Featurebased method for document alignment in comparable news corpora

Abstract

In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. It contributes 4.1 % and 8 % to performance improvement over Pearson’s correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3 % on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.

    Similar works

    Available Versions

    Last time updated on 01/04/2019