Search CORE

1 research outputs found

Improving document clustering using automated machine translation

Author: Buyue Qian
Ian Davidson
Xiang Wang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

With the development of statistical machine translation, we have ready-to-use tools that can translate documents in one language into many different languages. These translations provides different yet correlated views of the same set of doc-uments. This gives rise to a natural question: can we use the extra information to achieve a better clustering of the docu-ments? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated by machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consis-tently improve the clustering of real-world data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to the existing multiview clustering techniques, our technique does not rely on the compatibility and conditional indepen-dence assumptions, nor does it involve subtle parameter tun-ing

CiteSeerX

Crossref