Search CORE

1 research outputs found

Distributed Document Clustering Using Word-clusters *

Author
Publication venue
Publication date: 14/08/2008
Field of study

Abstract−Document clustering has become an increasingly important task in analyzing huge numbers of documents distributed among various sites. The challenging aspect is to analyze this enormous number of extremely high dimensional distributed documents and to organize them in such a way that results in better search and knowledge extraction without introducing much extra cost and complexity. This paper presents a distributed document clustering approach called Distributed Information Bottleneck (DIB). DIB adopts a two stage agglomerative Information Bottleneck (aIB) algorithm to generate local clusters. At the first stage, the high-dimensional document vector is significantly reduced by finding wordclusters. These word-clusters are then used to obtain documentclusters in the second stage. DIB then extracts compact but informative local models from these document-clusters and transfers them to a central site. At the global site, the local models, that are likely to describe the same document set, are first combined. The resultant local models are then clustered by using the aIB algorithm to produce a hierarchical organization of all distributed documents. Our experimental results demonstrate the robustness, efficiency and effectiveness of DIB approach to cluster distributed documents. I

CiteSeerX