Given a document collection, it is often desirable to find the core subset of documents focusing on a specific topic. We propose a new algorithm for this task. Document clustering aims at partitioning the document-term datasets into different groups by optimizing certain objective functions. However, they are not suitable for finding hotspots that are described by a small set of documents with few tightly coupled terms. In this paper we propose a novel hotspot finding algorithm, DCC (Dense Concept Clustering) in document collections. DCC can extract distinct small topics with most representative documents and words simultaneously. The hotspots are dense bicliques in binary document-word matrices and they can be discovered sequentially one at a time using the generalized Motzkin-Straus formalism. The representative documents and words are tightly correlated for concept descriptions. Experiments on real document datasets show the effectiveness of the proposed algorithm.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.