Skip to main content
Article thumbnail
Location of Repository

Finding Hotspots in Document Collection

By Wei Peng, Tao Li, Chris Ding and Tong Sun

Abstract

Given a document collection, it is often desirable to find the core subset of documents focusing on a specific topic. We propose a new algorithm for this task. Document clustering aims at partitioning the document-term datasets into different groups by optimizing certain objective functions. However, they are not suitable for finding hotspots that are described by a small set of documents with few tightly coupled terms. In this paper we propose a novel hotspot finding algorithm, DCC (Dense Concept Clustering) in document collections. DCC can extract distinct small topics with most representative documents and words simultaneously. The hotspots are dense bicliques in binary document-word matrices and they can be discovered sequentially one at a time using the generalized Motzkin-Straus formalism. The representative documents and words are tightly correlated for concept descriptions. Experiments on real document datasets show the effectiveness of the proposed algorithm.

Year: 2013
OAI identifier: oai:CiteSeerX.psu:10.1.1.352.6484
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://users.cis.fiu.edu/~taol... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.