Search CORE

2 research outputs found

Fast and effective text mining using linear-time document clustering

Author: Bjornar Larsen
Chinatsu Aone
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/1999
Field of study

Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of F-Measure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and k-means. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, “vector average damping, ” that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively

CiteSeerX

Crossref

Why Do Banks Fail? - The Explanation from Text Analytics Technique

Crossref