Search CORE

38 research outputs found

Alternatives to the k-Means Algorithm That Find Better Clusterings

Author: Charles Elkan
Greg Hamerly
Publication venue
Publication date: 21/11/2007
Field of study

CiteSeerX

Recommended from our members

Alternatives to the k-means algorithm that find better clusterings

Author: Elkan Charles
Hamerly Greg
Publication venue: eScholarship, University of California
Publication date: 03/04/2002
Field of study

We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k-harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to find which aspects of these algorithms contribute to finding good clusterings, as opposed to converging to a low-quality local optimum. We describe each algorithm in a unified framework that introduces separate cluster membership and data weight functions. We then show that the algorithms do behave very differently from each other on simple low-dimensional synthetic datasets, and that the k-harmonic means method is superior. Having a soft membership function is essential for finding high-quality clusterings, but having a non-constant data weight function is useful also.Pre-2018 CSE ID: CS2002-070

eScholarship - University of California

Recommended from our members

Learning the k in k-means

Author: Elkan Charles
Hamerly Greg
Publication venue: eScholarship, University of California
Publication date: 30/07/2002
Field of study

When clustering a dataset, the right number

k

of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present a new algorithm for choosing k that is based on a new statistical test for the hypothesis that a subset of data follows a Gaussian distribution. The algorithm runs k-means with increasing k until the test fails to reject the hypothesis that the data assigned to each k-means center are Gaussian. We present results from experiments on synthetic and real-world data showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity.Pre-2018 CSE ID: CS2002-071

eScholarship - University of California

Learning the k in k-means

Author: Charles Elkan
Greg Hamerly
Publication venue
Publication date: 30/07/2002
Field of study

When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it doe

CiteSeerX

eScholarship - University of California

Bayesian approaches to failure prediction for disk drives

Author: Charles Elkan
Greg Hamerly
Publication venue
Publication date: 01/01/2001
Field of study

Hard disk drive failures are rare but are often costly. The ability to predict failures is important to consumers, drive manufacturers, and computer system manufacturers alike. In this paper we investigate the abilities of two Bayesian methods to predict disk drive failures based on measurements of drive internal conditions. We first view the problem from an anomaly detection stance. We introduce a mixture model of naive Bayes submodels (i.e. clusters) that is trained using expectation-maximization. The second method is a naive Bayes classifier, a supervised learning approach. Both methods are tested on realworld data concerning 1936 drives. The predictive accuracy of both algorithms is far higher than the accuracy of thresholding methods used in the disk drive industry today. 1

CiteSeerX

PG-means: learning the number of clusters in data

Author: Greg Hamerly
Yu Feng
Publication venue: MIT Press
Publication date
Field of study

We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods

CiteSeerX

Comparing Multinomial and K-Means Clustering for SimPoint

Author: Brad Calder
Erez Perelman
Greg Hamerly
Publication venue
Publication date: 20/10/2005
Field of study

SimPoint is a technique used to pick what parts of the program’s execution to simulate in order to have a complete picture of execution. SimPoint uses data clustering algorithms from machine learning to automatically find repetitive (similar) patterns in a program’s execution, and it chooses one sample to represent each unique repetitive behavior. Together these samples represent an accurate picture of the complete execution of the program. SimPoint is based on the k-means clustering algorithm; recent work proposed using a different clustering method based on multinomial models, but only provided a preliminary comparison and analysis. In this work we provide a detailed comparison of using k-means and multinomial clustering for SimPoint. We show that k-means performs better than the recently proposed multinomial clustering approach. We then propose two improvements to the prior multinomial clustering approach in the areas of feature reduction and the picking of simulation points which allow multinomial clustering to perform as well as k-means. We then conclude by examining how to potentially combine multinomial clustering with k-means

CiteSeerX

eScholarship - University of California