38 research outputs found
Alternatives to the k-Means Algorithm That Find Better Clusterings
We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k- harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to nd which aspects of these algorithms contribute to nding good clusterings, as opposed to converging to a low-quality local optimum. We describe each algorithm in a uni ed framework that introduces separate cluster membership and data weight functions
Recommended from our members
Alternatives to the k-means algorithm that find better clusterings
We investigate here the behavior of the standard k-means clustering
algorithm and several alternatives to it: the k-harmonic means algorithm due to
Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two
new variants of k-harmonic means. Our aim is to find which aspects of these
algorithms contribute to finding good clusterings, as opposed to converging to
a low-quality local optimum. We describe each algorithm in a unified framework
that introduces separate cluster membership and data weight functions. We then
show that the algorithms do behave very differently from each other on simple
low-dimensional synthetic datasets, and that the k-harmonic means method is
superior. Having a soft membership function is essential for finding
high-quality clusterings, but having a non-constant data weight function is
useful also.Pre-2018 CSE ID: CS2002-070
Recommended from our members
Learning the k in k-means
When clustering a dataset, the right number of clusters to use
is often not obvious, and choosing k automatically is a hard algorithmic
problem. In this paper we present a new algorithm for choosing k that is based
on a new statistical test for the hypothesis that a subset of data follows a
Gaussian distribution. The algorithm runs k-means with increasing k until the
test fails to reject the hypothesis that the data assigned to each k-means
center are Gaussian. We present results from experiments on synthetic and
real-world data showing that the algorithm works well, and better than a recent
method based on the BIC penalty for model complexity.Pre-2018 CSE ID: CS2002-071
Learning the k in k-means
When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it doe
Bayesian approaches to failure prediction for disk drives
Hard disk drive failures are rare but are often costly. The ability to predict failures is important to consumers, drive manufacturers, and computer system manufacturers alike. In this paper we investigate the abilities of two Bayesian methods to predict disk drive failures based on measurements of drive internal conditions. We first view the problem from an anomaly detection stance. We introduce a mixture model of naive Bayes submodels (i.e. clusters) that is trained using expectation-maximization. The second method is a naive Bayes classifier, a supervised learning approach. Both methods are tested on realworld data concerning 1936 drives. The predictive accuracy of both algorithms is far higher than the accuracy of thresholding methods used in the disk drive industry today. 1
PG-means: learning the number of clusters in data
We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods
Comparing Multinomial and K-Means Clustering for SimPoint
SimPoint is a technique used to pick what parts of the program’s execution to simulate in order to have a complete picture of execution. SimPoint uses data clustering algorithms from machine learning to automatically find repetitive (similar) patterns in a program’s execution, and it chooses one sample to represent each unique repetitive behavior. Together these samples represent an accurate picture of the complete execution of the program. SimPoint is based on the k-means clustering algorithm; recent work proposed using a different clustering method based on multinomial models, but only provided a preliminary comparison and analysis. In this work we provide a detailed comparison of using k-means and multinomial clustering for SimPoint. We show that k-means performs better than the recently proposed multinomial clustering approach. We then propose two improvements to the prior multinomial clustering approach in the areas of feature reduction and the picking of simulation points which allow multinomial clustering to perform as well as k-means. We then conclude by examining how to potentially combine multinomial clustering with k-means