28,376 research outputs found
Fast k-means algorithm clustering
k-means has recently been recognized as one of the best algorithms for
clustering unsupervised data. Since k-means depends mainly on distance
calculation between all data points and the centers, the time cost will be high
when the size of the dataset is large (for example more than 500millions of
points). We propose a two stage algorithm to reduce the time cost of distance
calculation for huge datasets. The first stage is a fast distance calculation
using only a small portion of the data to produce the best possible location of
the centers. The second stage is a slow distance calculation in which the
initial centers used are taken from the first stage. The fast and slow stages
represent the speed of the movement of the centers. In the slow stage, the
whole dataset can be used to get the exact location of the centers. The time
cost of the distance calculation for the fast stage is very low due to the
small size of the training data chosen. The time cost of the distance
calculation for the slow stage is also minimized due to small number of
iterations. Different initial locations of the clusters have been used during
the test of the proposed algorithms. For large datasets, experiments show that
the 2-stage clustering method achieves better speed-up (1-9 times).Comment: 16 pages, Wimo2011; International Journal of Computer Networks &
Communications (IJCNC) Vol.3, No.4, July 201
Reducing the Time Requirement of k-Means Algorithm
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray
data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in ddimensional
space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize
the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm,
which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is
based on the recently established relationship between principal component analysis and the k-means clustering. We
provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and
six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is
empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the
clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the
quality is good (ARIHA.0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA.0.9).
In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to
microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm
can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the
members is used. This has been demonstrated in this work on six non-biological data
Study of document clustering using the k-means algorithm
One of the most commonly used data mining techniques is document clustering or unsupervised document classification which deals with the grouping of documents based on some document similarity function; This thesis deals with research issues associated with categorizing documents using the k-means clustering algorithm which groups objects into K number of groups based on document representations and similarities; The proposed hypothesis of this thesis is to prove that unsupervised clustering of a set of documents produces similar results to that of their supervised categorization
Clustering with Spectral Norm and the k-means Algorithm
There has been much progress on efficient algorithms for clustering data
points generated by a mixture of probability distributions under the
assumption that the means of the distributions are well-separated, i.e., the
distance between the means of any two distributions is at least
standard deviations. These results generally make heavy use of the generative
model and particular properties of the distributions. In this paper, we show
that a simple clustering algorithm works without assuming any generative
(probabilistic) model. Our only assumption is what we call a "proximity
condition": the projection of any data point onto the line joining its cluster
center to any other cluster center is standard deviations closer to
its own center than the other center. Here the notion of standard deviations is
based on the spectral norm of the matrix whose rows represent the difference
between a point and the mean of the cluster to which it belongs. We show that
in the generative models studied, our proximity condition is satisfied and so
we are able to derive most known results for generative models as corollaries
of our main result. We also prove some new results for generative models -
e.g., we can cluster all but a small fraction of points only assuming a bound
on the variance. Our algorithm relies on the well known -means algorithm,
and along the way, we prove a result of independent interest -- that the
-means algorithm converges to the "true centers" even in the presence of
spurious points provided the initial (estimated) centers are close enough to
the corresponding actual centers and all but a small fraction of the points
satisfy the proximity condition. Finally, we present a new technique for
boosting the ratio of inter-center separation to standard deviation
A Faster -means++ Algorithm
K-means++ is an important algorithm to choose initial cluster centers for the
k-means clustering algorithm. In this work, we present a new algorithm that can
solve the -means++ problem with near optimal running time. Given data
points in , the current state-of-the-art algorithm runs in
iterations, and each iteration takes
time. The overall running time is thus . We propose a
new algorithm \textsc{FastKmeans++} that only takes in time, in total
- …