115,067 research outputs found
On Variants of k-means Clustering
\textit{Clustering problems} often arise in the fields like data mining,
machine learning etc. to group a collection of objects into similar groups with
respect to a similarity (or dissimilarity) measure. Among the clustering
problems, specifically \textit{-means} clustering has got much attention
from the researchers. Despite the fact that -means is a very well studied
problem its status in the plane is still an open problem. In particular, it is
unknown whether it admits a PTAS in the plane. The best known approximation
bound in polynomial time is 9+\eps.
In this paper, we consider the following variant of -means. Given a set
of points in and a real , find a finite set of
points in that minimizes the quantity . For any fixed dimension , we design a local
search PTAS for this problem. We also give a "bi-criterion" local search
algorithm for -means which uses (1+\eps)k centers and yields a solution
whose cost is at most (1+\eps) times the cost of an optimal -means
solution. The algorithm runs in polynomial time for any fixed dimension.
The contribution of this paper is two fold. On the one hand, we are being
able to handle the square of distances in an elegant manner, which yields near
optimal approximation bound. This leads us towards a better understanding of
the -means problem. On the other hand, our analysis of local search might
also be useful for other geometric problems. This is important considering that
very little is known about the local search method for geometric approximation.Comment: 15 page
Faster K-Means Cluster Estimation
There has been considerable work on improving popular clustering algorithm
`K-means' in terms of mean squared error (MSE) and speed, both. However, most
of the k-means variants tend to compute distance of each data point to each
cluster centroid for every iteration. We propose a fast heuristic to overcome
this bottleneck with only marginal increase in MSE. We observe that across all
iterations of K-means, a data point changes its membership only among a small
subset of clusters. Our heuristic predicts such clusters for each data point by
looking at nearby clusters after the first iteration of k-means. We augment
well known variants of k-means with our heuristic to demonstrate effectiveness
of our heuristic. For various synthetic and real-world datasets, our heuristic
achieves speed-up of up-to 3 times when compared to efficient variants of
k-means.Comment: 6 pages, Accepted at ECIR 201
Reducing the Time Requirement of k-Means Algorithm
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray
data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in ddimensional
space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize
the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm,
which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is
based on the recently established relationship between principal component analysis and the k-means clustering. We
provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and
six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is
empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the
clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the
quality is good (ARIHA.0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA.0.9).
In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to
microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm
can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the
members is used. This has been demonstrated in this work on six non-biological data
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
Clustering for Different Scales of Measurement - the Gap-Ratio Weighted K-means Algorithm
This paper describes a method for clustering data that are spread out over
large regions and which dimensions are on different scales of measurement. Such
an algorithm was developed to implement a robotics application consisting in
sorting and storing objects in an unsupervised way. The toy dataset used to
validate such application consists of Lego bricks of different shapes and
colors. The uncontrolled lighting conditions together with the use of RGB color
features, respectively involve data with a large spread and different levels of
measurement between data dimensions. To overcome the combination of these two
characteristics in the data, we have developed a new weighted K-means
algorithm, called gap-ratio K-means, which consists in weighting each dimension
of the feature space before running the K-means algorithm. The weight
associated with a feature is proportional to the ratio of the biggest gap
between two consecutive data points, and the average of all the other gaps.
This method is compared with two other variants of K-means on the Lego bricks
clustering problem as well as two other common classification datasets.Comment: 13 pages, 6 figures, 2 tables. This paper is under the review process
for AIAP 201
Document clustering based on firefly algorithm
Document clustering is widely used in Information Retrieval however, existing clustering techniques suffer from local optima problem in determining the k number of clusters.Various efforts have been put to address such drawback and this includes the utilization of swarm-based algorithms such as particle swarm optimization and Ant Colony Optimization.This study explores the adaptation of another swarm algorithm which is the Firefly Algorithm (FA) in text clustering.We present two variants of FA; Weight- based Firefly Algorithm (WFA) and Weight-based Firefly Algorithm II (WFAII).The difference between the two algorithms is that the WFAII, includes a more restricted condition in determining members of a cluster.The proposed FA methods are later evaluated using the 20Newsgroups dataset.Experimental results on the quality of clustering between the two FA variants are presented and are later compared against the one produced by particle swarm optimization, K-means and the hybrid of FA and -K-means. The obtained results demonstrated that the WFAII outperformed the WFA, PSO, K-means and FA-Kmeans. This result indicates that a better clustering can be obtained once the exploitation of a search solution is improved
The k-means algorithm: A comprehensive survey and performance evaluation
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions
- …