172,234 research outputs found
Fast k-means algorithm clustering
k-means has recently been recognized as one of the best algorithms for
clustering unsupervised data. Since k-means depends mainly on distance
calculation between all data points and the centers, the time cost will be high
when the size of the dataset is large (for example more than 500millions of
points). We propose a two stage algorithm to reduce the time cost of distance
calculation for huge datasets. The first stage is a fast distance calculation
using only a small portion of the data to produce the best possible location of
the centers. The second stage is a slow distance calculation in which the
initial centers used are taken from the first stage. The fast and slow stages
represent the speed of the movement of the centers. In the slow stage, the
whole dataset can be used to get the exact location of the centers. The time
cost of the distance calculation for the fast stage is very low due to the
small size of the training data chosen. The time cost of the distance
calculation for the slow stage is also minimized due to small number of
iterations. Different initial locations of the clusters have been used during
the test of the proposed algorithms. For large datasets, experiments show that
the 2-stage clustering method achieves better speed-up (1-9 times).Comment: 16 pages, Wimo2011; International Journal of Computer Networks &
Communications (IJCNC) Vol.3, No.4, July 201
Clustering Performance Comparison in K-Mean Clustering Variations: A Fraud Detection Study
K-means clustering is a common clustering approach that is based on data partitioning. However, the k-means clustering has significant drawbacks, such as it is sensitive to deciding the initial condition. Several ways to improve the algorithm have been offered. To assess the algorithm's efficiency and correctness, the performance comparison should be evaluated. In this paper, several k-means algorithms, including random k-means, global k-means, and fast global k-means, were evaluated for their efficiency when applied to a fraud detection data set. The accuracy of each method and the Davies-Bouldin index was investigated for each algorithm to compare the clustering performance. The findings demonstrated that when a small number of groups was used, random k-means, global k-means, and fast global k-means gave similar clustering, but fast global k-means offered better errors when a big number of groups was used. Furthermore, global k-means took longer to execute than others
Randomized Dimensionality Reduction for k-means Clustering
We study the topic of dimensionality reduction for -means clustering.
Dimensionality reduction encompasses the union of two approaches: \emph{feature
selection} and \emph{feature extraction}. A feature selection based algorithm
for -means clustering selects a small subset of the input features and then
applies -means clustering on the selected features. A feature extraction
based algorithm for -means clustering constructs a small set of new
artificial features and then applies -means clustering on the constructed
features. Despite the significance of -means clustering as well as the
wealth of heuristic methods addressing it, provably accurate feature selection
methods for -means clustering are not known. On the other hand, two provably
accurate feature extraction methods for -means clustering are known in the
literature; one is based on random projections and the other is based on the
singular value decomposition (SVD).
This paper makes further progress towards a better understanding of
dimensionality reduction for -means clustering. Namely, we present the first
provably accurate feature selection method for -means clustering and, in
addition, we present two feature extraction methods. The first feature
extraction method is based on random projections and it improves upon the
existing results in terms of time complexity and number of features needed to
be extracted. The second feature extraction method is based on fast approximate
SVD factorizations and it also improves upon the existing results in terms of
time complexity. The proposed algorithms are randomized and provide
constant-factor approximation guarantees with respect to the optimal -means
objective value.Comment: IEEE Transactions on Information Theory, to appea
A fast version of the k-means classification algorithm for astronomical applications
Context. K-means is a clustering algorithm that has been used to classify
large datasets in astronomical databases. It is an unsupervised method, able to
cope very different types of problems. Aims. We check whether a variant of the
algorithm called single-pass k-means can be used as a fast alternative to the
traditional k-means. Methods. The execution time of the two algorithms are
compared when classifying subsets drawn from the SDSS-DR7 catalog of galaxy
spectra. Results. Single-pass k-means turn out to be between 20 % and 40 %
faster than k-means and provide statistically equivalent classifications. This
conclusion can be scaled up to other larger databases because the execution
time of both algorithms increases linearly with the number of objects.
Conclusions. Single-pass k-means can be safely used as a fast alternative to
k-means
Fast Color Quantization Using Weighted Sort-Means Clustering
Color quantization is an important operation with numerous applications in
graphics and image processing. Most quantization methods are essentially based
on data clustering algorithms. However, despite its popularity as a general
purpose clustering algorithm, k-means has not received much respect in the
color quantization literature because of its high computational requirements
and sensitivity to initialization. In this paper, a fast color quantization
method based on k-means is presented. The method involves several modifications
to the conventional (batch) k-means algorithm including data reduction, sample
weighting, and the use of triangle inequality to speed up the nearest neighbor
search. Experiments on a diverse set of images demonstrate that, with the
proposed modifications, k-means becomes very competitive with state-of-the-art
color quantization methods in terms of both effectiveness and efficiency.Comment: 30 pages, 2 figures, 4 table
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
- …