1,643,261 research outputs found
Fast k-means algorithm clustering
k-means has recently been recognized as one of the best algorithms for
clustering unsupervised data. Since k-means depends mainly on distance
calculation between all data points and the centers, the time cost will be high
when the size of the dataset is large (for example more than 500millions of
points). We propose a two stage algorithm to reduce the time cost of distance
calculation for huge datasets. The first stage is a fast distance calculation
using only a small portion of the data to produce the best possible location of
the centers. The second stage is a slow distance calculation in which the
initial centers used are taken from the first stage. The fast and slow stages
represent the speed of the movement of the centers. In the slow stage, the
whole dataset can be used to get the exact location of the centers. The time
cost of the distance calculation for the fast stage is very low due to the
small size of the training data chosen. The time cost of the distance
calculation for the slow stage is also minimized due to small number of
iterations. Different initial locations of the clusters have been used during
the test of the proposed algorithms. For large datasets, experiments show that
the 2-stage clustering method achieves better speed-up (1-9 times).Comment: 16 pages, Wimo2011; International Journal of Computer Networks &
Communications (IJCNC) Vol.3, No.4, July 201
Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel -means is an effective method for data clustering which extends
the commonly-used -means algorithm to work on a similarity matrix over
complex data structures. The kernel -means algorithm is however
computationally very complex as it requires the complete data matrix to be
calculated and stored. Further, the kernelized nature of the kernel -means
algorithm hinders the parallelization of its computations on modern
infrastructures for distributed computing. In this paper, we are defining a
family of kernel-based low-dimensional embeddings that allows for scaling
kernel -means on MapReduce via an efficient and unified parallelization
strategy. Afterwards, we propose two methods for low-dimensional embedding that
adhere to our definition of the embedding family. Exploiting the proposed
parallelization strategy, we present two scalable MapReduce algorithms for
kernel -means. We demonstrate the effectiveness and efficiency of the
proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data
Mining (SDM), 201
- …
