Search CORE

46 research outputs found

A tight lower bound instance for k-means++ in constant dimension

Author: A. Aggarwal
B. Bahmani
D. Arthur
D. Arthur
M. Agarwal
M.R. Ackermann
R. Jaiswal
Publication venue
Publication date: 01/01/2014
Field of study

The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial

k

centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For

i > 1

, pick a point to be the

i^{th}

center with probability proportional to the square of the Euclidean distance of this point to the closest previously

(i-1)

chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an

O(\log{k})

approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of

\Omega(\log{k})

in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say

1/poly(k)

). Brunsch and R\"{o}glin gave a dataset where the k-means++ seeding algorithm achieves an

O(\log{k})

approximation ratio with probability that is exponentially small in

k

. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an

O(\log{k})

approximation ratio with probability exponentially small in

k

. This solves open problems posed by Mahajan et al. and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with arXiv:1306.420

arXiv.org e-Print Archive

CiteSeerX

Crossref

Fast k-means based on KNN Graph

Author: Deng Cheng-Hao
Zhao Wan-Lei
Publication venue
Publication date: 04/05/2017
Field of study

In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast

k

-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

arXiv.org e-Print Archive

Crossref