46 research outputs found
A tight lower bound instance for k-means++ in constant dimension
The k-means++ seeding algorithm is one of the most popular algorithms that is
used for finding the initial centers when using the k-means heuristic. The
algorithm is a simple sampling procedure and can be described as follows: Pick
the first center randomly from the given points. For , pick a point to
be the center with probability proportional to the square of the
Euclidean distance of this point to the closest previously chosen
centers.
The k-means++ seeding algorithm is not only simple and fast but also gives an
approximation in expectation as shown by Arthur and Vassilvitskii.
There are datasets on which this seeding algorithm gives an approximation
factor of in expectation. However, it is not clear from these
results if the algorithm achieves good approximation factor with reasonably
high probability (say ). Brunsch and R\"{o}glin gave a dataset where
the k-means++ seeding algorithm achieves an approximation ratio
with probability that is exponentially small in . However, this and all
other known lower-bound examples are high dimensional. So, an open problem was
to understand the behavior of the algorithm on low dimensional datasets. In
this work, we give a simple two dimensional dataset on which the seeding
algorithm achieves an approximation ratio with probability
exponentially small in . This solves open problems posed by Mahajan et al.
and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with
arXiv:1306.420
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means