The k-means++ seeding algorithm is one of the most popular algorithms that is
used for finding the initial k centers when using the k-means heuristic. The
algorithm is a simple sampling procedure and can be described as follows: Pick
the first center randomly from the given points. For i>1, pick a point to
be the ith center with probability proportional to the square of the
Euclidean distance of this point to the closest previously (i−1) chosen
centers.
The k-means++ seeding algorithm is not only simple and fast but also gives an
O(logk) approximation in expectation as shown by Arthur and Vassilvitskii.
There are datasets on which this seeding algorithm gives an approximation
factor of Ω(logk) in expectation. However, it is not clear from these
results if the algorithm achieves good approximation factor with reasonably
high probability (say 1/poly(k)). Brunsch and R\"{o}glin gave a dataset where
the k-means++ seeding algorithm achieves an O(logk) approximation ratio
with probability that is exponentially small in k. However, this and all
other known lower-bound examples are high dimensional. So, an open problem was
to understand the behavior of the algorithm on low dimensional datasets. In
this work, we give a simple two dimensional dataset on which the seeding
algorithm achieves an O(logk) approximation ratio with probability
exponentially small in k. This solves open problems posed by Mahajan et al.
and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with
arXiv:1306.420