Search CORE

4 research outputs found

A tight lower bound instance for k-means++ in constant dimension

Author: A. Aggarwal
B. Bahmani
D. Arthur
D. Arthur
M. Agarwal
M.R. Ackermann
R. Jaiswal
Publication venue
Publication date: 01/01/2014
Field of study

The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial

k

centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For

i > 1

, pick a point to be the

i^{th}

center with probability proportional to the square of the Euclidean distance of this point to the closest previously

(i-1)

chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an

O(\log{k})

approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of

\Omega(\log{k})

in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say

1/poly(k)

). Brunsch and R\"{o}glin gave a dataset where the k-means++ seeding algorithm achieves an

O(\log{k})

approximation ratio with probability that is exponentially small in

k

. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an

O(\log{k})

approximation ratio with probability exponentially small in

k

. This solves open problems posed by Mahajan et al. and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with arXiv:1306.420

arXiv.org e-Print Archive

CiteSeerX

Crossref

The Complexity of the k-means Method

Author: Roughgarden Tim
Wang Joshua R.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th Annual European Symposium on Algorithms (ESA 2016)
Publication date: 01/01/2016
Field of study

The k-means method is a widely used technique for clustering points in Euclidean space. While it is extremely fast in practice, its worst-case running time is exponential in the number of data points. We prove that the k-means method can implicitly solve PSPACE-complete problems, providing a complexity-theoretic explanation for its worst-case running time. Our result parallels recent work on the complexity of the simplex method for linear programming

Dagstuhl Research Online Publication Server

Information geometry

Author: Verdoolaege Geert
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

This Special Issue of the journal Entropy, titled “Information Geometry I”, contains a collection of 17 papers concerning the foundations and applications of information geometry. Based on a geometrical interpretation of probability, information geometry has become a rich mathematical field employing the methods of differential geometry. It has numerous applications to data science, physics, and neuroscience. Presenting original research, yet written in an accessible, tutorial style, this collection of papers will be useful for scientists who are new to the field, while providing an excellent reference for the more experienced researcher. Several papers are written by authorities in the field, and topics cover the foundations of information geometry, as well as applications to statistics, Bayesian inference, machine learning, complex systems, physics, and neuroscience

Ghent University Academic Bibliography

Directory of Open Access Books (DOAB)