4 research outputs found

    A tight lower bound instance for k-means++ in constant dimension

    Full text link
    The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial kk centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For i>1i > 1, pick a point to be the ithi^{th} center with probability proportional to the square of the Euclidean distance of this point to the closest previously (iāˆ’1)(i-1) chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an O(logā”k)O(\log{k}) approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of Ī©(logā”k)\Omega(\log{k}) in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say 1/poly(k)1/poly(k)). Brunsch and R\"{o}glin gave a dataset where the k-means++ seeding algorithm achieves an O(logā”k)O(\log{k}) approximation ratio with probability that is exponentially small in kk. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an O(logā”k)O(\log{k}) approximation ratio with probability exponentially small in kk. This solves open problems posed by Mahajan et al. and by Brunsch and R\"{o}glin.Comment: To appear in TAMC 2014. arXiv admin note: text overlap with arXiv:1306.420

    The Complexity of the k-means Method

    Get PDF
    The k-means method is a widely used technique for clustering points in Euclidean space. While it is extremely fast in practice, its worst-case running time is exponential in the number of data points. We prove that the k-means method can implicitly solve PSPACE-complete problems, providing a complexity-theoretic explanation for its worst-case running time. Our result parallels recent work on the complexity of the simplex method for linear programming

    Information geometry

    Get PDF
    This Special Issue of the journal Entropy, titled ā€œInformation Geometry Iā€, contains a collection of 17 papers concerning the foundations and applications of information geometry. Based on a geometrical interpretation of probability, information geometry has become a rich mathematical field employing the methods of differential geometry. It has numerous applications to data science, physics, and neuroscience. Presenting original research, yet written in an accessible, tutorial style, this collection of papers will be useful for scientists who are new to the field, while providing an excellent reference for the more experienced researcher. Several papers are written by authorities in the field, and topics cover the foundations of information geometry, as well as applications to statistics, Bayesian inference, machine learning, complex systems, physics, and neuroscience
    corecore