Search CORE

10 research outputs found

A bi-criteria approximation algorithm for $k$ Means

Author: Makarychev Konstantin
Makarychev Yury
Sviridenko Maxim
Ward Justin
Publication venue
Publication date: 03/08/2015
Field of study

We consider the classical

k

-means clustering problem in the setting bi-criteria approximation, in which an algoithm is allowed to output

\beta k > k

clusters, and must produce a clustering with cost at most

\alpha

times the to the cost of the optimal set of

k

clusters. We argue that this approach is natural in many settings, for which the exact number of clusters is a priori unknown, or unimportant up to a constant factor. We give new bi-criteria approximation algorithms, based on linear programming and local search, respectively, which attain a guarantee

\alpha(\beta)

depending on the number

\beta k

of clusters that may be opened. Our gurantee

\alpha(\beta)

is always at most

9 + \epsilon

and improves rapidly with

\beta

(for example:

\alpha(2)<2.59

, and

\alpha(3) < 1.4

). Moreover, our algorithms have only polynomial dependence on the dimension of the input data, and so are applicable in high-dimensional settings

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Dagstuhl Research Online Publication Server

On Variants of k-means Clustering

Author: Bandyapadhyay Sayan
Varadarajan Kasturi
Publication venue
Publication date: 09/12/2015
Field of study

\textit{Clustering problems} often arise in the fields like data mining, machine learning etc. to group a collection of objects into similar groups with respect to a similarity (or dissimilarity) measure. Among the clustering problems, specifically \textit{

k

-means} clustering has got much attention from the researchers. Despite the fact that

k

-means is a very well studied problem its status in the plane is still an open problem. In particular, it is unknown whether it admits a PTAS in the plane. The best known approximation bound in polynomial time is 9+\eps. In this paper, we consider the following variant of

k

-means. Given a set

C

of points in

\mathcal{R}^d

and a real

f > 0

, find a finite set

F

of points in

\mathcal{R}^d

that minimizes the quantity

f*|F|+\sum_{p\in C} \min_{q \in F} {||p-q||}^2

. For any fixed dimension

d

, we design a local search PTAS for this problem. We also give a "bi-criterion" local search algorithm for

k

-means which uses (1+\eps)k centers and yields a solution whose cost is at most (1+\eps) times the cost of an optimal

k

-means solution. The algorithm runs in polynomial time for any fixed dimension. The contribution of this paper is two fold. On the one hand, we are being able to handle the square of distances in an elegant manner, which yields near optimal approximation bound. This leads us towards a better understanding of the

k

-means problem. On the other hand, our analysis of local search might also be useful for other geometric problems. This is important considering that very little is known about the local search method for geometric approximation.Comment: 15 page

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Training Gaussian Mixture Models at Scale via Coresets

Author: Faulkner Matthew
Feldman Dan
Krause Andreas
Lucic Mario
Publication venue
Publication date: 15/01/2018
Field of study

How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world datasets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error

arXiv.org e-Print Archive

Repository for Publications and Research Data

Caltech Authors

Fully Scalable MPC Algorithms for Clustering in High Dimension

Author: Czumaj Artur
Gao Guichen
Jiang Shaofeng H. -C.
Krauthgamer Robert
Veselý Pavel
Publication venue
Publication date: 14/11/2023
Field of study

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be

n^{\sigma}

for arbitrarily small fixed

\sigma>0

. Importantly, the local memory may be substantially smaller than the number of clusters

k

, yet all our algorithms are fast, i.e., run in

O(1)

rounds. We first devise a fast MPC algorithm for

O(1)

-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves

O(1)

-approximation for any clustering problem in general geometric setting; previous algorithms only provide

\mathrm{poly}(\log n)

-approximation or apply to restricted inputs, like low dimension or small number of clusters

k

; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves

O(1)

-bicriteria approximation for

k

-Median and for

k

-Means, namely, it computes

(1+\varepsilon)k

clusters of cost within

O(1/\varepsilon^2)

-factor of the optimum for

k

clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22]

arXiv.org e-Print Archive

Determinantal Point Processes for Coresets

Author: Amblard Pierre-Olivier
Barthelmé Simon
Tremblay Nicolas
Publication venue: Microtome Publishing
Publication date: 01/11/2019
Field of study

International audienceWhen one is faced with a dataset too large to be used all at once, an obvious solution is to retain only part of it. In practice this takes a wide variety of different forms, but among them " coresets " are especially appealing. A coreset is a (small) weighted sample of the original data that comes with a guarantee: that a cost function can be evaluated on the smaller set instead of the larger one, with low relative error. For some classes of problems, and via a careful choice of sampling distribution, iid random sampling has turned to be one of the most successful methods to build coresets efficiently. However, independent samples are sometimes overly redundant, and one could hope that enforcing diversity would lead to better performance. The difficulty lies in proving coreset properties in non-iid samples. We show that the coreset property holds for samples formed with determinantal point processes (DPP). DPPs are interesting because they are a rare example of repulsive point processes with tractable theoretical properties, enabling us to construct general coreset theorems. We apply our results to the k-means problem, and give empirical evidence of the superior performance of DPP samples over state of the art methods

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

HAL Descartes