10 research outputs found
A bi-criteria approximation algorithm for Means
We consider the classical -means clustering problem in the setting
bi-criteria approximation, in which an algoithm is allowed to output clusters, and must produce a clustering with cost at most times the
to the cost of the optimal set of clusters. We argue that this approach is
natural in many settings, for which the exact number of clusters is a priori
unknown, or unimportant up to a constant factor. We give new bi-criteria
approximation algorithms, based on linear programming and local search,
respectively, which attain a guarantee depending on the number
of clusters that may be opened. Our gurantee is
always at most and improves rapidly with (for example:
, and ). Moreover, our algorithms have only
polynomial dependence on the dimension of the input data, and so are applicable
in high-dimensional settings
On Variants of k-means Clustering
\textit{Clustering problems} often arise in the fields like data mining,
machine learning etc. to group a collection of objects into similar groups with
respect to a similarity (or dissimilarity) measure. Among the clustering
problems, specifically \textit{-means} clustering has got much attention
from the researchers. Despite the fact that -means is a very well studied
problem its status in the plane is still an open problem. In particular, it is
unknown whether it admits a PTAS in the plane. The best known approximation
bound in polynomial time is 9+\eps.
In this paper, we consider the following variant of -means. Given a set
of points in and a real , find a finite set of
points in that minimizes the quantity . For any fixed dimension , we design a local
search PTAS for this problem. We also give a "bi-criterion" local search
algorithm for -means which uses (1+\eps)k centers and yields a solution
whose cost is at most (1+\eps) times the cost of an optimal -means
solution. The algorithm runs in polynomial time for any fixed dimension.
The contribution of this paper is two fold. On the one hand, we are being
able to handle the square of distances in an elegant manner, which yields near
optimal approximation bound. This leads us towards a better understanding of
the -means problem. On the other hand, our analysis of local search might
also be useful for other geometric problems. This is important considering that
very little is known about the local search method for geometric approximation.Comment: 15 page
Training Gaussian Mixture Models at Scale via Coresets
How can we train a statistical mixture model on a massive data set? In this
work we show how to construct coresets for mixtures of Gaussians. A coreset is
a weighted subset of the data, which guarantees that models fitting the coreset
also provide a good fit for the original data set. We show that, perhaps
surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension
and the number of mixture components, while being independent of the data set
size. Hence, one can harness computationally intensive algorithms to compute a
good approximation on a significantly smaller data set. More importantly, such
coresets can be efficiently constructed both in distributed and streaming
settings and do not impose restrictions on the data generating process. Our
results rely on a novel reduction of statistical estimation to problems in
computational geometry and new combinatorial complexity results for mixtures of
Gaussians. Empirical evaluation on several real-world datasets suggests that
our coreset-based approach enables significant reduction in training-time with
negligible approximation error
Fully Scalable MPC Algorithms for Clustering in High Dimension
We design new parallel algorithms for clustering in high-dimensional
Euclidean spaces. These algorithms run in the Massively Parallel Computation
(MPC) model, and are fully scalable, meaning that the local memory in each
machine may be for arbitrarily small fixed .
Importantly, the local memory may be substantially smaller than the number of
clusters , yet all our algorithms are fast, i.e., run in rounds.
We first devise a fast MPC algorithm for -approximation of uniform
facility location. This is the first fully-scalable MPC algorithm that achieves
-approximation for any clustering problem in general geometric setting;
previous algorithms only provide -approximation or apply
to restricted inputs, like low dimension or small number of clusters ; e.g.
[Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad
et al., ICML'22]. We then build on this facility location result and devise a
fast MPC algorithm that achieves -bicriteria approximation for -Median
and for -Means, namely, it computes clusters of cost
within -factor of the optimum for clusters.
A primary technical tool that we introduce, and may be of independent
interest, is a new MPC primitive for geometric aggregation, namely, computing
for every data point a statistic of its approximate neighborhood, for
statistics like range counting and nearest-neighbor search. Our implementation
of this primitive works in high dimension, and is based on consistent hashing
(aka sparse partition), a technique that was recently used for streaming
algorithms [Czumaj et al., FOCS'22]
Determinantal Point Processes for Coresets
International audienceWhen one is faced with a dataset too large to be used all at once, an obvious solution is to retain only part of it. In practice this takes a wide variety of different forms, but among them " coresets " are especially appealing. A coreset is a (small) weighted sample of the original data that comes with a guarantee: that a cost function can be evaluated on the smaller set instead of the larger one, with low relative error. For some classes of problems, and via a careful choice of sampling distribution, iid random sampling has turned to be one of the most successful methods to build coresets efficiently. However, independent samples are sometimes overly redundant, and one could hope that enforcing diversity would lead to better performance. The difficulty lies in proving coreset properties in non-iid samples. We show that the coreset property holds for samples formed with determinantal point processes (DPP). DPPs are interesting because they are a rare example of repulsive point processes with tractable theoretical properties, enabling us to construct general coreset theorems. We apply our results to the k-means problem, and give empirical evidence of the superior performance of DPP samples over state of the art methods