54 research outputs found
Coresets for Clustering: Foundations and Challenges
Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering.
\textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs.
\textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by
\cite{DBLP:conf/soda/EibenFGLPS21}.
\textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of “simultaneous coresets” is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter
A New Coreset Framework for Clustering
Given a metric space, the -clustering problem consists of finding
centers such that the sum of the of distances raised to the power of every
point to its closest center is minimized. This encapsulates the famous
-median () and -means () clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the
solutions, also known as \emph{coresets}, has been an important research
direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously
improves upon the best known bounds for a large variety of settings, ranging
from Euclidean space, doubling metric, minor-free metric, and the general
metric cases
The Power of Uniform Sampling for Coresets
Motivated by practical generalizations of the classic -median and
-means objectives, such as clustering with size constraints, fair
clustering, and Wasserstein barycenter, we introduce a meta-theorem for
designing coresets for constrained-clustering problems. The meta-theorem
reduces the task of coreset construction to one on a bounded number of ring
instances with a much-relaxed additive error. This reduction enables us to
construct coresets using uniform sampling, in contrast to the widely-used
importance sampling, and consequently we can easily handle constrained
objectives. Notably and perhaps surprisingly, this simpler sampling scheme can
yield coresets whose size is independent of , the number of input points.
Our technique yields smaller coresets, and sometimes the first coresets, for
a large number of constrained clustering problems, including capacitated
clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in
minor-excluded graph, and polygon clustering under Fr\'{e}chet and Hausdorff
distance. Finally, our technique yields also smaller coresets for -median in
low-dimensional Euclidean spaces, specifically of size
in and
in
Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation
In all state-of-the-art sketching and coreset techniques for clustering, as
well as in the best known fixed-parameter tractable approximation algorithms,
randomness plays a key role. For the classic -median and -means problems,
there are no known deterministic dimensionality reduction procedure or coreset
construction that avoid an exponential dependency on the input dimension ,
the precision parameter or . Furthermore, there is no
coreset construction that succeeds with probability and whose size does
not depend on the number of input points, . This has led researchers in the
area to ask what is the power of randomness for clustering sketches [Feldman,
WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio
achievable deterministically without a complexity exponential in the dimension
are for both -median and -means, even when allowing a
complexity FPT in the number of clusters . This stands in sharp contrast
with the -approximation achievable in that case, when allowing
randomization.
In this paper, we provide deterministic sketches constructions for
clustering, whose size bounds are close to the best-known randomized ones. We
also construct a deterministic algorithm for computing
-approximation to -median and -means in high dimensional
Euclidean spaces in time , close to the
best randomized complexity.
Furthermore, our new insights on sketches also yield a randomized coreset
construction that uses uniform sampling, that immediately improves over the
recent results of [Braverman et al. FOCS '22] by a factor .Comment: FOCS 2023. Abstract reduced for arxiv requirement
An Empirical Evaluation of k-Means Coresets
Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets.
In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice
On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications
Fair clustering is a constrained variant of clustering where the goal is to
partition a set of colored points, such that the fraction of points of any
color in every cluster is more or less equal to the fraction of points of this
color in the dataset. This variant was recently introduced by Chierichetti et
al. [NeurIPS, 2017] in a seminal work and became widely popular in the
clustering literature. In this paper, we propose a new construction of coresets
for fair clustering based on random sampling. The new construction allows us to
obtain the first coreset for fair clustering in general metric spaces. For
Euclidean spaces, we obtain the first coreset whose size does not depend
exponentially on the dimension. Our coreset results solve open questions
proposed by Schmidt et al. [WAOA, 2019] and Huang et al. [NeurIPS, 2019]. The
new coreset construction helps to design several new approximation and
streaming algorithms. In particular, we obtain the first true
constant-approximation algorithm for metric fair clustering, whose running time
is fixed-parameter tractable (FPT). In the Euclidean case, we derive the first
-approximation algorithm for fair clustering whose time
complexity is near-linear and does not depend exponentially on the dimension of
the space. Besides, our coreset construction scheme is fairly general and gives
rise to coresets for a wide range of constrained clustering problems. This
leads to improved constant-approximations for these problems in general metrics
and near-linear time -approximations in the Euclidean metric
Towards Optimal Coreset Construction for -Clustering: Breaking the Quadratic Dependency on
Constructing small-sized coresets for various clustering problems has
attracted significant attention recently. We provide efficient coreset
construction algorithms for -Clustering with improved coreset sizes in
several metric spaces. In particular, we provide an
-sized coreset for -Clustering for all in Euclidean space, improving upon the best
known size upper bound [Cohen-Addad, Larsen,
Saulpic, Schwiegelshohn. STOC'22], breaking the quadratic dependency on for
the first time (when ). For example, our coreset size
for Euclidean -Median is , improving
the best known result by a factor when ; for Euclidean -Means, our coreset size is
, improving the best known result
by a
factor when . We also obtain optimal or
improved coreset sizes for general metric space, metric space with bounded
doubling dimension, and shortest path metric when the underlying graph has
bounded treewidth, for all . Our algorithm largely follows the
framework developed by Cohen-Addad et al. with some minor but useful changes.
Our technical contribution mainly lies in the analysis. An important
improvement in our analysis is a new notion of -covering of distance
vectors with a novel error metric, which allows us to provide a tighter
variance bound. Another useful technical ingredient is terminal embedding with
additive errors, for bounding the covering number in the Euclidean case
Coresets for Clustering with General Assignment Constraints
Designing small-sized \emph{coresets}, which approximately preserve the costs
of the solutions for large datasets, has been an important research direction
for the past decade. We consider coreset construction for a variety of general
constrained clustering problems. We significantly extend and generalize the
results of a very recent paper (Braverman et al., FOCS'22), by demonstrating
that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et
al., FOCS'22) can be applied to efficiently construct coresets for a very
general class of constrained clustering problems with general assignment
constraints, including capacity constraints on cluster centers, and assignment
structure constraints for data points (modeled by a convex body .
Our main theorem shows that a small-sized -coreset exists as long
as a complexity measure of the structure
constraint, and the \emph{covering exponent}
for metric space are bounded. The complexity measure
for convex body is the Lipschitz
constant of a certain transportation problem constrained in ,
called \emph{optimal assignment transportation problem}. We prove nontrivial
upper bounds of for various polytopes, including
the general matroid basis polytopes, and laminar matroid polytopes (with better
bound). As an application of our general theorem, we construct the first
coreset for the fault-tolerant clustering problem (with or without capacity
upper/lower bound) for the above metric spaces, in which the fault-tolerance
requirement is captured by a uniform matroid basis polytope
- …