10 research outputs found
Fully-Dynamic Coresets
With input sizes becoming massive, coresets -- small yet representative
summary of the input -- are relevant more than ever. A weighted set that
is a subset of the input is an -coreset if the cost of any
feasible solution with respect to is within
of the cost of with respect to the original input. We give a very general
technique to compute coresets in the fully-dynamic setting where input points
can be added or deleted. Given a static -coreset algorithm that
runs in time and computes a coreset of size , where is the number of input points and is the success probability, we give a fully-dynamic algorithm that
computes an -coreset with worst-case update time (this
bound is stated informally), where the success probability is .
Our technique is a fully-dynamic analog of the merge-and-reduce technique that
applies to insertion-only setting. Although our space usage is , we work
in the presence of an adaptive adversary, and we show that space is
required when adversary is adaptive. As a consequence, we get fully-dynamic
-coreset algorithms for -median and -means with worst-case
update time and coreset size
ignoring and
factors and assuming that poly. These are the first fully-dynamic algorithms for
-median and -means with worst-case update times poly. We also give conditional lower bound on update/query time
for any fully-dynamic -approximation algorithm for -means.Comment: Added missed important reference. Abstract is shortene
Experimental Evaluation of Fully Dynamic k-Means via Coresets
For a set of points in , the Euclidean -means problems
consists of finding centers such that the sum of distances squared from
each data point to its closest center is minimized. Coresets are one the main
tools developed recently to solve this problem in a big data context. They
allow to compress the initial dataset while preserving its structure: running
any algorithm on the coreset provides a guarantee almost equivalent to running
it on the full data.
In this work, we study coresets in a fully-dynamic setting: points are added
and deleted with the goal to efficiently maintain a coreset with which a
k-means solution can be computed. Based on an algorithm from Henzinger and Kale
[ESA'20], we present an efficient and practical implementation of a fully
dynamic coreset algorithm, that improves the running time by up to a factor of
20 compared to our non-optimized implementation of the algorithm by Henzinger
and Kale, without sacrificing more than 7% on the quality of the k-means
solution.Comment: Accepted at ALENEX 2
Fully Dynamic -Clustering in Update Time
We present a -approximate fully dynamic algorithm for the -median
and -means problems on metric spaces with amortized update time and worst-case query time . We complement our theoretical
analysis with the first in-depth experimental study for the dynamic -median
problem on general metrics, focusing on comparing our dynamic algorithm to the
current state-of-the-art by Henzinger and Kale [ESA'20]. Finally, we also
provide a lower bound for dynamic -median which shows that any
-approximate algorithm with query time must
have amortized update time, even in the incremental setting.Comment: Accepted at NeurIPS 202
Dynamic algorithms for k-center on graphs
In this paper we give the first efficient algorithms for the -center
problem on dynamic graphs undergoing edge updates. In this problem, the goal is
to partition the input into sets by choosing centers such that the
maximum distance from any data point to the closest center is minimized. It is
known that it is NP-hard to get a better than approximation for this
problem.
While in many applications the input may naturally be modeled as a graph, all
prior works on -center problem in dynamic settings are on metrics. In this
paper, we give a deterministic decremental -approximation
algorithm and a randomized incremental -approximation algorithm,
both with amortized update time for weighted graphs. Moreover, we
show a reduction that leads to a fully dynamic -approximation
algorithm for the -center problem, with worst-case update time that is
within a factor of the state-of-the-art upper bound for maintaining
-approximate single-source distances in graphs. Matching this
bound is a natural goalpost because the approximate distances of each vertex to
its center can be used to maintain a -approximation of the graph
diameter and the fastest known algorithms for such a diameter approximation
also rely on maintaining approximate single-source distances
Differential Privacy for Clustering Under Continual Observation
We consider the problem of clustering privately a dataset in
that undergoes both insertion and deletion of points. Specifically, we give an
-differentially private clustering mechanism for the -means
objective under continual observation. This is the first approximation
algorithm for that problem with an additive error that depends only
logarithmically in the number of updates. The multiplicative error is
almost the same as non privately. To do so we show how to perform dimension
reduction under continual observation and combine it with a differentially
private greedy approximation algorithm for -means. We also partially extend
our results to the -median problem
Coresets for Clustering with General Assignment Constraints
Designing small-sized \emph{coresets}, which approximately preserve the costs
of the solutions for large datasets, has been an important research direction
for the past decade. We consider coreset construction for a variety of general
constrained clustering problems. We significantly extend and generalize the
results of a very recent paper (Braverman et al., FOCS'22), by demonstrating
that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et
al., FOCS'22) can be applied to efficiently construct coresets for a very
general class of constrained clustering problems with general assignment
constraints, including capacity constraints on cluster centers, and assignment
structure constraints for data points (modeled by a convex body .
Our main theorem shows that a small-sized -coreset exists as long
as a complexity measure of the structure
constraint, and the \emph{covering exponent}
for metric space are bounded. The complexity measure
for convex body is the Lipschitz
constant of a certain transportation problem constrained in ,
called \emph{optimal assignment transportation problem}. We prove nontrivial
upper bounds of for various polytopes, including
the general matroid basis polytopes, and laminar matroid polytopes (with better
bound). As an application of our general theorem, we construct the first
coreset for the fault-tolerant clustering problem (with or without capacity
upper/lower bound) for the above metric spaces, in which the fault-tolerance
requirement is captured by a uniform matroid basis polytope
Approximating Edit Distance in the Fully Dynamic Model
The edit distance is a fundamental measure of sequence similarity, defined as
the minimum number of character insertions, deletions, and substitutions needed
to transform one string into the other. Given two strings of length at most
, simple dynamic programming computes their edit distance exactly in
time, which is also the best possible (up to subpolynomial factors)
assuming the Strong Exponential Time Hypothesis (SETH). The last few decades
have seen tremendous progress in edit distance approximation, where the runtime
has been brought down to subquadratic, near-linear, and even sublinear at the
cost of approximation.
In this paper, we study the dynamic edit distance problem, where the strings
change dynamically as the characters are substituted, inserted, or deleted over
time. Each change may happen at any location of either of the two strings. The
goal is to maintain the (exact or approximate) edit distance of such dynamic
strings while minimizing the update time. The exact edit distance can be
maintained in time per update (Charalampopoulos, Kociumaka,
Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the
unprecedented progress in edit distance approximation in the static setting,
strikingly little is known regarding dynamic edit distance approximation.
Utilizing the off-the-shelf tools, it is possible to achieve an
-approximation in update time for any constant . Improving upon this trade-off remains open.
The contribution of this work is a dynamic -approximation algorithm
with amortized expected update time of . In other words, we bring the
approximation-ratio and update-time product down to . Our solution
utilizes an elegant framework of precision sampling tree for edit distance
approximation (Andoni, Krauthgamer, Onak; 2010).Comment: Accepted to FOCS 202
Coresets for Clustering: Foundations and Challenges
Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering.
\textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs.
\textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by
\cite{DBLP:conf/soda/EibenFGLPS21}.
\textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of âsimultaneous coresetsâ is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter