Search CORE

10 research outputs found

Fully-Dynamic Coresets

Author: Henzinger Monika
Kale Sagar
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual European Symposium on Algorithms (ESA 2020)
Publication date: 01/01/2020
Field of study

With input sizes becoming massive, coresets -- small yet representative summary of the input -- are relevant more than ever. A weighted set

C_w

that is a subset of the input is an

\varepsilon

-coreset if the cost of any feasible solution

S

with respect to

C_w

is within

[1 {\pm} \varepsilon]

of the cost of

S

with respect to the original input. We give a very general technique to compute coresets in the fully-dynamic setting where input points can be added or deleted. Given a static

\varepsilon

-coreset algorithm that runs in time

t(n, \varepsilon, \lambda)

and computes a coreset of size

s(n, \varepsilon, \lambda)

, where

n

is the number of input points and

1 {-}\lambda

is the success probability, we give a fully-dynamic algorithm that computes an

\varepsilon

-coreset with worst-case update time

O((\log n) \cdot t(s(n, \varepsilon/\log n, \lambda/n), \varepsilon/\log n, \lambda/n) )

(this bound is stated informally), where the success probability is

1{-}\lambda

. Our technique is a fully-dynamic analog of the merge-and-reduce technique that applies to insertion-only setting. Although our space usage is

O(n)

, we work in the presence of an adaptive adversary, and we show that

\Omega(n)

space is required when adversary is adaptive. As a consequence, we get fully-dynamic

\varepsilon

-coreset algorithms for

k

-median and

k

-means with worst-case update time

O(\varepsilon^{-2}k^2\log^5 n \log^3 k)

and coreset size

O(\varepsilon^{-2}k\log n \log^2 k)

ignoring

\log \log n

and

\log(1/\varepsilon)

factors and assuming that

\varepsilon, \lambda = \Omega(1/

poly

(n))

. These are the first fully-dynamic algorithms for

k

-median and

k

-means with worst-case update times

O(

poly

(k, \log n, \varepsilon^{-1}))

. We also give conditional lower bound on update/query time for any fully-dynamic

(4 - \delta)

-approximation algorithm for

k

-means.Comment: Added missed important reference. Abstract is shortene

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Experimental Evaluation of Fully Dynamic k-Means via Coresets

Author: Henzinger Monika
Saulpic David
Sidl Leonhard
Publication venue
Publication date: 27/10/2023
Field of study

For a set of points in

\mathbb{R}^d

, the Euclidean

k

-means problems consists of finding

k

centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the coreset provides a guarantee almost equivalent to running it on the full data. In this work, we study coresets in a fully-dynamic setting: points are added and deleted with the goal to efficiently maintain a coreset with which a k-means solution can be computed. Based on an algorithm from Henzinger and Kale [ESA'20], we present an efficient and practical implementation of a fully dynamic coreset algorithm, that improves the running time by up to a factor of 20 compared to our non-optimized implementation of the algorithm by Henzinger and Kale, without sacrificing more than 7% on the quality of the k-means solution.Comment: Accepted at ALENEX 2

arXiv.org e-Print Archive

Fully Dynamic $k$ -Clustering in $\tilde O(k)$ Update Time

Author: Bhattacharya Sayan
Costa Martín
Lattanzi Silvio
Parotsidis Nikos
Publication venue
Publication date: 26/10/2023
Field of study

We present a

O(1)

-approximate fully dynamic algorithm for the

k

-median and

k

-means problems on metric spaces with amortized update time

\tilde O(k)

and worst-case query time

\tilde O(k^2)

. We complement our theoretical analysis with the first in-depth experimental study for the dynamic

k

-median problem on general metrics, focusing on comparing our dynamic algorithm to the current state-of-the-art by Henzinger and Kale [ESA'20]. Finally, we also provide a lower bound for dynamic

k

-median which shows that any

O(1)

-approximate algorithm with

\tilde O(\text{poly}(k))

query time must have

\tilde \Omega(k)

amortized update time, even in the incremental setting.Comment: Accepted at NeurIPS 202

arXiv.org e-Print Archive

Dynamic algorithms for k-center on graphs

Author: Cruciani Emilio
Forster Sebastian
Goranci Gramoz
Nazari Yasamin
Skarlatos Antonis
Publication venue
Publication date: 28/07/2023
Field of study

In this paper we give the first efficient algorithms for the

k

-center problem on dynamic graphs undergoing edge updates. In this problem, the goal is to partition the input into

k

sets by choosing

k

centers such that the maximum distance from any data point to the closest center is minimized. It is known that it is NP-hard to get a better than

2

approximation for this problem. While in many applications the input may naturally be modeled as a graph, all prior works on

k

-center problem in dynamic settings are on metrics. In this paper, we give a deterministic decremental

(2+\epsilon)

-approximation algorithm and a randomized incremental

(4+\epsilon)

-approximation algorithm, both with amortized update time

kn^{o(1)}

for weighted graphs. Moreover, we show a reduction that leads to a fully dynamic

(2+\epsilon)

-approximation algorithm for the

k

-center problem, with worst-case update time that is within a factor

k

of the state-of-the-art upper bound for maintaining

(1+\epsilon)

-approximate single-source distances in graphs. Matching this bound is a natural goalpost because the approximate distances of each vertex to its center can be used to maintain a

(2+\epsilon)

-approximation of the graph diameter and the fastest known algorithms for such a diameter approximation also rely on maintaining approximate single-source distances

arXiv.org e-Print Archive

Differential Privacy for Clustering Under Continual Observation

Author: Henzinger Monika
la Tour Max Dupré
Saulpic David
Publication venue
Publication date: 07/07/2023
Field of study

We consider the problem of clustering privately a dataset in

\mathbb{R}^d

that undergoes both insertion and deletion of points. Specifically, we give an

\varepsilon

-differentially private clustering mechanism for the

k

-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number

T

of updates. The multiplicative error is almost the same as non privately. To do so we show how to perform dimension reduction under continual observation and combine it with a differentially private greedy approximation algorithm for

k

-means. We also partially extend our results to the

k

-median problem

arXiv.org e-Print Archive

Approximating edit distance in the fully dynamic model

Author: Kociumaka Tomasz
Mukherjee Anish
Saha Barna
Publication venue: IEEE
Publication date: 22/12/2023
Field of study

Warwick Research Archives Portal Repository

Coresets for Clustering with General Assignment Constraints

Author: Huang Lingxiao
Jiang Shaofeng H. -C.
Li Jian
Wu Xuan
Publication venue
Publication date: 23/01/2023
Field of study

Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We significantly extend and generalize the results of a very recent paper (Braverman et al., FOCS'22), by demonstrating that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et al., FOCS'22) can be applied to efficiently construct coresets for a very general class of constrained clustering problems with general assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body

\mathcal{B})

. Our main theorem shows that a small-sized

\epsilon

-coreset exists as long as a complexity measure

\mathsf{Lip}(\mathcal{B})

of the structure constraint, and the \emph{covering exponent}

\Lambda_\epsilon(\mathcal{X})

for metric space

(\mathcal{X},d)

are bounded. The complexity measure

\mathsf{Lip}(\mathcal{B})

for convex body

\mathcal{B}

is the Lipschitz constant of a certain transportation problem constrained in

\mathcal{B}

, called \emph{optimal assignment transportation problem}. We prove nontrivial upper bounds of

\mathsf{Lip}(\mathcal{B})

for various polytopes, including the general matroid basis polytopes, and laminar matroid polytopes (with better bound). As an application of our general theorem, we construct the first coreset for the fault-tolerant clustering problem (with or without capacity upper/lower bound) for the above metric spaces, in which the fault-tolerance requirement is captured by a uniform matroid basis polytope

arXiv.org e-Print Archive

Approximating Edit Distance in the Fully Dynamic Model

Author: Kociumaka Tomasz
Mukherjee Anish
Saha Barna
Publication venue
Publication date: 14/07/2023
Field of study

The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most

n

, simple dynamic programming computes their edit distance exactly in

O(n^2)

time, which is also the best possible (up to subpolynomial factors) assuming the Strong Exponential Time Hypothesis (SETH). The last few decades have seen tremendous progress in edit distance approximation, where the runtime has been brought down to subquadratic, near-linear, and even sublinear at the cost of approximation. In this paper, we study the dynamic edit distance problem, where the strings change dynamically as the characters are substituted, inserted, or deleted over time. Each change may happen at any location of either of the two strings. The goal is to maintain the (exact or approximate) edit distance of such dynamic strings while minimizing the update time. The exact edit distance can be maintained in

\tilde{O}(n)

time per update (Charalampopoulos, Kociumaka, Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the unprecedented progress in edit distance approximation in the static setting, strikingly little is known regarding dynamic edit distance approximation. Utilizing the off-the-shelf tools, it is possible to achieve an

O(n^{c})

-approximation in

n^{0.5-c+o(1)}

update time for any constant

c\in [0,\frac16]

. Improving upon this trade-off remains open. The contribution of this work is a dynamic

n^{o(1)}

-approximation algorithm with amortized expected update time of

n^{o(1)}

. In other words, we bring the approximation-ratio and update-time product down to

n^{o(1)}

. Our solution utilizes an elegant framework of precision sampling tree for edit distance approximation (Andoni, Krauthgamer, Onak; 2010).Comment: Accepted to FOCS 202

arXiv.org e-Print Archive

Coresets for Clustering: Foundations and Challenges

Author: Wu Xuan
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 25/07/2022
Field of study

Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering. \textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs. \textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by \cite{DBLP:conf/soda/EibenFGLPS21}. \textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of “simultaneous coresets” is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter

JScholarship