10 research outputs found

    Fully-Dynamic Coresets

    Get PDF
    With input sizes becoming massive, coresets -- small yet representative summary of the input -- are relevant more than ever. A weighted set CwC_w that is a subset of the input is an Δ\varepsilon-coreset if the cost of any feasible solution SS with respect to CwC_w is within [1±Δ][1 {\pm} \varepsilon] of the cost of SS with respect to the original input. We give a very general technique to compute coresets in the fully-dynamic setting where input points can be added or deleted. Given a static Δ\varepsilon-coreset algorithm that runs in time t(n,Δ,λ)t(n, \varepsilon, \lambda) and computes a coreset of size s(n,Δ,λ)s(n, \varepsilon, \lambda), where nn is the number of input points and 1−λ1 {-}\lambda is the success probability, we give a fully-dynamic algorithm that computes an Δ\varepsilon-coreset with worst-case update time O((log⁥n)⋅t(s(n,Δ/log⁥n,λ/n),Δ/log⁥n,λ/n))O((\log n) \cdot t(s(n, \varepsilon/\log n, \lambda/n), \varepsilon/\log n, \lambda/n) ) (this bound is stated informally), where the success probability is 1−λ1{-}\lambda. Our technique is a fully-dynamic analog of the merge-and-reduce technique that applies to insertion-only setting. Although our space usage is O(n)O(n), we work in the presence of an adaptive adversary, and we show that Ω(n)\Omega(n) space is required when adversary is adaptive. As a consequence, we get fully-dynamic Δ\varepsilon-coreset algorithms for kk-median and kk-means with worst-case update time O(Δ−2k2log⁥5nlog⁥3k)O(\varepsilon^{-2}k^2\log^5 n \log^3 k) and coreset size O(Δ−2klog⁥nlog⁥2k)O(\varepsilon^{-2}k\log n \log^2 k) ignoring log⁥log⁥n\log \log n and log⁥(1/Δ)\log(1/\varepsilon) factors and assuming that Δ,λ=Ω(1/\varepsilon, \lambda = \Omega(1/poly(n))(n)). These are the first fully-dynamic algorithms for kk-median and kk-means with worst-case update times O(O(poly(k,log⁥n,Δ−1))(k, \log n, \varepsilon^{-1})). We also give conditional lower bound on update/query time for any fully-dynamic (4−ή)(4 - \delta)-approximation algorithm for kk-means.Comment: Added missed important reference. Abstract is shortene

    Experimental Evaluation of Fully Dynamic k-Means via Coresets

    Full text link
    For a set of points in Rd\mathbb{R}^d, the Euclidean kk-means problems consists of finding kk centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the coreset provides a guarantee almost equivalent to running it on the full data. In this work, we study coresets in a fully-dynamic setting: points are added and deleted with the goal to efficiently maintain a coreset with which a k-means solution can be computed. Based on an algorithm from Henzinger and Kale [ESA'20], we present an efficient and practical implementation of a fully dynamic coreset algorithm, that improves the running time by up to a factor of 20 compared to our non-optimized implementation of the algorithm by Henzinger and Kale, without sacrificing more than 7% on the quality of the k-means solution.Comment: Accepted at ALENEX 2

    Fully Dynamic kk-Clustering in O~(k)\tilde O(k) Update Time

    Full text link
    We present a O(1)O(1)-approximate fully dynamic algorithm for the kk-median and kk-means problems on metric spaces with amortized update time O~(k)\tilde O(k) and worst-case query time O~(k2)\tilde O(k^2). We complement our theoretical analysis with the first in-depth experimental study for the dynamic kk-median problem on general metrics, focusing on comparing our dynamic algorithm to the current state-of-the-art by Henzinger and Kale [ESA'20]. Finally, we also provide a lower bound for dynamic kk-median which shows that any O(1)O(1)-approximate algorithm with O~(poly(k))\tilde O(\text{poly}(k)) query time must have Ω~(k)\tilde \Omega(k) amortized update time, even in the incremental setting.Comment: Accepted at NeurIPS 202

    Dynamic algorithms for k-center on graphs

    Full text link
    In this paper we give the first efficient algorithms for the kk-center problem on dynamic graphs undergoing edge updates. In this problem, the goal is to partition the input into kk sets by choosing kk centers such that the maximum distance from any data point to the closest center is minimized. It is known that it is NP-hard to get a better than 22 approximation for this problem. While in many applications the input may naturally be modeled as a graph, all prior works on kk-center problem in dynamic settings are on metrics. In this paper, we give a deterministic decremental (2+Ï”)(2+\epsilon)-approximation algorithm and a randomized incremental (4+Ï”)(4+\epsilon)-approximation algorithm, both with amortized update time kno(1)kn^{o(1)} for weighted graphs. Moreover, we show a reduction that leads to a fully dynamic (2+Ï”)(2+\epsilon)-approximation algorithm for the kk-center problem, with worst-case update time that is within a factor kk of the state-of-the-art upper bound for maintaining (1+Ï”)(1+\epsilon)-approximate single-source distances in graphs. Matching this bound is a natural goalpost because the approximate distances of each vertex to its center can be used to maintain a (2+Ï”)(2+\epsilon)-approximation of the graph diameter and the fastest known algorithms for such a diameter approximation also rely on maintaining approximate single-source distances

    Differential Privacy for Clustering Under Continual Observation

    Full text link
    We consider the problem of clustering privately a dataset in Rd\mathbb{R}^d that undergoes both insertion and deletion of points. Specifically, we give an Δ\varepsilon-differentially private clustering mechanism for the kk-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number TT of updates. The multiplicative error is almost the same as non privately. To do so we show how to perform dimension reduction under continual observation and combine it with a differentially private greedy approximation algorithm for kk-means. We also partially extend our results to the kk-median problem

    Coresets for Clustering with General Assignment Constraints

    Full text link
    Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We significantly extend and generalize the results of a very recent paper (Braverman et al., FOCS'22), by demonstrating that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et al., FOCS'22) can be applied to efficiently construct coresets for a very general class of constrained clustering problems with general assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body B)\mathcal{B}). Our main theorem shows that a small-sized Ï”\epsilon-coreset exists as long as a complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) of the structure constraint, and the \emph{covering exponent} Λϔ(X)\Lambda_\epsilon(\mathcal{X}) for metric space (X,d)(\mathcal{X},d) are bounded. The complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) for convex body B\mathcal{B} is the Lipschitz constant of a certain transportation problem constrained in B\mathcal{B}, called \emph{optimal assignment transportation problem}. We prove nontrivial upper bounds of Lip(B)\mathsf{Lip}(\mathcal{B}) for various polytopes, including the general matroid basis polytopes, and laminar matroid polytopes (with better bound). As an application of our general theorem, we construct the first coreset for the fault-tolerant clustering problem (with or without capacity upper/lower bound) for the above metric spaces, in which the fault-tolerance requirement is captured by a uniform matroid basis polytope

    Approximating Edit Distance in the Fully Dynamic Model

    Full text link
    The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most nn, simple dynamic programming computes their edit distance exactly in O(n2)O(n^2) time, which is also the best possible (up to subpolynomial factors) assuming the Strong Exponential Time Hypothesis (SETH). The last few decades have seen tremendous progress in edit distance approximation, where the runtime has been brought down to subquadratic, near-linear, and even sublinear at the cost of approximation. In this paper, we study the dynamic edit distance problem, where the strings change dynamically as the characters are substituted, inserted, or deleted over time. Each change may happen at any location of either of the two strings. The goal is to maintain the (exact or approximate) edit distance of such dynamic strings while minimizing the update time. The exact edit distance can be maintained in O~(n)\tilde{O}(n) time per update (Charalampopoulos, Kociumaka, Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the unprecedented progress in edit distance approximation in the static setting, strikingly little is known regarding dynamic edit distance approximation. Utilizing the off-the-shelf tools, it is possible to achieve an O(nc)O(n^{c})-approximation in n0.5−c+o(1)n^{0.5-c+o(1)} update time for any constant c∈[0,16]c\in [0,\frac16]. Improving upon this trade-off remains open. The contribution of this work is a dynamic no(1)n^{o(1)}-approximation algorithm with amortized expected update time of no(1)n^{o(1)}. In other words, we bring the approximation-ratio and update-time product down to no(1)n^{o(1)}. Our solution utilizes an elegant framework of precision sampling tree for edit distance approximation (Andoni, Krauthgamer, Onak; 2010).Comment: Accepted to FOCS 202

    Coresets for Clustering: Foundations and Challenges

    Get PDF
    Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering. \textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs. \textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by \cite{DBLP:conf/soda/EibenFGLPS21}. \textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of “simultaneous coresets” is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter
    corecore