4 research outputs found

    Coresets for Clustering with General Assignment Constraints

    Full text link
    Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We significantly extend and generalize the results of a very recent paper (Braverman et al., FOCS'22), by demonstrating that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et al., FOCS'22) can be applied to efficiently construct coresets for a very general class of constrained clustering problems with general assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body B)\mathcal{B}). Our main theorem shows that a small-sized ϵ\epsilon-coreset exists as long as a complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) of the structure constraint, and the \emph{covering exponent} Λϵ(X)\Lambda_\epsilon(\mathcal{X}) for metric space (X,d)(\mathcal{X},d) are bounded. The complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) for convex body B\mathcal{B} is the Lipschitz constant of a certain transportation problem constrained in B\mathcal{B}, called \emph{optimal assignment transportation problem}. We prove nontrivial upper bounds of Lip(B)\mathsf{Lip}(\mathcal{B}) for various polytopes, including the general matroid basis polytopes, and laminar matroid polytopes (with better bound). As an application of our general theorem, we construct the first coreset for the fault-tolerant clustering problem (with or without capacity upper/lower bound) for the above metric spaces, in which the fault-tolerance requirement is captured by a uniform matroid basis polytope

    Near-linear time approximation schemes for clustering in doubling metrics

    Get PDF
    We consider the classic Facility Location, k-Median, and k-Means problems in metric spaces of doubling dimension d. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is Õ(2(1/ε)O(d2) n), making a significant improvement over the state-of-the-art algorithms that run in time n(d/ε)O(d). Moreover, we show how to extend the techniques used to get the first efficient approximation schemes for the problems of prize-collecting k-Median and k-Means and efficient bicriteria approximation schemes for k-Median with outliers, k-Means with outliers and k-Center

    ε\varepsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

    Full text link
    We study the problem of constructing ε\varepsilon-coresets for the (k,z)(k, z)-clustering problem in a doubling metric M(X,d)M(X, d). An ε\varepsilon-coreset is a weighted subset SXS\subseteq X with weight function w:SR0w : S \rightarrow \mathbb{R}_{\geq 0}, such that for any kk-subset C[X]kC \in [X]^k, it holds that xSw(x)dz(x,C)(1±ε)xXdz(x,C)\sum_{x \in S}{w(x) \cdot d^z(x, C)} \in (1 \pm \varepsilon) \cdot \sum_{x \in X}{d^z(x, C)}. We present an efficient algorithm that constructs an ε\varepsilon-coreset for the (k,z)(k, z)-clustering problem in M(X,d)M(X, d), where the size of the coreset only depends on the parameters k,z,εk, z, \varepsilon and the doubling dimension ddim(M)\mathsf{ddim}(M). To the best of our knowledge, this is the first efficient ε\varepsilon-coreset construction of size independent of X|X| for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M(X,d)M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance dd. Such a relation was not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1±ϵ)(1\pm\epsilon)-distortion of the distance function dd, and consider the notion of τ\tau-error probabilistic shattering dimension, we can prove an upper bound of O(ddim(M)log(1/ε)+loglog1τ)O( \mathsf{ddim}(M)\cdot \log(1/\varepsilon) +\log\log{\frac{1}{\tau}} ) for the probabilistic shattering dimension for even weighted doubling metrics. We believe this new relation is of independent interest and may find other applications. We also study the robust coresets and centroid sets in doubling metrics. Our robust coreset construction leads to new results in clustering and property testing, and the centroid sets can be used to accelerate the local search algorithms for clustering problems.Comment: Appeared in FOCS 2018, this is the full versio
    corecore