54 research outputs found

    Coresets for Clustering: Foundations and Challenges

    Get PDF
    Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering. \textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs. \textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by \cite{DBLP:conf/soda/EibenFGLPS21}. \textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of “simultaneous coresets” is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter

    Coresets for Clustering in Geometric Intersection Graphs

    Get PDF

    A New Coreset Framework for Clustering

    Full text link
    Given a metric space, the (k,z)(k,z)-clustering problem consists of finding kk centers such that the sum of the of distances raised to the power zz of every point to its closest center is minimized. This encapsulates the famous kk-median (z=1z=1) and kk-means (z=2z=2) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as \emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases

    The Power of Uniform Sampling for Coresets

    Full text link
    Motivated by practical generalizations of the classic kk-median and kk-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of nn, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fr\'{e}chet and Hausdorff distance. Finally, our technique yields also smaller coresets for 11-median in low-dimensional Euclidean spaces, specifically of size O~(ε1.5)\tilde{O}(\varepsilon^{-1.5}) in R2\mathbb{R}^2 and O~(ε1.6)\tilde{O}(\varepsilon^{-1.6}) in R3\mathbb{R}^3

    Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

    Full text link
    In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic kk-median and kk-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension dd, the precision parameter ε1\varepsilon^{-1} or kk. Furthermore, there is no coreset construction that succeeds with probability 11/n1-1/n and whose size does not depend on the number of input points, nn. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman, WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are Ω(1)\Omega(1) for both kk-median and kk-means, even when allowing a complexity FPT in the number of clusters kk. This stands in sharp contrast with the (1+ε)(1+\varepsilon)-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We also construct a deterministic algorithm for computing (1+ε)(1+\varepsilon)-approximation to kk-median and kk-means in high dimensional Euclidean spaces in time 2k2/εO(1)poly(nd)2^{k^2/\varepsilon^{O(1)}} poly(nd), close to the best randomized complexity. Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor kk.Comment: FOCS 2023. Abstract reduced for arxiv requirement

    An Empirical Evaluation of k-Means Coresets

    Get PDF
    Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice

    On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

    Get PDF
    Fair clustering is a constrained variant of clustering where the goal is to partition a set of colored points, such that the fraction of points of any color in every cluster is more or less equal to the fraction of points of this color in the dataset. This variant was recently introduced by Chierichetti et al. [NeurIPS, 2017] in a seminal work and became widely popular in the clustering literature. In this paper, we propose a new construction of coresets for fair clustering based on random sampling. The new construction allows us to obtain the first coreset for fair clustering in general metric spaces. For Euclidean spaces, we obtain the first coreset whose size does not depend exponentially on the dimension. Our coreset results solve open questions proposed by Schmidt et al. [WAOA, 2019] and Huang et al. [NeurIPS, 2019]. The new coreset construction helps to design several new approximation and streaming algorithms. In particular, we obtain the first true constant-approximation algorithm for metric fair clustering, whose running time is fixed-parameter tractable (FPT). In the Euclidean case, we derive the first (1+ϵ)(1+\epsilon)-approximation algorithm for fair clustering whose time complexity is near-linear and does not depend exponentially on the dimension of the space. Besides, our coreset construction scheme is fairly general and gives rise to coresets for a wide range of constrained clustering problems. This leads to improved constant-approximations for these problems in general metrics and near-linear time (1+ϵ)(1+\epsilon)-approximations in the Euclidean metric

    Towards Optimal Coreset Construction for (k,z)(k,z)-Clustering: Breaking the Quadratic Dependency on kk

    Full text link
    Constructing small-sized coresets for various clustering problems has attracted significant attention recently. We provide efficient coreset construction algorithms for (k,z)(k, z)-Clustering with improved coreset sizes in several metric spaces. In particular, we provide an O~z(k(2z+2)/(z+2)ε2)\tilde{O}_z(k^{(2z+2)/(z+2)}\varepsilon^{-2})-sized coreset for (k,z)(k, z)-Clustering for all z1z\geq 1 in Euclidean space, improving upon the best known O~z(k2ε2)\tilde{O}_z(k^2\varepsilon^{-2}) size upper bound [Cohen-Addad, Larsen, Saulpic, Schwiegelshohn. STOC'22], breaking the quadratic dependency on kk for the first time (when kε1k\leq \varepsilon^{-1}). For example, our coreset size for Euclidean kk-Median is O~(k4/3ε2)\tilde{O}(k^{4/3} \varepsilon^{-2}), improving the best known result O~(min{k2ε2,kε3})\tilde{O}(\min\left\{k^2\varepsilon^{-2}, k\varepsilon^{-3}\right\}) by a factor k2/3k^{2/3} when kε1k\leq \varepsilon^{-1}; for Euclidean kk-Means, our coreset size is O~(k3/2ε2)\tilde{O}(k^{3/2} \varepsilon^{-2}), improving the best known result O~(min{k2ε2,kε4})\tilde{O}(\min\left\{k^2\varepsilon^{-2}, k\varepsilon^{-4}\right\}) by a factor k1/2k^{1/2} when kε2k\leq \varepsilon^{-2}. We also obtain optimal or improved coreset sizes for general metric space, metric space with bounded doubling dimension, and shortest path metric when the underlying graph has bounded treewidth, for all z1z\geq 1. Our algorithm largely follows the framework developed by Cohen-Addad et al. with some minor but useful changes. Our technical contribution mainly lies in the analysis. An important improvement in our analysis is a new notion of α\alpha-covering of distance vectors with a novel error metric, which allows us to provide a tighter variance bound. Another useful technical ingredient is terminal embedding with additive errors, for bounding the covering number in the Euclidean case

    Coresets for Clustering with General Assignment Constraints

    Full text link
    Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We significantly extend and generalize the results of a very recent paper (Braverman et al., FOCS'22), by demonstrating that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et al., FOCS'22) can be applied to efficiently construct coresets for a very general class of constrained clustering problems with general assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body B)\mathcal{B}). Our main theorem shows that a small-sized ϵ\epsilon-coreset exists as long as a complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) of the structure constraint, and the \emph{covering exponent} Λϵ(X)\Lambda_\epsilon(\mathcal{X}) for metric space (X,d)(\mathcal{X},d) are bounded. The complexity measure Lip(B)\mathsf{Lip}(\mathcal{B}) for convex body B\mathcal{B} is the Lipschitz constant of a certain transportation problem constrained in B\mathcal{B}, called \emph{optimal assignment transportation problem}. We prove nontrivial upper bounds of Lip(B)\mathsf{Lip}(\mathcal{B}) for various polytopes, including the general matroid basis polytopes, and laminar matroid polytopes (with better bound). As an application of our general theorem, we construct the first coreset for the fault-tolerant clustering problem (with or without capacity upper/lower bound) for the above metric spaces, in which the fault-tolerance requirement is captured by a uniform matroid basis polytope
    corecore