139 research outputs found

    On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

    Get PDF
    Fair clustering is a constrained variant of clustering where the goal is to partition a set of colored points, such that the fraction of points of any color in every cluster is more or less equal to the fraction of points of this color in the dataset. This variant was recently introduced by Chierichetti et al. [NeurIPS, 2017] in a seminal work and became widely popular in the clustering literature. In this paper, we propose a new construction of coresets for fair clustering based on random sampling. The new construction allows us to obtain the first coreset for fair clustering in general metric spaces. For Euclidean spaces, we obtain the first coreset whose size does not depend exponentially on the dimension. Our coreset results solve open questions proposed by Schmidt et al. [WAOA, 2019] and Huang et al. [NeurIPS, 2019]. The new coreset construction helps to design several new approximation and streaming algorithms. In particular, we obtain the first true constant-approximation algorithm for metric fair clustering, whose running time is fixed-parameter tractable (FPT). In the Euclidean case, we derive the first (1+ϵ)(1+\epsilon)-approximation algorithm for fair clustering whose time complexity is near-linear and does not depend exponentially on the dimension of the space. Besides, our coreset construction scheme is fairly general and gives rise to coresets for a wide range of constrained clustering problems. This leads to improved constant-approximations for these problems in general metrics and near-linear time (1+ϵ)(1+\epsilon)-approximations in the Euclidean metric

    On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

    Get PDF
    Fair clustering is a constrained clustering problem where we need to partition a set of colored points. The fraction of points of each color in every cluster should be more or less equal to the fraction of points of this color in the dataset. The problem was recently introduced by Chierichetti et al. (2017) [1]. We propose a new construction of coresets for fair clustering for Euclidean and general metrics based on random sampling. For the Euclidean space Rd, we provide the first coreset whose size does not depend exponentially on the dimension d. The question of whether such constructions exist was asked by Schmidt et al. (2019) [2]and Huang et al. (2019) [5]. For general metrics, our construction provides the first coreset for fair clustering. New coresets appear to be a handy tool for designing better approximation and streaming algorithms for fair and other constrained clustering variants

    Improved Approximation and Scalability for Fair Max-Min Diversification

    Get PDF
    Given an nn-point metric space (X,d)(\mathcal{X},d) where each point belongs to one of m=O(1)m=O(1) different categories or groups and a set of integers k1,…,kmk_1, \ldots, k_m, the fair Max-Min diversification problem is to select kik_i points belonging to category i∈[m]i\in [m], such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor 22-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a 66-approximation that is guaranteed to satisfy the fairness constraints up to a factor 1−ϵ1-\epsilon for any constant ϵ\epsilon. We also present a linear time algorithm returning an m+1m+1 approximation with exact fairness. The best previous result was a 3m−13m-1 approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant ϵ>0\epsilon>0, we present a 1+ϵ1+\epsilon approximation algorithm that runs in O(nk)+2O(k)O(nk) + 2^{O(k)} time where k=k1+…+kmk=k_1+\ldots+k_m. We can improve the running time to O(nk)+poly(k)O(nk)+ poly(k) at the expense of only picking (1−ϵ)ki(1-\epsilon) k_i points from category i∈[m]i\in [m]. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202

    Coresets for Clustering in Graphs of Bounded Treewidth

    Full text link
    We initiate the study of coresets for clustering in graph metrics, i.e., the shortest-path metric of edge-weighted graphs. Such clustering problems are essential to data analysis and used for example in road networks and data visualization. A coreset is a compact summary of the data that approximately preserves the clustering objective for every possible center set, and it offers significant efficiency improvements in terms of running time, storage, and communication, including in streaming and distributed settings. Our main result is a near-linear time construction of a coreset for k-Median in a general graph GG, with size Oϵ,k(tw(G))O_{\epsilon, k}(\mathrm{tw}(G)) where tw(G)\mathrm{tw}(G) is the treewidth of GG, and we complement the construction with a nearly-tight size lower bound. The construction is based on the framework of Feldman and Langberg [STOC 2011], and our main technical contribution, as required by this framework, is a uniform bound of O(tw(G))O(\mathrm{tw}(G)) on the shattering dimension under any point weights. We validate our coreset on real-world road networks, and our scalable algorithm constructs tiny coresets with high accuracy, which translates to a massive speedup of existing approximation algorithms such as local search for graph k-Median

    Coresets for Regressions with Panel Data

    Get PDF
    This paper introduces the problem of coresets for regression problems to panel data settings. We first define coresets for several variants of regression problems with panel data and then present efficient algorithms to construct coresets of size that depend polynomially on 1/ε\varepsilon (where ε\varepsilon is the error parameter) and the number of regression parameters - independent of the number of individuals in the panel data or the time units each individual is observed for. Our approach is based on the Feldman-Langberg framework in which a key step is to upper bound the "total sensitivity" that is roughly the sum of maximum influences of all individual-time pairs taken over all possible choices of regression parameters. Empirically, we assess our approach with synthetic and real-world datasets; the coreset sizes constructed using our approach are much smaller than the full dataset and coresets indeed accelerate the running time of computing the regression objective.Comment: This is a Full version of a paper to appear in NeurIPS 2020. The code can be found in https://github.com/huanglx12/Coresets-for-regressions-with-panel-dat

    The Power of Uniform Sampling for Coresets

    Full text link
    Motivated by practical generalizations of the classic kk-median and kk-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of nn, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fr\'{e}chet and Hausdorff distance. Finally, our technique yields also smaller coresets for 11-median in low-dimensional Euclidean spaces, specifically of size O~(ε−1.5)\tilde{O}(\varepsilon^{-1.5}) in R2\mathbb{R}^2 and O~(ε−1.6)\tilde{O}(\varepsilon^{-1.6}) in R3\mathbb{R}^3
    • …
    corecore