257 research outputs found

    On the Cost of Essentially Fair Clusterings

    Get PDF
    Clustering is a fundamental tool in data mining. It partitions points into groups (clusters) and may be used to make decisions for each point based on its group. However, this process may harm protected (minority) classes if the clustering algorithm does not adequately represent them in desirable clusters -- especially if the data is already biased. At NIPS 2017, Chierichetti et al. proposed a model for fair clustering requiring the representation in each cluster to (approximately) preserve the global fraction of each protected class. Restricting to two protected classes, they developed both a 4-approximation for the fair kk-center problem and a O(t)O(t)-approximation for the fair kk-median problem, where tt is a parameter for the fairness model. For multiple protected classes, the best known result is a 14-approximation for fair kk-center. We extend and improve the known results. Firstly, we give a 5-approximation for the fair kk-center problem with multiple protected classes. Secondly, we propose a relaxed fairness notion under which we can give bicriteria constant-factor approximations for all of the classical clustering objectives kk-center, kk-supplier, kk-median, kk-means and facility location. The latter approximations are achieved by a framework that takes an arbitrary existing unfair (integral) solution and a fair (fractional) LP solution and combines them into an essentially fair clustering with a weakly supervised rounding scheme. In this way, a fair clustering can be established belatedly, in a situation where the centers are already fixed

    Training Gaussian Mixture Models at Scale via Coresets

    Get PDF
    How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world datasets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error

    Fast (1+ε)(1+\varepsilon)-Approximation Algorithms for Binary Matrix Factorization

    Full text link
    We introduce efficient (1+ε)(1+\varepsilon)-approximation algorithms for the binary matrix factorization (BMF) problem, where the inputs are a matrix A∈{0,1}n×d\mathbf{A}\in\{0,1\}^{n\times d}, a rank parameter k>0k>0, as well as an accuracy parameter ε>0\varepsilon>0, and the goal is to approximate A\mathbf{A} as a product of low-rank factors U∈{0,1}n×k\mathbf{U}\in\{0,1\}^{n\times k} and V∈{0,1}k×d\mathbf{V}\in\{0,1\}^{k\times d}. Equivalently, we want to find U\mathbf{U} and V\mathbf{V} that minimize the Frobenius loss ∥UV−A∥F2\|\mathbf{U}\mathbf{V} - \mathbf{A}\|_F^2. Before this work, the state-of-the-art for this problem was the approximation algorithm of Kumar et. al. [ICML 2019], which achieves a CC-approximation for some constant C≥576C\ge 576. We give the first (1+ε)(1+\varepsilon)-approximation algorithm using running time singly exponential in kk, where kk is typically a small integer. Our techniques generalize to other common variants of the BMF problem, admitting bicriteria (1+ε)(1+\varepsilon)-approximation algorithms for LpL_p loss functions and the setting where matrix operations are performed in F2\mathbb{F}_2. Our approach can be implemented in standard big data models, such as the streaming or distributed models.Comment: ICML 202

    Overlapping and Robust Edge-Colored Clustering in Hypergraphs

    Full text link
    A recent trend in data mining has explored (hyper)graph clustering algorithms for data with categorical relationship types. Such algorithms have applications in the analysis of social, co-authorship, and protein interaction networks, to name a few. Many such applications naturally have some overlap between clusters, a nuance which is missing from current combinatorial models. Additionally, existing models lack a mechanism for handling noise in datasets. We address these concerns by generalizing Edge-Colored Clustering, a recent framework for categorical clustering of hypergraphs. Our generalizations allow for a budgeted number of either (a) overlapping cluster assignments or (b) node deletions. For each new model we present a greedy algorithm which approximately minimizes an edge mistake objective, as well as bicriteria approximations where the second approximation factor is on the budget. Additionally, we address the parameterized complexity of each problem, providing FPT algorithms and hardness results
    • …
    corecore