Search CORE

257 research outputs found

On the Cost of Essentially Fair Clusterings

Author: Bercea Ioana O.
Khuller Samir
Kumar Aounon
Schmidt Daniel R.
Schmidt Melanie
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 26/11/2018
Field of study

Clustering is a fundamental tool in data mining. It partitions points into groups (clusters) and may be used to make decisions for each point based on its group. However, this process may harm protected (minority) classes if the clustering algorithm does not adequately represent them in desirable clusters -- especially if the data is already biased. At NIPS 2017, Chierichetti et al. proposed a model for fair clustering requiring the representation in each cluster to (approximately) preserve the global fraction of each protected class. Restricting to two protected classes, they developed both a 4-approximation for the fair

k

-center problem and a

O(t)

-approximation for the fair

k

-median problem, where

t

is a parameter for the fairness model. For multiple protected classes, the best known result is a 14-approximation for fair

k

-center. We extend and improve the known results. Firstly, we give a 5-approximation for the fair

k

-center problem with multiple protected classes. Secondly, we propose a relaxed fairness notion under which we can give bicriteria constant-factor approximations for all of the classical clustering objectives

k

-center,

k

-supplier,

k

-median,

k

-means and facility location. The latter approximations are achieved by a framework that takes an arbitrary existing unfair (integral) solution and a fair (fractional) LP solution and combines them into an essentially fair clustering with a weakly supervised rounding scheme. In this way, a fair clustering can be established belatedly, in a situation where the centers are already fixed

arXiv.org e-Print Archive

Kölner UniversitätsPublikationsServer

Dagstuhl Research Online Publication Server

Training Gaussian Mixture Models at Scale via Coresets

Author: Faulkner Matthew
Feldman Dan
Krause Andreas
Lucic Mario
Publication venue
Publication date: 15/01/2018
Field of study

How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world datasets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error

arXiv.org e-Print Archive

Repository for Publications and Research Data

Caltech Authors

Fast $(1+\varepsilon)$ -Approximation Algorithms for Binary Matrix Factorization

Author: Velingker Ameya
Vötsch Maximilian
Woodruff David P.
Zhou Samson
Publication venue
Publication date: 02/06/2023
Field of study

We introduce efficient

(1+\varepsilon)

-approximation algorithms for the binary matrix factorization (BMF) problem, where the inputs are a matrix

\mathbf{A}\in\{0,1\}^{n\times d}

, a rank parameter

k>0

, as well as an accuracy parameter

\varepsilon>0

, and the goal is to approximate

\mathbf{A}

as a product of low-rank factors

\mathbf{U}\in\{0,1\}^{n\times k}

and

\mathbf{V}\in\{0,1\}^{k\times d}

. Equivalently, we want to find

\mathbf{U}

and

\mathbf{V}

that minimize the Frobenius loss

\|\mathbf{U}\mathbf{V} - \mathbf{A}\|_F^2

. Before this work, the state-of-the-art for this problem was the approximation algorithm of Kumar et. al. [ICML 2019], which achieves a

C

-approximation for some constant

C\ge 576

. We give the first

(1+\varepsilon)

-approximation algorithm using running time singly exponential in

k

, where

k

is typically a small integer. Our techniques generalize to other common variants of the BMF problem, admitting bicriteria

(1+\varepsilon)

-approximation algorithms for

L_p

loss functions and the setting where matrix operations are performed in

\mathbb{F}_2

. Our approach can be implemented in standard big data models, such as the streaming or distributed models.Comment: ICML 202

arXiv.org e-Print Archive

Overlapping and Robust Edge-Colored Clustering in Hypergraphs

Author: Crane Alex
Lavallee Brian
Sullivan Blair D.
Veldt Nate
Publication venue
Publication date: 27/05/2023
Field of study

A recent trend in data mining has explored (hyper)graph clustering algorithms for data with categorical relationship types. Such algorithms have applications in the analysis of social, co-authorship, and protein interaction networks, to name a few. Many such applications naturally have some overlap between clusters, a nuance which is missing from current combinatorial models. Additionally, existing models lack a mechanism for handling noise in datasets. We address these concerns by generalizing Edge-Colored Clustering, a recent framework for categorical clustering of hypergraphs. Our generalizations allow for a budgeted number of either (a) overlapping cluster assignments or (b) node deletions. For each new model we present a greedy algorithm which approximately minimizes an edge mistake objective, as well as bicriteria approximations where the second approximation factor is on the budget. Additionally, we address the parameterized complexity of each problem, providing FPT algorithms and hardness results

arXiv.org e-Print Archive