257 research outputs found
On the Cost of Essentially Fair Clusterings
Clustering is a fundamental tool in data mining. It partitions points into
groups (clusters) and may be used to make decisions for each point based on its
group. However, this process may harm protected (minority) classes if the
clustering algorithm does not adequately represent them in desirable clusters
-- especially if the data is already biased.
At NIPS 2017, Chierichetti et al. proposed a model for fair clustering
requiring the representation in each cluster to (approximately) preserve the
global fraction of each protected class. Restricting to two protected classes,
they developed both a 4-approximation for the fair -center problem and a
-approximation for the fair -median problem, where is a parameter
for the fairness model. For multiple protected classes, the best known result
is a 14-approximation for fair -center.
We extend and improve the known results. Firstly, we give a 5-approximation
for the fair -center problem with multiple protected classes. Secondly, we
propose a relaxed fairness notion under which we can give bicriteria
constant-factor approximations for all of the classical clustering objectives
-center, -supplier, -median, -means and facility location. The
latter approximations are achieved by a framework that takes an arbitrary
existing unfair (integral) solution and a fair (fractional) LP solution and
combines them into an essentially fair clustering with a weakly supervised
rounding scheme. In this way, a fair clustering can be established belatedly,
in a situation where the centers are already fixed
Training Gaussian Mixture Models at Scale via Coresets
How can we train a statistical mixture model on a massive data set? In this
work we show how to construct coresets for mixtures of Gaussians. A coreset is
a weighted subset of the data, which guarantees that models fitting the coreset
also provide a good fit for the original data set. We show that, perhaps
surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension
and the number of mixture components, while being independent of the data set
size. Hence, one can harness computationally intensive algorithms to compute a
good approximation on a significantly smaller data set. More importantly, such
coresets can be efficiently constructed both in distributed and streaming
settings and do not impose restrictions on the data generating process. Our
results rely on a novel reduction of statistical estimation to problems in
computational geometry and new combinatorial complexity results for mixtures of
Gaussians. Empirical evaluation on several real-world datasets suggests that
our coreset-based approach enables significant reduction in training-time with
negligible approximation error
Fast -Approximation Algorithms for Binary Matrix Factorization
We introduce efficient -approximation algorithms for the
binary matrix factorization (BMF) problem, where the inputs are a matrix
, a rank parameter , as well as an
accuracy parameter , and the goal is to approximate
as a product of low-rank factors and
. Equivalently, we want to find
and that minimize the Frobenius loss . Before this work, the state-of-the-art for this problem was
the approximation algorithm of Kumar et. al. [ICML 2019], which achieves a
-approximation for some constant . We give the first
-approximation algorithm using running time singly exponential
in , where is typically a small integer. Our techniques generalize to
other common variants of the BMF problem, admitting bicriteria
-approximation algorithms for loss functions and the
setting where matrix operations are performed in . Our approach
can be implemented in standard big data models, such as the streaming or
distributed models.Comment: ICML 202
Overlapping and Robust Edge-Colored Clustering in Hypergraphs
A recent trend in data mining has explored (hyper)graph clustering algorithms
for data with categorical relationship types. Such algorithms have applications
in the analysis of social, co-authorship, and protein interaction networks, to
name a few. Many such applications naturally have some overlap between
clusters, a nuance which is missing from current combinatorial models.
Additionally, existing models lack a mechanism for handling noise in datasets.
We address these concerns by generalizing Edge-Colored Clustering, a recent
framework for categorical clustering of hypergraphs. Our generalizations allow
for a budgeted number of either (a) overlapping cluster assignments or (b) node
deletions. For each new model we present a greedy algorithm which approximately
minimizes an edge mistake objective, as well as bicriteria approximations where
the second approximation factor is on the budget. Additionally, we address the
parameterized complexity of each problem, providing FPT algorithms and hardness
results
- …