14 research outputs found
Fair Correlation Clustering in Forests
The study of algorithmic fairness received growing attention recently. This stems from the awareness that bias in the input data for machine learning systems may result in discriminatory outputs. For clustering tasks, one of the most central notions of fairness is the formalization by Chierichetti, Kumar, Lattanzi, and Vassilvitskii [NeurIPS 2017]. A clustering is said to be fair, if each cluster has the same distribution of manifestations of a sensitive attribute as the whole input set. This is motivated by various applications where the objects to be clustered have sensitive attributes that should not be over- or underrepresented. Most research on this version of fair clustering has focused on centriod-based objectives.
In contrast, we discuss the applicability of this fairness notion to Correlation Clustering. The existing literature on the resulting Fair Correlation Clustering problem either presents approximation algorithms with poor approximation guarantees or severely limits the possible distributions of the sensitive attribute (often only two manifestations with a 1:1 ratio are considered). Our goal is to understand if there is hope for better results in between these two extremes. To this end, we consider restricted graph classes which allow us to characterize the distributions of sensitive attributes for which this form of fairness is tractable from a complexity point of view.
While existing work on Fair Correlation Clustering gives approximation algorithms, we focus on exact solutions and investigate whether there are efficiently solvable instances. The unfair version of Correlation Clustering is trivial on forests, but adding fairness creates a surprisingly rich picture of complexities. We give an overview of the distributions and types of forests where Fair Correlation Clustering turns from tractable to intractable.
As the most surprising insight, we consider the fact that the cause of the hardness of Fair Correlation Clustering is not the strictness of the fairness condition. We lift most of our results to also hold for the relaxed version of the fairness condition. Instead, the source of hardness seems to be the distribution of the sensitive attribute. On the positive side, we identify some reasonable distributions that are indeed tractable. While this tractability is only shown for forests, it may open an avenue to design reasonable approximations for larger graph classes
Fair Correlation Clustering in General Graphs
We consider the family of Correlation Clustering optimization problems under fairness constraints. In Correlation Clustering we are given a graph whose every edge is labeled either with a + or a -, and the goal is to find a clustering that agrees the most with the labels: + edges within clusters and - edges across clusters. The notion of fairness implies that there is no over, or under, representation of vertices in the clustering: every vertex has a color and the distribution of colors within each cluster is required to be the same as the distribution of colors in the input graph. Previously, approximation algorithms were known only for fair disagreement minimization in complete unweighted graphs. We prove the following: (1) there is no finite approximation for fair disagreement minimization in general graphs unless P = NP (this hardness holds also for bicriteria algorithms); and (2) fair agreement maximization in general graphs admits a bicriteria approximation of ? 0.591 (an improved ? 0.609 true approximation is given for the special case of two uniformly distributed colors). Our algorithm is based on proving that the sticky Brownian motion rounding of [Abbasi Zadeh-Bansal-Guruganesh-Nikolov-Schwartz-Singh SODA\u2720] copes well with uncut edges
Single-Pass Pivot Algorithm for Correlation Clustering. Keep it simple!
We show that a simple single-pass semi-streaming variant of the Pivot
algorithm for Correlation Clustering gives a (3 + {\epsilon})-approximation
using O(n/{\epsilon}) words of memory. This is a slight improvement over the
recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3 +
{\epsilon})-approximation using O(n log n) words of memory, and Behnezhad,
Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory.
One of the main contributions of this paper is that both the algorithm and its
analysis are very simple, and also the algorithm is easy to implement
Spectral Normalized-Cut Graph Partitioning with Fairness Constraints
Normalized-cut graph partitioning aims to divide the set of nodes in a graph
into disjoint clusters to minimize the fraction of the total edges between
any cluster and all other clusters. In this paper, we consider a fair variant
of the partitioning problem wherein nodes are characterized by a categorical
sensitive attribute (e.g., gender or race) indicating membership to different
demographic groups. Our goal is to ensure that each group is approximately
proportionally represented in each cluster while minimizing the normalized cut
value. To resolve this problem, we propose a two-phase spectral algorithm
called FNM. In the first phase, we add an augmented Lagrangian term based on
our fairness criteria to the objective function for obtaining a fairer spectral
node embedding. Then, in the second phase, we design a rounding scheme to
produce clusters from the fair embedding that effectively trades off
fairness and partition quality. Through comprehensive experiments on nine
benchmark datasets, we demonstrate the superior performance of FNM compared
with three baseline methods.Comment: 17 pages, 7 figures, accepted to the 26th European Conference on
Artificial Intelligence (ECAI 2023
One Partition Approximating All -norm Objectives in Correlation Clustering
This paper considers correlation clustering on unweighted complete graphs. We
give a combinatorial algorithm that returns a single clustering solution that
is simultaneously -approximate for all -norms of the disagreement
vector. This proves that minimal sacrifice is needed in order to optimize
different norms of the disagreement vector. Our algorithm is the first
combinatorial approximation algorithm for the -norm objective, and more
generally the first combinatorial algorithm for the -norm objective
when . It is also faster than all previous algorithms that
minimize the -norm of the disagreement vector, with run-time
, where is the time for matrix multiplication on matrices. When the maximum positive degree in the graph is at most
, this can be improved to a run-time of .Comment: 27 pages, 2 figure
Efficient Correlation Clustering Methods for Large Consensus Clustering Instances
Consensus clustering (or clustering aggregation) inputs partitions of a
given ground set , and seeks to create a single partition that minimizes
disagreement with all input partitions. State-of-the-art algorithms for
consensus clustering are based on correlation clustering methods like the
popular Pivot algorithm. Unfortunately these methods have not proved to be
practical for consensus clustering instances where either or gets
large.
In this paper we provide practical run time improvements for correlation
clustering solvers when is large. We reduce the time complexity of Pivot
from to , and its space complexity from to
-- a significant savings since in practice is much less than
. We also analyze a sampling method for these algorithms when is
large, bridging the gap between running Pivot on the full set of input
partitions (an expected 1.57-approximation) and choosing a single input
partition at random (an expected 2-approximation). We show experimentally that
algorithms like Pivot do obtain quality clustering results in practice even on
small samples of input partitions
Densest Diverse Subgraphs: How to Plan a Successful Cocktail Party with Diversity
Dense subgraph discovery methods are routinely used in a variety of
applications including the identification of a team of skilled individuals for
collaboration from a social network. However, when the network's node set is
associated with a sensitive attribute such as race, gender, religion, or
political opinion, the lack of diversity can lead to lawsuits.
In this work, we focus on the problem of finding a densest diverse subgraph
in a graph whose nodes have different attribute values/types that we refer to
as colors. We propose two novel formulations motivated by different realistic
scenarios. Our first formulation, called the densest diverse subgraph problem
(DDSP), guarantees that no color represents more than some fraction of the
nodes in the output subgraph, which generalizes the state-of-the-art due to
Anagnostopoulos et al. (CIKM 2020). By varying the fraction we can range the
diversity constraint and interpolate from a diverse dense subgraph where all
colors have to be equally represented to an unconstrained dense subgraph. We
design a scalable -approximation algorithm, where is
the number of nodes. Our second formulation is motivated by the setting where
any specified color should not be overlooked. We propose the densest
at-least--subgraph problem (DalS), a novel generalization of
the classic DalS, where instead of a single value , we have a vector
of cardinality demands with one coordinate per color class. We
design a -approximation algorithm using linear programming together with
an acceleration technique. Computational experiments using synthetic and
real-world datasets demonstrate that our proposed algorithms are effective in
extracting dense diverse clusters.Comment: Accepted to KDD 202