    Fair Correlation Clustering in Forests

    The study of algorithmic fairness received growing attention recently. This stems from the awareness that bias in the input data for machine learning systems may result in discriminatory outputs. For clustering tasks, one of the most central notions of fairness is the formalization by Chierichetti, Kumar, Lattanzi, and Vassilvitskii [NeurIPS 2017]. A clustering is said to be fair, if each cluster has the same distribution of manifestations of a sensitive attribute as the whole input set. This is motivated by various applications where the objects to be clustered have sensitive attributes that should not be over- or underrepresented. Most research on this version of fair clustering has focused on centriod-based objectives. In contrast, we discuss the applicability of this fairness notion to Correlation Clustering. The existing literature on the resulting Fair Correlation Clustering problem either presents approximation algorithms with poor approximation guarantees or severely limits the possible distributions of the sensitive attribute (often only two manifestations with a 1:1 ratio are considered). Our goal is to understand if there is hope for better results in between these two extremes. To this end, we consider restricted graph classes which allow us to characterize the distributions of sensitive attributes for which this form of fairness is tractable from a complexity point of view. While existing work on Fair Correlation Clustering gives approximation algorithms, we focus on exact solutions and investigate whether there are efficiently solvable instances. The unfair version of Correlation Clustering is trivial on forests, but adding fairness creates a surprisingly rich picture of complexities. We give an overview of the distributions and types of forests where Fair Correlation Clustering turns from tractable to intractable. As the most surprising insight, we consider the fact that the cause of the hardness of Fair Correlation Clustering is not the strictness of the fairness condition. We lift most of our results to also hold for the relaxed version of the fairness condition. Instead, the source of hardness seems to be the distribution of the sensitive attribute. On the positive side, we identify some reasonable distributions that are indeed tractable. While this tractability is only shown for forests, it may open an avenue to design reasonable approximations for larger graph classes

    Fair Correlation Clustering in General Graphs

    We consider the family of Correlation Clustering optimization problems under fairness constraints. In Correlation Clustering we are given a graph whose every edge is labeled either with a + or a -, and the goal is to find a clustering that agrees the most with the labels: + edges within clusters and - edges across clusters. The notion of fairness implies that there is no over, or under, representation of vertices in the clustering: every vertex has a color and the distribution of colors within each cluster is required to be the same as the distribution of colors in the input graph. Previously, approximation algorithms were known only for fair disagreement minimization in complete unweighted graphs. We prove the following: (1) there is no finite approximation for fair disagreement minimization in general graphs unless P = NP (this hardness holds also for bicriteria algorithms); and (2) fair agreement maximization in general graphs admits a bicriteria approximation of ? 0.591 (an improved ? 0.609 true approximation is given for the special case of two uniformly distributed colors). Our algorithm is based on proving that the sticky Brownian motion rounding of [Abbasi Zadeh-Bansal-Guruganesh-Nikolov-Schwartz-Singh SODA\u2720] copes well with uncut edges

    Single-Pass Pivot Algorithm for Correlation Clustering. Keep it simple!

    We show that a simple single-pass semi-streaming variant of the Pivot algorithm for Correlation Clustering gives a (3 + {\epsilon})-approximation using O(n/{\epsilon}) words of memory. This is a slight improvement over the recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3 + {\epsilon})-approximation using O(n log n) words of memory, and Behnezhad, Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory. One of the main contributions of this paper is that both the algorithm and its analysis are very simple, and also the algorithm is easy to implement

    Spectral Normalized-Cut Graph Partitioning with Fairness Constraints

    Normalized-cut graph partitioning aims to divide the set of nodes in a graph into kk disjoint clusters to minimize the fraction of the total edges between any cluster and all other clusters. In this paper, we consider a fair variant of the partitioning problem wherein nodes are characterized by a categorical sensitive attribute (e.g., gender or race) indicating membership to different demographic groups. Our goal is to ensure that each group is approximately proportionally represented in each cluster while minimizing the normalized cut value. To resolve this problem, we propose a two-phase spectral algorithm called FNM. In the first phase, we add an augmented Lagrangian term based on our fairness criteria to the objective function for obtaining a fairer spectral node embedding. Then, in the second phase, we design a rounding scheme to produce kk clusters from the fair embedding that effectively trades off fairness and partition quality. Through comprehensive experiments on nine benchmark datasets, we demonstrate the superior performance of FNM compared with three baseline methods.Comment: 17 pages, 7 figures, accepted to the 26th European Conference on Artificial Intelligence (ECAI 2023

    One Partition Approximating All β„“p\ell_p-norm Objectives in Correlation Clustering

    This paper considers correlation clustering on unweighted complete graphs. We give a combinatorial algorithm that returns a single clustering solution that is simultaneously O(1)O(1)-approximate for all β„“p\ell_p-norms of the disagreement vector. This proves that minimal sacrifice is needed in order to optimize different norms of the disagreement vector. Our algorithm is the first combinatorial approximation algorithm for the β„“2\ell_2-norm objective, and more generally the first combinatorial algorithm for the β„“p\ell_p-norm objective when 2≀p<∞2 \leq p < \infty. It is also faster than all previous algorithms that minimize the β„“p\ell_p-norm of the disagreement vector, with run-time O(nΟ‰)O(n^\omega), where O(nΟ‰)O(n^\omega) is the time for matrix multiplication on nΓ—nn \times n matrices. When the maximum positive degree in the graph is at most Ξ”\Delta, this can be improved to a run-time of O(nΞ”2log⁑n)O(n\Delta^2 \log n).Comment: 27 pages, 2 figure

    Efficient Correlation Clustering Methods for Large Consensus Clustering Instances

    Consensus clustering (or clustering aggregation) inputs kk partitions of a given ground set VV, and seeks to create a single partition that minimizes disagreement with all input partitions. State-of-the-art algorithms for consensus clustering are based on correlation clustering methods like the popular Pivot algorithm. Unfortunately these methods have not proved to be practical for consensus clustering instances where either kk or VV gets large. In this paper we provide practical run time improvements for correlation clustering solvers when VV is large. We reduce the time complexity of Pivot from O(∣V∣2k)O(|V|^2 k) to O(∣V∣k)O(|V| k), and its space complexity from O(∣V∣2)O(|V|^2) to O(∣V∣k)O(|V| k) -- a significant savings since in practice kk is much less than ∣V∣|V|. We also analyze a sampling method for these algorithms when kk is large, bridging the gap between running Pivot on the full set of input partitions (an expected 1.57-approximation) and choosing a single input partition at random (an expected 2-approximation). We show experimentally that algorithms like Pivot do obtain quality clustering results in practice even on small samples of input partitions

    Densest Diverse Subgraphs: How to Plan a Successful Cocktail Party with Diversity

    Dense subgraph discovery methods are routinely used in a variety of applications including the identification of a team of skilled individuals for collaboration from a social network. However, when the network's node set is associated with a sensitive attribute such as race, gender, religion, or political opinion, the lack of diversity can lead to lawsuits. In this work, we focus on the problem of finding a densest diverse subgraph in a graph whose nodes have different attribute values/types that we refer to as colors. We propose two novel formulations motivated by different realistic scenarios. Our first formulation, called the densest diverse subgraph problem (DDSP), guarantees that no color represents more than some fraction of the nodes in the output subgraph, which generalizes the state-of-the-art due to Anagnostopoulos et al. (CIKM 2020). By varying the fraction we can range the diversity constraint and interpolate from a diverse dense subgraph where all colors have to be equally represented to an unconstrained dense subgraph. We design a scalable Ω(1/n)\Omega(1/\sqrt{n})-approximation algorithm, where nn is the number of nodes. Our second formulation is motivated by the setting where any specified color should not be overlooked. We propose the densest at-least-k⃗\vec{k}-subgraph problem (Dalk⃗\vec{k}S), a novel generalization of the classic DalkkS, where instead of a single value kk, we have a vector k{\mathbf k} of cardinality demands with one coordinate per color class. We design a 1/31/3-approximation algorithm using linear programming together with an acceleration technique. Computational experiments using synthetic and real-world datasets demonstrate that our proposed algorithms are effective in extracting dense diverse clusters.Comment: Accepted to KDD 202