8 research outputs found

    Dual Node and Edge Fairness-Aware Graph Partition

    Full text link
    Fair graph partition of social networks is a crucial step toward ensuring fair and non-discriminatory treatments in unsupervised user analysis. Current fair partition methods typically consider node balance, a notion pursuing a proportionally balanced number of nodes from all demographic groups, but ignore the bias induced by imbalanced edges in each cluster. To address this gap, we propose a notion edge balance to measure the proportion of edges connecting different demographic groups in clusters. We analyze the relations between node balance and edge balance, then with line graph transformations, we propose a co-embedding framework to learn dual node and edge fairness-aware representations for graph partition. We validate our framework through several social network datasets and observe balanced partition in terms of both nodes and edges along with good utility. Moreover, we demonstrate our fair partition can be used as pseudo labels to facilitate graph neural networks to behave fairly in node classification and link prediction tasks

    Densest Diverse Subgraphs: How to Plan a Successful Cocktail Party with Diversity

    Full text link
    Dense subgraph discovery methods are routinely used in a variety of applications including the identification of a team of skilled individuals for collaboration from a social network. However, when the network's node set is associated with a sensitive attribute such as race, gender, religion, or political opinion, the lack of diversity can lead to lawsuits. In this work, we focus on the problem of finding a densest diverse subgraph in a graph whose nodes have different attribute values/types that we refer to as colors. We propose two novel formulations motivated by different realistic scenarios. Our first formulation, called the densest diverse subgraph problem (DDSP), guarantees that no color represents more than some fraction of the nodes in the output subgraph, which generalizes the state-of-the-art due to Anagnostopoulos et al. (CIKM 2020). By varying the fraction we can range the diversity constraint and interpolate from a diverse dense subgraph where all colors have to be equally represented to an unconstrained dense subgraph. We design a scalable Ω(1/n)\Omega(1/\sqrt{n})-approximation algorithm, where nn is the number of nodes. Our second formulation is motivated by the setting where any specified color should not be overlooked. We propose the densest at-least-k⃗\vec{k}-subgraph problem (Dalk⃗\vec{k}S), a novel generalization of the classic DalkkS, where instead of a single value kk, we have a vector k{\mathbf k} of cardinality demands with one coordinate per color class. We design a 1/31/3-approximation algorithm using linear programming together with an acceleration technique. Computational experiments using synthetic and real-world datasets demonstrate that our proposed algorithms are effective in extracting dense diverse clusters.Comment: Accepted to KDD 202

    Fair Clustering via Hierarchical Fair-Dirichlet Process

    Full text link
    The advent of ML-driven decision-making and policy formation has led to an increasing focus on algorithmic fairness. As clustering is one of the most commonly used unsupervised machine learning approaches, there has naturally been a proliferation of literature on {\em fair clustering}. A popular notion of fairness in clustering mandates the clusters to be {\em balanced}, i.e., each level of a protected attribute must be approximately equally represented in each cluster. Building upon the original framework, this literature has rapidly expanded in various aspects. In this article, we offer a novel model-based formulation of fair clustering, complementing the existing literature which is almost exclusively based on optimizing appropriate objective functions

    Hierarchical clustering with dot products recovers hidden tree structure

    Full text link
    In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN

    Fairness-aware Machine Learning in Educational Data Mining

    Get PDF
    Fairness is an essential requirement of every educational system, which is reflected in a variety of educational activities. With the extensive use of Artificial Intelligence (AI) and Machine Learning (ML) techniques in education, researchers and educators can analyze educational (big) data and propose new (technical) methods in order to support teachers, students, or administrators of (online) learning systems in the organization of teaching and learning. Educational data mining (EDM) is the result of the application and development of data mining (DM), and ML techniques to deal with educational problems, such as student performance prediction and student grouping. However, ML-based decisions in education can be based on protected attributes, such as race or gender, leading to discrimination of individual students or subgroups of students. Therefore, ensuring fairness in ML models also contributes to equity in educational systems. On the other hand, bias can also appear in the data obtained from learning environments. Hence, bias-aware exploratory educational data analysis is important to support unbiased decision-making in EDM. In this thesis, we address the aforementioned issues and propose methods that mitigate discriminatory outcomes of ML algorithms in EDM tasks. Specifically, we make the following contributions: We perform bias-aware exploratory analysis of educational datasets using Bayesian networks to identify the relationships among attributes in order to understand bias in the datasets. We focus the exploratory data analysis on features having a direct or indirect relationship with the protected attributes w.r.t. prediction outcomes. We perform a comprehensive evaluation of the sufficiency of various group fairness measures in predictive models for student performance prediction problems. A variety of experiments on various educational datasets with different fairness measures are performed to provide users with a broad view of unfairness from diverse aspects. We deal with the student grouping problem in collaborative learning. We introduce the fair-capacitated clustering problem that takes into account cluster fairness and cluster cardinalities. We propose two approaches, namely hierarchical clustering and partitioning-based clustering, to obtain fair-capacitated clustering. We introduce the multi-fair capacitated (MFC) students-topics grouping problem that satisfies students' preferences while ensuring balanced group cardinalities and maximizing the diversity of members regarding the protected attribute. We propose three approaches: a greedy heuristic approach, a knapsack-based approach using vanilla maximal 0-1 knapsack formulation, and an MFC knapsack approach based on group fairness knapsack formulation. In short, the findings described in this thesis demonstrate the importance of fairness-aware ML in educational settings. We show that bias-aware data analysis, fairness measures, and fairness-aware ML models are essential aspects to ensure fairness in EDM and the educational environment.Ministry of Science and Culture of Lower Saxony/LernMINT/51410078/E

    New methods for algorithm evaluation and cluster initialisation with applications to healthcare

    Get PDF
    This thesis explores three themes related to modern operational research: evaluating the objective performance of an algorithm, combining clustering with concepts of mathematical fairness, and developing insightful healthcare models despite a lack of fine-grained data. The established evaluation procedure for algorithms — and particularly machine learning algorithms — lacks robustness, potentially inflating the success of the methods being assessed. To tackle this, the evolutionary dataset optimisation method is introduced as a supplementary evaluation tool. By traversing the space in which datasets exist, this method provides the means of attaining a richer understanding of the algorithm under study. This method is used to investigate a novel initialisation method for a centroid-based clustering algorithm, k-modes. The initialisation makes use of the game theoretic concept of a matching game to allocate the starting centroids in a mathematically fair way. The subsequent investigation reveals the conditions under which the new initialisation improves upon two other initialisation methods. An extension to the k-modes algorithmis utilised to segment an administrative dataset provided by the co-sponsors of this project, CwmTaf MorgannwgUniversity Health Board. The dataset corresponds to the patient population presenting a specific chronic disease, and comprises a high-level summary of their stays in hospital over a number of years. Despite the relative coarseness of this dataset, the segmentation provides a useful profiling of its instances. These profiles are used to inform a multi-class queuing model representing a hypothetical ward for the affected patients. Following a novel validation process for the queuing model, actionable insights into the needs of the population are found. In addition to these research pursuits, several open-source software packages have been developed to accompany this thesis. These pieces of software were developed using best practices to ensure the reliability, reproducibility, and sustainability of the research in this thesi
    corecore