8 research outputs found
Dual Node and Edge Fairness-Aware Graph Partition
Fair graph partition of social networks is a crucial step toward ensuring
fair and non-discriminatory treatments in unsupervised user analysis. Current
fair partition methods typically consider node balance, a notion pursuing a
proportionally balanced number of nodes from all demographic groups, but ignore
the bias induced by imbalanced edges in each cluster. To address this gap, we
propose a notion edge balance to measure the proportion of edges connecting
different demographic groups in clusters. We analyze the relations between node
balance and edge balance, then with line graph transformations, we propose a
co-embedding framework to learn dual node and edge fairness-aware
representations for graph partition. We validate our framework through several
social network datasets and observe balanced partition in terms of both nodes
and edges along with good utility. Moreover, we demonstrate our fair partition
can be used as pseudo labels to facilitate graph neural networks to behave
fairly in node classification and link prediction tasks
Densest Diverse Subgraphs: How to Plan a Successful Cocktail Party with Diversity
Dense subgraph discovery methods are routinely used in a variety of
applications including the identification of a team of skilled individuals for
collaboration from a social network. However, when the network's node set is
associated with a sensitive attribute such as race, gender, religion, or
political opinion, the lack of diversity can lead to lawsuits.
In this work, we focus on the problem of finding a densest diverse subgraph
in a graph whose nodes have different attribute values/types that we refer to
as colors. We propose two novel formulations motivated by different realistic
scenarios. Our first formulation, called the densest diverse subgraph problem
(DDSP), guarantees that no color represents more than some fraction of the
nodes in the output subgraph, which generalizes the state-of-the-art due to
Anagnostopoulos et al. (CIKM 2020). By varying the fraction we can range the
diversity constraint and interpolate from a diverse dense subgraph where all
colors have to be equally represented to an unconstrained dense subgraph. We
design a scalable -approximation algorithm, where is
the number of nodes. Our second formulation is motivated by the setting where
any specified color should not be overlooked. We propose the densest
at-least--subgraph problem (DalS), a novel generalization of
the classic DalS, where instead of a single value , we have a vector
of cardinality demands with one coordinate per color class. We
design a -approximation algorithm using linear programming together with
an acceleration technique. Computational experiments using synthetic and
real-world datasets demonstrate that our proposed algorithms are effective in
extracting dense diverse clusters.Comment: Accepted to KDD 202
Fair Clustering via Hierarchical Fair-Dirichlet Process
The advent of ML-driven decision-making and policy formation has led to an
increasing focus on algorithmic fairness. As clustering is one of the most
commonly used unsupervised machine learning approaches, there has naturally
been a proliferation of literature on {\em fair clustering}. A popular notion
of fairness in clustering mandates the clusters to be {\em balanced}, i.e.,
each level of a protected attribute must be approximately equally represented
in each cluster. Building upon the original framework, this literature has
rapidly expanded in various aspects. In this article, we offer a novel
model-based formulation of fair clustering, complementing the existing
literature which is almost exclusively based on optimizing appropriate
objective functions
Hierarchical clustering with dot products recovers hidden tree structure
In this paper we offer a new perspective on the well established
agglomerative clustering algorithm, focusing on recovery of hierarchical
structure. We recommend a simple variant of the standard algorithm, in which
clusters are merged by maximum average dot product and not, for example, by
minimum distance or within-cluster variance. We demonstrate that the tree
output by this algorithm provides a bona fide estimate of generative
hierarchical structure in data, under a generic probabilistic graphical model.
The key technical innovations are to understand how hierarchical information in
this model translates into tree geometry which can be recovered from data, and
to characterise the benefits of simultaneously growing sample size and data
dimension. We demonstrate superior tree recovery performance with real data
over existing approaches such as UPGMA, Ward's method, and HDBSCAN
Fairness-aware Machine Learning in Educational Data Mining
Fairness is an essential requirement of every educational system, which is reflected in a variety of educational activities. With the extensive use of Artificial Intelligence (AI) and Machine Learning (ML) techniques in education, researchers and educators can analyze educational (big) data and propose new (technical) methods in order to support teachers, students, or administrators of (online) learning systems in the organization of teaching and learning. Educational data mining (EDM) is the result of the application and development of data mining (DM), and ML techniques to deal with educational problems, such as student performance prediction and student grouping. However, ML-based decisions in education can be based on protected attributes, such as race or gender, leading to discrimination of individual students or subgroups of students. Therefore, ensuring fairness in ML models also contributes to equity in educational systems. On the other hand, bias can also appear in the data obtained from learning environments. Hence, bias-aware exploratory educational data analysis is important to support unbiased decision-making in EDM.
In this thesis, we address the aforementioned issues and propose methods that mitigate discriminatory outcomes of ML algorithms in EDM tasks. Specifically, we make the following contributions:
We perform bias-aware exploratory analysis of educational datasets using Bayesian networks to identify the relationships among attributes in order to understand bias in the datasets. We focus the exploratory data analysis on features having a direct or indirect relationship with the protected attributes w.r.t. prediction outcomes.
We perform a comprehensive evaluation of the sufficiency of various group fairness measures in predictive models for student performance prediction problems. A variety of experiments on various educational datasets with different fairness measures are performed to provide users with a broad view of unfairness from diverse aspects.
We deal with the student grouping problem in collaborative learning. We introduce the fair-capacitated clustering problem that takes into account cluster fairness and cluster cardinalities. We propose two approaches, namely hierarchical clustering and partitioning-based clustering, to obtain fair-capacitated clustering.
We introduce the multi-fair capacitated (MFC) students-topics grouping problem that satisfies students' preferences while ensuring balanced group cardinalities and maximizing the diversity of members regarding the protected attribute. We propose three approaches: a greedy heuristic approach, a knapsack-based approach using vanilla maximal 0-1 knapsack formulation, and an MFC knapsack approach based on group fairness knapsack formulation.
In short, the findings described in this thesis demonstrate the importance of fairness-aware ML in educational settings. We show that bias-aware data analysis, fairness measures, and fairness-aware ML models are essential aspects to ensure fairness in EDM and the educational environment.Ministry of Science and Culture of Lower Saxony/LernMINT/51410078/E
New methods for algorithm evaluation and cluster initialisation with applications to healthcare
This thesis explores three themes related to modern operational research: evaluating
the objective performance of an algorithm, combining clustering with concepts of
mathematical fairness, and developing insightful healthcare models despite a lack of
fine-grained data.
The established evaluation procedure for algorithms — and particularly machine
learning algorithms — lacks robustness, potentially inflating the success of the methods
being assessed. To tackle this, the evolutionary dataset optimisation method is
introduced as a supplementary evaluation tool. By traversing the space in which
datasets exist, this method provides the means of attaining a richer understanding of
the algorithm under study.
This method is used to investigate a novel initialisation method for a centroid-based
clustering algorithm, k-modes. The initialisation makes use of the game theoretic
concept of a matching game to allocate the starting centroids in a mathematically
fair way. The subsequent investigation reveals the conditions under which the new
initialisation improves upon two other initialisation methods.
An extension to the k-modes algorithmis utilised to segment an administrative dataset
provided by the co-sponsors of this project, CwmTaf MorgannwgUniversity Health
Board. The dataset corresponds to the patient population presenting a specific chronic
disease, and comprises a high-level summary of their stays in hospital over a number
of years. Despite the relative coarseness of this dataset, the segmentation provides a
useful profiling of its instances. These profiles are used to inform a multi-class queuing
model representing a hypothetical ward for the affected patients. Following a
novel validation process for the queuing model, actionable insights into the needs of
the population are found.
In addition to these research pursuits, several open-source software packages have
been developed to accompany this thesis. These pieces of software were developed
using best practices to ensure the reliability, reproducibility, and sustainability of the
research in this thesi