9 research outputs found
Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms
Diversity maximization aims to select a diverse and representative subset of
items from a large dataset. It is a fundamental optimization task that finds
applications in data summarization, feature selection, web search, recommender
systems, and elsewhere. However, in a setting where data items are associated
with different groups according to sensitive attributes like sex or race, it is
possible that algorithmic solutions for this task, if left unchecked, will
under- or over-represent some of the groups. Therefore, we are motivated to
address the problem of \emph{max-min diversification with fairness
constraints}, aiming to select items to maximize the minimum distance
between any pair of selected items while ensuring that the number of items
selected from each group falls within predefined lower and upper bounds. In
this work, we propose an exact algorithm based on integer linear programming
that is suitable for small datasets as well as a
-approximation algorithm for any that scales to large datasets. Extensive experiments on real-world datasets
demonstrate the superior performance of our proposed algorithms over existing
ones.Comment: 13 pages, 8 figures, to appear in SDM '2
Streaming Algorithms for Diversity Maximization with Fairness Constraints
Diversity maximization is a fundamental problem with wide applications in
data summarization, web search, and recommender systems. Given a set of
elements, it asks to select a subset of elements with maximum
\emph{diversity}, as quantified by the dissimilarities among the elements in
. In this paper, we focus on the diversity maximization problem with
fairness constraints in the streaming setting. Specifically, we consider the
max-min diversity objective, which selects a subset that maximizes the
minimum distance (dissimilarity) between any pair of distinct elements within
it. Assuming that the set is partitioned into disjoint groups by some
sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that
the selected subset contains elements from each group .
A streaming algorithm should process sequentially in one pass and return a
subset with maximum \emph{diversity} while guaranteeing the fairness
constraint. Although diversity maximization has been extensively studied, the
only known algorithms that can work with the max-min diversity objective and
fairness constraints are very inefficient for data streams. Since diversity
maximization is NP-hard in general, we propose two approximation algorithms for
fair diversity maximization in data streams, the first of which is
-approximate and specific for , where
, and the second of which achieves a
-approximation for an arbitrary . Experimental
results on real-world and synthetic datasets show that both algorithms provide
solutions of comparable quality to the state-of-the-art algorithms while
running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202
Improved Approximation and Scalability for Fair Max-Min Diversification
Given an -point metric space where each point belongs to
one of different categories or groups and a set of integers , the fair Max-Min diversification problem is to select
points belonging to category , such that the minimum pairwise
distance between selected points is maximized. The problem was introduced by
Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample
large data sets in various applications so that the derived sample achieves a
balance over diversity, i.e., the minimum distance between a pair of selected
points, and fairness, i.e., ensuring enough points of each category are
included. We prove the following results:
1. We first consider general metric spaces. We present a randomized
polynomial time algorithm that returns a factor -approximation to the
diversity but only satisfies the fairness constraints in expectation. Building
upon this result, we present a -approximation that is guaranteed to satisfy
the fairness constraints up to a factor for any constant
. We also present a linear time algorithm returning an
approximation with exact fairness. The best previous result was a
approximation.
2. We then focus on Euclidean metrics. We first show that the problem can be
solved exactly in one dimension. For constant dimensions, categories and any
constant , we present a approximation algorithm that
runs in time where . We can improve the
running time to at the expense of only picking points from category .
Finally, we present algorithms suitable to processing massive data sets
including single-pass data stream algorithms and composable coresets for the
distributed processing.Comment: To appear in ICDT 202
Diverse Data Selection under Fairness Constraints
Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe ? of n elements that can be partitioned into m disjoint groups, we aim to retrieve a k-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified k_i number of elements from each group i (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in n, that provide strong theoretical approximation guarantees for different values of m and k. Finally, we extend our algorithms and analysis to the case where groups can be overlapping
Diversification and fairness in top-k ranking algorithms
Given a user query, the typical user interfaces, such as search engines and recommender systems, only allow a small number of results to be returned to the user. Hence, figuring out what would be the top-k results is an important task in information retrieval, as it helps to ensure that the most relevant results are presented to the user. There exists an extensive body of research that studies how to score the records and return top-k to the user. Moreover, there exists an extensive set of criteria that researchers identify to present the user with top-k results, and result diversification is one of them. Diversifying the top-k result ensures that the returned result set is relevant as well as representative of the entire set of answers to the user query, and it is highly relevant in the context of search, recommendation, and data exploration. The goal of this dissertation is two-fold: the first goal is to focus on adapting existing popular diversification algorithms and studying how to expedite them without losing the accuracy of the answers. This work studies the scalability challenges of expediting the running time of existing diversification algorithms by designing a generic framework that produces the same results as the original algorithms, yet it is significantly faster in running time. This proposed approach handles scenarios where data change over a period of time and studies how to adapt the framework to accommodate data changes. The second aspect of the work studies how the existing top-k algorithms could lead to inequitable exposure of records that are equivalent qualitatively. This scenario is highly important for long-tail data where there exists a long tail of records that have similar utility, but the existing top-k algorithm only shows one of the top-ks, and the rest are never returned to the user. Both of these problems are studied analytically, and their hardness is studied. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) evaluating the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions over large-scale datasets
Models and algorithms for promoting diverse and fair query results
Ensuring fairness and diversity in search results are two key concerns in compelling search and recommendation applications. This work explicitly studies these two aspects given multiple users\u27 preferences as inputs, in an effort to create a single ranking or top-k result set that satisfies different fairness and diversity criteria. From group fairness standpoint, it adapts demographic parity like group fairness criteria and proposes new models that are suitable for ranking or producing top-k set of results. This dissertation also studies equitable exposure of individual search results in long tail data, a concept related to individual fairness. First, the dissertation focuses on aggregating ranks while achieving proportionate fairness (ensures proportionate representation of every group) for multiple protected groups. Then, the dissertation explores how to minimally modify original users\u27 preferences under plurality voting, aiming to produce top-k result set that satisfies complex fairness constraints. A concept referred to as manipulation by modifications is introduced, which involves making minimal changes to the original user preferences to ensure query satisfaction. This problem is formalized as the margin finding problem. A follow up work studies this problem considering a popular ranked choice voting mechanism, namely, the Instant Run-off Voting or IRV, as the preference aggregation method. From the standpoint of individual fairness, this dissertation studies an exposure concern that top-k set based algorithms exhibit when the underlying data has long tail properties, and designs techniques to make those results equitable. For result diversification, the work studies efficiency opportunities in existing diversification algorithms, and designs a generic access primitive called DivGetBatch() to enable that. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) extensive experimental study to evaluate the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions using large-scale datasets
Computational Approaches to Generating Diverse Enzyme Panels
Ph. D. ThesisMotivation
Enzymes are complex macromolecules crucial to life on earth. From bacteria to human
beings, all organisms use enzymes to catalyse the many thousands of chemical reactions
occurring in their cells. Enzyme functions are so diverse that the use of enzymes in
industries like pharmaceuticals and agriculture has gained popularity over recent years
as ”biocatalysts”.
Unfortunately, the confident laboratory-based characterisation of enzyme function has
lagged behind a massive increase in sequencing data, slowing down initiatives that
look to use biocatalysts as part of their chemical processes. Computational methods
for identifying biocatalysts do exist, but often falter due to the complexity of enzymes
and sequence bias, leaving much of the catalytic space of enzymes and their families
undiscovered.
This thesis has two major themes: the development of in silico approaches for curating
diverse panels of novel enzyme sequences for experimental characterisation, and of
tooling that integrates in silico panel creation and in vitro enzyme characterisation
into a unified and iterative framework.
Contributions of this thesis
The contributions of this thesis can be divided into the two larger themes, starting
with the diverse panel selection of sequences from an enzyme family:
• A novel type of protein network based on patterns of coevolving residues that
can be used to identify functionally-interesting groupings in enzyme families.
• The automatic sampling of functionally diverse subsets of enzyme sequences by
solving the maximum diversity problem.
- i -
• A study into the viability of artificially increasing enzyme family diversity through
neural networks-based generation of synthetic sequences.
The second theme, which deals with built tools for bridging the gap between the in
silico and in vitro side of enzyme family exploration:
• A platform that integrates the panel selection process and resulting characterisation data to promote an iterative approach to exploring enzyme families.
• A repository for storing the metadata generated by the major steps of characterisation assays in the lab.EPSRC and Prozomix Limite
Recommended from our members
Database Usability Enhancement in Data Exploration
Database usability has become an important research topic over the last decade. In the early days, database management systems were maintained by sophisticated users like database administrators. Today, due to the availability of data and computing resources, more non-expert users are involved in database computation. From their point of view, database systems lack ease of use. So researchers believe that usability is as important as the performance and functionality of databases and therefore developed many techniques such as natural language interface to enhance the ease of use of databases. In this thesis, we find some deeper technical issues in database usability, so we look at several core database technologies to further improve the ease of use of databases in two dimensions: we help users process data and exploit computing capacities.
We start by helping users find the data. In the real world, public data is everywhere on the Web, but it is scattered around. We extract a prototype relational knowledge base to solve this problem. We start from the most basic binary mapping relationships (sometimes also named bridge tables) between entities from the web. This mapping relationship facilitates many data transformation applications such as auto-correct, auto-fill, and auto-join.
After finding the data, we help users explore the data. When users issue queries to explore the data, their query results may contain too many items. So the system designer has to present a small subset of representative and diverse items rather than all items. This is known as the query result diversification problem. We propose the RC-Index, which helps to solve the diversification problem by significantly reducing the number of items that must be retrieved by the database to form a diverse set of a desired size. It is nearly an order of magnitude faster than the state-of-the-art and has a good performance guarantee, which improves the ease of use of databases in terms of querying.
Finally, we shift our focus from data to computing capacities. We propose a framework to help users choose configurations in the cloud. Cloud computing has revolutionized data analysis, but choosing the right configuration is challenging because the common pricing mechanism of the public cloud is too complicated. Users have to consider low-level resources to find the best plan for their computational tasks. To address this issue, we propose a new market-based framework for pricing computational tasks in the cloud. We introduce agents to help users configure their personalized databases, which improves the ease of use of databases in the cloud