Search CORE

9 research outputs found

Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms

Author: Fabbri Francesco
Li Jia
Mathioudakis Michael
Wang Yanhao
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2023
Field of study

Diversity maximization aims to select a diverse and representative subset of items from a large dataset. It is a fundamental optimization task that finds applications in data summarization, feature selection, web search, recommender systems, and elsewhere. However, in a setting where data items are associated with different groups according to sensitive attributes like sex or race, it is possible that algorithmic solutions for this task, if left unchecked, will under- or over-represent some of the groups. Therefore, we are motivated to address the problem of \emph{max-min diversification with fairness constraints}, aiming to select

k

items to maximize the minimum distance between any pair of selected items while ensuring that the number of items selected from each group falls within predefined lower and upper bounds. In this work, we propose an exact algorithm based on integer linear programming that is suitable for small datasets as well as a

\frac{1-\varepsilon}{5}

-approximation algorithm for any

\varepsilon \in (0, 1)

that scales to large datasets. Extensive experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.Comment: 13 pages, 8 figures, to appear in SDM '2

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Streaming Algorithms for Diversity Maximization with Fairness Constraints

Author: Fabbri Francesco
Mathioudakis Michael
Wang Yanhao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/07/2022
Field of study

Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set

X

n

elements, it asks to select a subset

S

k \ll n

elements with maximum \emph{diversity}, as quantified by the dissimilarities among the elements in

S

. In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset

S

that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set

X

is partitioned into

m

disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that the selected subset

S

contains

k_i

elements from each group

i \in [1,m]

. A streaming algorithm should process

X

sequentially in one pass and return a subset with maximum \emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is

\frac{1-\varepsilon}{4}

-approximate and specific for

m=2

, where

\varepsilon \in (0,1)

, and the second of which achieves a

\frac{1-\varepsilon}{3m+2}

-approximation for an arbitrary

m

. Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202

arXiv.org e-Print Archive

Improved Approximation and Scalability for Fair Max-Min Diversification

Author: Addanki Raghavendra
McGregor Andrew
Meliou Alexandra
Moumoulidou Zafeiria
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 25th International Conference on Database Theory (ICDT 2022)
Publication date: 01/01/2022
Field of study

Given an

n

-point metric space

(\mathcal{X},d)

where each point belongs to one of

m=O(1)

different categories or groups and a set of integers

k_1, \ldots, k_m

, the fair Max-Min diversification problem is to select

k_i

points belonging to category

i\in [m]

, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor

2

-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a

6

-approximation that is guaranteed to satisfy the fairness constraints up to a factor

1-\epsilon

for any constant

\epsilon

. We also present a linear time algorithm returning an

m+1

approximation with exact fairness. The best previous result was a

3m-1

approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant

\epsilon>0

, we present a

1+\epsilon

approximation algorithm that runs in

O(nk) + 2^{O(k)}

time where

k=k_1+\ldots+k_m

. We can improve the running time to

O(nk)+ poly(k)

at the expense of only picking

(1-\epsilon) k_i

points from category

i\in [m]

. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Diverse Data Selection under Fairness Constraints

Author: McGregor Andrew
Meliou Alexandra
Moumoulidou Zafeiria
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe ? of n elements that can be partitioned into m disjoint groups, we aim to retrieve a k-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified k_i number of elements from each group i (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in n, that provide strong theoretical approximation guarantees for different values of m and k. Finally, we extend our algorithms and analysis to the case where groups can be overlapping

Dagstuhl Research Online Publication Server

Diversification and fairness in top-k ranking algorithms

Author: Asadi Mahsa
Publication venue: Digital Commons @ NJIT
Publication date: 31/08/2023
Field of study

Given a user query, the typical user interfaces, such as search engines and recommender systems, only allow a small number of results to be returned to the user. Hence, figuring out what would be the top-k results is an important task in information retrieval, as it helps to ensure that the most relevant results are presented to the user. There exists an extensive body of research that studies how to score the records and return top-k to the user. Moreover, there exists an extensive set of criteria that researchers identify to present the user with top-k results, and result diversification is one of them. Diversifying the top-k result ensures that the returned result set is relevant as well as representative of the entire set of answers to the user query, and it is highly relevant in the context of search, recommendation, and data exploration. The goal of this dissertation is two-fold: the first goal is to focus on adapting existing popular diversification algorithms and studying how to expedite them without losing the accuracy of the answers. This work studies the scalability challenges of expediting the running time of existing diversification algorithms by designing a generic framework that produces the same results as the original algorithms, yet it is significantly faster in running time. This proposed approach handles scenarios where data change over a period of time and studies how to adapt the framework to accommodate data changes. The second aspect of the work studies how the existing top-k algorithms could lead to inequitable exposure of records that are equivalent qualitatively. This scenario is highly important for long-tail data where there exists a long tail of records that have similar utility, but the existing top-k algorithm only shows one of the top-ks, and the rest are never returned to the user. Both of these problems are studied analytically, and their hardness is studied. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) evaluating the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions over large-scale datasets

Digital Commons @ New Jersey Institute of Technology (NJIT)

Models and algorithms for promoting diverse and fair query results

Author: Islam Md Mouinul
Publication venue: Digital Commons @ NJIT
Publication date: 31/08/2023
Field of study

Ensuring fairness and diversity in search results are two key concerns in compelling search and recommendation applications. This work explicitly studies these two aspects given multiple users\u27 preferences as inputs, in an effort to create a single ranking or top-k result set that satisfies different fairness and diversity criteria. From group fairness standpoint, it adapts demographic parity like group fairness criteria and proposes new models that are suitable for ranking or producing top-k set of results. This dissertation also studies equitable exposure of individual search results in long tail data, a concept related to individual fairness. First, the dissertation focuses on aggregating ranks while achieving proportionate fairness (ensures proportionate representation of every group) for multiple protected groups. Then, the dissertation explores how to minimally modify original users\u27 preferences under plurality voting, aiming to produce top-k result set that satisfies complex fairness constraints. A concept referred to as manipulation by modifications is introduced, which involves making minimal changes to the original user preferences to ensure query satisfaction. This problem is formalized as the margin finding problem. A follow up work studies this problem considering a popular ranked choice voting mechanism, namely, the Instant Run-off Voting or IRV, as the preference aggregation method. From the standpoint of individual fairness, this dissertation studies an exposure concern that top-k set based algorithms exhibit when the underlying data has long tail properties, and designs techniques to make those results equitable. For result diversification, the work studies efficiency opportunities in existing diversification algorithms, and designs a generic access primitive called DivGetBatch() to enable that. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) extensive experimental study to evaluate the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions using large-scale datasets

Digital Commons @ New Jersey Institute of Technology (NJIT)

Computational Approaches to Generating Diverse Enzyme Panels

Author: Atallah Christian J.I.
Publication venue: Newcastle University
Publication date: 01/01/2022
Field of study

Ph. D. ThesisMotivation Enzymes are complex macromolecules crucial to life on earth. From bacteria to human beings, all organisms use enzymes to catalyse the many thousands of chemical reactions occurring in their cells. Enzyme functions are so diverse that the use of enzymes in industries like pharmaceuticals and agriculture has gained popularity over recent years as ”biocatalysts”. Unfortunately, the confident laboratory-based characterisation of enzyme function has lagged behind a massive increase in sequencing data, slowing down initiatives that look to use biocatalysts as part of their chemical processes. Computational methods for identifying biocatalysts do exist, but often falter due to the complexity of enzymes and sequence bias, leaving much of the catalytic space of enzymes and their families undiscovered. This thesis has two major themes: the development of in silico approaches for curating diverse panels of novel enzyme sequences for experimental characterisation, and of tooling that integrates in silico panel creation and in vitro enzyme characterisation into a unified and iterative framework. Contributions of this thesis The contributions of this thesis can be divided into the two larger themes, starting with the diverse panel selection of sequences from an enzyme family: • A novel type of protein network based on patterns of coevolving residues that can be used to identify functionally-interesting groupings in enzyme families. • The automatic sampling of functionally diverse subsets of enzyme sequences by solving the maximum diversity problem. - i - • A study into the viability of artificially increasing enzyme family diversity through neural networks-based generation of synthetic sequences. The second theme, which deals with built tools for bridging the gap between the in silico and in vitro side of enzyme family exploration: • A platform that integrates the panel selection process and resulting characterisation data to promote an iterative approach to exploring enzyme families. • A repository for storing the metadata generated by the major steps of characterisation assays in the lab.EPSRC and Prozomix Limite

Newcastle University eTheses

Recommended from our members

Database Usability Enhancement in Data Exploration

Author: Wang Yue
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/11/2017
Field of study

Database usability has become an important research topic over the last decade. In the early days, database management systems were maintained by sophisticated users like database administrators. Today, due to the availability of data and computing resources, more non-expert users are involved in database computation. From their point of view, database systems lack ease of use. So researchers believe that usability is as important as the performance and functionality of databases and therefore developed many techniques such as natural language interface to enhance the ease of use of databases. In this thesis, we find some deeper technical issues in database usability, so we look at several core database technologies to further improve the ease of use of databases in two dimensions: we help users process data and exploit computing capacities. We start by helping users find the data. In the real world, public data is everywhere on the Web, but it is scattered around. We extract a prototype relational knowledge base to solve this problem. We start from the most basic binary mapping relationships (sometimes also named bridge tables) between entities from the web. This mapping relationship facilitates many data transformation applications such as auto-correct, auto-fill, and auto-join. After finding the data, we help users explore the data. When users issue queries to explore the data, their query results may contain too many items. So the system designer has to present a small subset of representative and diverse items rather than all items. This is known as the query result diversification problem. We propose the RC-Index, which helps to solve the diversification problem by significantly reducing the number of items that must be retrieved by the database to form a diverse set of a desired size. It is nearly an order of magnitude faster than the state-of-the-art and has a good performance guarantee, which improves the ease of use of databases in terms of querying. Finally, we shift our focus from data to computing capacities. We propose a framework to help users choose configurations in the cloud. Cloud computing has revolutionized data analysis, but choosing the right configuration is challenging because the common pricing mechanism of the public cloud is too complicated. Users have to consider low-level resources to find the best plan for their computational tasks. To address this issue, we propose a new market-based framework for pricing computational tasks in the cloud. We introduce agents to help users configure their personalized databases, which improves the ease of use of databases in the cloud

ScholarWorks@UMass Amherst