9 research outputs found

    Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms

    Full text link
    Diversity maximization aims to select a diverse and representative subset of items from a large dataset. It is a fundamental optimization task that finds applications in data summarization, feature selection, web search, recommender systems, and elsewhere. However, in a setting where data items are associated with different groups according to sensitive attributes like sex or race, it is possible that algorithmic solutions for this task, if left unchecked, will under- or over-represent some of the groups. Therefore, we are motivated to address the problem of \emph{max-min diversification with fairness constraints}, aiming to select kk items to maximize the minimum distance between any pair of selected items while ensuring that the number of items selected from each group falls within predefined lower and upper bounds. In this work, we propose an exact algorithm based on integer linear programming that is suitable for small datasets as well as a 1ε5\frac{1-\varepsilon}{5}-approximation algorithm for any ε(0,1)\varepsilon \in (0, 1) that scales to large datasets. Extensive experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.Comment: 13 pages, 8 figures, to appear in SDM '2

    Streaming Algorithms for Diversity Maximization with Fairness Constraints

    Full text link
    Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set XX of nn elements, it asks to select a subset SS of knk \ll n elements with maximum \emph{diversity}, as quantified by the dissimilarities among the elements in SS. In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset SS that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set XX is partitioned into mm disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that the selected subset SS contains kik_i elements from each group i[1,m]i \in [1,m]. A streaming algorithm should process XX sequentially in one pass and return a subset with maximum \emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is 1ε4\frac{1-\varepsilon}{4}-approximate and specific for m=2m=2, where ε(0,1)\varepsilon \in (0,1), and the second of which achieves a 1ε3m+2\frac{1-\varepsilon}{3m+2}-approximation for an arbitrary mm. Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202

    Improved Approximation and Scalability for Fair Max-Min Diversification

    Get PDF
    Given an nn-point metric space (X,d)(\mathcal{X},d) where each point belongs to one of m=O(1)m=O(1) different categories or groups and a set of integers k1,,kmk_1, \ldots, k_m, the fair Max-Min diversification problem is to select kik_i points belonging to category i[m]i\in [m], such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor 22-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a 66-approximation that is guaranteed to satisfy the fairness constraints up to a factor 1ϵ1-\epsilon for any constant ϵ\epsilon. We also present a linear time algorithm returning an m+1m+1 approximation with exact fairness. The best previous result was a 3m13m-1 approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant ϵ>0\epsilon>0, we present a 1+ϵ1+\epsilon approximation algorithm that runs in O(nk)+2O(k)O(nk) + 2^{O(k)} time where k=k1++kmk=k_1+\ldots+k_m. We can improve the running time to O(nk)+poly(k)O(nk)+ poly(k) at the expense of only picking (1ϵ)ki(1-\epsilon) k_i points from category i[m]i\in [m]. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202

    Diverse Data Selection under Fairness Constraints

    Get PDF
    Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe ? of n elements that can be partitioned into m disjoint groups, we aim to retrieve a k-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified k_i number of elements from each group i (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in n, that provide strong theoretical approximation guarantees for different values of m and k. Finally, we extend our algorithms and analysis to the case where groups can be overlapping

    Diversification and fairness in top-k ranking algorithms

    Get PDF
    Given a user query, the typical user interfaces, such as search engines and recommender systems, only allow a small number of results to be returned to the user. Hence, figuring out what would be the top-k results is an important task in information retrieval, as it helps to ensure that the most relevant results are presented to the user. There exists an extensive body of research that studies how to score the records and return top-k to the user. Moreover, there exists an extensive set of criteria that researchers identify to present the user with top-k results, and result diversification is one of them. Diversifying the top-k result ensures that the returned result set is relevant as well as representative of the entire set of answers to the user query, and it is highly relevant in the context of search, recommendation, and data exploration. The goal of this dissertation is two-fold: the first goal is to focus on adapting existing popular diversification algorithms and studying how to expedite them without losing the accuracy of the answers. This work studies the scalability challenges of expediting the running time of existing diversification algorithms by designing a generic framework that produces the same results as the original algorithms, yet it is significantly faster in running time. This proposed approach handles scenarios where data change over a period of time and studies how to adapt the framework to accommodate data changes. The second aspect of the work studies how the existing top-k algorithms could lead to inequitable exposure of records that are equivalent qualitatively. This scenario is highly important for long-tail data where there exists a long tail of records that have similar utility, but the existing top-k algorithm only shows one of the top-ks, and the rest are never returned to the user. Both of these problems are studied analytically, and their hardness is studied. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) evaluating the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions over large-scale datasets

    Models and algorithms for promoting diverse and fair query results

    Get PDF
    Ensuring fairness and diversity in search results are two key concerns in compelling search and recommendation applications. This work explicitly studies these two aspects given multiple users\u27 preferences as inputs, in an effort to create a single ranking or top-k result set that satisfies different fairness and diversity criteria. From group fairness standpoint, it adapts demographic parity like group fairness criteria and proposes new models that are suitable for ranking or producing top-k set of results. This dissertation also studies equitable exposure of individual search results in long tail data, a concept related to individual fairness. First, the dissertation focuses on aggregating ranks while achieving proportionate fairness (ensures proportionate representation of every group) for multiple protected groups. Then, the dissertation explores how to minimally modify original users\u27 preferences under plurality voting, aiming to produce top-k result set that satisfies complex fairness constraints. A concept referred to as manipulation by modifications is introduced, which involves making minimal changes to the original user preferences to ensure query satisfaction. This problem is formalized as the margin finding problem. A follow up work studies this problem considering a popular ranked choice voting mechanism, namely, the Instant Run-off Voting or IRV, as the preference aggregation method. From the standpoint of individual fairness, this dissertation studies an exposure concern that top-k set based algorithms exhibit when the underlying data has long tail properties, and designs techniques to make those results equitable. For result diversification, the work studies efficiency opportunities in existing diversification algorithms, and designs a generic access primitive called DivGetBatch() to enable that. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) extensive experimental study to evaluate the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions using large-scale datasets

    Computational Approaches to Generating Diverse Enzyme Panels

    Get PDF
    Ph. D. ThesisMotivation Enzymes are complex macromolecules crucial to life on earth. From bacteria to human beings, all organisms use enzymes to catalyse the many thousands of chemical reactions occurring in their cells. Enzyme functions are so diverse that the use of enzymes in industries like pharmaceuticals and agriculture has gained popularity over recent years as ”biocatalysts”. Unfortunately, the confident laboratory-based characterisation of enzyme function has lagged behind a massive increase in sequencing data, slowing down initiatives that look to use biocatalysts as part of their chemical processes. Computational methods for identifying biocatalysts do exist, but often falter due to the complexity of enzymes and sequence bias, leaving much of the catalytic space of enzymes and their families undiscovered. This thesis has two major themes: the development of in silico approaches for curating diverse panels of novel enzyme sequences for experimental characterisation, and of tooling that integrates in silico panel creation and in vitro enzyme characterisation into a unified and iterative framework. Contributions of this thesis The contributions of this thesis can be divided into the two larger themes, starting with the diverse panel selection of sequences from an enzyme family: • A novel type of protein network based on patterns of coevolving residues that can be used to identify functionally-interesting groupings in enzyme families. • The automatic sampling of functionally diverse subsets of enzyme sequences by solving the maximum diversity problem. - i - • A study into the viability of artificially increasing enzyme family diversity through neural networks-based generation of synthetic sequences. The second theme, which deals with built tools for bridging the gap between the in silico and in vitro side of enzyme family exploration: • A platform that integrates the panel selection process and resulting characterisation data to promote an iterative approach to exploring enzyme families. • A repository for storing the metadata generated by the major steps of characterisation assays in the lab.EPSRC and Prozomix Limite