18 research outputs found
Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms
Diversity maximization aims to select a diverse and representative subset of
items from a large dataset. It is a fundamental optimization task that finds
applications in data summarization, feature selection, web search, recommender
systems, and elsewhere. However, in a setting where data items are associated
with different groups according to sensitive attributes like sex or race, it is
possible that algorithmic solutions for this task, if left unchecked, will
under- or over-represent some of the groups. Therefore, we are motivated to
address the problem of \emph{max-min diversification with fairness
constraints}, aiming to select items to maximize the minimum distance
between any pair of selected items while ensuring that the number of items
selected from each group falls within predefined lower and upper bounds. In
this work, we propose an exact algorithm based on integer linear programming
that is suitable for small datasets as well as a
-approximation algorithm for any that scales to large datasets. Extensive experiments on real-world datasets
demonstrate the superior performance of our proposed algorithms over existing
ones.Comment: 13 pages, 8 figures, to appear in SDM '2
Streaming Algorithms for Diversity Maximization with Fairness Constraints
Diversity maximization is a fundamental problem with wide applications in
data summarization, web search, and recommender systems. Given a set of
elements, it asks to select a subset of elements with maximum
\emph{diversity}, as quantified by the dissimilarities among the elements in
. In this paper, we focus on the diversity maximization problem with
fairness constraints in the streaming setting. Specifically, we consider the
max-min diversity objective, which selects a subset that maximizes the
minimum distance (dissimilarity) between any pair of distinct elements within
it. Assuming that the set is partitioned into disjoint groups by some
sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that
the selected subset contains elements from each group .
A streaming algorithm should process sequentially in one pass and return a
subset with maximum \emph{diversity} while guaranteeing the fairness
constraint. Although diversity maximization has been extensively studied, the
only known algorithms that can work with the max-min diversity objective and
fairness constraints are very inefficient for data streams. Since diversity
maximization is NP-hard in general, we propose two approximation algorithms for
fair diversity maximization in data streams, the first of which is
-approximate and specific for , where
, and the second of which achieves a
-approximation for an arbitrary . Experimental
results on real-world and synthetic datasets show that both algorithms provide
solutions of comparable quality to the state-of-the-art algorithms while
running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202
Diverse Data Selection under Fairness Constraints
Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe ? of n elements that can be partitioned into m disjoint groups, we aim to retrieve a k-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified k_i number of elements from each group i (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in n, that provide strong theoretical approximation guarantees for different values of m and k. Finally, we extend our algorithms and analysis to the case where groups can be overlapping
Improved Approximation and Scalability for Fair Max-Min Diversification
Given an -point metric space where each point belongs to
one of different categories or groups and a set of integers , the fair Max-Min diversification problem is to select
points belonging to category , such that the minimum pairwise
distance between selected points is maximized. The problem was introduced by
Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample
large data sets in various applications so that the derived sample achieves a
balance over diversity, i.e., the minimum distance between a pair of selected
points, and fairness, i.e., ensuring enough points of each category are
included. We prove the following results:
1. We first consider general metric spaces. We present a randomized
polynomial time algorithm that returns a factor -approximation to the
diversity but only satisfies the fairness constraints in expectation. Building
upon this result, we present a -approximation that is guaranteed to satisfy
the fairness constraints up to a factor for any constant
. We also present a linear time algorithm returning an
approximation with exact fairness. The best previous result was a
approximation.
2. We then focus on Euclidean metrics. We first show that the problem can be
solved exactly in one dimension. For constant dimensions, categories and any
constant , we present a approximation algorithm that
runs in time where . We can improve the
running time to at the expense of only picking points from category .
Finally, we present algorithms suitable to processing massive data sets
including single-pass data stream algorithms and composable coresets for the
distributed processing.Comment: To appear in ICDT 202
VI Workshop on Computational Data Analysis and Numerical Methods: Book of Abstracts
The VI Workshop on Computational Data Analysis and Numerical Methods (WCDANM) is going to be held on June 27-29, 2019, in the Department of Mathematics of the University of Beira Interior (UBI), Covilhã, Portugal and it is a unique opportunity to disseminate scientific research related to the areas of Mathematics in general, with particular relevance to the areas of Computational Data Analysis and Numerical Methods in theoretical and/or practical field, using new techniques, giving especial emphasis to applications in Medicine, Biology, Biotechnology, Engineering, Industry, Environmental Sciences, Finance, Insurance, Management and Administration. The meeting will provide a forum for discussion and debate of ideas with interest to the scientific community in general. With this meeting new scientific collaborations among colleagues, namely new collaborations in Masters and PhD projects are expected. The event is open to the entire scientific community (with or without communication/poster)
Explainable temporal data mining techniques to support the prediction task in Medicine
In the last decades, the increasing amount of data available in all fields raises the necessity to discover new knowledge and explain the hidden information found. On one hand, the rapid increase of interest in, and use of, artificial intelligence (AI) in computer applications has raised a parallel concern about its ability (or lack thereof) to provide understandable, or explainable, results to users. In the biomedical informatics and computer science communities, there is considerable discussion about the `` un-explainable" nature of artificial intelligence, where often algorithms and systems leave users, and even developers, in the dark with respect to how results were obtained. Especially in the biomedical context, the necessity to explain an artificial intelligence system result is legitimate of the importance of patient safety. On the other hand, current database systems enable us to store huge quantities of data. Their analysis through data mining techniques provides the possibility to extract relevant knowledge and useful hidden information. Relationships and patterns within these data could provide new medical knowledge. The analysis of such healthcare/medical data collections could greatly help to observe the health conditions of the population and extract useful information that can be exploited in the assessment of healthcare/medical processes. Particularly, the prediction of medical events is essential for preventing disease, understanding disease mechanisms, and increasing patient quality of care. In this context, an important aspect is to verify whether the database content supports the capability of predicting future events. In this thesis, we start addressing the problem of explainability, discussing some of the most significant challenges need to be addressed with scientific and engineering rigor in a variety of biomedical domains. We analyze the ``temporal component" of explainability, focusing on detailing different perspectives such as: the use of temporal data, the temporal task, the temporal reasoning, and the dynamics of explainability in respect to the user perspective and to knowledge. Starting from this panorama, we focus our attention on two different temporal data mining techniques. The first one, based on trend abstractions, starting from the concept of Trend-Event Pattern and moving through the concept of prediction, we propose a new kind of predictive temporal patterns, namely Predictive Trend-Event Patterns (PTE-Ps). The framework aims to combine complex temporal features to extract a compact and non-redundant predictive set of patterns composed by such temporal features. The second one, based on functional dependencies, we propose a methodology for deriving a new kind of approximate temporal functional dependencies, called Approximate Predictive Functional Dependencies (APFDs), based on a three-window framework. We then discuss the concept of approximation, the data complexity of deriving an APFD, the introduction of two new error measures, and finally the quality of APFDs in terms of coverage and reliability. Exploiting these methodologies, we analyze intensive care unit data from the MIMIC dataset
Robust Adaptive Decision Making: Bayesian Optimization and Beyond
The central task in many interactive machine learning systems can be formalized as the sequential optimization of a black-box function. Bayesian optimization (BO) is a powerful model-based framework for \emph{adaptive} experimentation, where the primary goal is the optimization of the black-box function via sequentially chosen decisions. In many real-world tasks, it is essential for the decisions to be \emph{robust} against, e.g., adversarial failures and perturbations, dynamic and time-varying phenomena, a mismatch between simulations and reality, etc. Under such requirements, the standard methods and BO algorithms become inadequate. In this dissertation, we consider four research directions with the goal of enhancing robust and adaptive decision making in BO and associated problems.
First, we study the related problem of level-set estimation (LSE) with Gaussian Processes (GPs). While in BO the goal is to find a maximizer of the unknown function, in LSE one seeks to find all "sufficiently good" solutions. We propose an efficient confidence-bound based algorithm that treats BO and LSE in a unified fashion. It is effective in settings that are non-trivial to incorporate into existing algorithms, including cases with pointwise costs, heteroscedastic noise, and multi-fidelity setting. Our main result is a general regret guarantee that covers these aspects.
Next, we consider GP optimization with robustness requirement: An adversary may perturb the returned design, and so we seek to find a robust maximizer in the case this occurs. This requirement is motivated by, e.g., settings where the functions during optimization and implementation stages are different. We propose a novel robust confidence-bound based algorithm. The rigorous regret guarantees for this algorithm are established and complemented with an algorithm-independent lower bound. We experimentally demonstrate that our robust approach consistently succeeds in finding a robust maximizer while standard BO methods fail.
We then investigate the problem of GP optimization in which the reward function varies with time. The setting is motivated by many practical applications in which the function to be optimized is not static. We model the unknown reward function via a GP whose evolution obeys a simple Markov model. Two confidence-bound based algorithms with the ability to "forget" about old data are proposed. We obtain regret bounds for these algorithms that jointly depend on the time horizon and the rate at which the function varies.
Finally, we consider the maximization of a set function subject to a cardinality constraint in the case a number of items from the returned set may be removed. One notable application is in batch BO where we need to select experiments to run, but some of them can fail. Our focus is on the worst-case adversarial setting, and we consider both \emph{submodular} (i.e., satisfies a natural notion of diminishing returns) and \emph{non-submodular} objectives. We propose robust algorithms that achieve constant-factor approximation guarantees. In the submodular case, the result on the maximum number of allowed removals is improved to in comparison to the previously known . In the non-submodular case, we obtain new guarantees in the support selection and batch BO tasks. We empirically demonstrate the robust performance of our algorithms in these, as well as, in data summarization and influence maximization tasks