18 research outputs found

    Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms

    Full text link
    Diversity maximization aims to select a diverse and representative subset of items from a large dataset. It is a fundamental optimization task that finds applications in data summarization, feature selection, web search, recommender systems, and elsewhere. However, in a setting where data items are associated with different groups according to sensitive attributes like sex or race, it is possible that algorithmic solutions for this task, if left unchecked, will under- or over-represent some of the groups. Therefore, we are motivated to address the problem of \emph{max-min diversification with fairness constraints}, aiming to select kk items to maximize the minimum distance between any pair of selected items while ensuring that the number of items selected from each group falls within predefined lower and upper bounds. In this work, we propose an exact algorithm based on integer linear programming that is suitable for small datasets as well as a 1−ε5\frac{1-\varepsilon}{5}-approximation algorithm for any ε∈(0,1)\varepsilon \in (0, 1) that scales to large datasets. Extensive experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.Comment: 13 pages, 8 figures, to appear in SDM '2

    Streaming Algorithms for Diversity Maximization with Fairness Constraints

    Full text link
    Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set XX of nn elements, it asks to select a subset SS of k≪nk \ll n elements with maximum \emph{diversity}, as quantified by the dissimilarities among the elements in SS. In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset SS that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set XX is partitioned into mm disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that the selected subset SS contains kik_i elements from each group i∈[1,m]i \in [1,m]. A streaming algorithm should process XX sequentially in one pass and return a subset with maximum \emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is 1−ε4\frac{1-\varepsilon}{4}-approximate and specific for m=2m=2, where ε∈(0,1)\varepsilon \in (0,1), and the second of which achieves a 1−ε3m+2\frac{1-\varepsilon}{3m+2}-approximation for an arbitrary mm. Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202

    Diverse Data Selection under Fairness Constraints

    Get PDF
    Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe ? of n elements that can be partitioned into m disjoint groups, we aim to retrieve a k-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified k_i number of elements from each group i (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in n, that provide strong theoretical approximation guarantees for different values of m and k. Finally, we extend our algorithms and analysis to the case where groups can be overlapping

    Improved Approximation and Scalability for Fair Max-Min Diversification

    Get PDF
    Given an nn-point metric space (X,d)(\mathcal{X},d) where each point belongs to one of m=O(1)m=O(1) different categories or groups and a set of integers k1,…,kmk_1, \ldots, k_m, the fair Max-Min diversification problem is to select kik_i points belonging to category i∈[m]i\in [m], such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor 22-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a 66-approximation that is guaranteed to satisfy the fairness constraints up to a factor 1−ϵ1-\epsilon for any constant ϵ\epsilon. We also present a linear time algorithm returning an m+1m+1 approximation with exact fairness. The best previous result was a 3m−13m-1 approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant ϵ>0\epsilon>0, we present a 1+ϵ1+\epsilon approximation algorithm that runs in O(nk)+2O(k)O(nk) + 2^{O(k)} time where k=k1+…+kmk=k_1+\ldots+k_m. We can improve the running time to O(nk)+poly(k)O(nk)+ poly(k) at the expense of only picking (1−ϵ)ki(1-\epsilon) k_i points from category i∈[m]i\in [m]. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202

    VI Workshop on Computational Data Analysis and Numerical Methods: Book of Abstracts

    Get PDF
    The VI Workshop on Computational Data Analysis and Numerical Methods (WCDANM) is going to be held on June 27-29, 2019, in the Department of Mathematics of the University of Beira Interior (UBI), Covilhã, Portugal and it is a unique opportunity to disseminate scientific research related to the areas of Mathematics in general, with particular relevance to the areas of Computational Data Analysis and Numerical Methods in theoretical and/or practical field, using new techniques, giving especial emphasis to applications in Medicine, Biology, Biotechnology, Engineering, Industry, Environmental Sciences, Finance, Insurance, Management and Administration. The meeting will provide a forum for discussion and debate of ideas with interest to the scientific community in general. With this meeting new scientific collaborations among colleagues, namely new collaborations in Masters and PhD projects are expected. The event is open to the entire scientific community (with or without communication/poster)

    Explainable temporal data mining techniques to support the prediction task in Medicine

    Get PDF
    In the last decades, the increasing amount of data available in all fields raises the necessity to discover new knowledge and explain the hidden information found. On one hand, the rapid increase of interest in, and use of, artificial intelligence (AI) in computer applications has raised a parallel concern about its ability (or lack thereof) to provide understandable, or explainable, results to users. In the biomedical informatics and computer science communities, there is considerable discussion about the `` un-explainable" nature of artificial intelligence, where often algorithms and systems leave users, and even developers, in the dark with respect to how results were obtained. Especially in the biomedical context, the necessity to explain an artificial intelligence system result is legitimate of the importance of patient safety. On the other hand, current database systems enable us to store huge quantities of data. Their analysis through data mining techniques provides the possibility to extract relevant knowledge and useful hidden information. Relationships and patterns within these data could provide new medical knowledge. The analysis of such healthcare/medical data collections could greatly help to observe the health conditions of the population and extract useful information that can be exploited in the assessment of healthcare/medical processes. Particularly, the prediction of medical events is essential for preventing disease, understanding disease mechanisms, and increasing patient quality of care. In this context, an important aspect is to verify whether the database content supports the capability of predicting future events. In this thesis, we start addressing the problem of explainability, discussing some of the most significant challenges need to be addressed with scientific and engineering rigor in a variety of biomedical domains. We analyze the ``temporal component" of explainability, focusing on detailing different perspectives such as: the use of temporal data, the temporal task, the temporal reasoning, and the dynamics of explainability in respect to the user perspective and to knowledge. Starting from this panorama, we focus our attention on two different temporal data mining techniques. The first one, based on trend abstractions, starting from the concept of Trend-Event Pattern and moving through the concept of prediction, we propose a new kind of predictive temporal patterns, namely Predictive Trend-Event Patterns (PTE-Ps). The framework aims to combine complex temporal features to extract a compact and non-redundant predictive set of patterns composed by such temporal features. The second one, based on functional dependencies, we propose a methodology for deriving a new kind of approximate temporal functional dependencies, called Approximate Predictive Functional Dependencies (APFDs), based on a three-window framework. We then discuss the concept of approximation, the data complexity of deriving an APFD, the introduction of two new error measures, and finally the quality of APFDs in terms of coverage and reliability. Exploiting these methodologies, we analyze intensive care unit data from the MIMIC dataset

    29th International Symposium on Algorithms and Computation: ISAAC 2018, December 16-19, 2018, Jiaoxi, Yilan, Taiwan

    Get PDF

    Robust Adaptive Decision Making: Bayesian Optimization and Beyond

    Get PDF
    The central task in many interactive machine learning systems can be formalized as the sequential optimization of a black-box function. Bayesian optimization (BO) is a powerful model-based framework for \emph{adaptive} experimentation, where the primary goal is the optimization of the black-box function via sequentially chosen decisions. In many real-world tasks, it is essential for the decisions to be \emph{robust} against, e.g., adversarial failures and perturbations, dynamic and time-varying phenomena, a mismatch between simulations and reality, etc. Under such requirements, the standard methods and BO algorithms become inadequate. In this dissertation, we consider four research directions with the goal of enhancing robust and adaptive decision making in BO and associated problems. First, we study the related problem of level-set estimation (LSE) with Gaussian Processes (GPs). While in BO the goal is to find a maximizer of the unknown function, in LSE one seeks to find all "sufficiently good" solutions. We propose an efficient confidence-bound based algorithm that treats BO and LSE in a unified fashion. It is effective in settings that are non-trivial to incorporate into existing algorithms, including cases with pointwise costs, heteroscedastic noise, and multi-fidelity setting. Our main result is a general regret guarantee that covers these aspects. Next, we consider GP optimization with robustness requirement: An adversary may perturb the returned design, and so we seek to find a robust maximizer in the case this occurs. This requirement is motivated by, e.g., settings where the functions during optimization and implementation stages are different. We propose a novel robust confidence-bound based algorithm. The rigorous regret guarantees for this algorithm are established and complemented with an algorithm-independent lower bound. We experimentally demonstrate that our robust approach consistently succeeds in finding a robust maximizer while standard BO methods fail. We then investigate the problem of GP optimization in which the reward function varies with time. The setting is motivated by many practical applications in which the function to be optimized is not static. We model the unknown reward function via a GP whose evolution obeys a simple Markov model. Two confidence-bound based algorithms with the ability to "forget" about old data are proposed. We obtain regret bounds for these algorithms that jointly depend on the time horizon and the rate at which the function varies. Finally, we consider the maximization of a set function subject to a cardinality constraint kk in the case a number of items Ï„\tau from the returned set may be removed. One notable application is in batch BO where we need to select experiments to run, but some of them can fail. Our focus is on the worst-case adversarial setting, and we consider both \emph{submodular} (i.e., satisfies a natural notion of diminishing returns) and \emph{non-submodular} objectives. We propose robust algorithms that achieve constant-factor approximation guarantees. In the submodular case, the result on the maximum number of allowed removals is improved to Ï„=o(k)\tau = o(k) in comparison to the previously known Ï„=o(k)\tau=o(\sqrt{k}). In the non-submodular case, we obtain new guarantees in the support selection and batch BO tasks. We empirically demonstrate the robust performance of our algorithms in these, as well as, in data summarization and influence maximization tasks
    corecore