330 research outputs found

    Recovering from Biased Data: Can Fairness Constraints Improve Accuracy?

    Get PDF
    Multiple fairness constraints have been proposed in the literature, motivated by a range of concerns about how demographic groups might be treated unfairly by machine learning classifiers. In this work we consider a different motivation; learning from biased training data. We posit several ways in which training data may be biased, including having a more noisy or negatively biased labeling process on members of a disadvantaged group, or a decreased prevalence of positive or negative examples from the disadvantaged group, or both. Given such biased training data, Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution. We examine the ability of fairness-constrained ERM to correct this problem. In particular, we find that the Equal Opportunity fairness constraint [Hardt et al., 2016] combined with ERM will provably recover the Bayes optimal classifier under a range of bias models. We also consider other recovery methods including re-weighting the training data, Equalized Odds, and Demographic Parity, and Calibration. These theoretical results provide additional motivation for considering fairness interventions even if an actor cares primarily about accuracy

    Fast Private Data Release Algorithms for Sparse Queries

    Full text link
    We revisit the problem of accurately answering large classes of statistical queries while preserving differential privacy. Previous approaches to this problem have either been very general but have not had run-time polynomial in the size of the database, have applied only to very limited classes of queries, or have relaxed the notion of worst-case error guarantees. In this paper we consider the large class of sparse queries, which take non-zero values on only polynomially many universe elements. We give efficient query release algorithms for this class, in both the interactive and the non-interactive setting. Our algorithms also achieve better accuracy bounds than previous general techniques do when applied to sparse queries: our bounds are independent of the universe size. In fact, even the runtime of our interactive mechanism is independent of the universe size, and so can be implemented in the "infinite universe" model in which no finite universe need be specified by the data curator

    Advancing Subgroup Fairness via Sleeping Experts

    Get PDF
    We study methods for improving fairness to subgroups in settings with overlapping populations and sequential predictions. Classical notions of fairness focus on the balance of some property across different populations. However, in many applications the goal of the different groups is not to be predicted equally but rather to be predicted well. We demonstrate that the task of satisfying this guarantee for multiple overlapping groups is not straightforward and show that for the simple objective of unweighted average of false negative and false positive rate, satisfying this for overlapping populations can be statistically impossible even when we are provided predictors that perform well separately on each subgroup. On the positive side, we show that when individuals are equally important to the different groups they belong to, this goal is achievable; to do so, we draw a connection to the sleeping experts literature in online learning. Motivated by the one-sided feedback in natural settings of interest, we extend our results to such a feedback model. We also provide a game-theoretic interpretation of our results, examining the incentives of participants to join the system and to provide the system full information about predictors they may possess. We end with several interesting open problems concerning the strength of guarantees that can be achieved in a computationally efficient manner

    On Statistical Query Sampling and NMR Quantum Computing

    Full text link
    We introduce a ``Statistical Query Sampling'' model, in which the goal of an algorithm is to produce an element in a hidden set SsubseteqbitnSsubseteqbit^n with reasonable probability. The algorithm gains information about SS through oracle calls (statistical queries), where the algorithm submits a query function g(cdot)g(cdot) and receives an approximation to PrxinS[g(x)=1]Pr_{x in S}[g(x)=1]. We show how this model is related to NMR quantum computing, in which only statistical properties of an ensemble of quantum systems can be measured, and in particular to the question of whether one can translate standard quantum algorithms to the NMR setting without putting all of their classical post-processing into the quantum system. Using Fourier analysis techniques developed in the related context of {em statistical query learning}, we prove a number of lower bounds (both information-theoretic and cryptographic) on the ability of algorithms to produces an xinSxin S, even when the set SS is fairly simple. These lower bounds point out a difficulty in efficiently applying NMR quantum computing to algorithms such as Shor's and Simon's algorithm that involve significant classical post-processing. We also explicitly relate the notion of statistical query sampling to that of statistical query learning. An extended abstract appeared in the 18th Aunnual IEEE Conference of Computational Complexity (CCC 2003), 2003. Keywords: statistical query, NMR quantum computing, lower boundComment: 17 pages, no figures. Appeared in 18th Aunnual IEEE Conference of Computational Complexity (CCC 2003

    Center-based Clustering under Perturbation Stability

    Full text link
    Clustering under most popular objective functions is NP-hard, even to approximate well, and so unlikely to be efficiently solvable in the worst case. Recently, Bilu and Linial \cite{Bilu09} suggested an approach aimed at bypassing this computational barrier by using properties of instances one might hope to hold in practice. In particular, they argue that instances in practice should be stable to small perturbations in the metric space and give an efficient algorithm for clustering instances of the Max-Cut problem that are stable to perturbations of size O(n1/2)O(n^{1/2}). In addition, they conjecture that instances stable to as little as O(1) perturbations should be solvable in polynomial time. In this paper we prove that this conjecture is true for any center-based clustering objective (such as kk-median, kk-means, and kk-center). Specifically, we show we can efficiently find the optimal clustering assuming only stability to factor-3 perturbations of the underlying metric in spaces without Steiner points, and stability to factor 2+32+\sqrt{3} perturbations for general metrics. In particular, we show for such instances that the popular Single-Linkage algorithm combined with dynamic programming will find the optimal clustering. We also present NP-hardness results under a weaker but related condition

    Noise-Tolerant Learning, the Parity Problem, and the Statistical Query Model

    Full text link
    We describe a slightly sub-exponential time algorithm for learning parity functions in the presence of random classification noise. This results in a polynomial-time algorithm for the case of parity functions that depend on only the first O(log n log log n) bits of input. This is the first known instance of an efficient noise-tolerant algorithm for a concept class that is provably not learnable in the Statistical Query model of Kearns. Thus, we demonstrate that the set of problems learnable in the statistical query model is a strict subset of those problems learnable in the presence of noise in the PAC model. In coding-theory terms, what we give is a poly(n)-time algorithm for decoding linear k by n codes in the presence of random noise for the case of k = c log n loglog n for some c > 0. (The case of k = O(log n) is trivial since one can just individually check each of the 2^k possible messages and choose the one that yields the closest codeword.) A natural extension of the statistical query model is to allow queries about statistical properties that involve t-tuples of examples (as opposed to single examples). The second result of this paper is to show that any class of functions learnable (strongly or weakly) with t-wise queries for t = O(log n) is also weakly learnable with standard unary queries. Hence this natural extension to the statistical query model does not increase the set of weakly learnable functions