14,711 research outputs found

    Differentially Private Multivariate Statistics with an Application to Contingency Table Analysis

    Full text link
    Differential privacy (DP) has become a rigorous central concept in privacy protection for the past decade. Among various notions of DP, ff-DP is an easily interpretable and informative concept that tightly captures privacy level by comparing trade-off functions obtained from the hypothetical test of how well the mechanism recognizes individual information in the dataset. We adopt the Gaussian differential privacy (GDP), a canonical parametric family of ff-DP. The Gaussian mechanism is a natural and fundamental mechanism that tightly achieves GDP. However, the ordinary multivariate Gaussian mechanism is not optimal with respect to statistical utility. To improve the utility, we develop the rank-deficient and James-Stein Gaussian mechanisms for releasing private multivariate statistics based on the geometry of multivariate Gaussian distribution. We show that our proposals satisfy GDP and dominate the ordinary Gaussian mechanism with respect to L2L_2-cost. We also show that the Laplace mechanism, a prime mechanism in ε\varepsilon-DP framework, is sub-optimal than Gaussian-type mechanisms under the framework of GDP. For a fair comparison, we calibrate the Laplace mechanism to the global sensitivity of the statistic with the exact approach to the trade-off function. We also develop the optimal parameter for the Laplace mechanism when applied to contingency tables. Indeed, we show that the Gaussian-type mechanisms dominate the Laplace mechanism in contingency table analysis. In addition, we apply our findings to propose differentially private χ2\chi^2-tests on contingency tables. Numerical results demonstrate that differentially private parametric bootstrap tests control the type I error rates and show higher power than other natural competitors

    DPpack: An R Package for Differentially Private Statistical Analysis and Machine Learning

    Full text link
    Differential privacy (DP) is the state-of-the-art framework for guaranteeing privacy for individuals when releasing aggregated statistics or building statistical/machine learning models from data. We develop the open-source R package DPpack that provides a large toolkit of differentially private analysis. The current version of DPpack implements three popular mechanisms for ensuring DP: Laplace, Gaussian, and exponential. Beyond that, DPpack provides a large toolkit of easily accessible privacy-preserving descriptive statistics functions. These include mean, variance, covariance, and quantiles, as well as histograms and contingency tables. Finally, DPpack provides user-friendly implementation of privacy-preserving versions of logistic regression, SVM, and linear regression, as well as differentially private hyperparameter tuning for each of these models. This extensive collection of implemented differentially private statistics and models permits hassle-free utilization of differential privacy principles in commonly performed statistical analysis. We plan to continue developing DPpack and make it more comprehensive by including more differentially private machine learning techniques, statistical modeling and inference in the future

    Differentially Private Publication of Sparse Data

    Full text link
    The problem of privately releasing data is to provide a version of a dataset without revealing sensitive information about the individuals who contribute to the data. The model of differential privacy allows such private release while providing strong guarantees on the output. A basic mechanism achieves differential privacy by adding noise to the frequency counts in the contingency tables (or, a subset of the count data cube) derived from the dataset. However, when the dataset is sparse in its underlying space, as is the case for most multi-attribute relations, then the effect of adding noise is to vastly increase the size of the published data: it implicitly creates a huge number of dummy data points to mask the true data, making it almost impossible to work with. We present techniques to overcome this roadblock and allow efficient private release of sparse data, while maintaining the guarantees of differential privacy. Our approach is to release a compact summary of the noisy data. Generating the noisy data and then summarizing it would still be very costly, so we show how to shortcut this step, and instead directly generate the summary from the input data, without materializing the vast intermediate noisy data. We instantiate this outline for a variety of sampling and filtering methods, and show how to use the resulting summary for approximate, private, query answering. Our experimental study shows that this is an effective, practical solution, with comparable and occasionally improved utility over the costly materialization approach

    Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies

    Full text link
    The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008) Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach that provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, Uhler et al. (2013) proposed new methods to release aggregate GWAS data without compromising an individual's privacy. We extend the methods developed in Uhler et al. (2013) for releasing differentially-private χ2\chi^2-statistics by allowing for arbitrary number of cases and controls, and for releasing differentially-private allelic test statistics. We also provide a new interpretation by assuming the controls' data are known, which is a realistic assumption because some GWAS use publicly available data as controls. We assess the performance of the proposed methods through a risk-utility analysis on a real data set consisting of DNA samples collected by the Wellcome Trust Case Control Consortium and compare the methods with the differentially-private release mechanism proposed by Johnson and Shmatikov (2013).Comment: 28 pages, 2 figures, source code available upon reques

    Accurate and Efficient Private Release of Datacubes and Contingency Tables

    Full text link
    A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a "strategy" set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to improve the accuracy of a given query set by answering some strategy queries more accurately than others. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise

    On the Differential Privacy of Bayesian Inference

    Get PDF
    We study how to communicate findings of Bayesian inference to third parties, while preserving the strong guarantee of differential privacy. Our main contributions are four different algorithms for private Bayesian inference on proba-bilistic graphical models. These include two mechanisms for adding noise to the Bayesian updates, either directly to the posterior parameters, or to their Fourier transform so as to preserve update consistency. We also utilise a recently introduced posterior sampling mechanism, for which we prove bounds for the specific but general case of discrete Bayesian networks; and we introduce a maximum-a-posteriori private mechanism. Our analysis includes utility and privacy bounds, with a novel focus on the influence of graph structure on privacy. Worked examples and experiments with Bayesian na{\"i}ve Bayes and Bayesian linear regression illustrate the application of our mechanisms.Comment: AAAI 2016, Feb 2016, Phoenix, Arizona, United State

    Differentially Private Exponential Random Graphs

    Full text link
    We propose methods to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network. Proposed techniques aim at fitting and estimating a wide class of exponential random graph models (ERGMs) in a differentially private manner, and thus offer rigorous privacy guarantees. More specifically, we use the randomized response mechanism to release networks under ϵ\epsilon-edge differential privacy. To maintain utility for statistical inference, treating the original graph as missing, we propose a way to use likelihood based inference and Markov chain Monte Carlo (MCMC) techniques to fit ERGMs to the produced synthetic networks. We demonstrate the usefulness of the proposed techniques on a real data example.Comment: minor edit
    corecore