81,965 research outputs found

    Private Graph Data Release: A Survey

    Full text link
    The application of graph analytics to various domains have yielded tremendous societal and economical benefits in recent years. However, the increasingly widespread adoption of graph analytics comes with a commensurate increase in the need to protect private information in graph databases, especially in light of the many privacy breaches in real-world graph data that was supposed to preserve sensitive information. This paper provides a comprehensive survey of private graph data release algorithms that seek to achieve the fine balance between privacy and utility, with a specific focus on provably private mechanisms. Many of these mechanisms fall under natural extensions of the Differential Privacy framework to graph data, but we also investigate more general privacy formulations like Pufferfish Privacy that can deal with the limitations of Differential Privacy. A wide-ranging survey of the applications of private graph data release mechanisms to social networks, finance, supply chain, health and energy is also provided. This survey paper and the taxonomy it provides should benefit practitioners and researchers alike in the increasingly important area of private graph data release and analysis

    Benchmarking Differential Privacy and Federated Learning for BERT Models

    Full text link
    Natural Language Processing (NLP) techniques can be applied to help with the diagnosis of medical conditions such as depression, using a collection of a person's utterances. Depression is a serious medical illness that can have adverse effects on how one feels, thinks, and acts, which can lead to emotional and physical problems. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training models with such data. In this work, we study the effects that the application of Differential Privacy (DP) has, in both a centralized and a Federated Learning (FL) setup, on training contextualized language models (BERT, ALBERT, RoBERTa and DistilBERT). We offer insights on how to privately train NLP models and what architectures and setups provide more desirable privacy utility trade-offs. We envisage this work to be used in future healthcare and mental health studies to keep medical history private. Therefore, we provide an open-source implementation of this work.Comment: 4 pages, 3 tables, 1 figur

    Impacts of Census Differential Privacy for Small-Area Disease Mapping to Monitor Health Inequities

    Full text link
    The US Census Bureau will implement a new privacy-preserving disclosure avoidance system (DAS), which includes application of differential privacy, on publicly-released 2020 census data. There are concerns that the DAS may bias small-area and demographically-stratified population counts, which play a critical role in public health research, serving as denominators in estimation of disease/mortality rates. Employing three DAS demonstration products, we quantify errors attributable to reliance on DAS-protected denominators in standard small-area disease mapping models for characterizing health inequities. We conduct simulation studies and real data analyses of inequities in premature mortality at the census tract level in Massachusetts and Georgia. Results show that overall patterns of inequity by racialized group and economic deprivation level are not compromised by the DAS. While early versions of DAS induce errors in mortality rate estimation that are larger for Black than non-Hispanic white populations in Massachusetts, this issue is ameliorated in newer DAS versions

    A Differentially Private Weighted Empirical Risk Minimization Procedure and its Application to Outcome Weighted Learning

    Full text link
    It is commonplace to use data containing personal information to build predictive models in the framework of empirical risk minimization (ERM). While these models can be highly accurate in prediction, results obtained from these models with the use of sensitive data may be susceptible to privacy attacks. Differential privacy (DP) is an appealing framework for addressing such data privacy issues by providing mathematically provable bounds on the privacy loss incurred when releasing information from sensitive data. Previous work has primarily concentrated on applying DP to unweighted ERM. We consider an important generalization to weighted ERM (wERM). In wERM, each individual's contribution to the objective function can be assigned varying weights. In this context, we propose the first differentially private wERM algorithm, backed by a rigorous theoretical proof of its DP guarantees under mild regularity conditions. Extending the existing DP-ERM procedures to wERM paves a path to deriving privacy-preserving learning methods for individualized treatment rules, including the popular outcome weighted learning (OWL). We evaluate the performance of the DP-wERM application to OWL in a simulation study and in a real clinical trial of melatonin for sleep health. All empirical results demonstrate the viability of training OWL models via wERM with DP guarantees while maintaining sufficiently useful model performance. Therefore, we recommend practitioners consider implementing the proposed privacy-preserving OWL procedure in real-world scenarios involving sensitive data.Comment: 24 pages and 2 figures for the main manuscript, 5 pages and 2 figures for the supplementary material

    Privacy and Utility of Private Synthetic Data for Medical Data Analyses

    Get PDF
    The increasing availability and use of sensitive personal data raises a set of issues regarding the privacy of the individuals behind the data. These concerns become even more important when health data are processed, as are considered sensitive (according to most global regulations). Privacy Enhancing Technologies (PETs) attempt to protect the privacy of individuals whilst preserving the utility of data. One of the most popular technologies recently is Differential Privacy (DP), which was used for the 2020 U.S. Census. Another trend is to combine synthetic data generators with DP to create so-called private synthetic data generators. The objective is to preserve statistical properties as accurately as possible, while the generated data should be as different as possible compared to the original data regarding private features. While these technologies seem promising, there is a gap between academic research on DP and synthetic data and the practical application and evaluation of these techniques for real-world use cases. In this paper, we evaluate three different private synthetic data generators (MWEM, DP-CTGAN, and PATE-CTGAN) on their use-case-specific privacy and utility. For the use case, continuous heart rate measurements from different individuals are analyzed. This work shows that private synthetic data generators have tremendous advantages over traditional techniques, but also require in-depth analysis depending on the use case. Furthermore, it can be seen that each technology has different strengths, so there is no clear winner. However, DP-CTGAN often performs slightly better than the other technologies, so it can be recommended for a continuous medical data use case

    Use of record-linkage to handle non-response and improve alcohol consumption estimates in health survey data: a study protocol

    Get PDF
    <p>Introduction: Reliable estimates of health-related behaviours, such as levels of alcohol consumption in the population, are required to formulate and evaluate policies. National surveys provide such data; validity depends on generalisability, but this is threatened by declining response levels. Attempts to address bias arising from non-response are typically limited to survey weights based on sociodemographic characteristics, which do not capture differential health and related behaviours within categories. This project aims to explore and address non-response bias in health surveys with a focus on alcohol consumption.</p> <p>Methods and analysis: The Scottish Health Surveys (SHeS) aim to provide estimates representative of the Scottish population living in private households. Survey data of consenting participants (92% of the achieved sample) have been record-linked to routine hospital admission (Scottish Morbidity Records (SMR)) and mortality (from National Records of Scotland (NRS)) data for surveys conducted in 1995, 1998, 2003, 2008, 2009 and 2010 (total adult sample size around 40 000), with maximum follow-up of 16 years. Also available are census information and SMR/NRS data for the general population. Comparisons of alcohol-related mortality and hospital admission rates in the linked SHeS-SMR/NRS with those in the general population will be made. Survey data will be augmented by quantification of differences to refine alcohol consumption estimates through the application of multiple imputation or inverse probability weighting. The resulting corrected estimates of population alcohol consumption will enable superior policy evaluation. An advanced weighting procedure will be developed for wider use.</p> <p>Ethics and dissemination: Ethics approval for SHeS has been given by the National Health Service (NHS) Multi-Centre Research Ethics Committee and use of linked data has been approved by the Privacy Advisory Committee to the Board of NHS National Services Scotland and Registrar General. Funding has been granted by the MRC. The outputs will include four or five public health and statistical methodological international journal and conference papers.</p&gt

    Differentially Private Robust Linear Regression

    Get PDF
    Differential privacy is a mathematically defined concept of data privacy that is based on the idea that a person should not face any additional harm by opting to give their data to a data collector. Data release mechanisms that satisfy the definition are said to be differentially private and they guarantee the privacy of the data on a specified privacy level by utilising carefully designed randomness that sufficiently masks the participation of each individual in the data set. The introduced randomness decreases the accuracy of the data analysis, but this effect can be diminished by clever algorithmic design. Robust private linear regression algorithm is a differentially private mechanism originally introduced by A. Honkela, M. Das, O. Dikmen, and S. Kaski in 2016. The algorithm is based on projecting the studied data inside known bounds and applying differentially private Laplace mechanism to perturb the sufficient statistics of the Bayesian linear regression model that is then fitted to the data using the privatised statistics. In this thesis, the idea, definitions and the most important theorems and properties of differential privacy are presented and discussed. The robust private linear regression algorithm is then presented in detail, including improvements that are related to determining and handling the parameters of the mechanism and were developed during my work as a research assistant in the Probabilistic Inference and Computational Biology research group (Department of Computer Science at University of Helsinki and Helsinki Institute for Information Technology) in 2016-2017. The performance of the algorithm is evaluated experimentally on both synthetic and real-life data. The latter data are from the Genomics of Drug Sensitivity in Cancer (GDSC) project and consist of the gene expression data of 985 cancer cell lines and their responses to 265 different anti-cancer drugs. The studied algorithm is applied to the GDSC data with the goal of predicting which cancer cell lines are sensitive to each drug and which are not. The application of a differentially private mechanism to the gene expression data is justifiable because genomic data are identifying and carry highly sensitive information about e.g. an individual's phenotype, health, and risk of various diseases. The results presented in the thesis show the studied algorithm works as planned and is able to benefit from having more data: in the sense of prediction accuracy, it approaches the non-private version of the same algorithm as the size of the available data set increases. It also reaches considerably better accuracy than the three compared algorithms that are based on different differentially private mechanisms: private linear regression with no projection, output perturbed linear regression, and functional mechanism linear regression
    • …
    corecore