1,357 research outputs found

    Quantifying Privacy: A Novel Entropy-Based Measure of Disclosure Risk

    Full text link
    It is well recognised that data mining and statistical analysis pose a serious treat to privacy. This is true for financial, medical, criminal and marketing research. Numerous techniques have been proposed to protect privacy, including restriction and data modification. Recently proposed privacy models such as differential privacy and k-anonymity received a lot of attention and for the latter there are now several improvements of the original scheme, each removing some security shortcomings of the previous one. However, the challenge lies in evaluating and comparing privacy provided by various techniques. In this paper we propose a novel entropy based security measure that can be applied to any generalisation, restriction or data modification technique. We use our measure to empirically evaluate and compare a few popular methods, namely query restriction, sampling and noise addition.Comment: 20 pages, 4 figure

    Economic Analysis and Statistical Disclosure Limitation

    Get PDF
    This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect conïŹdentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies

    Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

    Full text link
    Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation

    On the use of economic price theory to determine the optimum levels of privacy and information utility in microdata anonymisation

    Get PDF
    Statistical data, such as in the form of microdata, is used by different organisations as a basis for creating knowledge to assist in their planning and decision-making activities. However, before microdata can be made available for analysis, it needs to be anonymised in order to protect the privacy of the individuals whose data is released. The protection of privacy requires us to hide or obscure the released data. On the other hand, making data useful for its users implies that we should provide data that is accurate, complete and precise. Ideally, we should maximise both the level of privacy and the level of information utility of a released microdata set. However, as we increase the level of privacy, the level of information utility decreases. Without guidelines to guide the selection of the optimum levels of privacy and information utility, it is difficult to determine the optimum balance between the two goals. The objective and constraints of this optimisation problem can be captured naturally with concepts from Economic Price Theory. In this thesis, we present an approach based on Economic Price Theory for guiding the process of microdata anonymisation such that optimum levels of privacy and information utility are achieved.Thesis (PhD)--University of Pretoria, 2010.Computer Scienceunrestricte

    Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?

    Get PDF
    The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly to the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This article focuses on some of the key research findings of the eight nodes, organized into six topics: (1) improving census and survey data-quality and data collection methods; (2) using alternative sources of data; (3) protecting privacy and confidentiality by improving disclosure avoidance; (4) using spatial and spatio-temporal statistical modeling to improve estimates; (5) assessing data cost and data-quality tradeoffs; and (6) combining information from multiple sources. The article concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes, suggests some next steps, and discusses the implications of this research-network model for future federal government research initiatives

    Generating tabular datasets under differential privacy

    Full text link
    Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process

    A systems approach to evaluate One Health initiatives

    Get PDF
    Challenges calling for integrated approaches to health, such as the One Health (OH) approach, typically arise from the intertwined spheres of humans, animals, and ecosystems constituting their environment. Initiatives addressing such wicked problems commonly consist of complex structures and dynamics. As a result of the EU COST Action (TD 1404) “Network for Evaluation of One Health” (NEOH), we propose an evaluation framework anchored in systems theory to address the intrinsic complexity of OH initiatives and regard them as subsystems of the context within which they operate. Typically, they intend to influence a system with a view to improve human, animal, and environmental health. The NEOH evaluation framework consists of four overarching elements, namely: (1) the definition of the initiative and its context, (2) the description of the theory of change with an assessment of expected and unexpected outcomes, (3) the process evaluation of operational and supporting infrastructures (the “OH-ness”), and (4) an assessment of the association(s) between the process evaluation and the outcomes produced. It relies on a mixed methods approach by combining a descriptive and qualitative assessment with a semi-quantitative scoring for the evaluation of the degree and structural balance of “OH-ness” (summarised in an OH-index and OH-ratio, respectively) and conventional metrics for different outcomes in a multi-criteria-decision-analysis. Here, we focus on the methodology for Elements (1) and (3) including ready-to-use Microsoft Excel spreadsheets for the assessment of the “OH-ness”. We also provide an overview of Element (2), and refer to the NEOH handbook for further details, also regarding Element (4) (http://neoh.onehealthglobal.net). The presented approach helps researchers, practitioners, and evaluators to conceptualise and conduct evaluations of integrated approaches to health and facilitates comparison and learning across different OH activities thereby facilitating decisions on resource allocation. The application of the framework has been described in eight case studies in the same Frontiers research topic and provides first data on OH-index and OH-ratio, which is an important step towards their validation and the creation of a dataset for future benchmarking, and to demonstrate under which circumstances OH initiatives provide added value compared to disciplinary or conventional health initiatives

    De-identifying a public use microdata file from the Canadian national discharge abstract database

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records.</p> <p>Methods</p> <p>Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.</p> <p>Results</p> <p>Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.</p> <p>Conclusions</p> <p>The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p
    • 

    corecore