158 research outputs found

    Multiple imputation for sharing precise geographies in public use data

    Full text link
    When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects' identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS506 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    30 Years of Synthetic Data

    Full text link
    The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.Comment: 42 page

    Avoiding disclosure of individually identifiable health information: a literature review

    Get PDF
    Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility

    Erzeugung Mehrfach Imputierter Synthetischer Datensätze: Theorie und Implementierung

    Get PDF
    The book describes different approaches to generating multiply imputed synthetic datasets to guarantee confidentiality. Each chapter is dedicated to one approach, first describing the general concept followed by a detailed application to a real dataset providing useful guidelines on how to implement the theory in practice.Die Arbeit beschreibt verschiedene Ansätze zur Erstellung mehrfach imputierter synthetischer Datensätze. Diese Datensätze können der interessierten Fachöffentlichkeit zur Verfügung gestellt werden, ohne den Datenschutz zu verletzen. Jedes Kapitel befasst sich mit einem eigenen Ansatz, wobei zunächst das allgemeine Konzept beschrieben wird. Anschließend bietet eine detailierte Anwendung auf einen realen Datensatz hilfreiche Richtlinien, wie sich die beschriebene Theorie in der Praxis anwenden läßt

    Towards Providing Automated Feedback on the Quality of Inferences from Synthetic Datasets

    Get PDF
    • …
    corecore