Search CORE

2,065 research outputs found

Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality

Author: Drechsler Jörg
Reiter Jerome P.
Publication venue
Publication date
Field of study

"To protect the cofidentiality of survey respondents' identities and sensitive attributes, statistical agencies can release data in which cofidential values are replaced with multiple imputations. These are called synthetic data. We propose a two-stage approach to generating synthetic data that enables agencies to release different numbers of imputations for different variables. Generation in two stages can reduce computational burdens, decrease disclosure risk, and increase inferential accuracy relative to generation in one stage. We present methods for obtaining inferences from such data. We describe the application of two stage synthesis to creating a public use file for a German business database." (Author's abstract, IAB-Doku) ((en))IAB-Betriebspanel, Datenaufbereitung, Datenanonymisierung, Datenschutz, angewandte Statistik, statistische Methode, Arbeitsmarktforschung, Imputationsverfahren

Research Papers in Economics

Advancing Microdata Privacy Protection: A Review of Synthetic Data

Author: Bowen Claire McKay
Hu Jingchen
Publication venue
Publication date: 01/08/2023
Field of study

Synthetic data generation is a powerful tool for privacy protection when considering public release of record-level data files. Initially proposed about three decades ago, it has generated significant research and application interest. To meet the pressing demand of data privacy protection in a variety of contexts, the field needs more researchers and practitioners. This review provides a comprehensive introduction to synthetic data, including technical details of their generation and evaluation. Our review also addresses the challenges and limitations of synthetic data, discusses practical applications, and provides thoughts for future work

arXiv.org e-Print Archive

30 Years of Synthetic Data

Author: Drechsler Joerg
Haensch Anna-Carolina
Publication venue
Publication date: 04/04/2023
Field of study

The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.Comment: 42 page

arXiv.org e-Print Archive

Priv Stat Databases

Author
Publication venue
Publication date
Field of study

In this paper we propose a method for statistical disclosure limitation of categorical variables that we call Conditional Group Swapping. This approach is suitable for design and strata-defining variables, the cross-classification of which leads to the formation of important groups or subpopulations. These groups are considered important because from the point of view of data analysis it is desirable to preserve analytical characteristics within them. In general data swapping can be quite distorting ([12, 18, 15]), especially for the relationships between the variables not only within the subpopulations but for the overall data. To reduce the damage incurred by swapping, we propose to choose the records for swapping using conditional probabilities which depend on the characteristics of the exchanged records. In particular, our approach exploits the results of propensity scores methodology for the computation of swapping probabilities. The experimental results presented in the paper show good utility properties of the method.CC999999/ImCDC/Intramural CDC HHS/United States2020-03-23T00:00:00Z32206763PMC70874077412vault:3515

CDC Stacks

Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality

Author: Drechsler Jörg
Reiter J. P.
Publication venue: Nürnberg
Publication date: 29/05/2012
Field of study

Eine Methode, um die Vertraulichkeit von Daten, die in statistischen Ämtern erhobenen werden, zu gewährleisten, ist das Ersetzen vertraulicher Werte durch synthetische Daten, die mittels multipler Imputation generiert werden. Es wird ein zweistufiges Verfahren zur Generierung der synthetischen Daten vorgestellt, das eine unterschiedliche Anzahl von Imputationen für unterschiedliche Variablen ermöglicht. Die Vorteile eines zweistufigen Verfahren liegen in der Reduzierung der Laufzeit bei der Berechnung, in der Verringerung des Risikos der Deanonymisierung, und in der Erhöhung der inferentiellen Genauigkeit. Es wird beschrieben, wie das zweistufige Verfahren bei der Generierung eines Public-Use-Files des IAB-Betriebpanels zur Anwendung kommt. (IAB)"To protect the cofidentiality of survey respondents' identities and sensitive attributes, statistical agencies can release data in which cofidential values are replaced with multiple imputations. These are called synthetic data. We propose a two-stage approach to generating synthetic data that enables agencies to release different numbers of imputations for different variables. Generation in two stages can reduce computational burdens, decrease disclosure risk, and increase inferential accuracy relative to generation in one stage. We present methods for obtaining inferences from such data. We describe the application of two stage synthesis to creating a public use file for a German business database." (author's abstract

SSOAR - Social Science Open Access Repository