Search CORE

706 research outputs found

Quantifying Privacy: A Novel Entropy-Based Measure of Disclosure Risk

Author: A Oganian
C Dwork
CCM Fung
CJ Skinner
D Lambert
DE Denning
F Al-Saggaf
GT Duncan
JR Griggs
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Sankar
L Willenborg
M Trottini
N Lopez
N López
NR Adam
P Horak
P Tendick
R Ahlswede
S Fletcher
S Morris
T King
V Estivill-Castro
V Estivill-Castro
WA Fuller
WE Winkler
WE Yancey
Y Al-Saggaf
Publication venue
Publication date: 07/09/2014
Field of study

It is well recognised that data mining and statistical analysis pose a serious treat to privacy. This is true for financial, medical, criminal and marketing research. Numerous techniques have been proposed to protect privacy, including restriction and data modification. Recently proposed privacy models such as differential privacy and k-anonymity received a lot of attention and for the latter there are now several improvements of the original scheme, each removing some security shortcomings of the previous one. However, the challenge lies in evaluating and comparing privacy provided by various techniques. In this paper we propose a novel entropy based security measure that can be applied to any generalisation, restriction or data modification technique. We use our measure to empirically evaluate and compare a few popular methods, namely query restriction, sampling and noise addition.Comment: 20 pages, 4 figure

arXiv.org e-Print Archive

University of Newcastle's Digital Repository

Crossref

Distribution-Preserving Statistical Disclosure Limitation

Author: Benedetto Gary
Woodcock Simon
Publication venue
Publication date
Field of study

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.statistical disclosure limitation; confidentiality; privacy; multiple imputation; partially synthetic data

Research Papers in Economics