11 research outputs found
Using saturated models for data synthesis
The use of synthetic data sets are becoming ever more prevalent, as regulations such as the General Data Protection Regulation (GDPR), which place greater demands on the protection of individualsâ personal data, are coupled with the conflicting demand to make more data available to researchers. This paper discusses the approach of synthesizing categorical data at the aggregated (contingency table) level using a saturated count model, which adds noise - and hence protection - to cell counts. The paper also discusses how distributional properties of synthesis models are intrinsic to generating synthetic data with suitable risk and utility profiles
Obtaining (Æ,ÎŽ)-differential privacy guarantees when using the Poisson distribution to synthesize tabular data
We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain (Ï”, ÎŽ)-probabilistic differential privacy guarantees via the Poisson distributionâs cumulative distribution function). We demonstrate this Poisson synthesis mechanism empirically with the synthesis of the ESCrep data set, an administrativetype database that resembles the English School Census
On integrating the number of synthetic data sets m into the a priori synthesis approach
The synthesis mechanism given in Jackson et al. (2022) uses saturated models, along with overdispersed count distributions, to generate synthetic categorical data. The mechanism is controlled by tuning parameters, which can be tuned according to a specific risk or utility metric. Thus expected properties of synthetic data sets can be determined analytically a priori, that is, before they are generated. While Jackson et al. (2022) considered the case of generating m = 1 data set, this paper considers generating m > 1 data sets. In effect, m becomes a tuning parameter and the role of m in relation to the risk-utility trade-off can be shown analytically. The paper introduces a pair of risk metrics, Ï3(k,d) and Ï4(k,d) that are suited to m > 1 data sets; and also considers the more general issue of how best to analyse categorical data sets: average the data sets pre-analysis or average results post-analysis. Finally, the methods are demonstrated empirically with the synthesis of a constructed data set which is used to represent the English School Census
An overview on synthetic administrative data for research
Use of administrative data for research and for planning services has increased over recent decades due to the value of the large, rich information available. However, concerns about the release of sensitive or personal data and the associated disclosure risk can lead to lengthy approval processes and restricted data access. This can delay or prevent the production of timely evidence. A promising solution to facilitate more efficient data access is to create synthetic versions of the original datasets which do not hold any confidential information and can minimise disclosure risk. Such data may be used as an interim solution, allowing researchers to develop their analysis plans on non-disclosive data, whilst waiting for access to the real data. We aim to provide an overview of the background and uses of synthetic data, describe common methods used to generate synthetic data in the context of UK administrative research, propose a simplified terminology for categories of synthetic data, and illustrate challenges and future directions for research.
Confidentiality challenges in releasing longitudinally linked data
Longitudinally linked household data allows researchers to analyse trends over time as well as on a cross-sectional level. Such analysis requires households to be linked across waves, but this increases the possibility of disclosure risks. We focus on an inter-wave disclosure risk specific to such data sets where intruders can make use of intimate knowledge gained about the household in one wave to learn new sensitive information about the household in future waves. We consider a specific way this risk could occur when households split in one wave, so an individual has left the household, and illustrate this risk using the Wealth and Assets survey. We also show that simply removing the links between waves may be insufficient to adequately protect confidentiality. To mitigate this risk we investigate two statistical disclosure control methods, perturbation and synthesis, that alter sensitive information on these households in the current wave. In this way no new sensitive information will be disclosed to these individuals, while utility should be largely preserved provided the SDC measures are applied appropriately. © 2020, University of Skovde. All rights reserved
Using saturated count models for user-friendly synthesis of large confidential administrative databases
Over the past three decades, synthetic data methods for statistical disclosure control have continually evolved,but mainly within the domain of survey data sets. There are certain characteristics of administrative databases,such as their size, which present challenges from a syn-thesis perspective and require special attention. This paper, through the fitting of saturated count models,presents a synthesis method that is suitable for administrative databases. It is tuned by two parameters,and.The method allows large categorical data sets to be synthesized quickly and allows risk and utility metrics to be satisfied a priori, that is, prior to synthetic data generation. The paper explores how the flexibility afforded by two-parameter count models (the negative binomial and Poisson-inverse Gaussian) can be utilised to protect respondentsââespecially uniquesââprivacy in synthetic data. Finally, an empirical example is carried out through the synthesis of a database which can be viewed as a good substitute to the English School Census