Search CORE

1,357 research outputs found

Quantifying Privacy: A Novel Entropy-Based Measure of Disclosure Risk

Author: A Oganian
C Dwork
CCM Fung
CJ Skinner
D Lambert
DE Denning
F Al-Saggaf
GT Duncan
JR Griggs
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Sankar
L Willenborg
M Trottini
N Lopez
N López
NR Adam
P Horak
P Tendick
R Ahlswede
S Fletcher
S Morris
T King
V Estivill-Castro
V Estivill-Castro
WA Fuller
WE Winkler
WE Yancey
Y Al-Saggaf
Publication venue
Publication date: 07/09/2014
Field of study

It is well recognised that data mining and statistical analysis pose a serious treat to privacy. This is true for financial, medical, criminal and marketing research. Numerous techniques have been proposed to protect privacy, including restriction and data modification. Recently proposed privacy models such as differential privacy and k-anonymity received a lot of attention and for the latter there are now several improvements of the original scheme, each removing some security shortcomings of the previous one. However, the challenge lies in evaluating and comparing privacy provided by various techniques. In this paper we propose a novel entropy based security measure that can be applied to any generalisation, restriction or data modification technique. We use our measure to empirically evaluate and compare a few popular methods, namely query restriction, sampling and noise addition.Comment: 20 pages, 4 figure

arXiv.org e-Print Archive

University of Newcastle's Digital Repository

Crossref

Economic Analysis and Statistical Disclosure Limitation

Author: Abowd John M.
Schmutte Ian M
Publication venue: DigitalCommons@ILR
Publication date: 13/08/2015
Field of study

This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect conﬁdentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies

DigitalCommons@ILR

Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

Author: Chapelle Rémy
Falissard Bruno
Publication venue
Publication date: 10/10/2023
Field of study

Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation

arXiv.org e-Print Archive

On the use of economic price theory to determine the optimum levels of privacy and information utility in microdata anonymisation

Author: Zielinski Marek Piotr
Publication venue: 'University of Pretoria - Department of Philosophy'
Publication date: 09/06/2010
Field of study

Statistical data, such as in the form of microdata, is used by different organisations as a basis for creating knowledge to assist in their planning and decision-making activities. However, before microdata can be made available for analysis, it needs to be anonymised in order to protect the privacy of the individuals whose data is released. The protection of privacy requires us to hide or obscure the released data. On the other hand, making data useful for its users implies that we should provide data that is accurate, complete and precise. Ideally, we should maximise both the level of privacy and the level of information utility of a released microdata set. However, as we increase the level of privacy, the level of information utility decreases. Without guidelines to guide the selection of the optimum levels of privacy and information utility, it is difficult to determine the optimum balance between the two goals. The objective and constraints of this optimisation problem can be captured naturally with concepts from Economic Price Theory. In this thesis, we present an approach based on Economic Price Theory for guiding the process of microdata anonymisation such that optimum levels of privacy and information utility are achieved.Thesis (PhD)--University of Pretoria, 2010.Computer Scienceunrestricte

UPSpace at the University of Pretoria

Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the U.S. Statistical System?

Author: Abowd John M.
Belli Robert F
Cressie Noel
Folch David C.
Holan S. H.
Levenstein Margaret C.
Olson Kristen
Reiter Jerome P.
Shapiro Matthew D.
Smyth Jolene
Soh Leen-Kiat
Spencer Bruce D.
Spielman Seth E.
Vilhuber Lars
Weinberg , Daniel H.
Wikle Christopher K.
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2018
Field of study

The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly to the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This article focuses on some of the key research findings of the eight nodes, organized into six topics: (1) improving census and survey data-quality and data collection methods; (2) using alternative sources of data; (3) protecting privacy and confidentiality by improving disclosure avoidance; (4) using spatial and spatio-temporal statistical modeling to improve estimates; (5) assessing data cost and data-quality tradeoffs; and (6) combining information from multiple sources. The article concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes, suggests some next steps, and discusses the implications of this research-network model for future federal government research initiatives

Generating tabular datasets under differential privacy

Author: Truda Gianluca
Publication venue
Publication date: 28/08/2023
Field of study

Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process

arXiv.org e-Print Archive

A systems approach to evaluate One Health initiatives

Author: Anderson
Aragrande
Argyris
Baum
Befani
Borgman
Boriani
Britt
Bunge
Canyon
Chigas
Chokshi
Craig
de Savigny
El-Jardali
Falzon
Fath
Fiol
Garcia
Garvin
Giesecke
Gould
Gunderson
Guns
Hadorn Hirsch
Hanin
Haxton
Hirsch Hadorn
Houe
Houghton
Huysman
Häsler
Ingram
Jones
Laing
Lang
Lattuca
Ledford
Lerner
Levitt
Lélé
Meadows
Moore
Mowles
Nancarrow
Nikitina
Ostrom
Paternoster
Pfeiffer
Piwowar
Piwowar
Piwowar
Pumain
Rabinowitz
Radeski
Raney
Redding
Reynolds
Romanelli
Rooke
Rüegg
Santa
Schelling
Scott
Stokols
Stokols
Tenopir
Thygeson
Trochim
Tsang
Wallace
Walter
Watkins
Whitehead
Whitehead
Whitmee
Williams
Yukl
Zinsstag
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2018
Field of study

Challenges calling for integrated approaches to health, such as the One Health (OH) approach, typically arise from the intertwined spheres of humans, animals, and ecosystems constituting their environment. Initiatives addressing such wicked problems commonly consist of complex structures and dynamics. As a result of the EU COST Action (TD 1404) “Network for Evaluation of One Health” (NEOH), we propose an evaluation framework anchored in systems theory to address the intrinsic complexity of OH initiatives and regard them as subsystems of the context within which they operate. Typically, they intend to influence a system with a view to improve human, animal, and environmental health. The NEOH evaluation framework consists of four overarching elements, namely: (1) the definition of the initiative and its context, (2) the description of the theory of change with an assessment of expected and unexpected outcomes, (3) the process evaluation of operational and supporting infrastructures (the “OH-ness”), and (4) an assessment of the association(s) between the process evaluation and the outcomes produced. It relies on a mixed methods approach by combining a descriptive and qualitative assessment with a semi-quantitative scoring for the evaluation of the degree and structural balance of “OH-ness” (summarised in an OH-index and OH-ratio, respectively) and conventional metrics for different outcomes in a multi-criteria-decision-analysis. Here, we focus on the methodology for Elements (1) and (3) including ready-to-use Microsoft Excel spreadsheets for the assessment of the “OH-ness”. We also provide an overview of Element (2), and refer to the NEOH handbook for further details, also regarding Element (4) (http://neoh.onehealthglobal.net). The presented approach helps researchers, practitioners, and evaluators to conceptualise and conduct evaluations of integrated approaches to health and facilitates comparison and learning across different OH activities thereby facilitating decisions on resource allocation. The application of the framework has been described in eight case studies in the same Frontiers research topic and provides first data on OH-index and OH-ratio, which is an important step towards their validation and the creation of a dataset for future benchmarking, and to demonstrate under which circumstances OH initiatives provide added value compared to disciplinary or conventional health initiatives

OAR@UM

Crossref

Directory of Open Access Journals

Ghent University Academic Bibliography

Copenhagen University Research Information System

Research Repository

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

ZORA

Online Research Database In Technology

De-identifying a public use microdata file from the Canadian national discharge abstract database

Author: A Dale
A de Waal
A Gionis
A Hundepool
A Hundepool
A Machanavajjhala
A Machanavajjhala
A Meyerson
A Narayanan
Agency for Healthcare Research and Quality
B Hore
B Yolles
B-C Chen
BCM Fung
BCM Fung
BCM Fung
C Hogue
C Mackie
C Marsh
C Marsh
C Skinner
C Skinner
Canada Statistics
Canadian Institute for Health Information
Canadian Institute for Health Information
CE Shannon
CE Shannon
CK Liew
D Altman
D Defays
D Defays
D Hutchon
D Lafky
David Paton
DB Rubin
Department of Health and Human Services
Department of Health and Human Services
E Boyko
Federal Court (Canada)
Fida Dankar
G Aggarwal
G Duncan
G Loukides
G Sande
G Sullivan
G Sullivan
GD Smith
GR Heer
Gunes Koru
H Kargupta
J Castro
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Domingo-Ferrer
J Jimenez
J Schoenman
J Xu
JJ Kim
JP Gouweleeuw
K Abraham
K Benitez
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K El Emam
K LeFevre
Khaled El Emam
L Alexander
L Sweeney
L Sweeney
L Sweeney
L Sweeney
L Sweeney
L Willenborg
L Willenborg
LA Alexander
LH Cox
M Barbaro
M Templ
ME Nergiz
National Committee on Vital and Health Statistics
P Doyle
P Kooiman
P Nanopoulos
P Samarati
P Samarati
P Samarati
R Bayardo
R Gopal
RA Dandekar
RA Dandekar
RJ Bayardo
RJA Little
S Fienberg
S Hansell
S Ochoa
Statistics Canada
Statistics Canada
Statistics Canada
T de Waal
T Delamothe
T Hedrick
T Zeller Jr
V Ciriani
V Iyengar
V Torra
V Torra
V Torra
VS Iyengar
W Lowrance
W Winkler
WE Winkler
X Xiao
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records. Methods Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy. Results Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression. Conclusions The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central