8,835 research outputs found
Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods
This paper has been replaced with http://digitalcommons.ilr.cornell.edu/ldi/37.
We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε, δ)-differential privacy with (α, β)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial
An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices
Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy
A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models
Synthetic health data have the potential to mitigate privacy concerns when
sharing data to support biomedical research and the development of innovative
healthcare applications. Modern approaches for data generation based on machine
learning, generative adversarial networks (GAN) methods in particular, continue
to evolve and demonstrate remarkable potential. Yet there is a lack of a
systematic assessment framework to benchmark methods as they emerge and
determine which methods are most appropriate for which use cases. In this work,
we introduce a generalizable benchmarking framework to appraise key
characteristics of synthetic health data with respect to utility and privacy
metrics. We apply the framework to evaluate synthetic data generation methods
for electronic health records (EHRs) data from two large academic medical
centers with respect to several use cases. The results illustrate that there is
a utility-privacy tradeoff for sharing synthetic EHR data. The results further
indicate that no method is unequivocally the best on all criteria in each use
case, which makes it evident why synthetic data generation methods need to be
assessed in context
You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information
Metadata are associated to most of the information we produce in our daily
interactions and communication in the digital world. Yet, surprisingly,
metadata are often still catergorized as non-sensitive. Indeed, in the past,
researchers and practitioners have mainly focused on the problem of the
identification of a user from the content of a message.
In this paper, we use Twitter as a case study to quantify the uniqueness of
the association between metadata and user identity and to understand the
effectiveness of potential obfuscation strategies. More specifically, we
analyze atomic fields in the metadata and systematically combine them in an
effort to classify new tweets as belonging to an account using different
machine learning algorithms of increasing complexity. We demonstrate that
through the application of a supervised learning algorithm, we are able to
identify any user in a group of 10,000 with approximately 96.7% accuracy.
Moreover, if we broaden the scope of our search and consider the 10 most likely
candidates we increase the accuracy of the model to 99.22%. We also found that
data obfuscation is hard and ineffective for this type of data: even after
perturbing 60% of the training data, it is still possible to classify users
with an accuracy higher than 95%. These results have strong implications in
terms of the design of metadata obfuscation strategies, for example for data
set release, not only for Twitter, but, more generally, for most social media
platforms.Comment: 11 pages, 13 figures. Published in the Proceedings of the 12th
International AAAI Conference on Web and Social Media (ICWSM 2018). June
2018. Stanford, CA, US
Privacy-oriented manipulation of speaker representations
Speaker embeddings are ubiquitous, with applications ranging from speaker
recognition and diarization to speech synthesis and voice anonymisation. The
amount of information held by these embeddings lends them versatility, but also
raises privacy concerns. Speaker embeddings have been shown to contain
information on age, sex, health and more, which speakers may want to keep
private, especially when this information is not required for the target task.
In this work, we propose a method for removing and manipulating private
attributes from speaker embeddings that leverages a Vector-Quantized
Variational Autoencoder architecture, combined with an adversarial classifier
and a novel mutual information loss. We validate our model on two attributes,
sex and age, and perform experiments with ignorant and fully-informed
attackers, and with in-domain and out-of-domain data
- …