11 research outputs found
When the signal is in the noise: Exploiting Diffix's Sticky Noise
Anonymized data is highly valuable to both businesses and researchers. A
large body of research has however shown the strong limits of the
de-identification release-and-forget model, where data is anonymized and
shared. This has led to the development of privacy-preserving query-based
systems. Based on the idea of "sticky noise", Diffix has been recently proposed
as a novel query-based mechanism satisfying alone the EU Article~29 Working
Party's definition of anonymization. According to its authors, Diffix adds less
noise to answers than solutions based on differential privacy while allowing
for an unlimited number of queries.
This paper presents a new class of noise-exploitation attacks, exploiting the
noise added by the system to infer private information about individuals in the
dataset. Our first differential attack uses samples extracted from Diffix in a
likelihood ratio test to discriminate between two probability distributions. We
show that using this attack against a synthetic best-case dataset allows us to
infer private information with 89.4% accuracy using only 5 attributes. Our
second cloning attack uses dummy conditions that conditionally strongly affect
the output of the query depending on the value of the private attribute. Using
this attack on four real-world datasets, we show that we can infer private
attributes of at least 93% of the users in the dataset with accuracy between
93.3% and 97.1%, issuing a median of 304 queries per user. We show how to
optimize this attack, targeting 55.4% of the users and achieving 91.7%
accuracy, using a maximum of only 32 queries per user.
Our attacks demonstrate that adding data-dependent noise, as done by Diffix,
is not sufficient to prevent inference of private attributes. We furthermore
argue that Diffix alone fails to satisfy Art. 29 WP's definition of
anonymization. [...
Quantifying Surveillance in the Networked Age: Node-based Intrusions and Group Privacy
From the "right to be left alone" to the "right to selective disclosure",
privacy has long been thought as the control individuals have over the
information they share and reveal about themselves. However, in a world that is
more connected than ever, the choices of the people we interact with
increasingly affect our privacy. This forces us to rethink our definition of
privacy. We here formalize and study, as local and global node- and
edge-observability, Bloustein's concept of group privacy. We prove
edge-observability to be independent of the graph structure, while
node-observability depends only on the degree distribution of the graph. We
show on synthetic datasets that, for attacks spanning several hops such as
those implemented by social networks and current US laws, the presence of hubs
increases node-observability while a high clustering coefficient decreases it,
at fixed density. We then study the edge-observability of a large real-world
mobile phone dataset over a month and show that, even under the restricted
two-hops rule, compromising as little as 1% of the nodes leads to observing up
to 46% of all communications in the network. More worrisome, we also show that
on average 36\% of each person's communications would be locally
edge-observable under the same rule. Finally, we use real sensing data to show
how people living in cities are vulnerable to distributed node-observability
attacks. Using a smartphone app to compromise 1\% of the population, an
attacker could monitor the location of more than half of London's population.
Taken together, our results show that the current individual-centric approach
to privacy and data protection does not encompass the realities of modern life.
This makes us---as a society---vulnerable to large-scale surveillance attacks
which we need to develop protections against
MM: A general method to perform various data analysis tasks from a differentially private sketch
Differential privacy is the standard privacy definition for performing
analyses over sensitive data. Yet, its privacy budget bounds the number of
tasks an analyst can perform with reasonable accuracy, which makes it
challenging to deploy in practice. This can be alleviated by private sketching,
where the dataset is compressed into a single noisy sketch vector which can be
shared with the analysts and used to perform arbitrarily many analyses.
However, the algorithms to perform specific tasks from sketches must be
developed on a case-by-case basis, which is a major impediment to their use. In
this paper, we introduce the generic moment-to-moment (MM) method to
perform a wide range of data exploration tasks from a single private sketch.
Among other things, this method can be used to estimate empirical moments of
attributes, the covariance matrix, counting queries (including histograms), and
regression models. Our method treats the sketching mechanism as a black-box
operation, and can thus be applied to a wide variety of sketches from the
literature, widening their ranges of applications without further engineering
or privacy loss, and removing some of the technical barriers to the wider
adoption of sketches for data exploration under differential privacy. We
validate our method with data exploration tasks on artificial and real-world
data, and show that it can be used to reliably estimate statistics and train
classification models from private sketches.Comment: Published at the 18th International Workshop on Security and Trust
Management (STM 2022
Synthetic Data -- what, why and how?
This explainer document aims to provide an overview of the current state of
the rapidly expanding work on synthetic data technologies, with a particular
focus on privacy. The article is intended for a non-technical audience, though
some formal definitions have been given to provide clarity to specialists. This
article is intended to enable the reader to quickly become familiar with the
notion of synthetic data, as well as understand some of the subtle intricacies
that come with it. We do believe that synthetic data is a very useful tool, and
our hope is that this report highlights that, while drawing attention to
nuances that can easily be overlooked in its deployment.Comment: Commissioned by the Royal Society. 57 pages 2 figure
A Framework for Auditable Synthetic Data Generation
Synthetic data has gained significant momentum thanks to sophisticated
machine learning tools that enable the synthesis of high-dimensional datasets.
However, many generation techniques do not give the data controller control
over what statistical patterns are captured, leading to concerns over privacy
protection. While synthetic records are not linked to a particular real-world
individual, they can reveal information about users indirectly which may be
unacceptable for data owners. There is thus a need to empirically verify the
privacy of synthetic data -- a particularly challenging task in
high-dimensional data. In this paper we present a general framework for
synthetic data generation that gives data controllers full control over which
statistical properties the synthetic data ought to preserve, what exact
information loss is acceptable, and how to quantify it. The benefits of the
approach are that (1) one can generate synthetic data that results in high
utility for a given task, while (2) empirically validating that only statistics
considered safe by the data curator are used to generate the data. We thus show
the potential for synthetic data to be an effective means of releasing
confidential data safely, while retaining useful information for analysts
Compressive Learning with Privacy Guarantees
International audienceThis work addresses the problem of learning from large collections of data with privacy guarantees. The compressive learning framework proposes to deal with the large scale of datasets by compressing them into a single vector of generalized random moments, from which the learning task is then performed. We show that a simple perturbation of this mechanism with additive noise is sufficient to satisfy differential privacy, a well established formalism for defining and quantifying the privacy of a random mechanism. We combine this with a feature subsampling mechanism, which reduces the computational cost without damaging privacy. The framework is applied to the tasks of Gaussian modeling, k-means clustering and principal component analysis (PCA), for which sharp privacy bounds are derived. Empirically, the quality (for subsequent learning) of the compressed representation produced by our mechanism is strongly related with the induced noise level, for which we give analytical expressions
Theoretical models for web search privacy through query obfuscation
With the emergence of the Big Data era, privacy has become an increasingly important issue. The constant and ubiquitous logging of personal and professional data raises concerns, as this data is used for commercial, political or juridic purposes, with little to no regards for the usersâ intimacy. In particular, Web search â the activity through which users of search engines access information on the Internet from their search queries â has recently come to light as an area where privacy is both primordial and as of now unachievable. Indeed, Web search data is seen as extremely intimate, as it may contain commercial, financial and medical information, yet very few solutions exist to protect its privacy. A promising solution that has been proposed is query obfuscation, where a program on the userâs computer sends many artificial queries in the hope of drowning the userâs queries in noise. This approach is valuable, as it makes the user the sole responsible of her own privacy, and additionally ensures protection against an eavesdropper. However, no obfuscator developed up to now has been proven to address the privacy issues of Web search data in a meaningful way, and existing implementations have been shown to be either unusable or useless in practice. Assessing whether efficient and effective obfuscators can be designed is a crucial question for the future of Web search privacy. In this master thesis, we propose a novel framework for the analysis and design of query obfuscators. Our contributions are fourfold. Firstly, we analyze the literature and discuss the userâs needs to define design principles for obfuscators. Secondly, we define three novel privacy notions that answer these needs. Thirdly, we introduce a new model for practical obfuscators that implements the principles discussed. Fourthly, we build on this model and notions to discuss the feasibility of query obfuscation for Web search. Our conclusion is that query obfuscation is not a suitable solution for Web search privacy, but it is nonetheless a surprisingly valuable tool. Indeed, query obfuscation is a powerful technique that is inadequate to address the daunting task of Web search data, due to the sheer volume of data involved. We argue that the rigorous analysis proposed in this master thesis serves as a strong â and arguably the first â basis for the study of obfuscators as a solution to privacy issues in other domains, such as the privacy of patent search.Master [120] : ingĂ©nieur civil en mathĂ©matiques appliquĂ©es, UniversitĂ© catholique de Louvain, 201
Web Privacy: A Formal Adversarial Model for Query Obfuscation
The queries we perform, the searches we make, and the websites we visit — this sensitive data is collected at scale by companies as part of the services they provide. Query obfuscation, intertwining the user queries with artificial queries, has been proposed as a solution to protect the privacy of individuals on the web. We here present a formal model and formulate through attack models three privacy requirements for obfuscators: 1) indistinguishability, that the user query should be hard to identify; 2) coverage, that its topic should be hard to identify; and 3) imprecision, that the query should still be hard to identify for an attacker with additional auxiliary information. The latter is needed to make the former two guarantees “future-proof”. Using our framework, we derive two important results for obfuscators. First, we show that indistinguishability imposes strong bounds on the coverage and imprecision achievable by an obfuscator. Second, we prove an important tradeoff between coverage and imprecision, which inherently limits the strength and robustness of the privacy guarantees that an obfuscator can provide. We then introduce a family of obfuscators with provable indistinguishability guarantees, which we call ball obfuscators, and show, for a range of parameter values, the achievable coverage and imprecision. We show empirically that our theoretical tradeoff holds, and that its bound is not tight in practice: even in a simple idealized setting, there is a significant gap between practical coverage and imprecision guarantees, and the optimal bounds. While obfuscators have proven popular with the general public, all obfuscators currently available provide ad-hoc guarantees, and have been shown to be vulnerable to attacks, putting the data of users at risk. We hope this work to be a first step towards a robust evaluation of the properties of query obfuscators and the development of principled obfuscators
When the Signal is in the Noise: Exploiting Diffix's Sticky Noise
Anonymized data is highly valuable to both businesses and researchers. A large body of research has however shown the strong limits of the de-identification release-and-forget model, where data is anonymized and shared. This has led to the development of privacy-preserving query-based systems. Based on the idea of "sticky noise", Diffix has been recently proposed as a novel query-based mechanism satisfying alone the EU Article 29 Working Party's definition of anonymization. According to its authors, Diffix adds less noise to answers than solutions based on differential privacy while allowing for an unlimited number of queries. This paper presents a new class of noise-exploitation attacks, exploiting the noise added by the system to infer private information about individuals in the dataset. Our first differential attack uses samples extracted from Diffix in a likelihood ratio test to discriminate between two probability distributions. We show that using this attack against a synthetic best-case dataset allows us to infer private information with 89.4% accuracy using only 5 attributes. Our second cloning attack uses dummy conditions that conditionally strongly affect the output of the query depending on the value of the private attribute. Using this attack on four real-world datasets, we show that we can infer private attributes of at least 93% of the users in the dataset with accuracy between 93.3% and 97.1%, issuing a median of 304 queries per user. We show how to optimize this attack, targeting 55.4% of the users and achieving 91.7% accuracy, using a maximum of only 32 queries per user. Our attacks demonstrate that adding data-dependent noise, as done by Diffix, is not sufficient to prevent inference of private attributes. We furthermore argue that Diffix alone fails to satisfy Art. 29 WP's definition of anonymization. We conclude by discussing how non-provable privacy-preserving systems can be combined with fundamental security principles such as defense-in-depth and auditability to build practically useful anonymization systems without relying on differential privacy
TAPAS: a toolbox for adversarial privacy auditing of synthetic data
Personal data collected at scale promises to improve decision-making and accelerate innovation. However, sharing and using such data raises serious privacy concerns. A promising solution is to produce synthetic data, artificial records to share instead of real data. Since synthetic records are not linked to real persons, this intuitively prevents classical re-identification attacks. However, this is insufficient to protect privacy. We here present PrivE, a toolbox of attacks to evaluate synthetic data privacy under a wide range of scenarios. These attacks include generalizations of prior works and novel attacks. We also introduce a general framework for reasoning about privacy threats to synthetic data and showcase PrivE on several examples