3,524 research outputs found
Measuring Membership Privacy on Aggregate Location Time-Series
While location data is extremely valuable for various applications,
disclosing it prompts serious threats to individuals' privacy. To limit such
concerns, organizations often provide analysts with aggregate time-series that
indicate, e.g., how many people are in a location at a time interval, rather
than raw individual traces. In this paper, we perform a measurement study to
understand Membership Inference Attacks (MIAs) on aggregate location
time-series, where an adversary tries to infer whether a specific user
contributed to the aggregates.
We find that the volume of contributed data, as well as the regularity and
particularity of users' mobility patterns, play a crucial role in the attack's
success. We experiment with a wide range of defenses based on generalization,
hiding, and perturbation, and evaluate their ability to thwart the attack
vis-a-vis the utility loss they introduce for various mobility analytics tasks.
Our results show that some defenses fail across the board, while others work
for specific tasks on aggregate location time-series. For instance, suppressing
small counts can be used for ranking hotspots, data generalization for
forecasting traffic, hotspot discovery, and map inference, while sampling is
effective for location labeling and anomaly detection when the dataset is
sparse. Differentially private techniques provide reasonable accuracy only in
very specific settings, e.g., discovering hotspots and forecasting their
traffic, and more so when using weaker privacy notions like crowd-blending
privacy. Overall, our measurements show that there does not exist a unique
generic defense that can preserve the utility of the analytics for arbitrary
applications, and provide useful insights regarding the disclosure of sanitized
aggregate location time-series
Stealing Links from Graph Neural Networks
Graph data, such as chemical networks and social networks, may be deemed
confidential/private because the data owner often spends lots of resources
collecting the data or the data contains sensitive information, e.g., social
relationships. Recently, neural networks were extended to graph data, which are
known as graph neural networks (GNNs). Due to their superior performance, GNNs
have many applications, such as healthcare analytics, recommender systems, and
fraud detection. In this work, we propose the first attacks to steal a graph
from the outputs of a GNN model that is trained on the graph. Specifically,
given a black-box access to a GNN model, our attacks can infer whether there
exists a link between any pair of nodes in the graph used to train the model.
We call our attacks link stealing attacks. We propose a threat model to
systematically characterize an adversary's background knowledge along three
dimensions which in total leads to a comprehensive taxonomy of 8 different link
stealing attacks. We propose multiple novel methods to realize these 8 attacks.
Extensive experiments on 8 real-world datasets show that our attacks are
effective at stealing links, e.g., AUC (area under the ROC curve) is above 0.95
in multiple cases. Our results indicate that the outputs of a GNN model reveal
rich information about the structure of the graph used to train the model.Comment: To appear in the 30th Usenix Security Symposium, August 2021,
Vancouver, B.C., Canad
When the signal is in the noise: Exploiting Diffix's Sticky Noise
Anonymized data is highly valuable to both businesses and researchers. A
large body of research has however shown the strong limits of the
de-identification release-and-forget model, where data is anonymized and
shared. This has led to the development of privacy-preserving query-based
systems. Based on the idea of "sticky noise", Diffix has been recently proposed
as a novel query-based mechanism satisfying alone the EU Article~29 Working
Party's definition of anonymization. According to its authors, Diffix adds less
noise to answers than solutions based on differential privacy while allowing
for an unlimited number of queries.
This paper presents a new class of noise-exploitation attacks, exploiting the
noise added by the system to infer private information about individuals in the
dataset. Our first differential attack uses samples extracted from Diffix in a
likelihood ratio test to discriminate between two probability distributions. We
show that using this attack against a synthetic best-case dataset allows us to
infer private information with 89.4% accuracy using only 5 attributes. Our
second cloning attack uses dummy conditions that conditionally strongly affect
the output of the query depending on the value of the private attribute. Using
this attack on four real-world datasets, we show that we can infer private
attributes of at least 93% of the users in the dataset with accuracy between
93.3% and 97.1%, issuing a median of 304 queries per user. We show how to
optimize this attack, targeting 55.4% of the users and achieving 91.7%
accuracy, using a maximum of only 32 queries per user.
Our attacks demonstrate that adding data-dependent noise, as done by Diffix,
is not sufficient to prevent inference of private attributes. We furthermore
argue that Diffix alone fails to satisfy Art. 29 WP's definition of
anonymization. [...
Survey: Leakage and Privacy at Inference Time
Leakage of data from publicly available Machine Learning (ML) models is an
area of growing significance as commercial and government applications of ML
can draw on multiple sources of data, potentially including users' and clients'
sensitive data. We provide a comprehensive survey of contemporary advances on
several fronts, covering involuntary data leakage which is natural to ML
models, potential malevolent leakage which is caused by privacy attacks, and
currently available defence mechanisms. We focus on inference-time leakage, as
the most likely scenario for publicly available models. We first discuss what
leakage is in the context of different data, tasks, and model architectures. We
then propose a taxonomy across involuntary and malevolent leakage, available
defences, followed by the currently available assessment metrics and
applications. We conclude with outstanding challenges and open questions,
outlining some promising directions for future research
Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning
Machine learning (ML) has progressed rapidly during the past decade and the major factor that drives such development is the unprecedented large-scale data. As data generation is a continuous process, this leads to ML service providers updating their models frequently with newly-collected data in an online learning scenario. In consequence, if an ML model is queried with the same set of data samples at two different points in time, it will provide different results. In this paper, we investigate whether the change in the output of a black-box ML model before and after being updated can leak information of the dataset used to perform the update. This constitutes a new attack surface against black-box ML models and such information leakage severely damages the intellectual property and data privacy of the ML model owner/provider. In contrast to membership inference attacks, we use an encoder-decoder formulation that allows inferring diverse information ranging from detailed characteristics to full reconstruction of the dataset. Our new attacks are facilitated by state-of-the-art deep learning techniques. In particular, we propose a hybrid generative model (BM-GAN) that is based on generative adversarial networks (GANs) but includes a reconstructive loss that allows generating accurate samples. Our experiments show effective prediction of dataset characteristics and even full reconstruction in challenging conditions
Pool inference attacks on local differential privacy: quantifying the privacy guarantees of apple's count mean sketch in practice
Behavioral data generated by users’ devices, ranging from emoji use to pages visited, are collected at scale to improve apps and services. These data, however, contain fine-grained records and can reveal sensitive information about individual users. Local differential privacy has been used by companies as a solution to collect data from users while preserving privacy. We here first introduce pool inference attacks, where an adversary has access to a user’s obfuscated data, defines pools of objects, and exploits the user’s polarized behavior in multiple data collections to infer the user’s preferred pool. Second, we instantiate this attack against Count Mean Sketch, a local differential privacy mechanism proposed by Apple and deployed in iOS and Mac OS devices, using a Bayesian model. Using Apple’s parameters for the privacy loss ε, we then consider two specific attacks: one in the emojis setting — where an adversary aims at inferring a user’s preferred skin tone for emojis — and one against visited websites — where an adversary wants to learn the political orientation of a user from the news websites they visit. In both cases, we show the attack to be much more effective than a random guess when the adversary collects enough data. We find that users with high polarization and relevant interest are significantly more vulnerable, and we show that our attack is well-calibrated, allowing the adversary to target such vulnerable users. We finally validate our results for the emojis setting using user data from Twitter. Taken together, our results show that pool inference attacks are a concern for data protected by local differential privacy mechanisms with a large ε, emphasizing the need for additional technical safeguards and the need for more research on how to apply local differential privacy for multiple collections
- …