65 research outputs found
The performance of modularity maximization in practical contexts
Although widely used in practice, the behavior and accuracy of the popular
module identification technique called modularity maximization is not well
understood in practical contexts. Here, we present a broad characterization of
its performance in such situations. First, we revisit and clarify the
resolution limit phenomenon for modularity maximization. Second, we show that
the modularity function Q exhibits extreme degeneracies: it typically admits an
exponential number of distinct high-scoring solutions and typically lacks a
clear global maximum. Third, we derive the limiting behavior of the maximum
modularity Q_max for one model of infinitely modular networks, showing that it
depends strongly both on the size of the network and on the number of modules
it contains. Finally, using three real-world metabolic networks as examples, we
show that the degenerate solutions can fundamentally disagree on many, but not
all, partition properties such as the composition of the largest modules and
the distribution of module sizes. These results imply that the output of any
modularity maximization procedure should be interpreted cautiously in
scientific contexts. They also explain why many heuristics are often successful
at finding high-scoring partitions in practice and why different heuristics can
disagree on the modular structure of the same network. We conclude by
discussing avenues for mitigating some of these behaviors, such as combining
information from many degenerate solutions or using generative models.Comment: 20 pages, 14 figures, 6 appendices; code available at
http://www.santafe.edu/~aaronc/modularity
When the signal is in the noise: Exploiting Diffix's Sticky Noise
Anonymized data is highly valuable to both businesses and researchers. A
large body of research has however shown the strong limits of the
de-identification release-and-forget model, where data is anonymized and
shared. This has led to the development of privacy-preserving query-based
systems. Based on the idea of "sticky noise", Diffix has been recently proposed
as a novel query-based mechanism satisfying alone the EU Article~29 Working
Party's definition of anonymization. According to its authors, Diffix adds less
noise to answers than solutions based on differential privacy while allowing
for an unlimited number of queries.
This paper presents a new class of noise-exploitation attacks, exploiting the
noise added by the system to infer private information about individuals in the
dataset. Our first differential attack uses samples extracted from Diffix in a
likelihood ratio test to discriminate between two probability distributions. We
show that using this attack against a synthetic best-case dataset allows us to
infer private information with 89.4% accuracy using only 5 attributes. Our
second cloning attack uses dummy conditions that conditionally strongly affect
the output of the query depending on the value of the private attribute. Using
this attack on four real-world datasets, we show that we can infer private
attributes of at least 93% of the users in the dataset with accuracy between
93.3% and 97.1%, issuing a median of 304 queries per user. We show how to
optimize this attack, targeting 55.4% of the users and achieving 91.7%
accuracy, using a maximum of only 32 queries per user.
Our attacks demonstrate that adding data-dependent noise, as done by Diffix,
is not sufficient to prevent inference of private attributes. We furthermore
argue that Diffix alone fails to satisfy Art. 29 WP's definition of
anonymization. [...
Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models
With large language models (LLMs) poised to become embedded in our daily
lives, questions are starting to be raised about the dataset(s) they learned
from. These questions range from potential bias or misinformation LLMs could
retain from their training data to questions of copyright and fair use of
human-generated text. However, while these questions emerge, developers of the
recent state-of-the-art LLMs become increasingly reluctant to disclose details
on their training corpus. We here introduce the task of document-level
membership inference for real-world LLMs, i.e. inferring whether the LLM has
seen a given document during training or not. First, we propose a procedure for
the development and evaluation of document-level membership inference for LLMs
by leveraging commonly used data sources for training and the model release
date. We then propose a practical, black-box method to predict document-level
membership and instantiate it on OpenLLaMA-7B with both books and academic
papers. We show our methodology to perform very well, reaching an impressive
AUC of 0.856 for books and 0.678 for papers. We then show our approach to
outperform the sentence-level membership inference attacks used in the privacy
literature for the document-level membership task. We finally evaluate whether
smaller models might be less sensitive to document-level inference and show
OpenLLaMA-3B to be approximately as sensitive as OpenLLaMA-7B to our approach.
Taken together, our results show that accurate document-level membership can be
inferred for LLMs, increasing the transparency of technology poised to change
our lives
Re-aligning Shadow Models can Improve White-box Membership Inference Attacks
Machine learning models have been shown to leak sensitive information about
their training datasets. As models are being increasingly used, on devices, to
automate tasks and power new applications, there have been concerns that such
white-box access to its parameters, as opposed to the black-box setting which
only provides query access to the model, increases the attack surface. Directly
extending the shadow modelling technique from the black-box to the white-box
setting has been shown, in general, not to perform better than black-box only
attacks. A key reason is misalignment, a known characteristic of deep neural
networks. We here present the first systematic analysis of the causes of
misalignment in shadow models and show the use of a different weight
initialisation to be the main cause of shadow model misalignment. Second, we
extend several re-alignment techniques, previously developed in the model
fusion literature, to the shadow modelling context, where the goal is to
re-align the layers of a shadow model to those of the target model.We show
re-alignment techniques to significantly reduce the measured misalignment
between the target and shadow models. Finally, we perform a comprehensive
evaluation of white-box membership inference attacks (MIA). Our analysis
reveals that (1) MIAs suffer from misalignment between shadow models, but that
(2) re-aligning the shadow models improves, sometimes significantly, MIA
performance. On the CIFAR10 dataset with a false positive rate of 1\%,
white-box MIA using re-aligned shadow models improves the true positive rate by
4.5\%.Taken together, our results highlight that on-device deployment increase
the attack surface and that the newly available information can be used by an
attacker
Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing
Synthetic data is seen as the most promising solution to share
individual-level data while preserving privacy. Shadow modeling-based
membership inference attacks (MIAs) have become the standard approach to
evaluate the privacy risk of synthetic data. While very effective, they require
a large number of datasets to be created and models trained to evaluate the
risk posed by a single record. The privacy risk of a dataset is thus currently
evaluated by running MIAs on a handful of records selected using ad-hoc
methods. We here propose what is, to the best of our knowledge, the first
principled vulnerable record identification technique for synthetic data
publishing, leveraging the distance to a record's closest neighbors. We show
our method to strongly outperform previous ad-hoc methods across datasets and
generators. We also show evidence of our method to be robust to the choice of
MIA and to specific choice of parameters. Finally, we show it to accurately
identify vulnerable records when synthetic data generators are made
differentially private. The choice of vulnerable records is as important as
more accurate MIAs when evaluating the privacy of synthetic data releases,
including from a legal perspective. We here propose a simple yet highly
effective method to do so. We hope our method will enable practitioners to
better estimate the risk posed by synthetic data publishing and researchers to
fairly compare ever improving MIAs on synthetic data
Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data
Synthetic data is emerging as the most promising solution to share
individual-level data while safeguarding privacy. Membership inference attacks
(MIAs), based on shadow modeling, have become the standard to evaluate the
privacy of synthetic data. These attacks, however, currently assume the
attacker to have access to an auxiliary dataset sampled from a similar
distribution as the training dataset. This often is a very strong assumption
that would make an attack unlikely to happen in practice. We here show how this
assumption can be removed and how MIAs can be performed using only the
synthetic data. More specifically, in three different attack scenarios using
only synthetic data, our results demonstrate that MIAs are still successful,
across two real-world datasets and two synthetic data generators. These results
show how the strong hypothesis made when auditing synthetic data releases -
access to an auxiliary dataset - can be relaxed to perform an actual attack
Quantifying Surveillance in the Networked Age: Node-based Intrusions and Group Privacy
From the "right to be left alone" to the "right to selective disclosure",
privacy has long been thought as the control individuals have over the
information they share and reveal about themselves. However, in a world that is
more connected than ever, the choices of the people we interact with
increasingly affect our privacy. This forces us to rethink our definition of
privacy. We here formalize and study, as local and global node- and
edge-observability, Bloustein's concept of group privacy. We prove
edge-observability to be independent of the graph structure, while
node-observability depends only on the degree distribution of the graph. We
show on synthetic datasets that, for attacks spanning several hops such as
those implemented by social networks and current US laws, the presence of hubs
increases node-observability while a high clustering coefficient decreases it,
at fixed density. We then study the edge-observability of a large real-world
mobile phone dataset over a month and show that, even under the restricted
two-hops rule, compromising as little as 1% of the nodes leads to observing up
to 46% of all communications in the network. More worrisome, we also show that
on average 36\% of each person's communications would be locally
edge-observable under the same rule. Finally, we use real sensing data to show
how people living in cities are vulnerable to distributed node-observability
attacks. Using a smartphone app to compromise 1\% of the population, an
attacker could monitor the location of more than half of London's population.
Taken together, our results show that the current individual-centric approach
to privacy and data protection does not encompass the realities of modern life.
This makes us---as a society---vulnerable to large-scale surveillance attacks
which we need to develop protections against
Unique in the shopping mall: On the reidentifiability of credit card metadata
Large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities, or perform research. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use and, ultimately, their impact. We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90% of individuals. We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average. Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata.European Commission. Framework Programme 7 (Marie Curie Action. Grant 264994)U.S. Army Research Laboratory (Cooperative Agreement W911NF-09-2-0053)Belgian American Educational Foundation, inc.Wallonie-Bruxelles Internationa
- …