3,831 research outputs found
Unintended Memorization and Timing Attacks in Named Entity Recognition Models
Named entity recognition models (NER), are widely used for identifying named
entities (e.g., individuals, locations, and other information) in text
documents. Machine learning based NER models are increasingly being applied in
privacy-sensitive applications that need automatic and scalable identification
of sensitive information to redact text for data sharing. In this paper, we
study the setting when NER models are available as a black-box service for
identifying sensitive information in user documents and show that these models
are vulnerable to membership inference on their training datasets. With updated
pre-trained NER models from spaCy, we demonstrate two distinct membership
attacks on these models. Our first attack capitalizes on unintended
memorization in the NER's underlying neural network, a phenomenon NNs are known
to be vulnerable to. Our second attack leverages a timing side-channel to
target NER models that maintain vocabularies constructed from the training
data. We show that different functional paths of words within the training
dataset in contrast to words not previously seen have measurable differences in
execution time. Revealing membership status of training samples has clear
privacy implications, e.g., in text redaction, sensitive words or phrases to be
found and removed, are at risk of being detected in the training dataset. Our
experimental evaluation includes the redaction of both password and health
data, presenting both security risks and privacy/regulatory issues. This is
exacerbated by results that show memorization with only a single phrase. We
achieved 70% AUC in our first attack on a text redaction use-case. We also show
overwhelming success in the timing attack with 99.23% AUC. Finally we discuss
potential mitigation approaches to realize the safe use of NER models in light
of the privacy and security implications of membership inference attacks.Comment: This is the full version of the paper with the same title accepted
for publication in the Proceedings of the 23rd Privacy Enhancing Technologies
Symposium, PETS 202
Membership Inference Attacks against Language Models via Neighbourhood Comparison
Membership Inference attacks (MIAs) aim to predict whether a data sample was
present in the training data of a machine learning model or not, and are widely
used for assessing the privacy risks of language models. Most existing attacks
rely on the observation that models tend to assign higher probabilities to
their training samples than non-training points. However, simple thresholding
of the model score in isolation tends to lead to high false-positive rates as
it does not account for the intrinsic complexity of a sample. Recent work has
demonstrated that reference-based attacks which compare model scores to those
obtained from a reference model trained on similar data can substantially
improve the performance of MIAs. However, in order to train reference models,
attacks of this kind make the strong and arguably unrealistic assumption that
an adversary has access to samples closely resembling the original training
data. Therefore, we investigate their performance in more realistic scenarios
and find that they are highly fragile in relation to the data distribution used
to train reference models. To investigate whether this fragility provides a
layer of safety, we propose and evaluate neighbourhood attacks, which compare
model scores for a given sample to scores of synthetically generated neighbour
texts and therefore eliminate the need for access to the training data
distribution. We show that, in addition to being competitive with
reference-based attacks that have perfect knowledge about the training data
distribution, our attack clearly outperforms existing reference-free attacks as
well as reference-based attacks with imperfect knowledge, which demonstrates
the need for a reevaluation of the threat model of adversarial attacks
- …