209 research outputs found
Guiding Text-to-Text Privatization by Syntax
Metric Differential Privacy is a generalization of differential privacy
tailored to address the unique challenges of text-to-text privatization. By
adding noise to the representation of words in the geometric space of
embeddings, words are replaced with words located in the proximity of the noisy
representation. Since embeddings are trained based on word co-occurrences, this
mechanism ensures that substitutions stem from a common semantic context.
Without considering the grammatical category of words, however, this mechanism
cannot guarantee that substitutions play similar syntactic roles. We analyze
the capability of text-to-text privatization to preserve the grammatical
category of words after substitution and find that surrogate texts consist
almost exclusively of nouns. Lacking the capability to produce surrogate texts
that correlate with the structure of the sensitive texts, we encompass our
analysis by transforming the privatization step into a candidate selection
problem in which substitutions are directed to words with matching grammatical
properties. We demonstrate a substantial improvement in the performance of
downstream tasks by up to while retaining comparative privacy
guarantees
Driving Context into Text-to-Text Privatization
\textit{Metric Differential Privacy} enables text-to-text privatization by
adding calibrated noise to the vector of a word derived from an embedding space
and projecting this noisy vector back to a discrete vocabulary using a nearest
neighbor search. Since words are substituted without context, this mechanism is
expected to fall short at finding substitutes for words with ambiguous
meanings, such as \textit{'bank'}. To account for these ambiguous words, we
leverage a sense embedding and incorporate a sense disambiguation step prior to
noise injection. We encompass our modification to the privatization mechanism
with an estimation of privacy and utility. For word sense disambiguation on the
\textit{Words in Context} dataset, we demonstrate a substantial increase in
classification accuracy by
Directional Privacy for Deep Learning
Differentially Private Stochastic Gradient Descent (DP-SGD) is a key method
for applying privacy in the training of deep learning models. This applies
isotropic Gaussian noise to gradients during training, which can perturb these
gradients in any direction, damaging utility. Metric DP, however, can provide
alternative mechanisms based on arbitrary metrics that might be more suitable.
In this paper we apply \textit{directional privacy}, via a mechanism based on
the von Mises-Fisher (VMF) distribution, to perturb gradients in terms of
\textit{angular distance} so that gradient direction is broadly preserved. We
show that this provides -privacy for deep learning training, rather
than the -privacy of the Gaussian mechanism; and that
experimentally, on key datasets, the VMF mechanism can outperform the Gaussian
in the utility-privacy trade-off
An Easy-to-use and Robust Approach for the Differentially Private De-Identification of Clinical Textual Documents
Unstructured textual data is at the heart of healthcare systems. For obvious
privacy reasons, these documents are not accessible to researchers as long as
they contain personally identifiable information. One way to share this data
while respecting the legislative framework (notably GDPR or HIPAA) is, within
the medical structures, to de-identify it, i.e. to detect the personal
information of a person through a Named Entity Recognition (NER) system and
then replacing it to make it very difficult to associate the document with the
person. The challenge is having reliable NER and substitution tools without
compromising confidentiality and consistency in the document. Most of the
conducted research focuses on English medical documents with coarse
substitutions by not benefiting from advances in privacy. This paper shows how
an efficient and differentially private de-identification approach can be
achieved by strengthening the less robust de-identification method and by
adapting state-of-the-art differentially private mechanisms for substitution
purposes. The result is an approach for de-identifying clinical documents in
French language, but also generalizable to other languages and whose robustness
is mathematically proven
Differentially Private Natural Language Models: Recent Advances and Future Directions
Recent developments in deep learning have led to great success in various
natural language processing (NLP) tasks. However, these applications may
involve data that contain sensitive information. Therefore, how to achieve good
performance while also protect privacy of sensitive data is a crucial challenge
in NLP. To preserve privacy, Differential Privacy (DP), which can prevent
reconstruction attacks and protect against potential side knowledge, is
becoming a de facto technique for private data analysis. In recent years, NLP
in DP models (DP-NLP) has been studied from different perspectives, which
deserves a comprehensive review. In this paper, we provide the first systematic
review of recent advances on DP deep learning models in NLP. In particular, we
first discuss some differences and additional challenges of DP-NLP compared
with the standard DP deep learning. Then we investigate some existing work on
DP-NLP and present its recent developments from two aspects: gradient
perturbation based methods and embedding vector perturbation based methods. We
also discuss some challenges and future directions of this topic
- …