18 research outputs found
Privacy Guarantees for De-identifying Text Transformations
Machine Learning approaches to Natural Language Processing tasks benefit from
a comprehensive collection of real-life user data. At the same time, there is a
clear need for protecting the privacy of the users whose data is collected and
processed. For text collections, such as, e.g., transcripts of voice
interactions or patient records, replacing sensitive parts with benign
alternatives can provide de-identification. However, how much privacy is
actually guaranteed by such text transformations, and are the resulting texts
still useful for machine learning? In this paper, we derive formal privacy
guarantees for general text transformation-based de-identification methods on
the basis of Differential Privacy. We also measure the effect that different
ways of masking private information in dialog transcripts have on a subsequent
machine learning task. To this end, we formulate different masking strategies
and compare their privacy-utility trade-offs. In particular, we compare a
simple redact approach with more sophisticated word-by-word replacement using
deep learning models on multiple natural language understanding tasks like
named entity recognition, intent detection, and dialog act classification. We
find that only word-by-word replacement is robust against performance drops in
various tasks.Comment: Proceedings of INTERSPEECH 202
Corrupt Bandits for Preserving Local Privacy
We study a variant of the stochastic multi-armed bandit (MAB) problem in
which the rewards are corrupted. In this framework, motivated by privacy
preservation in online recommender systems, the goal is to maximize the sum of
the (unobserved) rewards, based on the observation of transformation of these
rewards through a stochastic corruption process with known parameters. We
provide a lower bound on the expected regret of any bandit algorithm in this
corrupted setting. We devise a frequentist algorithm, KLUCB-CF, and a Bayesian
algorithm, TS-CF and give upper bounds on their regret. We also provide the
appropriate corruption parameters to guarantee a desired level of local privacy
and analyze how this impacts the regret. Finally, we present some experimental
results that confirm our analysis
Compressive Privacy for a Linear Dynamical System
We consider a linear dynamical system in which the state vector consists of
both public and private states. One or more sensors make measurements of the
state vector and sends information to a fusion center, which performs the final
state estimation. To achieve an optimal tradeoff between the utility of
estimating the public states and protection of the private states, the
measurements at each time step are linearly compressed into a lower dimensional
space. Under the centralized setting where all measurements are collected by a
single sensor, we propose an optimization problem and an algorithm to find the
best compression matrix. Under the decentralized setting where measurements are
made separately at multiple sensors, each sensor optimizes its own local
compression matrix. We propose methods to separate the overall optimization
problem into multiple sub-problems that can be solved locally at each sensor.
We consider the cases where there is no message exchange between the sensors;
and where each sensor takes turns to transmit messages to the other sensors.
Simulations and empirical experiments demonstrate the efficiency of our
proposed approach in allowing the fusion center to estimate the public states
with good accuracy while preventing it from estimating the private states
accurately
Adversarial training approach for local data debiasing
The widespread use of automated decision processes in many areas of our
society raises serious ethical issues concerning the fairness of the process
and the possible resulting discriminations. In this work, we propose a novel
approach called GANsan whose objective is to prevent the possibility of any
discrimination i.e., direct and indirect) based on a sensitive attribute by
removing the attribute itself as well as the existing correlations with the
remaining attributes. Our sanitization algorithm GANsan is partially inspired
by the powerful framework of generative adversarial networks (in particular the
Cycle-GANs), which offers a flexible way to learn a distribution empirically or
to translate between two different distributions.
In contrast to prior work, one of the strengths of our approach is that the
sanitization is performed in the same space as the original data by only
modifying the other attributes as little as possible and thus preserving the
interpretability of the sanitized data. As a consequence, once the sanitizer is
trained, it can be applied to new data, such as for instance, locally by an
individual on his profile before releasing it. Finally, experiments on a real
dataset demonstrate the effectiveness of the proposed approach as well as the
achievable trade-off between fairness and utility