11 research outputs found
DP-SIPS: A simpler, more scalable mechanism for differentially private partition selection
Partition selection, or set union, is an important primitive in
differentially private mechanism design: in a database where each user
contributes a list of items, the goal is to publish as many of these items as
possible under differential privacy. In this work, we present a novel mechanism
for differentially private partition selection. This mechanism, which we call
DP-SIPS, is very simple: it consists of iterating the naive algorithm over the
data set multiple times, removing the released partitions from the data set
while increasing the privacy budget at each step. This approach preserves the
scalability benefits of the naive mechanism, yet its utility compares favorably
to more complex approaches developed in prior work
Differentially Private Heavy Hitter Detection using Federated Analytics
In this work, we study practical heuristics to improve the performance of
prefix-tree based algorithms for differentially private heavy hitter detection.
Our model assumes each user has multiple data points and the goal is to learn
as many of the most frequent data points as possible across all users' data
with aggregate and local differential privacy. We propose an adaptive
hyperparameter tuning algorithm that improves the performance of the algorithm
while satisfying computational, communication and privacy constraints. We
explore the impact of different data-selection schemes as well as the impact of
introducing deny lists during multiple runs of the algorithm. We test these
improvements using extensive experimentation on the Reddit
dataset~\cite{caldas2018leaf} on the task of learning the most frequent words
Sparsity-Preserving Differentially Private Training of Large Embedding Models
As the use of large embedding models in recommendation systems and language
applications increases, concerns over user data privacy have also risen.
DP-SGD, a training algorithm that combines differential privacy with stochastic
gradient descent, has been the workhorse in protecting user privacy without
compromising model accuracy by much. However, applying DP-SGD naively to
embedding models can destroy gradient sparsity, leading to reduced training
efficiency. To address this issue, we present two new algorithms, DP-FEST and
DP-AdaFEST, that preserve gradient sparsity during private training of large
embedding models. Our algorithms achieve substantial reductions ()
in gradient size, while maintaining comparable levels of accuracy, on benchmark
real-world datasets.Comment: Neural Information Processing Systems (NeurIPS) 202
Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry
Since its introduction in 2006, differential privacy has emerged as a
predominant statistical tool for quantifying data privacy in academic works.
Yet despite the plethora of research and open-source utilities that have
accompanied its rise, with limited exceptions, differential privacy has failed
to achieve widespread adoption in the enterprise domain. Our study aims to shed
light on the fundamental causes underlying this academic-industrial utilization
gap through detailed interviews of 24 privacy practitioners across 9 major
companies. We analyze the results of our survey to provide key findings and
suggestions for companies striving to improve privacy protection in their data
workflows and highlight the necessary and missing requirements of existing
differential privacy tools, with the goal of guiding researchers working
towards the broader adoption of differential privacy. Our findings indicate
that analysts suffer from lengthy bureaucratic processes for requesting access
to sensitive data, yet once granted, only scarcely-enforced privacy policies
stand between rogue practitioners and misuse of private information. We thus
argue that differential privacy can significantly improve the processes of
requesting and conducting data exploration across silos, and conclude that with
a few of the improvements suggested herein, the practical use of differential
privacy across the enterprise is within striking distance
Data Protection in Big Data Analysis
"Big data" applications are collecting data from various aspects of our lives more and more every day. This fast transition has surpassed the development pace of data protection techniques and has resulted in innumerable data breaches and privacy violations. To prevent that, it is important to ensure the data is protected while at rest, in transit, in use, as well as during computation or dispersal. We investigate data protection issues in big data analysis in this thesis. We address a security or privacy concern in each phase of the data science pipeline. These phases are: i) data cleaning and preparation, ii) data management, iii) data modelling and analysis, and iv) data dissemination and visualization. In each of our contributions, we either address an existing problem and propose a resolving design (Chapters 2 and 4), or evaluate a current solution for a problem and analyze whether it meets the expected security/privacy goal (Chapters 3 and 5).
Starting with privacy in data preparation, we investigate providing privacy in query analysis leveraging differential privacy techniques. We consider contextual outlier analysis and identify challenging queries that require releasing direct information about members of the dataset. We define a new sampling mechanism that allows releasing this information in a differentially private manner. Our second contribution is in the data modelling and analysis phase. We investigate the effect of data properties and application requirements on the successful implementation of privacy techniques. We in particular investigate the effects of data correlation on data protection guarantees of differential privacy. Our third contribution in this thesis is in the data management phase. The problem is to efficiently protecting the data that is outsourced to a database management system (DBMS) provider while still allowing join operation. We provide an encryption method to minimize the leakage and to guarantee confidentiality for the data efficiently. Our last contribution is in the data dissemination phase. We inspect the ownership/contract protection for the prediction models trained on the data. We evaluate the backdoor-based watermarking in deep neural networks which is an important and recent line of the work in model ownership/contract protection
Fully Privacy-Preserving Federated Representation Learning via Secure Embedding Aggregation
We consider a federated representation learning framework, where with the assistance of a central server, a group of distributed clients train collaboratively over their private data, for the representations (or embeddings) of a set of entities (e.g., users in a social network). Under this framework, for the key step of aggregating local embeddings trained at the clients in a private manner, we develop a secure embedding aggregation protocol named SecEA, which provides information-theoretical privacy guarantees for the set of entities and the corresponding embeddings at each client , against a curious server and up to colluding clients. As the first step of SecEA, the federated learning system performs a private entity union, for each client to learn all the entities in the system without knowing which entities belong to which clients. In each aggregation round, the local embeddings are secretly shared among the clients using Lagrange interpolation, and then each client constructs coded queries to retrieve the aggregated embeddings for the intended entities. We perform comprehensive experiments on various representation learning tasks to evaluate the utility and efficiency of SecEA, and empirically demonstrate that compared with embedding aggregation protocols without (or with weaker) privacy guarantees, SecEA incurs negligible performance loss (within 5%); and the additional computation latency of SecEA diminishes for training deeper models on larger datasets
Recommended from our members
Information-theoretic Approach to Design and Evaluate Privacy-preserving and Fair Frameworks for Continuous High-dimensional Data
Deep learning is becoming the latest trend in sensitive applications, such as healthcare, criminal justice, and finance. As these new applications emerge, adversaries are circumventing them.
Further, there have been concerns about the possibility of bias and discrimination in predictive applications.
In order to address these issues, we propose an information-theoretic approach to design a continuous high-dimensional data deep learning framework. We call this framework Gaussian privacy protector (GPP).
Our proposed framework has many advantages:
(1) it reduces the problem to the optimal compression of data about a measure of utility and privacy;
(2) it can prevent adversaries from private mining information from the released data while simultaneously maximizing the amount of the utility's information revealed;
(3) it adapts the idea of the information bottleneck (IB) based on the problem of revealing data, which is often sensitive;
(4) it considers a privacy funnel (PF) problem inspired by utility data as the central part of data to be revealed; (5) using a similar framework, we show how to achieve fairness in classification; and (6) this work illustrates the feasibility of creating a centralized platform to support this framework over distributed datasets.
We utilize variational lower bounds of mutual information approximation implemented as supervised learning using an adversarial training algorithm.
We use three datasets: hand-written digits (MNIST), celeb faces attributes (CelebA), and human activities and postural transitions' recognition using smartphone data (HAPT-Recognition) to evaluate our algorithms.
The experimental results on these datasets demonstrate that the proposed approach effectively removes private information from the datasets while allowing non-private information to be mined effectively
Incorporating Item Frequency for Differentially Private Set Union
We study the problem of releasing the set union of users' items subject to differential privacy. Previous approaches consider only the set of items for each user as the input. We propose incorporating the item frequency, which is typically available in set union problems, to boost the utility of private mechanisms. However, using the global item frequency over all users would largely increase privacy loss. We propose to use the local item frequency of each user to approximate the global item frequency without incurring additional privacy loss.
Local item frequency allows us to design greedy set union mechanisms that are differentially private, which is impossible for previous greedy proposals. Moreover, while all previous works have to use uniform sampling to limit the number of items each user would contribute to, our construction eliminates the sampling step completely and allows our mechanisms to consider all of the users' items.
Finally, we propose to transfer the knowledge of the global item frequency from a public dataset into our mechanism, which further boosts utility even when the public and private datasets are from different domains. We evaluate the proposed methods on multiple real-life datasets