Search CORE

11 research outputs found

DP-SIPS: A simpler, more scalable mechanism for differentially private partition selection

Author: Desfontaines Damien
Haney Samuel
Swanberg Marika
Publication venue
Publication date: 22/06/2023
Field of study

Partition selection, or set union, is an important primitive in differentially private mechanism design: in a database where each user contributes a list of items, the goal is to publish as many of these items as possible under differential privacy. In this work, we present a novel mechanism for differentially private partition selection. This mechanism, which we call DP-SIPS, is very simple: it consists of iterating the naive algorithm over the data set multiple times, removing the released partitions from the data set while increasing the privacy budget at each step. This approach preserves the scalability benefits of the naive mechanism, yet its utility compares favorably to more complex approaches developed in prior work

arXiv.org e-Print Archive

Differentially Private Heavy Hitter Detection using Federated Analytics

Author: Chadha Karan
Chen Junye
Duchi John
Feldman Vitaly
Hashemi Hanieh
Javidbakht Omid
McMillan Audra
Talwar Kunal
Publication venue
Publication date: 21/07/2023
Field of study

In this work, we study practical heuristics to improve the performance of prefix-tree based algorithms for differentially private heavy hitter detection. Our model assumes each user has multiple data points and the goal is to learn as many of the most frequent data points as possible across all users' data with aggregate and local differential privacy. We propose an adaptive hyperparameter tuning algorithm that improves the performance of the algorithm while satisfying computational, communication and privacy constraints. We explore the impact of different data-selection schemes as well as the impact of introducing deny lists during multiple runs of the algorithm. We test these improvements using extensive experimentation on the Reddit dataset~\cite{caldas2018leaf} on the task of learning the most frequent words

arXiv.org e-Print Archive

Sparsity-Preserving Differentially Private Training of Large Embedding Models

Author: Ghazi Badih
Huang Yangsibo
Kamath Pritish
Kumar Ravi
Manurangsi Pasin
Sinha Amer
Zhang Chiyuan
Publication venue
Publication date: 14/11/2023
Field of study

As the use of large embedding models in recommendation systems and language applications increases, concerns over user data privacy have also risen. DP-SGD, a training algorithm that combines differential privacy with stochastic gradient descent, has been the workhorse in protecting user privacy without compromising model accuracy by much. However, applying DP-SGD naively to embedding models can destroy gradient sparsity, leading to reduced training efficiency. To address this issue, we present two new algorithms, DP-FEST and DP-AdaFEST, that preserve gradient sparsity during private training of large embedding models. Our algorithms achieve substantial reductions (

10^6 \times

) in gradient size, while maintaining comparable levels of accuracy, on benchmark real-world datasets.Comment: Neural Information Processing Systems (NeurIPS) 202

arXiv.org e-Print Archive

Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry

Author: Garrido Gonzalo Munilla
Liu Xiaoyuan
Matthes Florian
Song Dawn
Publication venue
Publication date: 07/11/2022
Field of study

Since its introduction in 2006, differential privacy has emerged as a predominant statistical tool for quantifying data privacy in academic works. Yet despite the plethora of research and open-source utilities that have accompanied its rise, with limited exceptions, differential privacy has failed to achieve widespread adoption in the enterprise domain. Our study aims to shed light on the fundamental causes underlying this academic-industrial utilization gap through detailed interviews of 24 privacy practitioners across 9 major companies. We analyze the results of our survey to provide key findings and suggestions for companies striving to improve privacy protection in their data workflows and highlight the necessary and missing requirements of existing differential privacy tools, with the goal of guiding researchers working towards the broader adoption of differential privacy. Our findings indicate that analysts suffer from lengthy bureaucratic processes for requesting access to sensitive data, yet once granted, only scarcely-enforced privacy policies stand between rogue practitioners and misuse of private information. We thus argue that differential privacy can significantly improve the processes of requesting and conducting data exploration across silos, and conclude that with a few of the improvements suggested herein, the practical use of differential privacy across the enterprise is within striking distance

arXiv.org e-Print Archive

Data Protection in Big Data Analysis

Author: Shafieinejad Masoumeh
Publication venue: 'University of Waterloo'
Publication date: 13/08/2021
Field of study

"Big data" applications are collecting data from various aspects of our lives more and more every day. This fast transition has surpassed the development pace of data protection techniques and has resulted in innumerable data breaches and privacy violations. To prevent that, it is important to ensure the data is protected while at rest, in transit, in use, as well as during computation or dispersal. We investigate data protection issues in big data analysis in this thesis. We address a security or privacy concern in each phase of the data science pipeline. These phases are: i) data cleaning and preparation, ii) data management, iii) data modelling and analysis, and iv) data dissemination and visualization. In each of our contributions, we either address an existing problem and propose a resolving design (Chapters 2 and 4), or evaluate a current solution for a problem and analyze whether it meets the expected security/privacy goal (Chapters 3 and 5). Starting with privacy in data preparation, we investigate providing privacy in query analysis leveraging differential privacy techniques. We consider contextual outlier analysis and identify challenging queries that require releasing direct information about members of the dataset. We define a new sampling mechanism that allows releasing this information in a differentially private manner. Our second contribution is in the data modelling and analysis phase. We investigate the effect of data properties and application requirements on the successful implementation of privacy techniques. We in particular investigate the effects of data correlation on data protection guarantees of differential privacy. Our third contribution in this thesis is in the data management phase. The problem is to efficiently protecting the data that is outsourced to a database management system (DBMS) provider while still allowing join operation. We provide an encryption method to minimize the leakage and to guarantee confidentiality for the data efficiently. Our last contribution is in the data dissemination phase. We inspect the ownership/contract protection for the prediction models trained on the data. We evaluate the backdoor-based watermarking in deep neural networks which is an important and recent line of the work in model ownership/contract protection

University of Waterloo's Institutional Repository

Fully Privacy-Preserving Federated Representation Learning via Secure Embedding Aggregation

Author: Jiaxiang Tang
Jinbao Zhu
Kai Zhang
Lichao Sun
Songze Li
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 18/06/2022
Field of study

We consider a federated representation learning framework, where with the assistance of a central server, a group of

N

distributed clients train collaboratively over their private data, for the representations (or embeddings) of a set of entities (e.g., users in a social network). Under this framework, for the key step of aggregating local embeddings trained at the clients in a private manner, we develop a secure embedding aggregation protocol named SecEA, which provides information-theoretical privacy guarantees for the set of entities and the corresponding embeddings at each client

simultaneously

, against a curious server and up to

T < N/2

colluding clients. As the first step of SecEA, the federated learning system performs a private entity union, for each client to learn all the entities in the system without knowing which entities belong to which clients. In each aggregation round, the local embeddings are secretly shared among the clients using Lagrange interpolation, and then each client constructs coded queries to retrieve the aggregated embeddings for the intended entities. We perform comprehensive experiments on various representation learning tasks to evaluate the utility and efficiency of SecEA, and empirically demonstrate that compared with embedding aggregation protocols without (or with weaker) privacy guarantees, SecEA incurs negligible performance loss (within 5%); and the additional computation latency of SecEA diminishes for training deeper models on larger datasets

Cryptology ePrint Archive

Recommended from our members

Information-theoretic Approach to Design and Evaluate Privacy-preserving and Fair Frameworks for Continuous High-dimensional Data

Author: Alsulaimawi Zahir Ahmed Hussein
Publication venue: 'Oregon State University'
Publication date
Field of study

Deep learning is becoming the latest trend in sensitive applications, such as healthcare, criminal justice, and finance. As these new applications emerge, adversaries are circumventing them. Further, there have been concerns about the possibility of bias and discrimination in predictive applications. In order to address these issues, we propose an information-theoretic approach to design a continuous high-dimensional data deep learning framework. We call this framework Gaussian privacy protector (GPP). Our proposed framework has many advantages: (1) it reduces the problem to the optimal compression of data about a measure of utility and privacy; (2) it can prevent adversaries from private mining information from the released data while simultaneously maximizing the amount of the utility's information revealed; (3) it adapts the idea of the information bottleneck (IB) based on the problem of revealing data, which is often sensitive; (4) it considers a privacy funnel (PF) problem inspired by utility data as the central part of data to be revealed; (5) using a similar framework, we show how to achieve fairness in classification; and (6) this work illustrates the feasibility of creating a centralized platform to support this framework over distributed datasets. We utilize variational lower bounds of mutual information approximation implemented as supervised learning using an adversarial training algorithm. We use three datasets: hand-written digits (MNIST), celeb faces attributes (CelebA), and human activities and postural transitions' recognition using smartphone data (HAPT-Recognition) to evaluate our algorithms. The experimental results on these datasets demonstrate that the proposed approach effectively removes private information from the datasets while allowing non-private information to be mined effectively

ScholarsArchive@OSU

Incorporating Item Frequency for Differentially Private Set Union

Author: Carvalho Ricardo Silva
Gondara Lovedeep Singh
Wang Ke
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 28/06/2022
Field of study

We study the problem of releasing the set union of users' items subject to differential privacy. Previous approaches consider only the set of items for each user as the input. We propose incorporating the item frequency, which is typically available in set union problems, to boost the utility of private mechanisms. However, using the global item frequency over all users would largely increase privacy loss. We propose to use the local item frequency of each user to approximate the global item frequency without incurring additional privacy loss. Local item frequency allows us to design greedy set union mechanisms that are differentially private, which is impossible for previous greedy proposals. Moreover, while all previous works have to use uniform sampling to limit the number of items each user would contribute to, our construction eliminates the sampling step completely and allows our mechanisms to consider all of the users' items. Finally, we propose to transfer the knowledge of the global item frequency from a public dataset into our mechanism, which further boosts utility even when the public and private datasets are from different domains. We evaluate the proposed methods on multiple real-life datasets

Association for the Advancement of Artificial Intelligence: AAAI Publications