15 research outputs found
Detecting Communities under Differential Privacy
Complex networks usually expose community structure with groups of nodes
sharing many links with the other nodes in the same group and relatively few
with the nodes of the rest. This feature captures valuable information about
the organization and even the evolution of the network. Over the last decade, a
great number of algorithms for community detection have been proposed to deal
with the increasingly complex networks. However, the problem of doing this in a
private manner is rarely considered. In this paper, we solve this problem under
differential privacy, a prominent privacy concept for releasing private data.
We analyze the major challenges behind the problem and propose several schemes
to tackle them from two perspectives: input perturbation and algorithm
perturbation. We choose Louvain method as the back-end community detection for
input perturbation schemes and propose the method LouvainDP which runs Louvain
algorithm on a noisy super-graph. For algorithm perturbation, we design
ModDivisive using exponential mechanism with the modularity as the score. We
have thoroughly evaluated our techniques on real graphs of different sizes and
verified their outperformance over the state-of-the-art
Introduction and Comparison of Novel Decentral Learning Schemes with Multiple Data Pools for Privacy-Preserving ECG Classification
Artificial intelligence and machine learning have led to prominent and spectacular innovations in various scenarios. Application in medicine, however, can be challenging due to privacy concerns and strict legal regulations. Methods that centralize knowledge instead of data could address this issue. In this work, 6 different decentralized machine learning algorithms are applied to 12-lead ECG classification and compared to conventional, centralized machine learning. The results show that state-of-the-art federated learning leads to reasonable losses of classification performance compared to a standard, central model (-0.054 AUROC) while providing a significantly higher level of privacy. A proposed weighted variant of federated learning (-0.049 AUROC) and an ensemble (-0.035 AUROC) outperformed the standard federated learning algorithm. Overall, considering multiple metrics, the novel batch-wise sequential learning scheme performed best (-0.036 AUROC to baseline). Although, the technical aspects of implementing them in a real-world application are to be carefully considered, the described algorithms constitute a way forward towards preserving-preserving AI in medicine
Differentially Private Sparse Vectors with Low Error, Optimal Space, and Fast Access
Representing a sparse histogram, or more generally a sparse vector, is a
fundamental task in differential privacy. An ideal solution would use space
close to information-theoretical lower bounds, have an error distribution that
depends optimally on the desired privacy level, and allow fast random access to
entries in the vector. However, existing approaches have only achieved two of
these three goals.
In this paper we introduce the Approximate Laplace Projection (ALP) mechanism
for approximating k-sparse vectors. This mechanism is shown to simultaneously
have information-theoretically optimal space (up to constant factors), fast
access to vector entries, and error of the same magnitude as the
Laplace-mechanism applied to dense vectors. A key new technique is a unary
representation of small integers, which is shown to be robust against
``randomized response'' noise. This representation is combined with hashing, in
the spirit of Bloom filters, to obtain a space-efficient, differentially
private representation. Our theoretical performance bounds are complemented by
simulations which show that the constant factors on the main performance
parameters are quite small, suggesting practicality of the technique
PrivLava: Synthesizing Relational Data with Foreign Keys under Differential Privacy
Answering database queries while preserving privacy is an important problem
that has attracted considerable research attention in recent years. A canonical
approach to this problem is to use synthetic data. That is, we replace the
input database R with a synthetic database R* that preserves the
characteristics of R, and use R* to answer queries. Existing solutions for
relational data synthesis, however, either fail to provide strong privacy
protection, or assume that R contains a single relation. In addition, it is
challenging to extend the existing single-relation solutions to the case of
multiple relations, because they are unable to model the complex correlations
induced by the foreign keys. Therefore, multi-relational data synthesis with
strong privacy guarantees is an open problem. In this paper, we address the
above open problem by proposing PrivLava, the first solution for synthesizing
relational data with foreign keys under differential privacy, a rigorous
privacy framework widely adopted in both academia and industry. The key idea of
PrivLava is to model the data distribution in R using graphical models, with
latent variables included to capture the inter-relational correlations caused
by foreign keys. We show that PrivLava supports arbitrary foreign key
references that form a directed acyclic graph, and is able to tackle the common
case when R contains a mixture of public and private relations. Extensive
experiments on census data sets and the TPC-H benchmark demonstrate that
PrivLava significantly outperforms its competitors in terms of the accuracy of
aggregate queries processed on the synthetic data.Comment: This is an extended version of a SIGMOD 2023 pape
SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication
Histograms and synthetic data are of key importance in data analysis. However, researchers have shown that even aggregated data such as histograms, containing no obvious sensitive attributes, can result in privacy leakage. To enable data analysis, a strong notion of privacy is required to avoid risking unintended privacy violations.Such a strong notion of privacy is differential privacy, a statistical notion of privacy that makes privacy leakage quantifiable. The caveat regarding differential privacy is that while it has strong guarantees for privacy, privacy comes at a cost of accuracy. Despite this trade-off being a central and important issue in the adoption of differential privacy, there exists a gap in the literature regarding providing an understanding of the trade-off and how to address it appropriately. Through a systematic literature review (SLR), we investigate the state-of-the-art within accuracy improving differentially private algorithms for histogram and synthetic data publishing. Our contribution is two-fold: 1) we identify trends and connections in the contributions to the field of differential privacy for histograms and synthetic data and 2) we provide an understanding of the privacy/accuracy trade-off challenge by crystallizing different dimensions to accuracy improvement. Accordingly, we position and visualize the ideas in relation to each other and external work, and deconstruct each algorithm to examine the building blocks separately with the aim of pinpointing which dimension of accuracy improvement each technique/approach is targeting. Hence, this systematization of knowledge (SoK) provides an understanding of in which dimensions and how accuracy improvement can be pursued without sacrificing privacy
SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication
Histograms and synthetic data are of key importance in data analysis.
However, researchers have shown that even aggregated data such as histograms,
containing no obvious sensitive attributes, can result in privacy leakage. To
enable data analysis, a strong notion of privacy is required to avoid risking
unintended privacy violations.
Such a strong notion of privacy is differential privacy, a statistical notion
of privacy that makes privacy leakage quantifiable. The caveat regarding
differential privacy is that while it has strong guarantees for privacy,
privacy comes at a cost of accuracy. Despite this trade off being a central and
important issue in the adoption of differential privacy, there exists a gap in
the literature regarding providing an understanding of the trade off and how to
address it appropriately.
Through a systematic literature review (SLR), we investigate the
state-of-the-art within accuracy improving differentially private algorithms
for histogram and synthetic data publishing. Our contribution is two-fold: 1)
we identify trends and connections in the contributions to the field of
differential privacy for histograms and synthetic data and 2) we provide an
understanding of the privacy/accuracy trade off challenge by crystallizing
different dimensions to accuracy improvement. Accordingly, we position and
visualize the ideas in relation to each other and external work, and
deconstruct each algorithm to examine the building blocks separately with the
aim of pinpointing which dimension of accuracy improvement each
technique/approach is targeting. Hence, this systematization of knowledge (SoK)
provides an understanding of in which dimensions and how accuracy improvement
can be pursued without sacrificing privacy
Distributed, Private, Sparse Histograms in the Two-Server Model
We consider the computation of sparse, -differentially private~(DP) histograms in the two-server model of secure multi-party computation~(MPC), which has recently gained traction in the context of privacy-preserving measurements of aggregate user data.
We introduce protocols that enable two semi-honest non-colluding servers to compute histograms over the data held by multiple users, while only learning a private view of the data. Our solution achieves the same asymptotic -error of as in the central model of DP, but
\emph{without} relying on a trusted curator. The server communication and computation costs of our protocol are independent of the number of histogram buckets, and are linear in the number of users, while the client cost is independent of the number of users, , and .
Its linear dependence on the number of users lets our protocol scale well, which we confirm using microbenchmarks:
for a billion users, , and , the per-user cost of our protocol is only ms of server computation and bytes of communication. In contrast, a baseline protocol using garbled circuits only allows up to users, where it requires 600 KB communication per user