10,568 research outputs found
Supporting Regularized Logistic Regression Privately and Efficiently
As one of the most popular statistical and machine learning models, logistic
regression with regularization has found wide adoption in biomedicine, social
sciences, information technology, and so on. These domains often involve data
of human subjects that are contingent upon strict privacy regulations.
Increasing concerns over data privacy make it more and more difficult to
coordinate and conduct large-scale collaborative studies, which typically rely
on cross-institution data sharing and joint analysis. Our work here focuses on
safeguarding regularized logistic regression, a widely-used machine learning
model in various disciplines while at the same time has not been investigated
from a data security and privacy perspective. We consider a common use scenario
of multi-institution collaborative studies, such as in the form of research
consortia or networks as widely seen in genetics, epidemiology, social
sciences, etc. To make our privacy-enhancing solution practical, we demonstrate
a non-conventional and computationally efficient method leveraging distributing
computing and strong cryptography to provide comprehensive protection over
individual-level and summary data. Extensive empirical evaluation on several
studies validated the privacy guarantees, efficiency and scalability of our
proposal. We also discuss the practical implications of our solution for
large-scale studies and applications from various disciplines, including
genetic and biomedical studies, smart grid, network analysis, etc
Distributed Private Online Learning for Social Big Data Computing over Data Center Networks
With the rapid growth of Internet technologies, cloud computing and social
networks have become ubiquitous. An increasing number of people participate in
social networks and massive online social data are obtained. In order to
exploit knowledge from copious amounts of data obtained and predict social
behavior of users, we urge to realize data mining in social networks. Almost
all online websites use cloud services to effectively process the large scale
of social data, which are gathered from distributed data centers. These data
are so large-scale, high-dimension and widely distributed that we propose a
distributed sparse online algorithm to handle them. Additionally,
privacy-protection is an important point in social networks. We should not
compromise the privacy of individuals in networks, while these social data are
being learned for data mining. Thus we also consider the privacy problem in
this article. Our simulations shows that the appropriate sparsity of data would
enhance the performance of our algorithm and the privacy-preserving method does
not significantly hurt the performance of the proposed algorithm.Comment: ICC201
A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data
Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The United States Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce Indicators
Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning
Deep Learning has recently become hugely popular in machine learning,
providing significant improvements in classification accuracy in the presence
of highly-structured and large databases.
Researchers have also considered privacy implications of deep learning.
Models are typically trained in a centralized manner with all the data being
processed by the same training algorithm. If the data is a collection of users'
private data, including habits, personal pictures, geographical positions,
interests, and more, the centralized server will have access to sensitive
information that could potentially be mishandled. To tackle this problem,
collaborative deep learning models have recently been proposed where parties
locally train their deep learning structures and only share a subset of the
parameters in the attempt to keep their respective training sets private.
Parameters can also be obfuscated via differential privacy (DP) to make
information extraction even more challenging, as proposed by Shokri and
Shmatikov at CCS'15.
Unfortunately, we show that any privacy-preserving collaborative deep
learning is susceptible to a powerful attack that we devise in this paper. In
particular, we show that a distributed, federated, or decentralized deep
learning approach is fundamentally broken and does not protect the training
sets of honest participants. The attack we developed exploits the real-time
nature of the learning process that allows the adversary to train a Generative
Adversarial Network (GAN) that generates prototypical samples of the targeted
training set that was meant to be private (the samples generated by the GAN are
intended to come from the same distribution as the training data).
Interestingly, we show that record-level DP applied to the shared parameters of
the model, as suggested in previous work, is ineffective (i.e., record-level DP
is not designed to address our attack).Comment: ACM CCS'17, 16 pages, 18 figure
Exploring Machine Learning Models for Federated Learning: A Review of Approaches, Performance, and Limitations
In the growing world of artificial intelligence, federated learning is a
distributed learning framework enhanced to preserve the privacy of individuals'
data. Federated learning lays the groundwork for collaborative research in
areas where the data is sensitive. Federated learning has several implications
for real-world problems. In times of crisis, when real-time decision-making is
critical, federated learning allows multiple entities to work collectively
without sharing sensitive data. This distributed approach enables us to
leverage information from multiple sources and gain more diverse insights. This
paper is a systematic review of the literature on privacy-preserving machine
learning in the last few years based on the Preferred Reporting Items for
Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Specifically, we have
presented an extensive review of supervised/unsupervised machine learning
algorithms, ensemble methods, meta-heuristic approaches, blockchain technology,
and reinforcement learning used in the framework of federated learning, in
addition to an overview of federated learning applications. This paper reviews
the literature on the components of federated learning and its applications in
the last few years. The main purpose of this work is to provide researchers and
practitioners with a comprehensive overview of federated learning from the
machine learning point of view. A discussion of some open problems and future
research directions in federated learning is also provided
- …