8,189 research outputs found
Differentially Private Distributed Learning for Language Modeling Tasks
One of the big challenges in machine learning applications is that training
data can be different from the real-world data faced by the algorithm. In
language modeling, users' language (e.g. in private messaging) could change in
a year and be completely different from what we observe in publicly available
data. At the same time, public data can be used for obtaining general knowledge
(i.e. general model of English). We study approaches to distributed fine-tuning
of a general model on user private data with the additional requirements of
maintaining the quality on the general data and minimization of communication
costs. We propose a novel technique that significantly improves prediction
quality on users' language compared to a general model and outperforms gradient
compression methods in terms of communication efficiency. The proposed
procedure is fast and leads to an almost 70% perplexity reduction and 8.7
percentage point improvement in keystroke saving rate on informal English
texts. We also show that the range of tasks our approach is applicable to is
not limited by language modeling only. Finally, we propose an experimental
framework for evaluating differential privacy of distributed training of
language models and show that our approach has good privacy guarantees
dpUGC: Learn Differentially Private Representation for User Generated Contents
This paper firstly proposes a simple yet efficient generalized approach to
apply differential privacy to text representation (i.e., word embedding). Based
on it, we propose a user-level approach to learn personalized differentially
private word embedding model on user generated contents (UGC). To our best
knowledge, this is the first work of learning user-level differentially private
word embedding model from text for sharing. The proposed approaches protect the
privacy of the individual from re-identification, especially provide better
trade-off of privacy and data utility on UGC data for sharing. The experimental
results show that the trained embedding models are applicable for the classic
text analysis tasks (e.g., regression). Moreover, the proposed approaches of
learning differentially private embedding models are both framework- and data-
independent, which facilitates the deployment and sharing. The source code is
available at https://github.com/sonvx/dpText
Deep Learning Towards Mobile Applications
Recent years have witnessed an explosive growth of mobile devices. Mobile
devices are permeating every aspect of our daily lives. With the increasing
usage of mobile devices and intelligent applications, there is a soaring demand
for mobile applications with machine learning services. Inspired by the
tremendous success achieved by deep learning in many machine learning tasks, it
becomes a natural trend to push deep learning towards mobile applications.
However, there exist many challenges to realize deep learning in mobile
applications, including the contradiction between the miniature nature of
mobile devices and the resource requirement of deep neural networks, the
privacy and security concerns about individuals' data, and so on. To resolve
these challenges, during the past few years, great leaps have been made in this
area. In this paper, we provide an overview of the current challenges and
representative achievements about pushing deep learning on mobile devices from
three aspects: training with mobile data, efficient inference on mobile
devices, and applications of mobile deep learning. The former two aspects cover
the primary tasks of deep learning. Then, we go through our two recent
applications that apply the data collected by mobile devices to inferring mood
disturbance and user identification. Finally, we conclude this paper with the
discussion of the future of this area.Comment: Conference version accepted by ICDCS'1
Not Just Privacy: Improving Performance of Private Deep Learning in Mobile Cloud
The increasing demand for on-device deep learning services calls for a highly
efficient manner to deploy deep neural networks (DNNs) on mobile devices with
limited capacity. The cloud-based solution is a promising approach to enabling
deep learning applications on mobile devices where the large portions of a DNN
are offloaded to the cloud. However, revealing data to the cloud leads to
potential privacy risk. To benefit from the cloud data center without the
privacy risk, we design, evaluate, and implement a cloud-based framework ARDEN
which partitions the DNN across mobile devices and cloud data centers. A simple
data transformation is performed on the mobile device, while the
resource-hungry training and the complex inference rely on the cloud data
center. To protect the sensitive information, a lightweight privacy-preserving
mechanism consisting of arbitrary data nullification and random noise addition
is introduced, which provides strong privacy guarantee. A rigorous privacy
budget analysis is given. Nonetheless, the private perturbation to the original
data inevitably has a negative impact on the performance of further inference
on the cloud side. To mitigate this influence, we propose a noisy training
method to enhance the cloud-side network robustness to perturbed data. Through
the sophisticated design, ARDEN can not only preserve privacy but also improve
the inference performance. To validate the proposed ARDEN, a series of
experiments based on three image datasets and a real mobile application are
conducted. The experimental results demonstrate the effectiveness of ARDEN.
Finally, we implement ARDEN on a demo system to verify its practicality.Comment: Conference version accepted by KDD'1
Learning Differentially Private Recurrent Language Models
We demonstrate that it is possible to train large recurrent language models
with user-level differential privacy guarantees with only a negligible cost in
predictive accuracy. Our work builds on recent advances in the training of deep
networks on user-partitioned data and privacy accounting for stochastic
gradient descent. In particular, we add user-level privacy protection to the
federated averaging algorithm, which makes "large step" updates from user-level
data. Our work demonstrates that given a dataset with a sufficiently large
number of users (a requirement easily met by even small internet-scale
datasets), achieving differential privacy comes at the cost of increased
computation, rather than in decreased utility as in most prior work. We find
that our private LSTM language models are quantitatively and qualitatively
similar to un-noised models when trained on a large dataset.Comment: Camera-ready ICLR 2018 version, minor edits from previou
Multi-Institutional Deep Learning Modeling Without Sharing Patient Data: A Feasibility Study on Brain Tumor Segmentation
Deep learning models for semantic segmentation of images require large
amounts of data. In the medical imaging domain, acquiring sufficient data is a
significant challenge. Labeling medical image data requires expert knowledge.
Collaboration between institutions could address this challenge, but sharing
medical data to a centralized location faces various legal, privacy, technical,
and data-ownership challenges, especially among international institutions. In
this study, we introduce the first use of federated learning for
multi-institutional collaboration, enabling deep learning modeling without
sharing patient data. Our quantitative results demonstrate that the performance
of federated semantic segmentation models (Dice=0.852) on multimodal brain
scans is similar to that of models trained by sharing data (Dice=0.862). We
compare federated learning with two alternative collaborative learning methods
and find that they fail to match the performance of federated learning.Comment: MICCAI, Brain Lesion (BrainLes) workshop, September 16, 2018,
Granada, Spai
Differential Privacy Has Disparate Impact on Model Accuracy
Differential privacy (DP) is a popular mechanism for training machine
learning models with bounded leakage about the presence of specific points in
the training data. The cost of differential privacy is a reduction in the
model's accuracy. We demonstrate that in the neural networks trained using
differentially private stochastic gradient descent (DP-SGD), this cost is not
borne equally: accuracy of DP models drops much more for the underrepresented
classes and subgroups.
For example, a gender classification model trained using DP-SGD exhibits much
lower accuracy for black faces than for white faces. Critically, this gap is
bigger in the DP model than in the non-DP model, i.e., if the original model is
unfair, the unfairness becomes worse once DP is applied. We demonstrate this
effect for a variety of tasks and models, including sentiment analysis of text
and image classification. We then explain why DP training mechanisms such as
gradient clipping and noise addition have disproportionate effect on the
underrepresented and more complex subgroups, resulting in a disparate reduction
of model accuracy
Protection Against Reconstruction and Its Applications in Private Federated Learning
In large-scale statistical learning, data collection and model fitting are
moving increasingly toward peripheral devices---phones, watches, fitness
trackers---away from centralized data collection. Concomitant with this rise in
decentralized data are increasing challenges of maintaining privacy while
allowing enough information to fit accurate, useful statistical models. This
motivates local notions of privacy---most significantly, local differential
privacy, which provides strong protections against sensitive data
disclosures---where data is obfuscated before a statistician or learner can
even observe it, providing strong protections to individuals' data. Yet local
privacy as traditionally employed may prove too stringent for practical use,
especially in modern high-dimensional statistical and machine learning
problems. Consequently, we revisit the types of disclosures and adversaries
against which we provide protections, considering adversaries with limited
prior information and ensuring that with high probability, ensuring they cannot
reconstruct an individual's data within useful tolerances. By reconceptualizing
these protections, we allow more useful data release---large privacy parameters
in local differential privacy---and we design new (minimax) optimal locally
differentially private mechanisms for statistical learning problems for
\emph{all} privacy levels. We thus present practicable approaches to
large-scale locally private model training that were previously impossible,
showing theoretically and empirically that we can fit large-scale image
classification and language models with little degradation in utility
An Overview of Privacy in Machine Learning
Over the past few years, providers such as Google, Microsoft, and Amazon have
started to provide customers with access to software interfaces allowing them
to easily embed machine learning tasks into their applications. Overall,
organizations can now use Machine Learning as a Service (MLaaS) engines to
outsource complex tasks, e.g., training classifiers, performing predictions,
clustering, etc. They can also let others query models trained on their data.
Naturally, this approach can also be used (and is often advocated) in other
contexts, including government collaborations, citizen science projects, and
business-to-business partnerships. However, if malicious users were able to
recover data used to train these models, the resulting information leakage
would create serious issues. Likewise, if the inner parameters of the model are
considered proprietary information, then access to the model should not allow
an adversary to learn such parameters. In this document, we set to review
privacy challenges in this space, providing a systematic review of the relevant
research literature, also exploring possible countermeasures. More
specifically, we provide ample background information on relevant concepts
around machine learning and privacy. Then, we discuss possible adversarial
models and settings, cover a wide range of attacks that relate to private
and/or sensitive information leakage, and review recent results attempting to
defend against such attacks. Finally, we conclude with a list of open problems
that require more work, including the need for better evaluations, more
targeted defenses, and the study of the relation to policy and data protection
efforts
RON-Gauss: Enhancing Utility in Non-Interactive Private Data Release
A key challenge facing the design of differential privacy in the
non-interactive setting is to maintain the utility of the released data. To
overcome this challenge, we utilize the Diaconis-Freedman-Meckes (DFM) effect,
which states that most projections of high-dimensional data are nearly
Gaussian. Hence, we propose the RON-Gauss model that leverages the novel
combination of dimensionality reduction via random orthonormal (RON) projection
and the Gaussian generative model for synthesizing differentially-private data.
We analyze how RON-Gauss benefits from the DFM effect, and present multiple
algorithms for a range of machine learning applications, including both
unsupervised and supervised learning. Furthermore, we rigorously prove that (a)
our algorithms satisfy the strong -differential privacy guarantee,
and (b) RON projection can lower the level of perturbation required for
differential privacy. Finally, we illustrate the effectiveness of RON-Gauss
under three common machine learning applications -- clustering, classification,
and regression -- on three large real-world datasets. Our empirical results
show that (a) RON-Gauss outperforms previous approaches by up to an order of
magnitude, and (b) loss in utility compared to the non-private real data is
small. Thus, RON-Gauss can serve as a key enabler for real-world deployment of
privacy-preserving data release.Comment: Appears in PoPETS 2019.
- …