2,220 research outputs found
On Lightweight Privacy-Preserving Collaborative Learning for IoT Objects
The Internet of Things (IoT) will be a main data generation infrastructure
for achieving better system intelligence. This paper considers the design and
implementation of a practical privacy-preserving collaborative learning scheme,
in which a curious learning coordinator trains a better machine learning model
based on the data samples contributed by a number of IoT objects, while the
confidentiality of the raw forms of the training data is protected against the
coordinator. Existing distributed machine learning and data encryption
approaches incur significant computation and communication overhead, rendering
them ill-suited for resource-constrained IoT objects. We study an approach that
applies independent Gaussian random projection at each IoT object to obfuscate
data and trains a deep neural network at the coordinator based on the projected
data from the IoT objects. This approach introduces light computation overhead
to the IoT objects and moves most workload to the coordinator that can have
sufficient computing resources. Although the independent projections performed
by the IoT objects address the potential collusion between the curious
coordinator and some compromised IoT objects, they significantly increase the
complexity of the projected data. In this paper, we leverage the superior
learning capability of deep learning in capturing sophisticated patterns to
maintain good learning performance. Extensive comparative evaluation shows that
this approach outperforms other lightweight approaches that apply additive
noisification for differential privacy and/or support vector machines for
learning in the applications with light data pattern complexities.Comment: 12 pages,IOTDI 201
A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources
Electronic Health Records (EHRs) are a valuable asset to facilitate clinical
research and point of care applications; however, many challenges such as data
privacy concerns impede its optimal utilization. Deep generative models,
particularly, Generative Adversarial Networks (GANs) show great promise in
generating synthetic EHR data by learning underlying data distributions while
achieving excellent performance and addressing these challenges. This work aims
to review the major developments in various applications of GANs for EHRs and
provides an overview of the proposed methodologies. For this purpose, we
combine perspectives from healthcare applications and machine learning
techniques in terms of source datasets and the fidelity and privacy evaluation
of the generated synthetic datasets. We also compile a list of the metrics and
datasets used by the reviewed works, which can be utilized as benchmarks for
future research in the field. We conclude by discussing challenges in GANs for
EHRs development and proposing recommended practices. We hope that this work
motivates novel research development directions in the intersection of
healthcare and machine learning
A survey of generative adversarial networks for synthesizing structured electronic health records
Electronic Health Records (EHRs) are a valuable asset to facilitate clinical research and point of care applications; however, many challenges such as data privacy concerns impede its optimal utilization. Deep generative models, particularly, Generative Adversarial Networks (GANs) show great promise in generating synthetic EHR data by learning underlying data distributions while achieving excellent performance and addressing these challenges. This work aims to survey the major developments in various applications of GANs for EHRs and provides an overview of the proposed methodologies. For this purpose, we combine perspectives from healthcare applications and machine learning techniques in terms of source datasets and the fidelity and privacy evaluation of the generated synthetic datasets. We also compile a list of the metrics and datasets used by the reviewed works, which can be utilized as benchmarks for future research in the field. We conclude by discussing challenges in GANs for EHRs development and proposing recommended practices. We hope that this work motivates novel research development directions in the intersection of healthcare and machine learning
A Survey of Privacy Attacks in Machine Learning
As machine learning becomes more widely used, the need to study its
implications in security and privacy becomes more urgent. Although the body of
work in privacy has been steadily growing over the past few years, research on
the privacy aspects of machine learning has received less focus than the
security aspects. Our contribution in this research is an analysis of more than
40 papers related to privacy attacks against machine learning that have been
published during the past seven years. We propose an attack taxonomy, together
with a threat model that allows the categorization of different attacks based
on the adversarial knowledge, and the assets under attack. An initial
exploration of the causes of privacy leaks is presented, as well as a detailed
analysis of the different attacks. Finally, we present an overview of the most
commonly proposed defenses and a discussion of the open problems and future
directions identified during our analysis.Comment: Under revie
Generating tabular datasets under differential privacy
Machine Learning (ML) is accelerating progress across fields and industries,
but relies on accessible and high-quality training data. Some of the most
important datasets are found in biomedical and financial domains in the form of
spreadsheets and relational databases. But this tabular data is often sensitive
in nature. Synthetic data generation offers the potential to unlock sensitive
data, but generative models tend to memorise and regurgitate training data,
which undermines the privacy goal. To remedy this, researchers have
incorporated the mathematical framework of Differential Privacy (DP) into the
training process of deep neural networks. But this creates a trade-off between
the quality and privacy of the resulting data. Generative Adversarial Networks
(GANs) are the dominant paradigm for synthesising tabular data under DP, but
suffer from unstable adversarial training and mode collapse, which are
exacerbated by the privacy constraints and challenging tabular data modality.
This work optimises the quality-privacy trade-off of generative models,
producing higher quality tabular datasets with the same privacy guarantees. We
implement novel end-to-end models that leverage attention mechanisms to learn
reversible tabular representations. We also introduce TableDiffusion, the first
differentially-private diffusion model for tabular data synthesis. Our
experiments show that TableDiffusion produces higher-fidelity synthetic
datasets, avoids the mode collapse problem, and achieves state-of-the-art
performance on privatised tabular data synthesis. By implementing
TableDiffusion to predict the added noise, we enabled it to bypass the
challenges of reconstructing mixed-type tabular data. Overall, the diffusion
paradigm proves vastly more data and privacy efficient than the adversarial
paradigm, due to augmented re-use of each data batch and a smoother iterative
training process
Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of Smart Devices
Smart devices with built-in sensors, computational capabilities, and network
connectivity have become increasingly pervasive. The crowds of smart devices
offer opportunities to collectively sense and perform computing tasks in an
unprecedented scale. This paper presents Crowd-ML, a privacy-preserving machine
learning framework for a crowd of smart devices, which can solve a wide range
of learning problems for crowdsensing data with differential privacy
guarantees. Crowd-ML endows a crowdsensing system with an ability to learn
classifiers or predictors online from crowdsensing data privately with minimal
computational overheads on devices and servers, suitable for a practical and
large-scale employment of the framework. We analyze the performance and the
scalability of Crowd-ML, and implement the system with off-the-shelf
smartphones as a proof of concept. We demonstrate the advantages of Crowd-ML
with real and simulated experiments under various conditions
Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record
The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning
- …