Search CORE

1,522 research outputs found

Interactive Range Queries under Differential Privacy

Author: Alnemari Asma Mohammed
Publication venue: RIT Scholar Works
Publication date: 01/06/2020
Field of study

Differential privacy approaches employ a curator to control data sharing with analysts without compromising individual privacy. The curator’s role is to guard the data and determine what is appropriate for release using the parameter epsilon to adjust the accuracy of the released data. A low epsilon value provides more privacy, while a higher epsilon value is associated with higher accuracy. Counting queries, which ”count” the number of items in a dataset that meet speciﬁc conditions, impose additional restrictions on privacy protection. In particular, if the resulting counts are low, the data released is more speciﬁc and can lead to privacy loss. This work addresses privacy challenges in single-attribute counting-range queries by proposing a Workload Partitioning Mechanism (WPM) which generates estimated answers based on query sensitivity. The mechanism is then extended to handle multiple-attribute range queries by preventing interrelated attributes from revealing private information about individuals. Further, the mechanism is paired with access control to improve system privacy and security, thus illustrating its practicality. The work also extends the WPM to reduce the error to be polylogarithmic in the sensitivity degree of the issued queries. This thesis describes the research questions addressed by WPM to date, and discusses future plans to expand the current research tasks toward developing a more efﬁcient mechanism for range queries

RIT Scholar Works

Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record

Author: Beaulieu-Jones Brett Kreigh
Publication venue: ScholarlyCommons
Publication date: 01/01/2017
Field of study

The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning

ScholarlyCommons@Penn

Recommended from our members

End-to-End Machine Learning Frameworks for Medicine: Data Imputation, Model Interpretation and Synthetic Data Generation

Author: Yoon Jinsung
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Tremendous successes in machine learning have been achieved in a variety of applications such as image classification and language translation via supervised learning frameworks. Recently, with the rapid increase of electronic health records (EHR), machine learning researchers got immense opportunities to adopt the successful supervised learning frameworks to diverse clinical applications. To properly employ machine learning frameworks for medicine, we need to handle the special properties of the EHR and clinical applications: (1) extensive missing data, (2) model interpretation, (3) privacy of the data. This dissertation addresses those specialties to construct end-to-end machine learning frameworks for clinical decision support. We focus on the following three problems: (1) how to deal with incomplete data (data imputation), (2) how to explain the decisions of the trained model (model interpretation), (3) how to generate synthetic data for better sharing private clinical data (synthetic data generation). To appropriately handle those problems, we propose novel machine learning algorithms for both static and longitudinal settings. For data imputation, we propose modified Generative Adversarial Networks and Recurrent Neural Networks to accurately impute the missing values and return the complete data for applying state-of-the-art supervised learning models. For model interpretation, we utilize the actor-critic framework to estimate feature importance of the trained model's decision in an instance level. We expand this algorithm to active sensing framework that recommends which observations should we measure and when. For synthetic data generation, we extend well-known Generative Adversarial Network frameworks from static setting to longitudinal setting, and propose a novel differentially private synthetic data generation framework.To demonstrate the utilities of the proposed models, we evaluate those models on various real-world medical datasets including cohorts in the intensive care units, wards, and primary care hospitals. We show that the proposed algorithms consistently outperform state-of-the-art for handling missing data, understanding the trained model, and generating private synthetic data that are critical for building end-to-end machine learning frameworks for medicine

eScholarship - University of California

Privacy, Space and Time: a Survey on Privacy-Preserving Continuous Data Publishing

Author: Katsomallos Manos
Kotzinos Dimitris
Tzompanaki Katerina
Publication venue: DigitalCommons@UMaine
Publication date: 13/07/2021
Field of study

Sensors, portable devices, and location-based services, generate massive amounts of geo-tagged, and/or location- and user-related data on a daily basis. The manipulation of such data is useful in numerous application domains, e.g., healthcare, intelligent buildings, and traffic monitoring, to name a few. A high percentage of these data carry information of users\u27 activities and other personal details, and thus their manipulation and sharing arise concerns about the privacy of the individuals involved. To enable the secure‚Äîfrom the users\u27 privacy perspective‚Äîdata sharing, researchers have already proposed various seminal techniques for the protection of users\u27 privacy. However, the continuous fashion in which data are generated nowadays, and the high availability of external sources of information, pose more threats and add extra challenges to the problem. In this survey, we visit the works done on data privacy for continuous data publishing, and report on the proposed solutions, with a special focus on solutions concerning location or geo-referenced data

University of Maine

Generating tabular datasets under differential privacy

Author: Truda Gianluca
Publication venue
Publication date: 28/08/2023
Field of study

Machine Learning (ML) is accelerating progress across fields and industries, but relies on accessible and high-quality training data. Some of the most important datasets are found in biomedical and financial domains in the form of spreadsheets and relational databases. But this tabular data is often sensitive in nature. Synthetic data generation offers the potential to unlock sensitive data, but generative models tend to memorise and regurgitate training data, which undermines the privacy goal. To remedy this, researchers have incorporated the mathematical framework of Differential Privacy (DP) into the training process of deep neural networks. But this creates a trade-off between the quality and privacy of the resulting data. Generative Adversarial Networks (GANs) are the dominant paradigm for synthesising tabular data under DP, but suffer from unstable adversarial training and mode collapse, which are exacerbated by the privacy constraints and challenging tabular data modality. This work optimises the quality-privacy trade-off of generative models, producing higher quality tabular datasets with the same privacy guarantees. We implement novel end-to-end models that leverage attention mechanisms to learn reversible tabular representations. We also introduce TableDiffusion, the first differentially-private diffusion model for tabular data synthesis. Our experiments show that TableDiffusion produces higher-fidelity synthetic datasets, avoids the mode collapse problem, and achieves state-of-the-art performance on privatised tabular data synthesis. By implementing TableDiffusion to predict the added noise, we enabled it to bypass the challenges of reconstructing mixed-type tabular data. Overall, the diffusion paradigm proves vastly more data and privacy efficient than the adversarial paradigm, due to augmented re-use of each data batch and a smoother iterative training process

arXiv.org e-Print Archive

Aggregated OD tracks of mobile phone data for the recognition of daily mobility spaces: an application to Lombardia region

Author: F. Manfredini
P. Pucci
P. Tagliolato
Publication venue
Publication date: 01/01/2013
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Fake Malware Generation Using HMM and GAN

Author: Di Troia Fabio
Trehan Harshit
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

In the past decade, the number of malware attacks have grown considerably and, more importantly, evolved. Many researchers have successfully integrated state-of-the-art machine learning techniques to combat this ever present and rising threat to information security. However, the lack of enough data to appropriately train these machine learning models is one big challenge that is still present. Generative modelling has proven to be very efficient at generating image-like synthesized data that can match the actual data distribution. In this paper, we aim to generate malware samples as opcode sequences and attempt to differentiate them from the real ones with the goal to build fake malware data that can be used to effectively train the machine learning models. We use and compare different Generative Adversarial Networks (GAN) algorithms and Hidden Markov Models (HMM) to generate such fake samples obtaining promising results

SJSU ScholarWorks