15 research outputs found
Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive Survey
Data quality is the key factor for the development of trustworthy AI in
healthcare. A large volume of curated datasets with controlled confounding
factors can help improve the accuracy, robustness and privacy of downstream AI
algorithms. However, access to good quality datasets is limited by the
technical difficulty of data acquisition and large-scale sharing of healthcare
data is hindered by strict ethical restrictions. Data synthesis algorithms,
which generate data with a similar distribution as real clinical data, can
serve as a potential solution to address the scarcity of good quality data
during the development of trustworthy AI. However, state-of-the-art data
synthesis algorithms, especially deep learning algorithms, focus more on
imaging data while neglecting the synthesis of non-imaging healthcare data,
including clinical measurements, medical signals and waveforms, and electronic
healthcare records (EHRs). Thus, in this paper, we will review the synthesis
algorithms, particularly for non-imaging medical data, with the aim of
providing trustworthy AI in this domain. This tutorial-styled review paper will
provide comprehensive descriptions of non-imaging medical data synthesis on
aspects including algorithms, evaluations, limitations and future research
directions.Comment: 35 pages, Submitted to ACM Computing Survey
A method for machine learning generation of realistic synthetic datasets for validating healthcare applications
Digital health applications can improve quality and effectiveness of healthcare, by offering a number of new tools to users, which are often considered a medical device. Assuring their safe operation requires, amongst others, clinical validation, needing large datasets to test them in realistic clinical scenarios. Access to datasets is challenging, due to patient privacy concerns. Development of synthetic datasets is seen as a potential alternative. The objective of the paper is the development of a method for the generation of realistic synthetic datasets, statistically equivalent to real clinical datasets, and demonstrate that the Generative Adversarial Network (GAN) based approach is fit for purpose. A generative adversarial network was implemented and trained, in a series of six experiments, using numerical and categorical variables, including ICD-9 and laboratory codes, from three clinically relevant datasets. A number of contextual steps provided the success criteria for the synthetic dataset. A synthetic dataset that exhibits very similar statistical characteristics with the real dataset was generated. Pairwise association of variables is very similar. A high degree of Jaccard similarity and a successful K-S test further support this. The proof of concept of generating realistic synthetic datasets was successful, with the approach showing promise for further work
Artificial Intelligence for In Silico Clinical Trials: A Review
A clinical trial is an essential step in drug development, which is often
costly and time-consuming. In silico trials are clinical trials conducted
digitally through simulation and modeling as an alternative to traditional
clinical trials. AI-enabled in silico trials can increase the case group size
by creating virtual cohorts as controls. In addition, it also enables
automation and optimization of trial design and predicts the trial success
rate. This article systematically reviews papers under three main topics:
clinical simulation, individualized predictive modeling, and computer-aided
trial design. We focus on how machine learning (ML) may be applied in these
applications. In particular, we present the machine learning problem
formulation and available data sources for each task. We end with discussing
the challenges and opportunities of AI for in silico trials in real-world
applications
PPGAN: Privacy-preserving Generative Adversarial Network
Generative Adversarial Network (GAN) and its variants serve as a perfect
representation of the data generation model, providing researchers with a large
amount of high-quality generated data. They illustrate a promising direction
for research with limited data availability. When GAN learns the semantic-rich
data distribution from a dataset, the density of the generated distribution
tends to concentrate on the training data. Due to the gradient parameters of
the deep neural network contain the data distribution of the training samples,
they can easily remember the training samples. When GAN is applied to private
or sensitive data, for instance, patient medical records, as private
information may be leakage. To address this issue, we propose a
Privacy-preserving Generative Adversarial Network (PPGAN) model, in which we
achieve differential privacy in GANs by adding well-designed noise to the
gradient during the model learning procedure. Besides, we introduced the
Moments Accountant strategy in the PPGAN training process to improve the
stability and compatibility of the model by controlling privacy loss. We also
give a mathematical proof of the differential privacy discriminator. Through
extensive case studies of the benchmark datasets, we demonstrate that PPGAN can
generate high-quality synthetic data while retaining the required data
available under a reasonable privacy budget.Comment: This paper was accepted by IEEE ICPADS 2019 Workshop. This paper
contains 10 pages, 3 figure
CONAN: Complementary Pattern Augmentation for Rare Disease Detection
Rare diseases affect hundreds of millions of people worldwide but are hard to
detect since they have extremely low prevalence rates (varying from 1/1,000 to
1/200,000 patients) and are massively underdiagnosed. How do we reliably detect
rare diseases with such low prevalence rates? How to further leverage patients
with possibly uncertain diagnosis to improve detection? In this paper, we
propose a Complementary pattern Augmentation (CONAN) framework for rare disease
detection. CONAN combines ideas from both adversarial training and max-margin
classification. It first learns self-attentive and hierarchical embedding for
patient pattern characterization. Then, we develop a complementary generative
adversarial networks (GAN) model to generate candidate positive and negative
samples from the uncertain patients by encouraging a max-margin between
classes. In addition, CONAN has a disease detector that serves as the
discriminator during the adversarial training for identifying rare diseases. We
evaluated CONAN on two disease detection tasks. For low prevalence inflammatory
bowel disease (IBD) detection, CONAN achieved .96 precision recall area under
the curve (PR-AUC) and 50.1% relative improvement over best baseline. For rare
disease idiopathic pulmonary fibrosis (IPF) detection, CONAN achieves .22
PR-AUC with 41.3% relative improvement over the best baseline
Generative Adversarial Networks for Creating Synthetic Free-Text Medical Data: A Proposal for Collaborative Research and Re-use of Machine Learning Models
Restrictions in sharing Patient Health Identifiers (PHI) limit cross-organizational re-use of free-text medical data. We leverage Generative Adversarial Networks (GAN) to produce synthetic unstructured free-text medical data with low re-identification risk, and assess the suitability of these datasets to replicate machine learning models. We trained GAN models using unstructured free-text laboratory messages pertaining to salmonella, and identified the most accurate models for creating synthetic datasets that reflect the informational characteristics of the original dataset. Natural Language Generation metrics comparing the real and synthetic datasets demonstrated high similarity. Decision models generated using these datasets reported high performance metrics. There was no statistically significant difference in performance measures reported by models trained using real and synthetic datasets. Our results inform the use of GAN models to generate synthetic unstructured free-text data with limited re-identification risk, and use of this data to enable collaborative research and re-use of machine learning models