2,781 research outputs found
A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources
Electronic Health Records (EHRs) are a valuable asset to facilitate clinical
research and point of care applications; however, many challenges such as data
privacy concerns impede its optimal utilization. Deep generative models,
particularly, Generative Adversarial Networks (GANs) show great promise in
generating synthetic EHR data by learning underlying data distributions while
achieving excellent performance and addressing these challenges. This work aims
to review the major developments in various applications of GANs for EHRs and
provides an overview of the proposed methodologies. For this purpose, we
combine perspectives from healthcare applications and machine learning
techniques in terms of source datasets and the fidelity and privacy evaluation
of the generated synthetic datasets. We also compile a list of the metrics and
datasets used by the reviewed works, which can be utilized as benchmarks for
future research in the field. We conclude by discussing challenges in GANs for
EHRs development and proposing recommended practices. We hope that this work
motivates novel research development directions in the intersection of
healthcare and machine learning
A survey of generative adversarial networks for synthesizing structured electronic health records
Electronic Health Records (EHRs) are a valuable asset to facilitate clinical research and point of care applications; however, many challenges such as data privacy concerns impede its optimal utilization. Deep generative models, particularly, Generative Adversarial Networks (GANs) show great promise in generating synthetic EHR data by learning underlying data distributions while achieving excellent performance and addressing these challenges. This work aims to survey the major developments in various applications of GANs for EHRs and provides an overview of the proposed methodologies. For this purpose, we combine perspectives from healthcare applications and machine learning techniques in terms of source datasets and the fidelity and privacy evaluation of the generated synthetic datasets. We also compile a list of the metrics and datasets used by the reviewed works, which can be utilized as benchmarks for future research in the field. We conclude by discussing challenges in GANs for EHRs development and proposing recommended practices. We hope that this work motivates novel research development directions in the intersection of healthcare and machine learning
Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model
Synthetic electronic health records (EHRs) that are both realistic and
preserve privacy can serve as an alternative to real EHRs for machine learning
(ML) modeling and statistical analysis. However, generating high-fidelity and
granular electronic health record (EHR) data in its original,
highly-dimensional form poses challenges for existing methods due to the
complexities inherent in high-dimensional data. In this paper, we propose
Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal
high-dimensional EHR, which preserve the statistical properties of real EHR and
can be used to train accurate ML models without privacy concerns. Our HALO
method, designed as a hierarchical autoregressive model, generates a
probability density function of medical codes, clinical visits, and patient
records, allowing for the generation of realistic EHR data in its original,
unaggregated form without the need for variable selection or aggregation.
Additionally, our model also produces high-quality continuous variables in a
longitudinal and probabilistic manner. We conducted extensive experiments and
demonstrate that HALO can generate high-fidelity EHR data with high-dimensional
disease code probabilities (d > 10,000), disease co-occurrence probabilities
within visits (d > 1,000,000), and conditional probabilities across consecutive
visits (d > 5,000,000) and achieve above 0.9 R2 correlation in comparison to
real EHR data. This performance then enables downstream ML models trained on
its synthetic data to achieve comparable accuracy to models trained on real
data (0.938 AUROC with HALO data vs. 0.943 with real data). Finally, using a
combination of real and synthetic data enhances the accuracy of ML models
beyond that achieved by using only real EHR data
Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications
The recent availability of electronic health records (EHRs) have provided enormous opportunities to develop artificial intelligence (AI) algorithms. However, patient privacy has become a major concern that limits data sharing across hospital settings and subsequently hinders the advances in AI. Synthetic data, which benefits from the development and proliferation of generative models, has served as a promising substitute for real patient EHR data. However, the current generative models are limited as they only generate single type of clinical data for a synthetic patient, i.e., either continuous-valued or discrete-valued. To mimic the nature of clinical decision-making which encompasses various data types/sources, in this study, we propose a generative adversarial network (GAN) entitled EHR-M-GAN that simultaneously synthesizes mixed-type timeseries EHR data. EHR-M-GAN is capable of capturing the multidimensional, heterogeneous, and correlated temporal dynamics in patient trajectories. We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients, and performed privacy risk evaluation of the proposed model. EHR-M-GAN has demonstrated its superiority over state-of-the-art benchmarks for synthesizing clinical timeseries with high fidelity, while addressing the limitations regarding data types and dimensionality in the current generative models. Notably, prediction models for outcomes of intensive care performed significantly better when training data was augmented with the addition of EHR-M-GAN-generated timeseries. EHR-M-GAN may have use in developing AI algorithms in resource-limited settings, lowering the barrier for data acquisition while preserving patient privacy
ScoEHR: generating synthetic electronic health records using continuous-time diffusion models
Global access to statistically and clinically representative patient health data holds potential for advancing disease research, enhancing patient care, and accelerating drug development. However, acquisition of health data such as electronic health records (EHRs) comes with challenges characterised by high costs, time constraints, and concerns related to patient privacy. An approach to tackling these challenges is by using synthetic data. In this paper we introduce ScoEHR, a novel deep learning method for generating synthetic EHRs, which combines an autoencoder with a continuous-time diffusion model. ScoEHR is shown to outperform three baseline synthetic EHR generation frameworks (medGAN, medWGAN, and medBGAN) on two publicly available datasets, MIMIC-III and the Yale New Haven Health System Emergency Department dataset, based on four widely accepted metrics of data utility. Additionally, a blind clinician evaluation was carried out to assess the qualitative realism of the synthetic data generated by ScoEHR. In this evaluation, a patient’s data was labeled as ‘unrealistic’ if at least one clinician found it to be unrealistic. This evaluation showed that existing real EHR data and ScoEHR generated synthetic data were scored as equally realistic. Our code is available at https://github.com/aanaseer/ ScoEHR
A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models
Synthetic health data have the potential to mitigate privacy concerns when
sharing data to support biomedical research and the development of innovative
healthcare applications. Modern approaches for data generation based on machine
learning, generative adversarial networks (GAN) methods in particular, continue
to evolve and demonstrate remarkable potential. Yet there is a lack of a
systematic assessment framework to benchmark methods as they emerge and
determine which methods are most appropriate for which use cases. In this work,
we introduce a generalizable benchmarking framework to appraise key
characteristics of synthetic health data with respect to utility and privacy
metrics. We apply the framework to evaluate synthetic data generation methods
for electronic health records (EHRs) data from two large academic medical
centers with respect to several use cases. The results illustrate that there is
a utility-privacy tradeoff for sharing synthetic EHR data. The results further
indicate that no method is unequivocally the best on all criteria in each use
case, which makes it evident why synthetic data generation methods need to be
assessed in context
Adversarial behaviours knowledge area
The technological advancements witnessed by our society in recent decades have brought
improvements in our quality of life, but they have also created a number of opportunities for
attackers to cause harm. Before the Internet revolution, most crime and malicious activity
generally required a victim and a perpetrator to come into physical contact, and this limited
the reach that malicious parties had. Technology has removed the need for physical contact
to perform many types of crime, and now attackers can reach victims anywhere in the world, as long as they are connected to the Internet. This has revolutionised the characteristics of crime and warfare, allowing operations that would not have been possible before. In this document, we provide an overview of the malicious operations that are happening on the Internet today. We first provide a taxonomy of malicious activities based on the attacker’s motivations and capabilities, and then move on to the technological and human elements that adversaries require to run a successful operation. We then discuss a number of frameworks that have been proposed to model malicious operations. Since adversarial behaviours are not a purely technical topic, we draw from research in a number of fields (computer science, criminology, war studies). While doing this, we discuss how these frameworks can be used by researchers and practitioners to develop effective mitigations against malicious online operations.Published versio
- …