1,414 research outputs found
Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive Survey
Data quality is the key factor for the development of trustworthy AI in
healthcare. A large volume of curated datasets with controlled confounding
factors can help improve the accuracy, robustness and privacy of downstream AI
algorithms. However, access to good quality datasets is limited by the
technical difficulty of data acquisition and large-scale sharing of healthcare
data is hindered by strict ethical restrictions. Data synthesis algorithms,
which generate data with a similar distribution as real clinical data, can
serve as a potential solution to address the scarcity of good quality data
during the development of trustworthy AI. However, state-of-the-art data
synthesis algorithms, especially deep learning algorithms, focus more on
imaging data while neglecting the synthesis of non-imaging healthcare data,
including clinical measurements, medical signals and waveforms, and electronic
healthcare records (EHRs). Thus, in this paper, we will review the synthesis
algorithms, particularly for non-imaging medical data, with the aim of
providing trustworthy AI in this domain. This tutorial-styled review paper will
provide comprehensive descriptions of non-imaging medical data synthesis on
aspects including algorithms, evaluations, limitations and future research
directions.Comment: 35 pages, Submitted to ACM Computing Survey
Synthetic data generation method for hybrid image-tabular data using two generative adversarial networks
The generation of synthetic medical records using generative adversarial
networks (GANs) has become increasingly important for addressing privacy
concerns and promoting data sharing in the medical field. In this paper, we
propose a novel method for generating synthetic hybrid medical records
consisting of chest X-ray images (CXRs) and structured tabular data (including
anthropometric data and laboratory tests) using an auto-encoding GAN
({\alpha}GAN) and a conditional tabular GAN (CTGAN). Our approach involves
training a {\alpha}GAN model on a large public database (pDB) to reduce the
dimensionality of CXRs. We then applied the trained encoder of the GAN model to
the images in original database (oDB) to obtain the latent vectors. These
latent vectors were combined with tabular data in oDB, and these joint data
were used to train the CTGAN model. We successfully generated diverse synthetic
records of hybrid CXR and tabular data, maintaining correspondence between
them. We evaluated this synthetic database (sDB) through visual assessment,
distribution of interrecord distances, and classification tasks. Our evaluation
results showed that the sDB captured the features of the oDB while maintaining
the correspondence between the images and tabular data. Although our approach
relies on the availability of a large-scale pDB containing a substantial number
of images with the same modality and imaging region as those in the oDB, this
method has the potential for the public release of synthetic datasets without
compromising the secondary use of data.Comment: 14 page
Distributed Conditional GAN (discGAN) For Synthetic Healthcare Data Generation
In this paper, we propose a distributed Generative Adversarial Networks
(discGANs) to generate synthetic tabular data specific to the healthcare
domain. While using GANs to generate images has been well studied, little to no
attention has been given to generation of tabular data. Modeling distributions
of discrete and continuous tabular data is a non-trivial task with high
utility. We applied discGAN to model non-Gaussian multi-modal healthcare data.
We generated 249,000 synthetic records from original 2,027 eICU dataset. We
evaluated the performance of the model using machine learning efficacy, the
Kolmogorov-Smirnov (KS) test for continuous variables and chi-squared test for
discrete variables. Our results show that discGAN was able to generate data
with distributions similar to the real data
Adversarial Random Forests for Density Estimation and Generative Modeling
We propose methods for density estimation and
data synthesis using a novel form of unsupervised
random forests. Inspired by generative adversarial
networks, we implement a recursive procedure in
which trees gradually learn structural properties
of the data through alternating rounds of generation and discrimination. The method is provably
consistent under minimal assumptions. Unlike
classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows
for fully synthetic data generation. We achieve
comparable or superior performance to state-ofthe-art probabilistic circuits and deep learning
models on various tabular data benchmarks while
executing about two orders of magnitude faster
on average. An accompanying R package, arf,
is available on CRAN
- …