Search CORE

1,414 research outputs found

CTAB-GAN: Effective Table Data Synthesizing

Author: Birke Robert
Publication venue: MLResearchPress
Publication date: 01/01/2021
Field of study

Non-Imaging Medical Data Synthesis for Trustworthy AI: A Comprehensive Survey

Author: Del Ser Javier
Stenson Iain
Walsh Simon
Wang Lichao
Wu Huanjun
Xing Xiaodan
Yang Guang
Yong May
Publication venue
Publication date: 17/09/2022
Field of study

Data quality is the key factor for the development of trustworthy AI in healthcare. A large volume of curated datasets with controlled confounding factors can help improve the accuracy, robustness and privacy of downstream AI algorithms. However, access to good quality datasets is limited by the technical difficulty of data acquisition and large-scale sharing of healthcare data is hindered by strict ethical restrictions. Data synthesis algorithms, which generate data with a similar distribution as real clinical data, can serve as a potential solution to address the scarcity of good quality data during the development of trustworthy AI. However, state-of-the-art data synthesis algorithms, especially deep learning algorithms, focus more on imaging data while neglecting the synthesis of non-imaging healthcare data, including clinical measurements, medical signals and waveforms, and electronic healthcare records (EHRs). Thus, in this paper, we will review the synthesis algorithms, particularly for non-imaging medical data, with the aim of providing trustworthy AI in this domain. This tutorial-styled review paper will provide comprehensive descriptions of non-imaging medical data synthesis on aspects including algorithms, evaluations, limitations and future research directions.Comment: 35 pages, Submitted to ACM Computing Survey

arXiv.org e-Print Archive

Synthetic data generation method for hybrid image-tabular data using two generative adversarial networks

Author: Hanaoka Shouhei
Kikuchi Tomohiro
Mori Harushi
Nakao Takahiro
Nomura Yukihiro
Takenaga Tomomi
Yoshikawa Takeharu
Publication venue
Publication date: 15/08/2023
Field of study

The generation of synthetic medical records using generative adversarial networks (GANs) has become increasingly important for addressing privacy concerns and promoting data sharing in the medical field. In this paper, we propose a novel method for generating synthetic hybrid medical records consisting of chest X-ray images (CXRs) and structured tabular data (including anthropometric data and laboratory tests) using an auto-encoding GAN ({\alpha}GAN) and a conditional tabular GAN (CTGAN). Our approach involves training a {\alpha}GAN model on a large public database (pDB) to reduce the dimensionality of CXRs. We then applied the trained encoder of the GAN model to the images in original database (oDB) to obtain the latent vectors. These latent vectors were combined with tabular data in oDB, and these joint data were used to train the CTGAN model. We successfully generated diverse synthetic records of hybrid CXR and tabular data, maintaining correspondence between them. We evaluated this synthetic database (sDB) through visual assessment, distribution of interrecord distances, and classification tasks. Our evaluation results showed that the sDB captured the features of the oDB while maintaining the correspondence between the images and tabular data. Although our approach relies on the availability of a large-scale pDB containing a substantial number of images with the same modality and imaging region as those in the oDB, this method has the potential for the public release of synthetic datasets without compromising the secondary use of data.Comment: 14 page

arXiv.org e-Print Archive

Distributed Conditional GAN (discGAN) For Synthetic Healthcare Data Generation

Author: Adewole Sodiq
Fuentes David
McSpadden Diana
Publication venue
Publication date: 09/04/2023
Field of study

In this paper, we propose a distributed Generative Adversarial Networks (discGANs) to generate synthetic tabular data specific to the healthcare domain. While using GANs to generate images has been well studied, little to no attention has been given to generation of tabular data. Modeling distributions of discrete and continuous tabular data is a non-trivial task with high utility. We applied discGAN to model non-Gaussian multi-modal healthcare data. We generated 249,000 synthetic records from original 2,027 eICU dataset. We evaluated the performance of the model using machine learning efficacy, the Kolmogorov-Smirnov (KS) test for continuous variables and chi-squared test for discrete variables. Our results show that discGAN was able to generate data with distributions similar to the real data

arXiv.org e-Print Archive

Adversarial Random Forests for Density Estimation and Generative Modeling

Author: Blesch Kristin
Kapar Jan
Watson David S
Wright Marvin N
Publication venue: PMLR (The Proceedings of Machine Learning Research)
Publication date: 27/04/2023
Field of study

We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-ofthe-art probabilistic circuits and deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. An accompanying R package, arf, is available on CRAN