26 research outputs found
MCRAGE: Synthetic Healthcare Data for Fairness
In the field of healthcare, electronic health records (EHR) serve as crucial
training data for developing machine learning models for diagnosis, treatment,
and the management of healthcare resources. However, medical datasets are often
imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and
age. Machine learning models trained on class-imbalanced EHR datasets perform
significantly worse in deployment for individuals of the minority classes
compared to samples from majority classes, which may lead to inequitable
healthcare outcomes for minority groups. To address this challenge, we propose
Minority Class Rebalancing through Augmentation by Generative modeling
(MCRAGE), a novel approach to augment imbalanced datasets using samples
generated by a deep generative model. The MCRAGE process involves training a
Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of
generating high-quality synthetic EHR samples from underrepresented classes. We
use this synthetic data to augment the existing imbalanced dataset, thereby
achieving a more balanced distribution across all classes, which can be used to
train an unbiased machine learning model. We measure the performance of MCRAGE
versus alternative approaches using Accuracy, F1 score and AUROC. We provide
theoretical justification for our method in terms of recent convergence results
for DDPMs with minimal assumptions.Comment: Keywords: synthetic electronic health records, conditional denoising
diffusion probabilistic model, healthcare AI, tabular data, fairness,
synthetic data. This paper is the result of work completed at the 2023 Emory
University Department of Mathematics REU/RET program under the direction of
Project Advisor Dr. Xi Yuanzhe. This work is sponsored by NSF DMS 205101
Generating Medical Prescriptions with Conditional Transformer
Access to real-world medication prescriptions is essential for medical
research and healthcare quality improvement. However, access to real medication
prescriptions is often limited due to the sensitive nature of the information
expressed. Additionally, manually labelling these instructions for training and
fine-tuning Natural Language Processing (NLP) models can be tedious and
expensive. We introduce a novel task-specific model architecture,
Label-To-Text-Transformer (\textbf{LT3}), tailored to generate synthetic
medication prescriptions based on provided labels, such as a vocabulary list of
medications and their attributes. LT3 is trained on a set of around 2K lines of
medication prescriptions extracted from the MIMIC-III database, allowing the
model to produce valuable synthetic medication prescriptions. We evaluate LT3's
performance by contrasting it with a state-of-the-art Pre-trained Language
Model (PLM), T5, analysing the quality and diversity of generated texts. We
deploy the generated synthetic data to train the SpacyNER model for the Named
Entity Recognition (NER) task over the n2c2-2018 dataset. The experiments show
that the model trained on synthetic data can achieve a 96-98\% F1 score at
Label Recognition on Drug, Frequency, Route, Strength, and Form. LT3 codes and
data will be shared at
\url{https://github.com/HECTA-UoM/Label-To-Text-Transformer}Comment: Accepted to: Workshop on Synthetic Data Generation with Generative AI
(SyntheticData4ML Workshop) at NeurIPS 202
Is artificial data useful for biomedical Natural Language Processing algorithms?
A major obstacle to the development of Natural Language Processing (NLP)
methods in the biomedical domain is data accessibility. This problem can be
addressed by generating medical data artificially. Most previous studies have
focused on the generation of short clinical text, and evaluation of the data
utility has been limited. We propose a generic methodology to guide the
generation of clinical text with key phrases. We use the artificial data as
additional training data in two key biomedical NLP tasks: text classification
and temporal relation extraction. We show that artificially generated training
data used in conjunction with real training data can lead to performance boosts
for data-greedy neural network algorithms. We also demonstrate the usefulness
of the generated data for NLP setups where it fully replaces real training
data.Comment: BioNLP 201
A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese
Textual health records of cancer patients are usually protracted and highly
unstructured, making it very time-consuming for health professionals to get a
complete overview of the patient's therapeutic course. As such limitations can
lead to suboptimal and/or inefficient treatment procedures, healthcare
providers would greatly benefit from a system that effectively summarizes the
information of those records. With the advent of deep neural models, this
objective has been partially attained for English clinical texts, however, the
research community still lacks an effective solution for languages with limited
resources. In this paper, we present the approach we developed to extract
procedures, drugs, and diseases from oncology health records written in
European Portuguese. This project was conducted in collaboration with the
Portuguese Institute for Oncology which, besides holding over years of
duly protected medical records, also provided oncologist expertise throughout
the development of the project. Since there is no annotated corpus for
biomedical entity extraction in Portuguese, we also present the strategy we
followed in annotating the corpus for the development of the models. The final
models, which combined a neural architecture with entity linking, achieved
scores of , , and per cent in the mention extraction
of procedures, drugs, and diseases, respectively
Autocompletion of Chief Complaints in the Electronic Health Records using Large Language Models
The Chief Complaint (CC) is a crucial component of a patient's medical record
as it describes the main reason or concern for seeking medical care. It
provides critical information for healthcare providers to make informed
decisions about patient care. However, documenting CCs can be time-consuming
for healthcare providers, especially in busy emergency departments. To address
this issue, an autocompletion tool that suggests accurate and well-formatted
phrases or sentences for clinical notes can be a valuable resource for triage
nurses. In this study, we utilized text generation techniques to develop
machine learning models using CC data. In our proposed work, we train a Long
Short-Term Memory (LSTM) model and fine-tune three different variants of
Biomedical Generative Pretrained Transformers (BioGPT), namely
microsoft/biogpt, microsoft/BioGPT-Large, and microsoft/BioGPT-Large-PubMedQA.
Additionally, we tune a prompt by incorporating exemplar CC sentences,
utilizing the OpenAI API of GPT-4. We evaluate the models' performance based on
the perplexity score, modified BERTScore, and cosine similarity score. The
results show that BioGPT-Large exhibits superior performance compared to the
other models. It consistently achieves a remarkably low perplexity score of
1.65 when generating CC, whereas the baseline LSTM model achieves the best
perplexity score of 170. Further, we evaluate and assess the proposed models'
performance and the outcome of GPT-4.0. Our study demonstrates that utilizing
LLMs such as BioGPT, leads to the development of an effective autocompletion
tool for generating CC documentation in healthcare settings.Comment: IEEE BigData 2023 - Sorrento, Italy. 10 Pages, 4 Figures, 5 Table
LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training
Natural language generation from structured data mainly focuses on
surface-level descriptions, suffering from uncontrollable content selection and
low fidelity. Previous works leverage logical forms to facilitate logical
knowledge-conditioned text generation. Though achieving remarkable progress,
they are data-hungry, which makes the adoption for real-world applications
challenging with limited data. To this end, this paper proposes a unified
framework for logical knowledge-conditioned text generation in the few-shot
setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach
leverages self-training and samples pseudo logical forms based on content and
structure consistency. Experimental results demonstrate that our approach can
obtain better few-shot performance than baselines.Comment: Work in progres