6,556 research outputs found
UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations?
This paper presents UMASS_BioNLP team participation in the MEDIQA-Chat 2023
shared task for Task-A and Task-C. We focus especially on Task-C and propose a
novel LLMs cooperation system named a doctor-patient loop to generate
high-quality conversation data sets. The experiment results demonstrate that
our approaches yield reasonable performance as evaluated by automatic metrics
such as ROUGE, medical concept recall, BLEU, and Self-BLEU. Furthermore, we
conducted a comparative analysis between our proposed method and ChatGPT and
GPT-4. This analysis also investigates the potential of utilizing cooperation
LLMs to generate high-quality datasets
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation
The authors would like to thank Rachel Young and Tom Knoll for supporting the team and hiring the evaluators, Vitalii Zhelezniak for his advice on revising the paper, and Kristian Boda for helping to set up the Stanza+Snomed fact-extraction system.Publisher PD
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation
In recent years, machine learning models have rapidly become better at
generating clinical consultation notes; yet, there is little work on how to
properly evaluate the generated consultation notes to understand the impact
they may have on both the clinician using them and the patient's clinical
safety. To address this we present an extensive human evaluation study of
consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii)
write their own notes, (iii) post-edit a number of automatically generated
notes, and (iv) extract all the errors, both quantitative and qualitative. We
then carry out a correlation study with 18 automatic quality metrics and the
human judgements. We find that a simple, character-based Levenshtein distance
metric performs on par if not better than common model-based metrics like
BertScore. All our findings and annotations are open-sourced.Comment: To be published in proceedings of ACL 202
Improving Summarization with Human Edits
Recent work has shown the promise of learning with human feedback paradigms
to produce human-determined high-quality text. Existing works use human
feedback to train large language models (LLMs) in general domain abstractive
summarization and have obtained summary quality exceeding traditional
likelihood training. In this paper, we focus on a less explored form of human
feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training
(SALT), a novel technique to use both the human-edited and model-generated data
together in the training loop. In addition, we demonstrate simulating Human
Edits with ground truth summaries coming from existing training data --
Imitation edits, along with the model-generated summaries obtained after the
training, to reduce the need for expensive human-edit data. In our experiments,
we extend human feedback exploration from general domain summarization to
medical domain summarization. Our results demonstrate the effectiveness of SALT
in improving the summary quality with Human and Imitation Edits. Through
additional experiments, we show that SALT outperforms the conventional RLHF
method (designed for human preferences) -- DPO, when applied to human-edit
data. We hope the evidence in our paper prompts researchers to explore,
collect, and better use different human feedback approaches scalably.Comment: To appear in proceedings of the Main Conference on Empirical Methods
in Natural Language Processing (EMNLP) 202
Overview of the Problem List Summarization (ProbSum) 2023 Shared Task on Summarizing Patients' Active Diagnoses and Problems from Electronic Health Record Progress Notes
The BioNLP Workshop 2023 initiated the launch of a shared task on Problem
List Summarization (ProbSum) in January 2023. The aim of this shared task is to
attract future research efforts in building NLP models for real-world
diagnostic decision support applications, where a system generating relevant
and accurate diagnoses will augment the healthcare providers decision-making
process and improve the quality of care for patients. The goal for participants
is to develop models that generated a list of diagnoses and problems using
input from the daily care notes collected from the hospitalization of
critically ill patients. Eight teams submitted their final systems to the
shared task leaderboard. In this paper, we describe the tasks, datasets,
evaluation metrics, and baseline systems. Additionally, the techniques and
results of the evaluation of the different approaches tried by the
participating teams are summarized.Comment: To appear in the Proceedings of the 5th BioNLP Workshop at AC
ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation
Recent immense breakthroughs in generative models such as in GPT4 have
precipitated re-imagined ubiquitous usage of these models in all applications.
One area that can benefit by improvements in artificial intelligence (AI) is
healthcare. The note generation task from doctor-patient encounters, and its
associated electronic medical record documentation, is one of the most arduous
time-consuming tasks for physicians. It is also a natural prime potential
beneficiary to advances in generative models. However with such advances,
benchmarking is more critical than ever. Whether studying model weaknesses or
developing new evaluation metrics, shared open datasets are an imperative part
of understanding the current state-of-the-art. Unfortunately as clinic
encounter conversations are not routinely recorded and are difficult to
ethically share due to patient confidentiality, there are no sufficiently large
clinic dialogue-note datasets to benchmark this task. Here we present the
Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset
to date tackling the problem of AI-assisted note generation from visit
dialogue. We also present the benchmark performances of several common
state-of-the-art approaches
ASR Error Detection via Audio-Transcript entailment
Despite improved performances of the latest Automatic Speech Recognition
(ASR) systems, transcription errors are still unavoidable. These errors can
have a considerable impact in critical domains such as healthcare, when used to
help with clinical documentation. Therefore, detecting ASR errors is a critical
first step in preventing further error propagation to downstream applications.
To this end, we propose a novel end-to-end approach for ASR error detection
using audio-transcript entailment. To the best of our knowledge, we are the
first to frame this problem as an end-to-end entailment task between the audio
segment and its corresponding transcript segment. Our intuition is that there
should be a bidirectional entailment between audio and transcript when there is
no recognition error and vice versa. The proposed model utilizes an acoustic
encoder and a linguistic encoder to model the speech and transcript
respectively. The encoded representations of both modalities are fused to
predict the entailment. Since doctor-patient conversations are used in our
experiments, a particular emphasis is placed on medical terms. Our proposed
model achieves classification error rates (CER) of 26.2% on all transcription
errors and 23% on medical errors specifically, leading to improvements upon a
strong baseline by 12% and 15.4%, respectively.Comment: Accepted to Interspeech 202
WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models
This paper describes our submission to the MEDIQA-Chat 2023 shared task for
automatic clinical note generation from doctor-patient conversations. We report
results for two approaches: the first fine-tunes a pre-trained language model
(PLM) on the shared task data, and the second uses few-shot in-context learning
(ICL) with a large language model (LLM). Both achieve high performance as
measured by automatic metrics (e.g. ROUGE, BERTScore) and ranked second and
first, respectively, of all submissions to the shared task. Expert human
scrutiny indicates that notes generated via the ICL-based approach with GPT-4
are preferred about as often as human-written notes, making it a promising path
toward automated note generation from doctor-patient conversations.Comment: Camera-ready submission to ClinicalNLP @ ACL 202
- …