Human Evaluation and Correlation with Automatic Metrics in Consultation
  Note Generation

Belz, Anya; Flann, Jack; Juric, Damir; Korfiatis, Alex Papadopoulos; Moramarco, Francesco; Perera, Mark; Reiter, Ehud; Savkov, Aleksandar

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Authors: Anya Belz
Jack Flann
Damir Juric
Alex Papadopoulos Korfiatis
Francesco Moramarco
Mark Perera
Ehud Reiter
Aleksandar Savkov
Publication date: 1 January 2022
Publisher
Doi

Abstract

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.Comment: To be published in proceedings of ACL 202

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Research Archive

oai:aura.abdn.ac.uk:2164/20151

Last time updated on 27/02/2023

DCU Online Research Access Service

oai:doras.dcu.ie:28644

Last time updated on 02/08/2023