Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam
  and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted
  Medical Education and Decision Making in Radiation Oncology

Bert, Christoph; Distel, Luitpold V.; Fietkau, Rainer; Frey, Benjamin; Gaipl, Udo S.; Gomaa, Ahmed; Grigo, Johanna; Haderlein, Marlen; Huang, Yixing; Lettmaier, Sebastian; Maier, Andreas; Putz, Florian; Semrau, Sabine; Tkhayat, Hassen Ben; Weissmann, Thomas

Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation Oncology

Authors: Christoph Bert
Luitpold V. Distel
Rainer Fietkau
Benjamin Frey
Udo S. Gaipl
Ahmed Gomaa
Johanna Grigo
Marlen Haderlein
Yixing Huang
Sebastian Lettmaier
Andreas Maier
Florian Putz
Sabine Semrau
Hassen Ben Tkhayat
Thomas Weissmann
Publication date: 23 May 2023
Publisher

Abstract

The potential of large language models in medicine for education and decision making purposes has been demonstrated as they achieve decent scores on medical exams such as the United States Medical Licensing Exam (USMLE) and the MedQA exam. In this work, we evaluate the performance of ChatGPT-4 in the specialized field of radiation oncology using the 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal gray zone cases. For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 63.65% and 74.57%, respectively, highlighting the advantage of the latest ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in radiation oncology are identified to some extent. Specifically, ChatGPT-4 demonstrates good knowledge of statistics, CNS & eye, pediatrics, biology, and physics but has limitations in bone & soft tissue and gynecology, as per the ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs well in diagnosis, prognosis, and toxicity but lacks proficiency in topics related to brachytherapy and dosimetry, as well as in-depth questions from clinical trials. For the gray zone cases, ChatGPT-4 is able to suggest a personalized treatment approach to each case with high correctness and comprehensiveness. Most importantly, it provides novel treatment aspects for many cases, which are not suggested by any human experts. Both evaluations demonstrate the potential of ChatGPT-4 in medical education for the general public and cancer patients, as well as the potential to aid clinical decision-making, while acknowledging its limitations in certain domains. Because of the risk of hallucination, facts provided by ChatGPT always need to be verified

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2304.11957

Last time updated on 26/04/2023