Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam
and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted
Medical Education and Decision Making in Radiation Oncology
The potential of large language models in medicine for education and decision
making purposes has been demonstrated as they achieve decent scores on medical
exams such as the United States Medical Licensing Exam (USMLE) and the MedQA
exam. In this work, we evaluate the performance of ChatGPT-4 in the specialized
field of radiation oncology using the 38th American College of Radiology (ACR)
radiation oncology in-training (TXIT) exam and the 2022 Red Journal gray zone
cases. For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of
63.65% and 74.57%, respectively, highlighting the advantage of the latest
ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in
radiation oncology are identified to some extent. Specifically, ChatGPT-4
demonstrates good knowledge of statistics, CNS & eye, pediatrics, biology, and
physics but has limitations in bone & soft tissue and gynecology, as per the
ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs well in
diagnosis, prognosis, and toxicity but lacks proficiency in topics related to
brachytherapy and dosimetry, as well as in-depth questions from clinical
trials. For the gray zone cases, ChatGPT-4 is able to suggest a personalized
treatment approach to each case with high correctness and comprehensiveness.
Most importantly, it provides novel treatment aspects for many cases, which are
not suggested by any human experts. Both evaluations demonstrate the potential
of ChatGPT-4 in medical education for the general public and cancer patients,
as well as the potential to aid clinical decision-making, while acknowledging
its limitations in certain domains. Because of the risk of hallucination, facts
provided by ChatGPT always need to be verified