4 research outputs found
Towards Accurate Differential Diagnosis with Large Language Models
An accurate differential diagnosis (DDx) is a cornerstone of medical care,
often reached through an iterative process of interpretation that combines
clinical history, physical examination, investigations and procedures.
Interactive interfaces powered by Large Language Models (LLMs) present new
opportunities to both assist and automate aspects of this process. In this
study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its
ability to generate a DDx alone or as an aid to clinicians. 20 clinicians
evaluated 302 challenging, real-world medical cases sourced from the New
England Journal of Medicine (NEJM) case reports. Each case report was read by
two clinicians, who were randomized to one of two assistive conditions: either
assistance from search engines and standard medical resources, or LLM
assistance in addition to these tools. All clinicians provided a baseline,
unassisted DDx prior to using the respective assistive tools. Our LLM for DDx
exhibited standalone performance that exceeded that of unassisted clinicians
(top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study
arms, the DDx quality score was higher for clinicians assisted by our LLM
(top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%)
(McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p =
0.03). Further, clinicians assisted by our LLM arrived at more comprehensive
differential lists than those without its assistance. Our study suggests that
our LLM for DDx has potential to improve clinicians' diagnostic reasoning and
accuracy in challenging cases, meriting further real-world evaluation for its
ability to empower physicians and widen patients' access to specialist-level
expertise
Towards Generalist Biomedical AI
Medicine is inherently multimodal, with rich data modalities spanning text,
imaging, genomics, and more. Generalist biomedical artificial intelligence (AI)
systems that flexibly encode, integrate, and interpret this data at scale can
potentially enable impactful applications ranging from scientific discovery to
care delivery. To enable the development of these models, we first curate
MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses
14 diverse tasks such as medical question answering, mammography and
dermatology image interpretation, radiology report generation and
summarization, and genomic variant calling. We then introduce Med-PaLM
Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI
system. Med-PaLM M is a large multimodal generative model that flexibly encodes
and interprets biomedical data including clinical language, imaging, and
genomics with the same set of model weights. Med-PaLM M reaches performance
competitive with or exceeding the state of the art on all MultiMedBench tasks,
often surpassing specialist models by a wide margin. We also report examples of
zero-shot generalization to novel medical concepts and tasks, positive transfer
learning across tasks, and emergent zero-shot medical reasoning. To further
probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist
evaluation of model-generated (and human) chest X-ray reports and observe
encouraging performance across model scales. In a side-by-side ranking on 246
retrospective chest X-rays, clinicians express a pairwise preference for
Med-PaLM M reports over those produced by radiologists in up to 40.50% of
cases, suggesting potential clinical utility. While considerable work is needed
to validate these models in real-world use cases, our results represent a
milestone towards the development of generalist biomedical AI systems
A communication aid with context-aware vocabulary prediction
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 131-136).by Ewa Dominowska.M.Eng
Towards Expert-Level Medical Question Answering with Large Language Models
Recent artificial intelligence (AI) systems have reached milestones in "grand
challenges" ranging from Go to protein-folding. The capability to retrieve
medical knowledge, reason over it, and answer medical questions comparably to
physicians has long been viewed as one such grand challenge.
Large language models (LLMs) have catalyzed significant progress in medical
question answering; Med-PaLM was the first model to exceed a "passing" score in
US Medical Licensing Examination (USMLE) style questions with a score of 67.2%
on the MedQA dataset. However, this and other prior work suggested significant
room for improvement, especially when models' answers were compared to
clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by
leveraging a combination of base LLM improvements (PaLM 2), medical domain
finetuning, and prompting strategies including a novel ensemble refinement
approach.
Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM
by over 19% and setting a new state-of-the-art. We also observed performance
approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU
clinical topics datasets.
We performed detailed human evaluations on long-form questions along multiple
axes relevant to clinical applications. In pairwise comparative ranking of 1066
consumer medical questions, physicians preferred Med-PaLM 2 answers to those
produced by physicians on eight of nine axes pertaining to clinical utility (p
< 0.001). We also observed significant improvements compared to Med-PaLM on
every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form
"adversarial" questions to probe LLM limitations.
While further studies are necessary to validate the efficacy of these models
in real-world settings, these results highlight rapid progress towards
physician-level performance in medical question answering