35 research outputs found
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Generalist foundation models such as GPT-4 have displayed surprising
capabilities in a wide variety of domains and tasks. Yet, there is a prevalent
assumption that they cannot match specialist capabilities of fine-tuned models.
For example, most explorations to date on medical competency benchmarks have
leveraged domain-specific training, as exemplified by efforts on BioGPT and
Med-PaLM. We build on a prior study of GPT-4's capabilities on medical
challenge benchmarks in the absence of special training. Rather than using
simple prompting to highlight the model's out-of-the-box capabilities, we
perform a systematic exploration of prompt engineering. We find that prompting
innovation can unlock deeper specialist capabilities and show that GPT-4 easily
tops prior leading results for medical benchmarks. The prompting methods we
explore are general purpose, and make no specific use of domain expertise,
removing the need for expert-curated content. Our experimental design carefully
controls for overfitting during the prompt engineering process. We introduce
Medprompt, based on a composition of several prompting strategies. With
Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark
datasets in the MultiMedQA suite. The method outperforms leading specialist
models such as Med-PaLM 2 by a significant margin with an order of magnitude
fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27%
reduction in error rate on the MedQA dataset over the best methods to date
achieved with specialist models and surpasses a score of 90% for the first
time. Beyond medical problems, we show the power of Medprompt to generalize to
other domains and provide evidence for the broad applicability of the approach
via studies of the strategy on exams in electrical engineering, machine
learning, philosophy, accounting, law, nursing, and clinical psychology.Comment: 21 pages, 7 figure
Recommended from our members
International evaluation of an AI system for breast cancer screening.
Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful1. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives2. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening.Professor Fiona Gilbert receives funding from the National Institute for Health Research (Senior Investigator award)
Not all green space is created equal: biodiversity predicts psychological restorative benefits from urban green space
Contemporary epidemiological methods testing the associations between green space and psychological well-being treat all vegetation cover as equal. However, there is very good reason to expect that variations in ecological "quality" (number of species, integrity of ecological processes) may influence the link between access to green space and benefits to human health and well-being. We test the relationship between green space quality and restorative benefit in an inner city urban population in Bradford, UK. We selected 12 urban parks for study where we carried out botanical and faunal surveys to quantify biodiversity and assessed the site facilities of the green space (cleanliness, provision of amenities). We also conducted 128 surveys with park users to quantify psychological restoration based on four self-reported measure of general restoration, attention-grabbing distractions, being away from everyday life, and site preference. We present three key results. First, there is a positive association between site facilities and biodiversity. Second, restorative benefit is predicted by biodiversity, which explained 43% of the variance in restorative benefit across the parks, with minimal input from other variables. Third, the benefits accrued through access to green space were unrelated to age, gender, and ethnic background. The results add to a small but growing body of evidence that emphasise the role of nature in contributing to the well-being of urban populations and, hence, the need to consider biodiversity in the design of landscapes that enhance multiple ecosystem services
Robust and Efficient Medical Imaging with Self-Supervision
Recent progress in Medical Artificial Intelligence (AI) has delivered systems
that can reach clinical expert level performance. However, such systems tend to
demonstrate sub-optimal "out-of-distribution" performance when evaluated in
clinical settings different from the training environment. A common mitigation
strategy is to develop separate systems for each clinical setting using
site-specific data [1]. However, this quickly becomes impractical as medical
data is time-consuming to acquire and expensive to annotate [2]. Thus, the
problem of "data-efficient generalization" presents an ongoing difficulty for
Medical AI development. Although progress in representation learning shows
promise, their benefits have not been rigorously studied, specifically for
out-of-distribution settings. To meet these challenges, we present REMEDIS, a
unified representation learning strategy to improve robustness and
data-efficiency of medical imaging AI. REMEDIS uses a generic combination of
large-scale supervised transfer learning with self-supervised learning and
requires little task-specific customization. We study a diverse range of
medical imaging tasks and simulate three realistic application scenarios using
retrospective data. REMEDIS exhibits significantly improved in-distribution
performance with up to 11.5% relative improvement in diagnostic accuracy over a
strong supervised baseline. More importantly, our strategy leads to strong
data-efficient generalization of medical imaging AI, matching strong supervised
baselines using between 1% to 33% of retraining data across tasks. These
results suggest that REMEDIS can significantly accelerate the life-cycle of
medical imaging AI development thereby presenting an important step forward for
medical imaging AI to deliver broad impact