Search CORE

10 research outputs found

BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys

Author: Gao Jianfeng
Gu Yu
Li Chunyuan
Lungren Matthew P.
Poon Hoifung
Usuyama Naoto
Yang Jianwei
Zhang Sheng
Publication venue
Publication date: 20/10/2023
Field of study

Rapid progress has been made in instruction-learning for image editing with natural-language instruction, as exemplified by InstructPix2Pix. In biomedicine, such methods can be applied to counterfactual image generation, which helps differentiate causal structure from spurious correlation and facilitate robust image interpretation for disease progression modeling. However, generic image-editing models are ill-suited for the biomedical domain, and counterfactual biomedical image generation is largely underexplored. In this paper, we present BiomedJourney, a novel method for counterfactual biomedical image generation by instruction-learning from multimodal patient journeys. Given a patient with two biomedical images taken at different time points, we use GPT-4 to process the corresponding imaging reports and generate a natural language description of disease progression. The resulting triples (prior image, progression description, new image) are then used to train a latent diffusion model for counterfactual biomedical image generation. Given the relative scarcity of image time series data, we introduce a two-stage curriculum that first pretrains the denoising network using the much more abundant single image-report pairs (with dummy prior image), and then continues training using the counterfactual triples. Experiments using the standard MIMIC-CXR dataset demonstrate the promise of our method. In a comprehensive battery of tests on counterfactual medical image generation, BiomedJourney substantially outperforms prior state-of-the-art methods in instruction image editing and medical image generation such as InstructPix2Pix and RoentGen. To facilitate future study in counterfactual medical generation, we plan to release our instruction-learning code and pretrained models.Comment: Project page & demo: https://aka.ms/biomedjourne

arXiv.org e-Print Archive

Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

Author: Gu Yu
Naumann Tristan
Poon Hoifung
Sanapathi Praneeth
Strandberg Erika
Usuyama Naoto
Valluri Naveen
Wei Mu
Woldesenbet Yonas
Wong Cliff
Zhang Sheng
Publication venue
Publication date: 12/07/2023
Field of study

Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities across a wide range of tasks, including health applications. In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised learning, substantial gains can be attained over out-of-box LLMs, with additional advantages such as cost, efficiency, and white-box model access. We conduct a case study on adverse drug event (ADE) extraction, which is an important area for improving care. On standard ADE extraction evaluation, a GPT-3.5 distilled PubMedBERT model attained comparable accuracy as supervised state-of-the-art models without using any labeled data. Despite being over 1,000 times smaller, the distilled model outperformed its teacher GPT-3.5 by over 6 absolute points in F1 and GPT-4 by over 5 absolute points. Ablation studies on distillation model choice (e.g., PubMedBERT vs BioGPT) and ADE extraction architecture shed light on best practice for biomedical knowledge extraction. Similar gains were attained by distillation for other standard biomedical knowledge extraction tasks such as gene-disease associations and protected health information, further illustrating the promise of this approach

arXiv.org e-Print Archive

Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology

Author: Abel Jacob
Bifulco Carlo
Gu Yu
Moung Christine
Naumann Tristan
Piening Brian
Poon Hoifung
Usuyama Naoto
Weerasinghe Roshanthi
Wong Cliff
Zheng Sheng
Publication venue
Publication date: 04/08/2023
Field of study

Clinical trial matching is a key process in health delivery and discovery. In practice, it is plagued by overwhelming unstructured data and unscalable manual processing. In this paper, we conduct a systematic study on scaling clinical trial matching using large language models (LLMs), with oncology as the focus area. Our study is grounded in a clinical trial matching system currently in test deployment at a large U.S. health network. Initial findings are promising: out of box, cutting-edge LLMs, such as GPT-4, can already structure elaborate eligibility criteria of clinical trials and extract complex matching logic (e.g., nested AND/OR/NOT). While still far from perfect, LLMs substantially outperform prior strong baselines and may serve as a preliminary solution to help triage patient-trial candidates with humans in the loop. Our study also reveals a few significant growth areas for applying LLMs to end-to-end clinical trial matching, such as context limitation and accuracy, especially in structuring patient information from longitudinal medical records.Comment: 24 pages, 5 figures, accepted at Machine Learning for Healthcare (MLHC) 202

arXiv.org e-Print Archive

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Author: Carignan Dean
Edgar Richard
Fusi Nicolo
Horvitz Eric
King Nicholas
Larson Jonathan
Lee Yin Tat
Li Yuanzhi
Liu Weishung
Luo Renqian
McKinney Scott Mayer
Ness Robert Osazuwa
Nori Harsha
Poon Hoifung
Qin Tao
Usuyama Naoto
White Chris
Zhang Sheng
Publication venue
Publication date: 27/11/2023
Field of study

Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.Comment: 21 pages, 7 figure

arXiv.org e-Print Archive

Exploring the Boundaries of GPT-4 in Radiology

Author: Alvarez-Valle Javier
Bannur Shruthi
Bouzid Kenza
Castro Daniel C.
Hyland Stephanie
Khanna Sameer Tajdin
Liu Qianchu
Lungren Matthew P.
Nori Aditya V.
Oktay Ozan
Poon Hoifung
Pérez-García Fernando
Rajpurkar Pranav
Schwaighofer Anton
Sharma Harshita
Thieme Anja
Tinn Robert
Usuyama Naoto
Wetscherek Maria Teodora
Publication venue
Publication date: 23/10/2023
Field of study

The recent success of general-domain large language models (LLMs) has significantly changed the natural language processing paradigm towards a unified foundation model across domains and applications. In this paper, we focus on assessing the performance of GPT-4, the most capable LLM so far, on the text-based applications for radiology reports, comparing against state-of-the-art (SOTA) radiology-specific models. Exploring various prompting strategies, we evaluated GPT-4 on a diverse range of common radiology tasks and we found GPT-4 either outperforms or is on par with current SOTA radiology models. With zero-shot prompting, GPT-4 already obtains substantial gains (

\approx

10% absolute improvement) over radiology models in temporal sentence similarity classification (accuracy) and natural language inference (

F_1

). For tasks that require learning dataset-specific style or schema (e.g. findings summarisation), GPT-4 improves with example-based prompting and matches supervised SOTA. Our extensive error analysis with a board-certified radiologist shows GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context that require nuanced domain knowledge. For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.Comment: EMNLP 2023 mai

arXiv.org e-Print Archive

Toward structuring real-world data: Deep learning for extracting oncology information from clinical text with patient-level supervision.

Author: Bifulco Carlo
Gu Yu
Lee Soohee
Lucas Michael
Naumann Tristan
Piening Brian D.
Poon Hoifung
Preston Sam
Rao Rajesh
Tinn Robert
Tittel Paul
Usuyama Naoto
Valluri Naveen
Weerasinghe Roshanthi
Wei Mu
Publication venue: Providence Digital Commons
Publication date: 14/04/2023
Field of study

Most detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information, for general RWD applications. We conduct an extensive study on 135,107 patients from the cancer registry of a large integrated delivery network (IDN) comprising healthcare systems in five western US states. Our deep-learning methods attain test area under the receiver operating characteristic curve (AUROC) values of 94%-99% for key tumor attributes and comparable performance on held-out data from separate health systems and states. Ablation results demonstrate the superiority of these advanced deep-learning methods. Error analysis shows that our NLP system sometimes even corrects errors in registrar labels

Providence St. Joseph Health Digital Commons

HapMuC: somatic mutation calling using heterozygous germ line variants near candidate mutations

Author: Albers
Beal
Benson
Blei
Bradley
Chiaromonte
Cibulskis
Cilibrasi
Ding
Eddy
Fawcett
Forbes
Genovese
Genovese
Goya
Haruki Kume
He
Kent
Kent
Koboldt
Larson
Ley
Meyerson
Naoto Usuyama
Nik-Zainal
Pleasance
Robin
Roth
Sato
Satoru Miyano
Saunders
Schwartz
Seishi Ogawa
Seiya Imoto
Shah
Sherry
Shiraishi
Thorvaldsdóttir
Yoshida
Yuichi Shiraishi
Yukio Homma
Yusuke Sato
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref