Search CORE

28 research outputs found

Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification

Author: Aerts Hugo JWL
Bitterman Danielle S.
Chen Shan
Li Yingya
Lu Sheng
Savova Guergana K.
Van Hoang
Publication venue
Publication date: 05/04/2023
Field of study

Recent advances in large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates the performance of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical tasks beyond question-answering. Because no patient data can be passed to the OpenAI API public interface, we evaluated model performance with over 10000 samples as proxies for two fundamental tasks in the clinical domain - classification and reasoning. The first task is classifying whether statements of clinical and policy recommendations in scientific literature constitute health advice. The second task is causal relation detection from the biomedical literature. We compared LLMs with simpler models, such as bag-of-words (BoW) with logistic regression, and fine-tuned BioBERT models. Despite the excitement around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks remained the best strategy. The simple BoW model performed on par with the most complex LLM prompting. Prompt engineering required significant investment.Comment: 28 pages, 2 tables and 4 figures. Submitting for revie

arXiv.org e-Print Archive

Maastricht University Research Portal

Enrichment of the NLST and NSCLC-Radiomics computed tomography collections with AI-derived annotations

Author: Aerts Hugo JWL
Bontempi Dennis
Bridge Christopher P
Clunie David
Fedorov Andrey
Kikinis Ron
Krishnaswamy Deepa
Punzo Davide
Thiriveedhi Vamsi
Publication venue
Publication date: 31/05/2023
Field of study

Public imaging datasets are critical for the development and evaluation of automated tools in cancer imaging. Unfortunately, many do not include annotations or image-derived features, complicating their downstream analysis. Artificial intelligence-based annotation tools have been shown to achieve acceptable performance and thus can be used to automatically annotate large datasets. As part of the effort to enrich public data available within NCI Imaging Data Commons (IDC), here we introduce AI-generated annotations for two collections of computed tomography images of the chest, NSCLC-Radiomics, and the National Lung Screening Trial. Using publicly available AI algorithms we derived volumetric annotations of thoracic organs at risk, their corresponding radiomics features, and slice-level annotations of anatomical landmarks and regions. The resulting annotations are publicly available within IDC, where the DICOM format is used to harmonize the data and achieve FAIR principles. The annotations are accompanied by cloud-enabled notebooks demonstrating their use. This study reinforces the need for large, publicly accessible curated datasets and demonstrates how AI can be used to aid in cancer imaging

arXiv.org e-Print Archive

LongHealth: A Question Answering Benchmark with Long Clinical Documents

Author: Adams Lisa
Aerts Hugo JWL.
Bressem Keno
Busch Felix
Excoffier Jean-Baptiste
Han Tianyu
Kather Jakob Nikolas
Löser Alexander
Ortala Matthieu
Truhn Daniel
Publication venue
Publication date: 25/01/2024
Field of study

Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data. Methods: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data interpretation. Conclusion: While LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The LongHealth benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application. We make the benchmark and evaluation code publicly available.Comment: 11 pages, 3 figures, 5 table

arXiv.org e-Print Archive

Repeatability of Multiparametric Prostate MRI Radiomics Features

Author: Aerts Hugo JWL
Fedorov Andrey
Fennessy Fiona M
Kikinis Ron
Peled Sharon
Pieper Steve
Schwier Michael
Tempany Clare M
van Griethuysen Joost
Vangel Mark G
Publication venue
Publication date: 15/11/2018
Field of study

In this study we assessed the repeatability of the values of radiomics features for small prostate tumors using test-retest Multiparametric Magnetic Resonance Imaging (mpMRI) images. The premise of radiomics is that quantitative image features can serve as biomarkers characterizing disease. For such biomarkers to be useful, repeatability is a basic requirement, meaning its value must remain stable between two scans, if the conditions remain stable. We investigated repeatability of radiomics features under various preprocessing and extraction configurations including various image normalization schemes, different image pre-filtering, 2D vs 3D texture computation, and different bin widths for image discretization. Image registration as means to re-identify regions of interest across time points was evaluated against human-expert segmented regions in both time points. Even though we found many radiomics features and preprocessing combinations with a high repeatability (Intraclass Correlation Coefficient (ICC) > 0.85), our results indicate that overall the repeatability is highly sensitive to the processing parameters (under certain configurations, it can be below 0.0). Image normalization, using a variety of approaches considered, did not result in consistent improvements in repeatability. There was also no consistent improvement of repeatability through the use of pre-filtering options, or by using image registration between timepoints to improve consistency of the region of interest localization. Based on these results we urge caution when interpreting radiomics features and advise paying close attention to the processing configuration details of reported results. Furthermore, we advocate reporting all processing details in radiomics studies and strongly recommend making the implementation available

arXiv.org e-Print Archive

Maastricht University Research Portal

Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy

Author: Aerts Hugo JWL
Bitterman Danielle S.
Chen Shan
Guevara Marco
Mak Raymond H.
Miller Timothy A.
Murray Arpi
Ramirez Nicolas
Savova Guergana K.
Warner Jeremy L.
Publication venue
Publication date: 23/03/2023
Field of study

Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet remain under-studied. Real-world evidence holds potential to improve our understanding of toxicities, but toxicity information is often only in clinical notes. We developed natural language processing (NLP) models to identify the presence and severity of esophagitis from notes of patients treated with thoracic RT. We fine-tuned statistical and pre-trained BERT-based models for three esophagitis classification tasks: Task 1) presence of esophagitis, Task 2) severe esophagitis or not, and Task 3) no esophagitis vs. grade 1 vs. grade 2-3. Transferability was tested on 345 notes from patients with esophageal cancer undergoing RT. Fine-tuning PubmedBERT yielded the best performance. The best macro-F1 was 0.92, 0.82, and 0.74 for Task 1, 2, and 3, respectively. Selecting the most informative note sections during fine-tuning improved macro-F1 by over 2% for all tasks. Silver-labeled data improved the macro-F1 by over 3% across all tasks. For the esophageal cancer notes, the best macro-F1 was 0.73, 0.74, and 0.65 for Task 1, 2, and 3, respectively, without additional fine-tuning. To our knowledge, this is the first effort to automatically extract esophagitis toxicity severity according to CTCAE guidelines from clinic notes. The promising performance provides proof-of-concept for NLP-based automated detailed toxicity monitoring in expanded domains.Comment: 17 pages, 6 tables, 1figure, submiting to JCO-CCI for revie

arXiv.org e-Print Archive

Maastricht University Research Portal

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

Author: Aerts Hugo JWL
Bitterman Danielle S.
Chaunzwa Tafadzwa L.
Chen Shan
Franco Idalid
Goldstein Madeleine
Guevara Marco
Harper Susan
Kann Benjamin
Mak Raymond H.
Moningi Shalini
Qian Jack
Savova Guergana K.
Thomas Spencer
Publication venue
Publication date: 11/08/2023
Field of study

Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support.Comment: 38 pages, 5 figures, 5 tables in main, submitted for revie

arXiv.org e-Print Archive

MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

Author: Adams Lisa C.
Aerts Hugo JWL.
Augustin Moritz
Borchert Florian
Bressem Keno K.
Busch Felix
Grosser Lennart
Grundmann Paul
Liu Leonhard
Loyen Jan P.
Löser Alexander
Makowski Marcus R.
Niehues Stefan M.
Papaioannou Jens-Michalis
Xu Lina
Publication venue
Publication date: 24/03/2023
Field of study

This paper presents medBERTde, a pre-trained German BERT model specifically designed for the German medical domain. The model has been trained on a large corpus of 4.7 Million German medical documents and has been shown to achieve new state-of-the-art performance on eight different medical benchmarks covering a wide range of disciplines and medical document types. In addition to evaluating the overall performance of the model, this paper also conducts a more in-depth analysis of its capabilities. We investigate the impact of data deduplication on the model's performance, as well as the potential benefits of using more efficient tokenization methods. Our results indicate that domain-specific models such as medBERTde are particularly useful for longer texts, and that deduplication of training data does not necessarily lead to improved performance. Furthermore, we found that efficient tokenization plays only a minor role in improving model performance, and attribute most of the improved performance to the large amount of training data. To encourage further research, the pre-trained model weights and new benchmarks based on radiological data are made publicly available for use by the scientific community.Comment: Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann contributed equall

arXiv.org e-Print Archive

Imaging biomarker roadmap for cancer studies.

Author: O'Connor James PB
Aboagye Eric O
Adams Judith E
Aerts Hugo JWL
Barrington Sally F
Beer Ambros J
Boellaard Ronald
Bohndiek Sarah
Brady Michael
Brown Gina
Buckley David L
Chenevert Thomas L
Clarke Laurence P
Collette Sandra
Cook Gary J
deSouza Nandita M
Dickson John C
Dive Caroline
Evelhoch Jeffrey L
Faivre-Finn Corinne
Gallagher Ferdia
Gilbert Fiona
Gillies Robert J
Goh Vicky
Griffiths John
Groves Ashley M
Halligan Steve
Harris Adrian L
Hawkes David J
Hoekstra Otto S
Huang Erich P
Hutton Brian F
Jackson Edward F
Jayson Gordon C
Jones Andrew
Koh Dow-Mu
Lacombe Denis
Lambin Philippe
Lassau Nathalie
Leach Martin O
Lee Ting-Yim
Leen Edward L
Lewis Jason S
Liu Yan
Lythgoe Mark F
Manoharan Prakash
Maxwell Ross J
Miles Kenneth A
Morgan Bruno
Morris Stephen
Ng Tony
Padhani Anwar R
Parker Geoff JM
Partridge Mike
Pathak Arvind P
Peet Andrew C
Punwani Shonit
Reynolds Andrew R
Robinson Simon P
Shankar Lalitha K
Sharma Ricky A
Soloviev Dmitry
Stroobants Sigrid
Sullivan Daniel C
Taylor Stuart A
Tofts Paul S
Tozer Gillian M
van Herk Marcel
Walker-Samuel Simon
Wason James
Williams Kaye J
Workman Paul
Yankeelov Thomas E
Brindle Kevin
McShane Lisa M
Jackson Alan
Waterton John C
Publication venue: Nature reviews. Clinical oncology
Publication date: 01/03/2017
Field of study

Imaging biomarkers (IBs) are integral to the routine management of patients with cancer. IBs used daily in oncology include clinical TNM stage, objective response and left ventricular ejection fraction. Other CT, MRI, PET and ultrasonography biomarkers are used extensively in cancer research and drug development. New IBs need to be established either as useful tools for testing research hypotheses in clinical trials and research studies, or as clinical decision-making tools for use in healthcare, by crossing 'translational gaps' through validation and qualification. Important differences exist between IBs and biospecimen-derived biomarkers and, therefore, the development of IBs requires a tailored 'roadmap'. Recognizing this need, Cancer Research UK (CRUK) and the European Organisation for Research and Treatment of Cancer (EORTC) assembled experts to review, debate and summarize the challenges of IB validation and qualification. This consensus group has produced 14 key recommendations for accelerating the clinical translation of IBs, which highlight the role of parallel (rather than sequential) tracks of technical (assay) validation, biological/clinical validation and assessment of cost-effectiveness; the need for IB standardization and accreditation systems; the need to continually revisit IB precision; an alternative framework for biological/clinical validation of IBs; and the essential requirements for multicentre studies to qualify IBs for clinical use.Development of this roadmap received support from Cancer Research UK and the Engineering and Physical Sciences Research Council (grant references A/15267, A/16463, A/16464, A/16465, A/16466 and A/18097), the EORTC Cancer Research Fund, and the Innovative Medicines Initiative Joint Undertaking (grant agreement number 115151), resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/2007-2013) and European Federation of Pharmaceutical Industries and Associations (EFPIA) companies' in kind contribution

Biblioteca Digital de la Comunidad de Madrid

Apollo (Cambridge)

FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare

Author: Abadía Mónica Cano
Abolmaesumi Purang
Aerts Hugo JWL
Albarqouni Shadi
Alberich Leonor Cerdá
Ammar Mohammed
Amugongo Lameck M
Aquino Yves Saint James
Ashrafuzzaman Md
Asselbergs Folkert W
Aussó Susanna
Awate Suyash
Beger Brigit
Bielikova Maria
Bobowicz Maciej
Botwe Benard O
Brown Pieta
Bruijne Marleen De
Buvat Irène
Buyx Alena
Cardoso M Jorge
Carter Stacy
Chan An-Wen
Chouvarda Ioanna
Cintas Celia
Colantonio Sara
Collins Gary
Cook Tessa
Donoso-Bach Lluís
Dou Qi
Duncan James
Dwivedi Girish
Díaz Oliver
Elattar Mustafa
Emelie Anais
Feragen Aasa
Ferrante Enzo
Fofanah Abdul Joseph
Fotiadis Dimitrios I
Frangi Alejandro F
Fritzsche Marie-Christine
Fromont Lauren A
Ghassemi Marzyeh
Gichoya Judy W
Glocker Ben
Goisauf Melanie
González Fabio A
Gordebeke Peter
Guevara Pamela
Jayakody Harsha
Joshi Smriti
Kaissis Georgios
Kalpathy-Cramer Jayashree
Khanal Bishesh
Klein Stefan
Kondylakis Haridimos
Krestin Gabriel P
Kushibar Kaisar
Lambin Philippe
Langlotz Curtis P
Lara Andrea
Lazrak Noussair
Lekadir Karim
Linguraru Marius George
Lu Qinghua
Mahmoud Mukhtar M E
Maier-Hein Lena
Marias Kostas
Marrakchi-Kacem Linda
Martí-Bonmatí Luis
Meijering Erik
Misuraca Gianluca
Mohammed Yunusa G
Mongan John
Mori Kensaku
Mutsvangwa Tinashe E M
Mzurikwao Deogratias
Nakasi Rose
Napel Sandy
Navarro Arcadi
Niessen Wiro J
Osuala Richard
Papanikolaou Nikolaos
Park Jinah
Petersen Steffen E
Phiri Lighton
Porras Antonio R
Prior Fred
Puig-Bosch Xènia
Pujol Oriol
Raviv Tammy Riklin
Rekik Islem
Rieke Nicola
Riklund Katrine
Rittner Leticia
Rogers Wendy A
Rueckert Daniel
Salahuddin Zohaib
Sall Ousmane
Salvado Olivier
Schnabel Julia A
Shabani Mahsa
Starmans Martijn P A
Tegenaw Geletaw S
Tolsgaard Martin G
Tsakou Gianna
Tsiknakis Manolis
Walsh Ian
Weicken Eva
Wenzel Markus
Woodruf Henry C
Wu Carol C
Yaqub Mohammad
Zahir Jihad
Zeng Yi
Zhou S Kevin
Zhussupov Doszhan
Zuluaga Maria A
Publication venue: 'Center for Open Science'
Publication date: 01/01/2023
Field of study

Despite major advances in artificial intelligence (AI) for medicine and healthcare, the deployment and adoption of AI technologies remain limited in real-world clinical practice. In recent years, concerns have been raised about the technical, clinical, ethical and legal risks associated with medical AI. To increase real world adoption, it is essential that medical AI tools are trusted and accepted by patients, clinicians, health organisations and authorities. This work describes the FUTURE-AI guideline as the first international consensus framework for guiding the development and deployment of trustworthy AI tools in healthcare. The FUTURE-AI consortium was founded in 2021 and currently comprises 118 inter-disciplinary experts from 51 countries representing all continents, including AI scientists, clinicians, ethicists, and social scientists. Over a two-year period, the consortium defined guiding principles and best practices for trustworthy AI through an iterative process comprising an in-depth literature review, a modified Delphi survey, and online consensus meetings. The FUTURE-AI framework was established based on 6 guiding principles for trustworthy AI in healthcare, i.e. Fairness, Universality, Traceability, Usability, Robustness and Explainability. Through consensus, a set of 28 best practices were defined, addressing technical, clinical, legal and socio-ethical dimensions. The recommendations cover the entire lifecycle of medical AI, from design, development and validation to regulation, deployment, and monitoring. FUTURE-AI is a risk-informed, assumption-free guideline which provides a structured approach for constructing medical AI tools that will be trusted, deployed and adopted in real-world practice. Researchers are encouraged to take the recommendations into account in proof-of-concept stages to facilitate future translation towards clinical practice of medical AI

EUR Research Repository