28 research outputs found
Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification
Recent advances in large language models (LLMs) have shown impressive ability
in biomedical question-answering, but have not been adequately investigated for
more specific biomedical applications. This study investigates the performance
of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical
tasks beyond question-answering. Because no patient data can be passed to the
OpenAI API public interface, we evaluated model performance with over 10000
samples as proxies for two fundamental tasks in the clinical domain -
classification and reasoning. The first task is classifying whether statements
of clinical and policy recommendations in scientific literature constitute
health advice. The second task is causal relation detection from the biomedical
literature. We compared LLMs with simpler models, such as bag-of-words (BoW)
with logistic regression, and fine-tuned BioBERT models. Despite the excitement
around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks
remained the best strategy. The simple BoW model performed on par with the most
complex LLM prompting. Prompt engineering required significant investment.Comment: 28 pages, 2 tables and 4 figures. Submitting for revie
Enrichment of the NLST and NSCLC-Radiomics computed tomography collections with AI-derived annotations
Public imaging datasets are critical for the development and evaluation of
automated tools in cancer imaging. Unfortunately, many do not include
annotations or image-derived features, complicating their downstream analysis.
Artificial intelligence-based annotation tools have been shown to achieve
acceptable performance and thus can be used to automatically annotate large
datasets. As part of the effort to enrich public data available within NCI
Imaging Data Commons (IDC), here we introduce AI-generated annotations for two
collections of computed tomography images of the chest, NSCLC-Radiomics, and
the National Lung Screening Trial. Using publicly available AI algorithms we
derived volumetric annotations of thoracic organs at risk, their corresponding
radiomics features, and slice-level annotations of anatomical landmarks and
regions. The resulting annotations are publicly available within IDC, where the
DICOM format is used to harmonize the data and achieve FAIR principles. The
annotations are accompanied by cloud-enabled notebooks demonstrating their use.
This study reinforces the need for large, publicly accessible curated datasets
and demonstrates how AI can be used to aid in cancer imaging
LongHealth: A Question Answering Benchmark with Long Clinical Documents
Background: Recent advancements in large language models (LLMs) offer
potential benefits in healthcare, particularly in processing extensive patient
records. However, existing benchmarks do not fully assess LLMs' capability in
handling real-world, lengthy clinical data.
Methods: We present the LongHealth benchmark, comprising 20 detailed
fictional patient cases across various diseases, with each case containing
5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice
questions in three categories: information extraction, negation, and sorting,
challenging LLMs to extract and interpret information from large clinical
documents.
Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens
and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for
comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1,
particularly in tasks focused on information retrieval from single and multiple
patient documents. However, all models struggled significantly in tasks
requiring the identification of missing information, highlighting a critical
area for improvement in clinical data interpretation.
Conclusion: While LLMs show considerable potential for processing long
clinical documents, their current accuracy levels are insufficient for reliable
clinical use, especially in scenarios requiring the identification of missing
information. The LongHealth benchmark provides a more realistic assessment of
LLMs in a healthcare setting and highlights the need for further model
refinement for safe and effective clinical application.
We make the benchmark and evaluation code publicly available.Comment: 11 pages, 3 figures, 5 table
Repeatability of Multiparametric Prostate MRI Radiomics Features
In this study we assessed the repeatability of the values of radiomics
features for small prostate tumors using test-retest Multiparametric Magnetic
Resonance Imaging (mpMRI) images. The premise of radiomics is that quantitative
image features can serve as biomarkers characterizing disease. For such
biomarkers to be useful, repeatability is a basic requirement, meaning its
value must remain stable between two scans, if the conditions remain stable. We
investigated repeatability of radiomics features under various preprocessing
and extraction configurations including various image normalization schemes,
different image pre-filtering, 2D vs 3D texture computation, and different bin
widths for image discretization. Image registration as means to re-identify
regions of interest across time points was evaluated against human-expert
segmented regions in both time points. Even though we found many radiomics
features and preprocessing combinations with a high repeatability (Intraclass
Correlation Coefficient (ICC) > 0.85), our results indicate that overall the
repeatability is highly sensitive to the processing parameters (under certain
configurations, it can be below 0.0). Image normalization, using a variety of
approaches considered, did not result in consistent improvements in
repeatability. There was also no consistent improvement of repeatability
through the use of pre-filtering options, or by using image registration
between timepoints to improve consistency of the region of interest
localization. Based on these results we urge caution when interpreting
radiomics features and advise paying close attention to the processing
configuration details of reported results. Furthermore, we advocate reporting
all processing details in radiomics studies and strongly recommend making the
implementation available
Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy
Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet
remain under-studied. Real-world evidence holds potential to improve our
understanding of toxicities, but toxicity information is often only in clinical
notes. We developed natural language processing (NLP) models to identify the
presence and severity of esophagitis from notes of patients treated with
thoracic RT. We fine-tuned statistical and pre-trained BERT-based models for
three esophagitis classification tasks: Task 1) presence of esophagitis, Task
2) severe esophagitis or not, and Task 3) no esophagitis vs. grade 1 vs. grade
2-3. Transferability was tested on 345 notes from patients with esophageal
cancer undergoing RT.
Fine-tuning PubmedBERT yielded the best performance. The best macro-F1 was
0.92, 0.82, and 0.74 for Task 1, 2, and 3, respectively. Selecting the most
informative note sections during fine-tuning improved macro-F1 by over 2% for
all tasks. Silver-labeled data improved the macro-F1 by over 3% across all
tasks. For the esophageal cancer notes, the best macro-F1 was 0.73, 0.74, and
0.65 for Task 1, 2, and 3, respectively, without additional fine-tuning.
To our knowledge, this is the first effort to automatically extract
esophagitis toxicity severity according to CTCAE guidelines from clinic notes.
The promising performance provides proof-of-concept for NLP-based automated
detailed toxicity monitoring in expanded domains.Comment: 17 pages, 6 tables, 1figure, submiting to JCO-CCI for revie
Large Language Models to Identify Social Determinants of Health in Electronic Health Records
Social determinants of health (SDoH) have an important impact on patient
outcomes but are incompletely collected from the electronic health records
(EHR). This study researched the ability of large language models to extract
SDoH from free text in EHRs, where they are most commonly documented, and
explored the role of synthetic clinical text for improving the extraction of
these scarcely documented, yet extremely valuable, clinical data. 800 patient
notes were annotated for SDoH categories, and several transformer-based models
were evaluated. The study also experimented with synthetic data generation and
assessed for algorithmic bias. Our best-performing models were fine-tuned
Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The
benefit of augmenting fine-tuning with synthetic data varied across model
architecture and size, with smaller Flan-T5 models (base and large) showing the
greatest improvements in performance (delta F1 +0.12 to +0.23). Model
performance was similar on the in-hospital system dataset but worse on the
MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and
few-shot performance of ChatGPT-family models for both tasks. These fine-tuned
models were less likely than ChatGPT to change their prediction when
race/ethnicity and gender descriptors were added to the text, suggesting less
algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of
patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can
effectively extracted SDoH information from clinic notes, performing better
compare to GPT zero- and few-shot settings. These models could enhance
real-world evidence on SDoH and aid in identifying patients needing social
support.Comment: 38 pages, 5 figures, 5 tables in main, submitted for revie
MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain
This paper presents medBERTde, a pre-trained German BERT model specifically
designed for the German medical domain. The model has been trained on a large
corpus of 4.7 Million German medical documents and has been shown to achieve
new state-of-the-art performance on eight different medical benchmarks covering
a wide range of disciplines and medical document types. In addition to
evaluating the overall performance of the model, this paper also conducts a
more in-depth analysis of its capabilities. We investigate the impact of data
deduplication on the model's performance, as well as the potential benefits of
using more efficient tokenization methods. Our results indicate that
domain-specific models such as medBERTde are particularly useful for longer
texts, and that deduplication of training data does not necessarily lead to
improved performance. Furthermore, we found that efficient tokenization plays
only a minor role in improving model performance, and attribute most of the
improved performance to the large amount of training data. To encourage further
research, the pre-trained model weights and new benchmarks based on
radiological data are made publicly available for use by the scientific
community.Comment: Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann
contributed equall
Imaging biomarker roadmap for cancer studies.
Imaging biomarkers (IBs) are integral to the routine management of patients with cancer. IBs used daily in oncology include clinical TNM stage, objective response and left ventricular ejection fraction. Other CT, MRI, PET and ultrasonography biomarkers are used extensively in cancer research and drug development. New IBs need to be established either as useful tools for testing research hypotheses in clinical trials and research studies, or as clinical decision-making tools for use in healthcare, by crossing 'translational gaps' through validation and qualification. Important differences exist between IBs and biospecimen-derived biomarkers and, therefore, the development of IBs requires a tailored 'roadmap'. Recognizing this need, Cancer Research UK (CRUK) and the European Organisation for Research and Treatment of Cancer (EORTC) assembled experts to review, debate and summarize the challenges of IB validation and qualification. This consensus group has produced 14 key recommendations for accelerating the clinical translation of IBs, which highlight the role of parallel (rather than sequential) tracks of technical (assay) validation, biological/clinical validation and assessment of cost-effectiveness; the need for IB standardization and accreditation systems; the need to continually revisit IB precision; an alternative framework for biological/clinical validation of IBs; and the essential requirements for multicentre studies to qualify IBs for clinical use.Development of this roadmap received support from Cancer Research UK and the Engineering and Physical Sciences Research Council (grant references A/15267, A/16463, A/16464, A/16465, A/16466 and A/18097), the EORTC Cancer Research Fund, and the Innovative Medicines Initiative Joint Undertaking (grant agreement number 115151), resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/2007-2013) and European Federation of Pharmaceutical Industries and Associations (EFPIA) companies' in kind contribution
FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare
Despite major advances in artificial intelligence (AI) for medicine and healthcare, the deployment and adoption of AI technologies remain limited in real-world clinical practice. In recent years, concerns have been raised about the technical, clinical, ethical and legal risks associated with medical AI. To increase real world adoption, it is essential that medical AI tools are trusted and accepted by patients, clinicians, health organisations and authorities. This work describes the FUTURE-AI guideline as the first international consensus framework for guiding the development and deployment of trustworthy AI tools in healthcare. The FUTURE-AI consortium was founded in 2021 and currently comprises 118 inter-disciplinary experts from 51 countries representing all continents, including AI scientists, clinicians, ethicists, and social scientists. Over a two-year period, the consortium defined guiding principles and best practices for trustworthy AI through an iterative process comprising an in-depth literature review, a modified Delphi survey, and online consensus meetings. The FUTURE-AI framework was established based on 6 guiding principles for trustworthy AI in healthcare, i.e. Fairness, Universality, Traceability, Usability, Robustness and Explainability. Through consensus, a set of 28 best practices were defined, addressing technical, clinical, legal and socio-ethical dimensions. The recommendations cover the entire lifecycle of medical AI, from design, development and validation to regulation, deployment, and monitoring. FUTURE-AI is a risk-informed, assumption-free guideline which provides a structured approach for constructing medical AI tools that will be trusted, deployed and adopted in real-world practice. Researchers are encouraged to take the recommendations into account in proof-of-concept stages to facilitate future translation towards clinical practice of medical AI