30 research outputs found
Open-Source Clinical Machine Learning Models: Critical Appraisal of Feasibility, Advantages, and Challenges
Machine learning applications promise to augment clinical capabilities and at least 64 models have already been approved by the US Food and Drug Administration. These tools are developed, shared, and used in an environment in which regulations and market forces remain immature. An important consideration when evaluating this environment is the introduction of open-source solutions in which innovations are freely shared; such solutions have long been a facet of digital culture. We discuss the feasibility and implications of open-source machine learning in a health care infrastructure built upon proprietary information. The decreased cost of development as compared to drugs and devices, a longstanding culture of open-source products in other industries, and the beginnings of machine learning–friendly regulatory pathways together allow for the development and deployment of open-source machine learning models. Such tools have distinct advantages including enhanced product integrity, customizability, and lower cost, leading to increased access. However, significant questions regarding engineering concerns about implementation infrastructure and model safety, a lack of incentives from intellectual property protection, and nebulous liability rules significantly complicate the ability to develop such open-source models. Ultimately, the reconciliation of open-source machine learning and the proprietary information–driven health care environment requires that policymakers, regulators, and health care organizations actively craft a conducive market in which innovative developers will continue to both work and collaborate
Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model
Advances in large language models (LLMs) provide new opportunities in
healthcare for improved patient care, clinical decision-making, and enhancement
of physician and administrator workflows. However, the potential of these
models importantly depends on their ability to generalize effectively across
clinical environments and populations, a challenge often underestimated in
early development. To better understand reasons for these challenges and inform
mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s
clinical notes, analyzing its performance on 30-day all-cause readmission
prediction focusing on variability across hospitals and patient
characteristics. We found poorer generalization particularly in hospitals with
fewer samples, among patients with government and unspecified insurance, the
elderly, and those with high comorbidities. To understand reasons for lack of
generalization, we investigated sample sizes for fine-tuning, note content
(number of words per note), patient characteristics (comorbidity level, age,
insurance type, borough), and health system aspects (hospital, all-cause 30-day
readmission, and mortality rates). We used descriptive statistics and
supervised classification to identify features. We found that, along with
sample size, patient age, number of comorbidities, and the number of words in
notes are all important factors related to generalization. Finally, we compared
local fine-tuning (hospital specific), instance-based augmented fine-tuning and
cluster-based fine-tuning for improving generalization. Among these, local
fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most
helpful in settings with limited data). Overall, this study provides new
insights for enhancing the deployment of large language models in the
societally important domain of healthcare, and improving their performance for
broader populations
Classifying publications from the clinical and translational science award program along the translational research spectrum: a machine learning approach
BACKGROUND:
Translational research is a key area of focus of the National Institutes of Health (NIH), as demonstrated by the substantial investment in the Clinical and Translational Science Award (CTSA) program. The goal of the CTSA program is to accelerate the translation of discoveries from the bench to the bedside and into communities. Different classification systems have been used to capture the spectrum of basic to clinical to population health research, with substantial differences in the number of categories and their definitions. Evaluation of the effectiveness of the CTSA program and of translational research in general is hampered by the lack of rigor in these definitions and their application. This study adds rigor to the classification process by creating a checklist to evaluate publications across the translational spectrum and operationalizes these classifications by building machine learning-based text classifiers to categorize these publications.
METHODS:
Based on collaboratively developed definitions, we created a detailed checklist for categories along the translational spectrum from T0 to T4. We applied the checklist to CTSA-linked publications to construct a set of coded publications for use in training machine learning-based text classifiers to classify publications within these categories. The training sets combined T1/T2 and T3/T4 categories due to low frequency of these publication types compared to the frequency of T0 publications. We then compared classifier performance across different algorithms and feature sets and applied the classifiers to all publications in PubMed indexed to CTSA grants. To validate the algorithm, we manually classified the articles with the top 100 scores from each classifier.
RESULTS:
The definitions and checklist facilitated classification and resulted in good inter-rater reliability for coding publications for the training set. Very good performance was achieved for the classifiers as represented by the area under the receiver operating curves (AUC), with an AUC of 0.94 for the T0 classifier, 0.84 for T1/T2, and 0.92 for T3/T4.
CONCLUSIONS:
The combination of definitions agreed upon by five CTSA hubs, a checklist that facilitates more uniform definition interpretation, and algorithms that perform well in classifying publications along the translational spectrum provide a basis for establishing and applying uniform definitions of translational research categories. The classification algorithms allow publication analyses that would not be feasible with manual classification, such as assessing the distribution and trends of publications across the CTSA network and comparing the categories of publications and their citations to assess knowledge transfer across the translational research spectrum
A dynamic risk score for early prediction of cardiogenic shock using machine learning
Myocardial infarction and heart failure are major cardiovascular diseases
that affect millions of people in the US. The morbidity and mortality are
highest among patients who develop cardiogenic shock. Early recognition of
cardiogenic shock is critical. Prompt implementation of treatment measures can
prevent the deleterious spiral of ischemia, low blood pressure, and reduced
cardiac output due to cardiogenic shock. However, early identification of
cardiogenic shock has been challenging due to human providers' inability to
process the enormous amount of data in the cardiac intensive care unit (ICU)
and lack of an effective risk stratification tool. We developed a deep
learning-based risk stratification tool, called CShock, for patients admitted
into the cardiac ICU with acute decompensated heart failure and/or myocardial
infarction to predict onset of cardiogenic shock. To develop and validate
CShock, we annotated cardiac ICU datasets with physician adjudicated outcomes.
CShock achieved an area under the receiver operator characteristic curve
(AUROC) of 0.820, which substantially outperformed CardShock (AUROC 0.519), a
well-established risk score for cardiogenic shock prognosis. CShock was
externally validated in an independent patient cohort and achieved an AUROC of
0.800, demonstrating its generalizability in other cardiac ICUs