39 research outputs found
Self-Supervised Time-to-Event Modeling with Structured Medical Records
Time-to-event (TTE) models are used in medicine and other fields for
estimating the probability distribution of the time until a specific event
occurs. TTE models provide many advantages over classification using fixed time
horizons, including naturally handling censored observations, but require more
parameters and are challenging to train in settings with limited labeled data.
Existing approaches, e.g. proportional hazards or accelerated failure time,
employ distributional assumptions to reduce parameters but are vulnerable to
model misspecification. In this work, we address these challenges with MOTOR
(Many Outcome Time Oriented Representations), a self-supervised model that
leverages temporal structure found in collections of timestamped events in
electronic health records (EHR) and health insurance claims. MOTOR uses a TTE
pretraining objective that predicts the probability distribution of times when
events occur, making it well-suited to transfer learning for medical prediction
tasks. Having pretrained on EHR and claims data of up to 55M patient records
(9B clinical events), we evaluate performance after finetuning for 19 tasks
across two datasets. Task-specific models built using MOTOR improve
time-dependent C statistics by 4.6% over state-of-the-art while greatly
improving sample efficiency, achieving comparable performance to existing
methods using only 5% of available task data
Clinical Utility Gains from Incorporating Comorbidity and Geographic Location Information into Risk Estimation Equations for Atherosclerotic Cardiovascular Disease
Objective: There are several efforts to re-learn the 2013 ACC/AHA pooled
cohort equations (PCE) for patients with specific comorbidities and geographic
locations. With over 363 customized risk models in the literature, we aim to
evaluate such revised models to determine if the performance improvements
translate to gains in clinical utility.
Methods: We re-train a baseline PCE using the ACC/AHA PCE variables and
revise it to incorporate subject-level geographic location and comorbidity
information. We apply fixed effects, random effects, and extreme gradient
boosting models to handle the correlation and heterogeneity induced by
locations. Models are trained using 2,464,522 claims records from Optum
Clinformatics Data Mart and validated in the hold-out set (N=1,056,224). We
evaluate models' performance overall and across subgroups defined by the
presence or absence of chronic kidney disease (CKD) or rheumatoid arthritis
(RA) and geographic locations. We evaluate models' expected net benefit using
decision curve analysis and models' statistical properties using several
discrimination and calibration metrics.
Results: The baseline PCE is miscalibrated overall, in patients with CKD or
RA, and locations with small populations. Our revised models improved both the
overall (GND P-value=0.41) and subgroup calibration but only enhanced net
benefit in the underrepresented subgroups. The gains are larger in the
subgroups with comorbidities and heterogeneous across geographic locations.
Conclusions: Revising the PCE with comorbidity and location information
significantly enhanced models' calibration; however, such improvements do not
necessarily translate to clinical gains. Thus, we recommend future works to
quantify the consequences from using risk calculators to guide clinical
decisions
The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs
The successes of foundation models such as ChatGPT and AlphaFold have spurred
significant interest in building similar models for electronic medical records
(EMRs) to improve patient care and hospital operations. However, recent hype
has obscured critical gaps in our understanding of these models' capabilities.
We review over 80 foundation models trained on non-imaging EMR data (i.e.
clinical text and/or structured data) and create a taxonomy delineating their
architectures, training data, and potential use cases. We find that most models
are trained on small, narrowly-scoped clinical datasets (e.g. MIMIC-III) or
broad, public biomedical corpora (e.g. PubMed) and are evaluated on tasks that
do not provide meaningful insights on their usefulness to health systems. In
light of these findings, we propose an improved evaluation framework for
measuring the benefits of clinical foundation models that is more closely
grounded to metrics that matter in healthcare.Comment: Reformatted figures, updated contribution
INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis
Synthesizing information from multiple data sources plays a crucial role in
the practice of modern medicine. Current applications of artificial
intelligence in medicine often focus on single-modality data due to a lack of
publicly available, multimodal medical datasets. To address this limitation, we
introduce INSPECT, which contains de-identified longitudinal records from a
large cohort of patients at risk for pulmonary embolism (PE), along with ground
truth labels for multiple outcomes. INSPECT contains data from 19,402 patients,
including CT images, radiology report impression sections, and structured
electronic health record (EHR) data (i.e. demographics, diagnoses, procedures,
vitals, and medications). Using INSPECT, we develop and release a benchmark for
evaluating several baseline modeling approaches on a variety of important PE
related tasks. We evaluate image-only, EHR-only, and multimodal fusion models.
Trained models and the de-identified dataset are made available for
non-commercial use under a data use agreement. To the best of our knowledge,
INSPECT is the largest multimodal dataset integrating 3D medical imaging and
EHR for reproducible methods evaluation and research
Instability in clinical risk stratification models using deep learning
While it has been well known in the ML community that deep learning models
suffer from instability, the consequences for healthcare deployments are under
characterised. We study the stability of different model architectures trained
on electronic health records, using a set of outpatient prediction tasks as a
case study. We show that repeated training runs of the same deep learning model
on the same training data can result in significantly different outcomes at a
patient level even though global performance metrics remain stable. We propose
two stability metrics for measuring the effect of randomness of model training,
as well as mitigation strategies for improving model stability.Comment: Accepted for publication in Machine Learning for Health (ML4H) 202
AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature
The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient’s disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient’s given set of phenotypes. Diagnosis of singleton patients (without relatives’ exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database–based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children’s Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu