Multimodal deep learning foundation models can learn the relationship between
images and text. In the context of medical imaging, mapping images to language
concepts reflects the clinical task of diagnostic image interpretation, however
current general-purpose foundation models do not perform well in this context
because their training corpus have limited medical text and images. To address
this challenge and account for the range of cardiac physiology, we leverage
1,032,975 cardiac ultrasound videos and corresponding expert interpretations to
develop EchoCLIP, a multimodal foundation model for echocardiography. EchoCLIP
displays strong zero-shot (not explicitly trained) performance in cardiac
function assessment (external validation left ventricular ejection fraction
mean absolute error (MAE) of 7.1%) and identification of implanted intracardiac
devices (areas under the curve (AUC) between 0.84 and 0.98 for pacemakers and
artificial heart valves). We also developed a long-context variant (EchoCLIP-R)
with a custom echocardiography report text tokenizer which can accurately
identify unique patients across multiple videos (AUC of 0.86), identify
clinical changes such as orthotopic heart transplants (AUC of 0.79) or cardiac
surgery (AUC 0.77), and enable robust image-to-text search (mean cross-modal
retrieval rank in the top 1% of candidate text reports). These emergent
capabilities can be used for preliminary assessment and summarization of
echocardiographic findings