Advances in self-supervised learning (SSL) have shown that self-supervised
pretraining on medical imaging data can provide a strong initialization for
downstream supervised classification and segmentation. Given the difficulty of
obtaining expert labels for medical image recognition tasks, such an
"in-domain" SSL initialization is often desirable due to its improved label
efficiency over standard transfer learning. However, most efforts toward SSL of
medical imaging data are not adapted to video-based medical imaging modalities.
With this progress in mind, we developed a self-supervised contrastive learning
approach, EchoCLR, catered to echocardiogram videos with the goal of learning
strong representations for efficient fine-tuning on downstream cardiac disease
diagnosis. EchoCLR leverages (i) distinct videos of the same patient as
positive pairs for contrastive learning and (ii) a frame re-ordering pretext
task to enforce temporal coherence. When fine-tuned on small portions of
labeled data (as few as 51 exams), EchoCLR pretraining significantly improved
classification performance for left ventricular hypertrophy (LVH) and aortic
stenosis (AS) over other transfer learning and SSL approaches across internal
and external test sets. For example, when fine-tuning on 10% of available
training data (519 studies), an EchoCLR-pretrained model achieved 0.72 AUROC
(95% CI: [0.69, 0.75]) on LVH classification, compared to 0.61 AUROC (95% CI:
[0.57, 0.64]) with a standard transfer learning approach. Similarly, using 1%
of available training data (53 studies), EchoCLR pretraining achieved 0.82
AUROC (95% CI: [0.79, 0.84]) on severe AS classification, compared to 0.61
AUROC (95% CI: [0.58, 0.65]) with transfer learning. EchoCLR is unique in its
ability to learn representations of medical videos and demonstrates that SSL
can enable label-efficient disease classification from small, labeled datasets