Foundation vision-language models are currently transforming computer vision,
and are on the rise in medical imaging fueled by their very promising
generalization capabilities. However, the initial attempts to transfer this new
paradigm to medical imaging have shown less impressive performances than those
observed in other domains, due to the significant domain shift and the complex,
expert domain knowledge inherent to medical-imaging tasks. Motivated by the
need for domain-expert foundation models, we present FLAIR, a pre-trained
vision-language model for universal retinal fundus image understanding. To this
end, we compiled 37 open-access, mostly categorical fundus imaging datasets
from various sources, with up to 97 different target conditions and 284,660
images. We integrate the expert's domain knowledge in the form of descriptive
textual prompts, during both pre-training and zero-shot inference, enhancing
the less-informative categorical supervision of the data. Such a textual
expert's knowledge, which we compiled from the relevant clinical literature and
community standards, describes the fine-grained features of the pathologies as
well as the hierarchies and dependencies between them. We report comprehensive
evaluations, which illustrate the benefit of integrating expert knowledge and
the strong generalization capabilities of FLAIR under difficult scenarios with
domain shifts or unseen categories. When adapted with a lightweight linear
probe, FLAIR outperforms fully-trained, dataset-focused models, more so in the
few-shot regimes. Interestingly, FLAIR outperforms by a large margin more
generalist, larger-scale image-language models, which emphasizes the potential
of embedding experts' domain knowledge and the limitations of generalist models
in medical imaging.Comment: The pre-trained model is available at:
https://github.com/jusiro/FLAI