This paper explores training medical vision-language models (VLMs) -- where
the visual and language inputs are embedded into a common space -- with a
particular focus on scenarios where training data is limited, as is often the
case in clinical datasets. We explore several candidate methods to improve
low-data performance, including: (i) adapting generic pre-trained models to
novel image and text domains (i.e. medical imaging and reports) via unimodal
self-supervision; (ii) using local (e.g. GLoRIA) & global (e.g. InfoNCE)
contrastive loss functions as well as a combination of the two; (iii) extra
supervision during VLM training, via: (a) image- and text-only
self-supervision, and (b) creating additional positive image-text pairs for
training through augmentation and nearest-neighbour search.
Using text-to-image retrieval as a benchmark, we evaluate the performance of
these methods with variable sized training datasets of paired chest X-rays and
radiological reports. Combined, they significantly improve retrieval compared
to fine-tuning CLIP, roughly equivalent to training with the data. A similar
pattern is found in the downstream task classification of CXR-related
conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM
benchmark, in the zero-shot and linear probing settings. We conclude with a set
of recommendations for researchers aiming to train vision-language models on
other medical imaging modalities when training data is scarce. To facilitate
further research, we will make our code and models publicly available.Comment: Accepted to MIDL 202