Recently, multi-modal vision-language foundation models have gained
significant attention in the medical field. While these models offer great
opportunities, they still face a number of challenges, such as the requirement
for fine-grained knowledge understanding in computer-aided diagnosis and
capability of utilizing very limited or no task-specific labeled data in
real-world clinical applications. In this study, we present MaCo, a novel
multi-modal medical foundation model that explores masked contrastive learning
to achieve granular alignment and zero-shot learning for a variety of medical
imaging tasks. MaCo incorporates a correlation weighting mechanism to adjust
the correlation between masked image patches and their corresponding reports,
thereby enhancing the representation learning capabilities. We evaluate MaCo on
six well-known open-source X-ray datasets, and the experimental results show it
outperforms seven state-of-the-art approaches for classification, segmentation,
and zero-shot phase grounding, demonstrating its great potential to promote a
wide range of medical image analysis tasks