Medical report generation demands automatic creation of coherent and precise
descriptions for medical images. However, the scarcity of labelled medical
image-report pairs poses formidable challenges in developing large-scale neural
networks capable of harnessing the potential of artificial intelligence,
exemplified by large language models. This study builds upon the
state-of-the-art vision-language pre-training and fine-tuning approach, BLIP-2,
to customize general large-scale foundation models. Integrating adapter tuning
and a medical knowledge enhancement loss, our model significantly improves
accuracy and coherence. Validation on the dataset of ImageCLEFmedical 2023
demonstrates our model's prowess, achieving the best-averaged results against
several state-of-the-art methods. Significant improvements in ROUGE and CIDEr
underscore our method's efficacy, highlighting promising outcomes for the rapid
medical-domain adaptation of the vision-language foundation models in
addressing challenges posed by data scarcity