Explaining Deep Learning models is becoming increasingly important in the
face of daily emerging multimodal models, particularly in safety-critical
domains like medical imaging. However, the lack of detailed investigations into
the performance of explainability methods on these models is widening the gap
between their development and safe deployment. In this work, we analyze the
performance of various explainable AI methods on a vision-language model,
MedCLIP, to demystify its inner workings. We also provide a simple methodology
to overcome the shortcomings of these methods. Our work offers a different new
perspective on the explainability of a recent well-known VLM in the medical
domain and our assessment method is generalizable to other current and possible
future VLMs