We propose a novel deep neural network architecture to learn interpretable
representation for medical image analysis. Our architecture generates a global
attention for region of interest, and then learns bag of words style deep
feature embeddings with local attention. The global, and local feature maps are
combined using a contemporary transformer architecture for highly accurate
Gallbladder Cancer (GBC) detection from Ultrasound (USG) images. Our
experiments indicate that the detection accuracy of our model beats even human
radiologists, and advocates its use as the second reader for GBC diagnosis. Bag
of words embeddings allow our model to be probed for generating interpretable
explanations for GBC detection consistent with the ones reported in medical
literature. We show that the proposed model not only helps understand decisions
of neural network models but also aids in discovery of new visual features
relevant to the diagnosis of GBC. Source-code and model will be available at
https://github.com/sbasu276/RadFormerComment: To Appear in Elsevier Medical Image Analysi