Interpretability is highly desired for deep neural network-based classifiers,
especially when addressing high-stake decisions in medical imaging. Commonly
used post-hoc interpretability methods have the limitation that they can
produce plausible but different interpretations of a given model, leading to
ambiguity about which one to choose. To address this problem, a novel
decision-theory-motivated approach is investigated to establish a
self-interpretable model, given a pretrained deep binary black-box medical
image classifier. This approach involves utilizing a self-interpretable
encoder-decoder model in conjunction with a single-layer fully connected
network with unity weights. The model is trained to estimate the test statistic
of the given trained black-box deep binary classifier to maintain a similar
accuracy. The decoder output image, referred to as an equivalency map, is an
image that represents a transformed version of the to-be-classified image that,
when processed by the fixed fully connected layer, produces the same test
statistic value as the original classifier. The equivalency map provides a
visualization of the transformed image features that directly contribute to the
test statistic value and, moreover, permits quantification of their relative
contributions. Unlike the traditional post-hoc interpretability methods, the
proposed method is self-interpretable, quantitative, and fundamentally based on
decision theory. Detailed quantitative and qualitative analysis have been
performed with three different medical image binary classification tasks