Explaining the decisions of an Artificial Intelligence (AI) model is
increasingly critical in many real-world, high-stake applications. Hundreds of
papers have either proposed new feature attribution methods, discussed or
harnessed these tools in their work. However, despite humans being the target
end-users, most attribution methods were only evaluated on proxy
automatic-evaluation metrics (Zhang et al. 2018; Zhou et al. 2016; Petsiuk et
al. 2018). In this paper, we conduct the first user study to measure
attribution map effectiveness in assisting humans in ImageNet classification
and Stanford Dogs fine-grained classification, and when an image is natural or
adversarial (i.e., contains adversarial perturbations). Overall, feature
attribution is surprisingly not more effective than showing humans nearest
training-set examples. On a harder task of fine-grained dog categorization,
presenting attribution maps to humans does not help, but instead hurts the
performance of human-AI teams compared to AI alone. Importantly, we found
automatic attribution-map evaluation measures to correlate poorly with the
actual human-AI team performance. Our findings encourage the community to
rigorously test their methods on the downstream human-in-the-loop applications
and to rethink the existing evaluation metrics.Comment: NeurIPS 2021; 10 pages of Main text; 28 pages of Appendi