42 research outputs found
Interpretation of Neural Networks is Fragile
In order for machine learning to be deployed and trusted in many
applications, it is crucial to be able to reliably explain why the machine
learning algorithm makes certain predictions. For example, if an algorithm
classifies a given pathology image to be a malignant tumor, then the doctor may
need to know which parts of the image led the algorithm to this classification.
How to interpret black-box predictors is thus an important and active area of
research. A fundamental question is: how much can we trust the interpretation
itself? In this paper, we show that interpretation of deep learning predictions
is extremely fragile in the following sense: two perceptively indistinguishable
inputs with the same predicted label can be assigned very different
interpretations. We systematically characterize the fragility of several
widely-used feature-importance interpretation methods (saliency maps, relevance
propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that
even small random perturbation can change the feature importance and new
systematic perturbations can lead to dramatically different interpretations
without changing the label. We extend these results to show that
interpretations based on exemplars (e.g. influence functions) are similarly
fragile. Our analysis of the geometry of the Hessian matrix gives insight on
why fragility could be a fundamental challenge to the current interpretation
approaches.Comment: Published as a conference paper at AAAI 201
Meaningfully Explaining Model Mistakes Using Conceptual Counterfactuals
Understanding and explaining the mistakes made by trained models is critical
to many machine learning objectives, such as improving robustness, addressing
concept drift, and mitigating biases. However, this is often an ad hoc process
that involves manually looking at the model's mistakes on many test samples and
guessing at the underlying reasons for those incorrect predictions. In this
paper, we propose a systematic approach, conceptual counterfactual
explanations(CCE), that explains why a classifier makes a mistake on a
particular test sample(s) in terms of human-understandable concepts (e.g. this
zebra is misclassified as a dog because of faint stripes). We base CCE on two
prior ideas: counterfactual explanations and concept activation vectors, and
validate our approach on well-known pretrained models, showing that it explains
the models' mistakes meaningfully. In addition, for new models trained on data
with spurious correlations, CCE accurately identifies the spurious correlation
as the cause of model mistakes from a single misclassified test sample. On two
challenging medical applications, CCE generated useful insights, confirmed by
clinicians, into biases and mistakes the model makes in real-world settings.
The code for CCE is publicly available and can easily be applied to explain
mistakes in new models