4,544 research outputs found
Interpretation of Neural Networks is Fragile
In order for machine learning to be deployed and trusted in many
applications, it is crucial to be able to reliably explain why the machine
learning algorithm makes certain predictions. For example, if an algorithm
classifies a given pathology image to be a malignant tumor, then the doctor may
need to know which parts of the image led the algorithm to this classification.
How to interpret black-box predictors is thus an important and active area of
research. A fundamental question is: how much can we trust the interpretation
itself? In this paper, we show that interpretation of deep learning predictions
is extremely fragile in the following sense: two perceptively indistinguishable
inputs with the same predicted label can be assigned very different
interpretations. We systematically characterize the fragility of several
widely-used feature-importance interpretation methods (saliency maps, relevance
propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that
even small random perturbation can change the feature importance and new
systematic perturbations can lead to dramatically different interpretations
without changing the label. We extend these results to show that
interpretations based on exemplars (e.g. influence functions) are similarly
fragile. Our analysis of the geometry of the Hessian matrix gives insight on
why fragility could be a fundamental challenge to the current interpretation
approaches.Comment: Published as a conference paper at AAAI 201
Are You Tampering With My Data?
We propose a novel approach towards adversarial attacks on neural networks
(NN), focusing on tampering the data used for training instead of generating
attacks on trained models. Our network-agnostic method creates a backdoor
during training which can be exploited at test time to force a neural network
to exhibit abnormal behaviour. We demonstrate on two widely used datasets
(CIFAR-10 and SVHN) that a universal modification of just one pixel per image
for all the images of a class in the training set is enough to corrupt the
training procedure of several state-of-the-art deep neural networks causing the
networks to misclassify any images to which the modification is applied. Our
aim is to bring to the attention of the machine learning community, the
possibility that even learning-based methods that are personally trained on
public datasets can be subject to attacks by a skillful adversary.Comment: 18 page
- …