1,967 research outputs found
Interpretation of Neural Networks is Fragile
In order for machine learning to be deployed and trusted in many
applications, it is crucial to be able to reliably explain why the machine
learning algorithm makes certain predictions. For example, if an algorithm
classifies a given pathology image to be a malignant tumor, then the doctor may
need to know which parts of the image led the algorithm to this classification.
How to interpret black-box predictors is thus an important and active area of
research. A fundamental question is: how much can we trust the interpretation
itself? In this paper, we show that interpretation of deep learning predictions
is extremely fragile in the following sense: two perceptively indistinguishable
inputs with the same predicted label can be assigned very different
interpretations. We systematically characterize the fragility of several
widely-used feature-importance interpretation methods (saliency maps, relevance
propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that
even small random perturbation can change the feature importance and new
systematic perturbations can lead to dramatically different interpretations
without changing the label. We extend these results to show that
interpretations based on exemplars (e.g. influence functions) are similarly
fragile. Our analysis of the geometry of the Hessian matrix gives insight on
why fragility could be a fundamental challenge to the current interpretation
approaches.Comment: Published as a conference paper at AAAI 201
Adversarial Training for Free!
Adversarial training, in which a network is trained on adversarial examples,
is one of the few defenses against adversarial attacks that withstands strong
attacks. Unfortunately, the high cost of generating strong adversarial examples
makes standard adversarial training impractical on large-scale problems like
ImageNet. We present an algorithm that eliminates the overhead cost of
generating adversarial examples by recycling the gradient information computed
when updating model parameters. Our "free" adversarial training algorithm
achieves comparable robustness to PGD adversarial training on the CIFAR-10 and
CIFAR-100 datasets at negligible additional cost compared to natural training,
and can be 7 to 30 times faster than other strong adversarial training methods.
Using a single workstation with 4 P100 GPUs and 2 days of runtime, we can train
a robust model for the large-scale ImageNet classification task that maintains
40% accuracy against PGD attacks. The code is available at
https://github.com/ashafahi/free_adv_train.Comment: Accepted to NeurIPS 201
Efficient Defenses Against Adversarial Attacks
Following the recent adoption of deep neural networks (DNN) accross a wide
range of applications, adversarial attacks against these models have proven to
be an indisputable threat. Adversarial samples are crafted with a deliberate
intention of undermining a system. In the case of DNNs, the lack of better
understanding of their working has prevented the development of efficient
defenses. In this paper, we propose a new defense method based on practical
observations which is easy to integrate into models and performs better than
state-of-the-art defenses. Our proposed solution is meant to reinforce the
structure of a DNN, making its prediction more stable and less likely to be
fooled by adversarial samples. We conduct an extensive experimental study
proving the efficiency of our method against multiple attacks, comparing it to
numerous defenses, both in white-box and black-box setups. Additionally, the
implementation of our method brings almost no overhead to the training
procedure, while maintaining the prediction performance of the original model
on clean samples.Comment: 16 page
- …