1,918 research outputs found
Interpreting Adversarially Trained Convolutional Neural Networks
We attempt to interpret how adversarially trained convolutional neural
networks (AT-CNNs) recognize objects. We design systematic approaches to
interpret AT-CNNs in both qualitative and quantitative ways and compare them
with normally trained models. Surprisingly, we find that adversarial training
alleviates the texture bias of standard CNNs when trained on object recognition
tasks, and helps CNNs learn a more shape-biased representation. We validate our
hypothesis from two aspects. First, we compare the salience maps of AT-CNNs and
standard CNNs on clean images and images under different transformations. The
comparison could visually show that the prediction of the two types of CNNs is
sensitive to dramatically different types of features. Second, to achieve
quantitative verification, we construct additional test datasets that destroy
either textures or shapes, such as style-transferred version of clean data,
saturated images and patch-shuffled ones, and then evaluate the classification
accuracy of AT-CNNs and normal CNNs on these datasets. Our findings shed some
light on why AT-CNNs are more robust than those normally trained ones and
contribute to a better understanding of adversarial training over CNNs from an
interpretation perspective.Comment: To apper in ICML1
- …