5 research outputs found
Catastrophic overfitting can be induced with discriminative non-robust features
Adversarial training (AT) is the de facto method for building robust neural
networks, but it can be computationally expensive. To mitigate this, fast
single-step attacks can be used, but this may lead to catastrophic overfitting
(CO). This phenomenon appears when networks gain non-trivial robustness during
the first stages of AT, but then reach a breaking point where they become
vulnerable in just a few iterations. The mechanisms that lead to this failure
mode are still poorly understood. In this work, we study the onset of CO in
single-step AT methods through controlled modifications of typical datasets of
natural images. In particular, we show that CO can be induced at much smaller
values than it was observed before just by injecting images with
seemingly innocuous features. These features aid non-robust classification but
are not enough to achieve robustness on their own. Through extensive
experiments we analyze this novel phenomenon and discover that the presence of
these easy features induces a learning shortcut that leads to CO. Our findings
provide new insights into the mechanisms of CO and improve our understanding of
the dynamics of AT. The code to reproduce our experiments can be found at
https://github.com/gortizji/co_features.Comment: Published in Transactions on Machine Learning Research (TMLR
Hold me tight! Influence of discriminative features on deep network boundaries
Important insights towards the explainability of neural networks reside in the characteristics of their decision boundaries. In this work, we borrow tools from the field of adversarial robustness, and propose a new perspective that relates dataset features to the distance of samples to the decision boundary. This enables us to carefully tweak the position of the training samples and measure the induced changes on the boundaries of CNNs trained on large-scale vision datasets. We use this framework to reveal some intriguing properties of CNNs. Specifically, we rigorously confirm that neural networks exhibit a high invariance to non-discriminative features, and show that very small perturbations of the training samples in certain directions can lead to sudden invariances in the orthogonal ones. This is precisely the mechanism that adversarial training uses to achieve robustness