913 research outputs found
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
We investigate conditions under which test statistics exist that can reliably
detect examples, which have been adversarially manipulated in a white-box
attack. These statistics can be easily computed and calibrated by randomly
corrupting inputs. They exploit certain anomalies that adversarial attacks
introduce, in particular if they follow the paradigm of choosing perturbations
optimally under p-norm constraints. Access to the log-odds is the only
requirement to defend models. We justify our approach empirically, but also
provide conditions under which detectability via the suggested test statistics
is guaranteed to be effective. In our experiments, we show that it is even
possible to correct test time predictions for adversarial attacks with high
accuracy
- …