4,898 research outputs found
What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples
Adversarial examples, deliberately crafted using small perturbations to fool
deep neural networks, were first studied in image processing and more recently
in NLP. While approaches to detecting adversarial examples in NLP have largely
relied on search over input perturbations, image processing has seen a range of
techniques that aim to characterise adversarial subspaces over the learned
representations.
In this paper, we adapt two such approaches to NLP, one based on nearest
neighbors and influence functions and one on Mahalanobis distances. The former
in particular produces a state-of-the-art detector when compared against
several strong baselines; moreover, the novel use of influence functions
provides insight into how the nature of adversarial example subspaces in NLP
relate to those in image processing, and also how they differ depending on the
kind of NLP task.Comment: 20 pages, Accepted in IJCNLP_AACL 202
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
We investigate conditions under which test statistics exist that can reliably
detect examples, which have been adversarially manipulated in a white-box
attack. These statistics can be easily computed and calibrated by randomly
corrupting inputs. They exploit certain anomalies that adversarial attacks
introduce, in particular if they follow the paradigm of choosing perturbations
optimally under p-norm constraints. Access to the log-odds is the only
requirement to defend models. We justify our approach empirically, but also
provide conditions under which detectability via the suggested test statistics
is guaranteed to be effective. In our experiments, we show that it is even
possible to correct test time predictions for adversarial attacks with high
accuracy
- …