6 research outputs found
Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples
Recent efforts have shown that neural text processing models are vulnerable
to adversarial examples, but the nature of these examples is poorly understood.
In this work, we show that adversarial attacks against CNN, LSTM and
Transformer-based classification models perform word substitutions that are
identifiable through frequency differences between replaced words and their
corresponding substitutions. Based on these findings, we propose
frequency-guided word substitutions (FGWS), a simple algorithm exploiting the
frequency properties of adversarial word substitutions for the detection of
adversarial examples. FGWS achieves strong performance by accurately detecting
adversarial examples on the SST-2 and IMDb sentiment datasets, with F1
detection scores of up to 91.4% against RoBERTa-based classification models. We
compare our approach against a recently proposed perturbation discrimination
framework and show that we outperform it by up to 13.0% F1.Comment: EACL 2021 camera-read
Adversarial Examples Detection with Bayesian Neural Network
In this paper, we propose a new framework to detect adversarial examples
motivated by the observations that random components can improve the smoothness
of predictors and make it easier to simulate the output distribution of a deep
neural network. With these observations, we propose a novel Bayesian
adversarial example detector, short for BATer, to improve the performance of
adversarial example detection. Specifically, we study the distributional
difference of hidden layer output between natural and adversarial examples, and
propose to use the randomness of the Bayesian neural network to simulate hidden
layer output distribution and leverage the distribution dispersion to detect
adversarial examples. The advantage of a Bayesian neural network is that the
output is stochastic while a deep neural network without random components does
not have such characteristics. Empirical results on several benchmark datasets
against popular attacks show that the proposed BATer outperforms the
state-of-the-art detectors in adversarial example detection