Detecting adversarial samples that are carefully crafted to fool the model is
a critical step to socially-secure applications. However, existing adversarial
detection methods require access to sufficient training data, which brings
noteworthy concerns regarding privacy leakage and generalizability. In this
work, we validate that the adversarial sample generated by attack algorithms is
strongly related to a specific vector in the high-dimensional inputs. Such
vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated
without original training data. Based on this discovery, we propose a
data-agnostic adversarial detection framework, which induces different
responses between normal and adversarial samples to UAPs. Experimental results
show that our method achieves competitive detection performance on various text
classification tasks, and maintains an equivalent time consumption to normal
inference.Comment: Accepted by ACL2023 (Short Paper