Adversarial attacks reveal serious flaws in deep learning models. More
dangerously, these attacks preserve the original meaning and escape human
recognition. Existing methods for detecting these attacks need to be trained
using original/adversarial data. In this paper, we propose detection without
training by voting on hard labels from predictions of transformations, namely,
VoteTRANS. Specifically, VoteTRANS detects adversarial text by comparing the
hard labels of input text and its transformation. The evaluation demonstrates
that VoteTRANS effectively detects adversarial text across various
state-of-the-art attacks, models, and datasets.Comment: Findings of ACL 2023 (long paper