Pre-trained language models (PLMs) have been widely used to underpin various
downstream tasks. However, the adversarial attack task has found that PLMs are
vulnerable to small perturbations. Mainstream methods adopt a detached
two-stage framework to attack without considering the subsequent influence of
substitution at each step. In this paper, we formally model the adversarial
attack task on PLMs as a sequential decision-making problem, where the whole
attack process is sequential with two decision-making problems, i.e., word
finder and word substitution. Considering the attack process can only receive
the final state without any direct intermediate signals, we propose to use
reinforcement learning to find an appropriate sequential attack path to
generate adversaries, named SDM-Attack. Extensive experimental results show
that SDM-Attack achieves the highest attack success rate with a comparable
modification rate and semantic similarity to attack fine-tuned BERT.
Furthermore, our analyses demonstrate the generalization and transferability of
SDM-Attack. The code is available at https://github.com/fduxuan/SDM-Attack