A backdoor attack allows a malicious user to manipulate the environment or
corrupt the training data, thus inserting a backdoor into the trained agent.
Such attacks compromise the RL system's reliability, leading to potentially
catastrophic results in various key fields. In contrast, relatively limited
research has investigated effective defenses against backdoor attacks in RL.
This paper proposes the Recovery Triggered States (RTS) method, a novel
approach that effectively protects the victim agents from backdoor attacks. RTS
involves building a surrogate network to approximate the dynamics model.
Developers can then recover the environment from the triggered state to a clean
state, thereby preventing attackers from activating backdoors hidden in the
agent by presenting the trigger. When training the surrogate to predict states,
we incorporate agent action information to reduce the discrepancy between the
actions taken by the agent on predicted states and the actions taken on real
states. RTS is the first approach to defend against backdoor attacks in a
single-agent setting. Our results show that using RTS, the cumulative reward
only decreased by 1.41% under the backdoor attack