From Shortcuts to Triggers: Backdoor Defense with Denoised PoE

Chen, Muhao; Liu, Qin; Wang, Fei; Xiao, Chaowei

From Shortcuts to Triggers: Backdoor Defense with Denoised PoE

Authors: Muhao Chen
Qin Liu
Fei Wang
Chaowei Xiao
Publication date: 24 May 2023
Publisher

Abstract

Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the backdoor shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on SST-2 dataset show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of trigger.Comment: Work in Progres

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.14910

Last time updated on 26/05/2023