Language models are often at risk of diverse backdoor attacks, especially
data poisoning. Thus, it is important to investigate defense solutions for
addressing them. Existing backdoor defense methods mainly focus on backdoor
attacks with explicit triggers, leaving a universal defense against various
backdoor attacks with diverse triggers largely unexplored. In this paper, we
propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised
Product-of-Experts), which is inspired by the shortcut nature of backdoor
attacks, to defend various backdoor attacks. DPoE consists of two models: a
shallow model that captures the backdoor shortcuts and a main model that is
prevented from learning the backdoor shortcuts. To address the label flip
caused by backdoor attackers, DPoE incorporates a denoising design. Experiments
on SST-2 dataset show that DPoE significantly improves the defense performance
against various types of backdoor triggers including word-level,
sentence-level, and syntactic triggers. Furthermore, DPoE is also effective
under a more challenging but practical setting that mixes multiple types of
trigger.Comment: Work in Progres