6 research outputs found
Revealing Perceptible Backdoors, without the Training Set, via the Maximum Achievable Misclassification Fraction Statistic
Recently, a backdoor data poisoning attack was proposed, which adds
mislabeled examples to the training set, with an embedded backdoor pattern,
aiming to have the classifier learn to classify to a target class whenever the
backdoor pattern is present in a test sample. Here, we address post-training
detection of innocuous perceptible backdoors in DNN image classifiers, wherein
the defender does not have access to the poisoned training set, but only to the
trained classifier, as well as unpoisoned examples. This problem is challenging
because without the poisoned training set, we have no hint about the actual
backdoor pattern used during training. This post-training scenario is also of
great import because in many practical contexts the DNN user did not train the
DNN and does not have access to the training data. We identify two important
properties of perceptible backdoor patterns - spatial invariance and robustness
- based upon which we propose a novel detector using the maximum achievable
misclassification fraction (MAMF) statistic. We detect whether the trained DNN
has been backdoor-attacked and infer the source and target classes. Our
detector outperforms other existing detectors and, coupled with an
imperceptible backdoor detector, helps achieve post-training detection of all
evasive backdoors
Backdoor Attacks and Defences on Deep Neural Networks
Nowadays, due to the huge amount of resources required for network training, pre-trained models are commonly exploited in all kinds of deep learning tasks, like image classification, natural language processing, etc. These models are directly deployed in the real environments, or only fine-tuned on a limited set of data that are collected, for instance, from the Internet. However, a natural question arises: can we trust pre-trained models or the data downloaded from the Internet? The answer is ‘No’. An attacker can easily perform a so-called backdoor attack to hide a backdoor into a pre-trained model by poisoning the dataset used for training or indirectly releasing some poisoned data on the Internet as a bait. Such an attack is stealthy since the hidden backdoor does not affect the behaviour of the network in normal operating conditions, and the malicious behaviour being activated only when a triggering signal is presented at the network input.
In this thesis, we present a general framework for backdoor attacks and defences, and overview the state-of-the-art backdoor attacks and the corresponding defences in the field image classification, by casting them in the introduced framework. By focusing on the face recognition domain, two new backdoor attacks were proposed, effective under different threat models. Finally, we design a universal method to defend against backdoor attacks, regardless of the specific attack setting, namely the poisoning strategy and the triggering signal