304 research outputs found
Rethinking the Reverse-engineering of Trojan Triggers
Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks.
Reverse-engineering methods can reconstruct the trigger and thus identify
affected models. Existing reverse-engineering methods only consider input space
constraints, e.g., trigger size in the input space. Expressly, they assume the
triggers are static patterns in the input space and fail to detect models with
feature space triggers such as image style transformations. We observe that
both input-space and feature-space Trojans are associated with feature space
hyperplanes. Based on this observation, we design a novel reverse-engineering
method that exploits the feature space constraint to reverse-engineer Trojan
triggers. Results on four datasets and seven different attacks demonstrate that
our solution effectively defends both input-space and feature-space Trojans. It
outperforms state-of-the-art reverse-engineering methods and other types of
defenses in both Trojaned model detection and mitigation tasks. On average, the
detection accuracy of our method is 93\%. For Trojan mitigation, our method can
reduce the ASR (attack success rate) to only 0.26\% with the BA (benign
accuracy) remaining nearly unchanged. Our code can be found at
https://github.com/RU-System-Software-and-Security/FeatureRE
Training with More Confidence: Mitigating Injected and Natural Backdoors During Training
The backdoor or Trojan attack is a severe threat to deep neural networks
(DNNs). Researchers find that DNNs trained on benign data and settings can also
learn backdoor behaviors, which is known as the natural backdoor. Existing
works on anti-backdoor learning are based on weak observations that the
backdoor and benign behaviors can differentiate during training. An adaptive
attack with slow poisoning can bypass such defenses. Moreover, these methods
cannot defend natural backdoors. We found the fundamental differences between
backdoor-related neurons and benign neurons: backdoor-related neurons form a
hyperplane as the classification surface across input domains of all affected
labels. By further analyzing the training process and model architectures, we
found that piece-wise linear functions cause this hyperplane surface. In this
paper, we design a novel training method that forces the training to avoid
generating such hyperplanes and thus remove the injected backdoors. Our
extensive experiments on five datasets against five state-of-the-art attacks
and also benign training show that our method can outperform existing
state-of-the-art defenses. On average, the ASR (attack success rate) of the
models trained with NONE is 54.83 times lower than undefended models under
standard poisoning backdoor attack and 1.75 times lower under the natural
backdoor attack. Our code is available at
https://github.com/RU-System-Software-and-Security/NONE
Backdoor Learning for NLP: Recent Advances, Challenges, and Future Research Directions
Although backdoor learning is an active research topic in the NLP domain, the
literature lacks studies that systematically categorize and summarize backdoor
attacks and defenses. To bridge the gap, we present a comprehensive and
unifying study of backdoor learning for NLP by summarizing the literature in a
systematic manner. We first present and motivate the importance of backdoor
learning for building robust NLP systems. Next, we provide a thorough account
of backdoor attack techniques, their applications, defenses against backdoor
attacks, and various mitigation techniques to remove backdoor attacks. We then
provide a detailed review and analysis of evaluation metrics, benchmark
datasets, threat models, and challenges related to backdoor learning in NLP.
Ultimately, our work aims to crystallize and contextualize the landscape of
existing literature in backdoor learning for the text domain and motivate
further research in the field. To this end, we identify troubling gaps in the
literature and offer insights and ideas into open challenges and future
research directions. Finally, we provide a GitHub repository with a list of
backdoor learning papers that will be continuously updated at
https://github.com/marwanomar1/Backdoor-Learning-for-NLP
Evil from Within: Machine Learning Backdoors through Hardware Trojans
Backdoors pose a serious threat to machine learning, as they can compromise
the integrity of security-critical systems, such as self-driving cars. While
different defenses have been proposed to address this threat, they all rely on
the assumption that the hardware on which the learning models are executed
during inference is trusted. In this paper, we challenge this assumption and
introduce a backdoor attack that completely resides within a common hardware
accelerator for machine learning. Outside of the accelerator, neither the
learning model nor the software is manipulated, so that current defenses fail.
To make this attack practical, we overcome two challenges: First, as memory on
a hardware accelerator is severely limited, we introduce the concept of a
minimal backdoor that deviates as little as possible from the original model
and is activated by replacing a few model parameters only. Second, we develop a
configurable hardware trojan that can be provisioned with the backdoor and
performs a replacement only when the specific target model is processed. We
demonstrate the practical feasibility of our attack by implanting our hardware
trojan into the Xilinx Vitis AI DPU, a commercial machine-learning accelerator.
We configure the trojan with a minimal backdoor for a traffic-sign recognition
system. The backdoor replaces only 30 (0.069%) model parameters, yet it
reliably manipulates the recognition once the input contains a backdoor
trigger. Our attack expands the hardware circuit of the accelerator by 0.24%
and induces no run-time overhead, rendering a detection hardly possible. Given
the complex and highly distributed manufacturing process of current hardware,
our work points to a new threat in machine learning that is inaccessible to
current security mechanisms and calls for hardware to be manufactured only in
fully trusted environments
A Survey on Neural Trojans
Neural networks have become increasingly prevalent in many real-world applications including security-critical ones. Due to the high hardware requirement and time consumption to train high-performance neural network models, users often outsource training to a machine-learning-as-a-service (MLaaS) provider. This puts the integrity of the trained model at risk. In 2017, Liu et. al. found that, by mixing the training data with a few malicious samples of a certain trigger pattern, hidden functionality can be embedded in the trained network which can be evoked by the trigger pattern. We refer to this kind of hidden malicious functionality as neural Trojans. In this paper, we survey a myriad of neural Trojan attack and defense techniques that have been proposed over the last few years. In a neural Trojan insertion attack, the attacker can be the MLaaS provider itself or a third party capable of adding or tampering with training data. In most research on attacks, the attacker selects the Trojan\u27s functionality and a set of input patterns that will trigger the Trojan. Training data poisoning is the most common way to make the neural network acquire Trojan functionality. Trojan embedding methods that modify the training algorithm or directly interfere with the neural network\u27s execution at the binary level have also been studied. Defense techniques include detecting neural Trojans in the model and/or Trojan trigger patterns, erasing the Trojan\u27s functionality from the neural network model, and bypassing the Trojan. It was also shown that carefully crafted neural Trojans can be used to mitigate other types of attacks. We systematize the above attack and defense approaches in this paper
- …