516 research outputs found
TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents
Recent work has identified that classification models implemented as
neural networks are vulnerable to
data-poisoning and Trojan attacks at training time.
In this work, we show that these
training-time vulnerabilities extend to
deep reinforcement learning (DRL) agents
and can be exploited by an adversary with access
to the training process.
In particular, we focus on
Trojan attacks that augment the function of
reinforcement learning policies
with hidden behaviors.
We demonstrate that such attacks can be implemented
through minuscule data poisoning (as little as 0.025% of the training data) and
in-band
reward modification that does not affect
the reward on normal inputs.
The policies learned with our proposed attack approach perform imperceptibly similar to benign policies but deteriorate drastically when the Trojan is triggered
in both targeted and untargeted settings.
Furthermore, we show that existing Trojan defense mechanisms for classification tasks are not effective in the reinforcement learning setting
Stealthy Backdoor Attack for Code Models
Code models, such as CodeBERT and CodeT5, offer general-purpose
representations of code and play a vital role in supporting downstream
automated software engineering tasks. Most recently, code models were revealed
to be vulnerable to backdoor attacks. A code model that is backdoor-attacked
can behave normally on clean examples but will produce pre-defined malicious
outputs on examples injected with triggers that activate the backdoors.
Existing backdoor attacks on code models use unstealthy and easy-to-detect
triggers. This paper aims to investigate the vulnerability of code models with
stealthy backdoor attacks. To this end, we propose AFRAIDOOR (Adversarial
Feature as Adaptive Backdoor). AFRAIDOOR achieves stealthiness by leveraging
adversarial perturbations to inject adaptive triggers into different inputs. We
evaluate AFRAIDOOR on three widely adopted code models (CodeBERT, PLBART and
CodeT5) and two downstream tasks (code summarization and method name
prediction). We find that around 85% of adaptive triggers in AFRAIDOOR bypass
the detection in the defense process. By contrast, only less than 12% of the
triggers from previous work bypass the defense. When the defense method is not
applied, both AFRAIDOOR and baselines have almost perfect attack success rates.
However, once a defense is applied, the success rates of baselines decrease
dramatically to 10.47% and 12.06%, while the success rate of AFRAIDOOR are
77.05% and 92.98% on the two tasks. Our finding exposes security weaknesses in
code models under stealthy backdoor attacks and shows that the state-of-the-art
defense method cannot provide sufficient protection. We call for more research
efforts in understanding security threats to code models and developing more
effective countermeasures.Comment: 18 pages, Under review of IEEE Transactions on Software Engineerin
MDTD: A Multi Domain Trojan Detector for Deep Neural Networks
Machine learning models that use deep neural networks (DNNs) are vulnerable
to backdoor attacks. An adversary carrying out a backdoor attack embeds a
predefined perturbation called a trigger into a small subset of input samples
and trains the DNN such that the presence of the trigger in the input results
in an adversary-desired output class. Such adversarial retraining however needs
to ensure that outputs for inputs without the trigger remain unaffected and
provide high classification accuracy on clean samples. In this paper, we
propose MDTD, a Multi-Domain Trojan Detector for DNNs, which detects inputs
containing a Trojan trigger at testing time. MDTD does not require knowledge of
trigger-embedding strategy of the attacker and can be applied to a pre-trained
DNN model with image, audio, or graph-based inputs. MDTD leverages an insight
that input samples containing a Trojan trigger are located relatively farther
away from a decision boundary than clean samples. MDTD estimates the distance
to a decision boundary using adversarial learning methods and uses this
distance to infer whether a test-time input sample is Trojaned or not. We
evaluate MDTD against state-of-the-art Trojan detection methods across five
widely used image-based datasets: CIFAR100, CIFAR10, GTSRB, SVHN, and
Flowers102; four graph-based datasets: AIDS, WinMal, Toxicant, and COLLAB; and
the SpeechCommand audio dataset. MDTD effectively identifies samples that
contain different types of Trojan triggers. We evaluate MDTD against adaptive
attacks where an adversary trains a robust DNN to increase (decrease) distance
of benign (Trojan) inputs from a decision boundary.Comment: Accepted to ACM Conference on Computer and Communications Security
(ACM CCS) 202
- …