189 research outputs found
Towards Robust Neural Networks via Random Self-ensemble
Recent studies have revealed the vulnerability of deep neural networks: A
small adversarial perturbation that is imperceptible to human can easily make a
well-trained deep neural network misclassify. This makes it unsafe to apply
neural networks in security-critical applications. In this paper, we propose a
new defense algorithm called Random Self-Ensemble (RSE) by combining two
important concepts: {\bf randomness} and {\bf ensemble}. To protect a targeted
model, RSE adds random noise layers to the neural network to prevent the strong
gradient-based attacks, and ensembles the prediction over random noises to
stabilize the performance. We show that our algorithm is equivalent to ensemble
an infinite number of noisy models without any additional memory
overhead, and the proposed training procedure based on noisy stochastic
gradient descent can ensure the ensemble model has a good predictive
capability. Our algorithm significantly outperforms previous defense techniques
on real data sets. For instance, on CIFAR-10 with VGG network (which has 92\%
accuracy without any attack), under the strong C\&W attack within a certain
distortion tolerance, the accuracy of unprotected model drops to less than
10\%, the best previous defense technique has accuracy, while our method
still has prediction accuracy under the same level of attack. Finally,
our method is simple and easy to integrate into any neural network.Comment: ECCV 2018 camera read
Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories
In this paper, we define, evaluate, and improve the ``relay-generalization''
performance of reinforcement learning (RL) agents on the out-of-distribution
``controllable'' states. Ideally, an RL agent that generally masters a task
should reach its goal starting from any controllable state of the environment
instead of memorizing a small set of trajectories. For example, a self-driving
system should be able to take over the control from humans in the middle of
driving and must continue to drive the car safely. To practically evaluate this
type of generalization, we start the test agent from the middle of other
independently well-trained \emph{stranger} agents' trajectories. With extensive
experimental evaluation, we show the prevalence of \emph{generalization
failure} on controllable states from stranger agents. For example, in the
Humanoid environment, we observed that a well-trained Proximal Policy
Optimization (PPO) agent, with only 3.9\% failure rate during regular testing,
failed on 81.6\% of the states generated by well-trained stranger PPO agents.
To improve "relay generalization," we propose a novel method called
Self-Trajectory Augmentation (STA), which will reset the environment to the
agent's old states according to the Q function during training. After applying
STA to the Soft Actor Critic's (SAC) training procedure, we reduced the failure
rate of SAC under relay-evaluation by more than three times in most settings
without impacting agent performance and increasing the needed number of
environment interactions. Our code is available at
https://github.com/lan-lc/STA.Comment: ICRL 202
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
The safety alignment of Large Language Models (LLMs) is vulnerable to both
manual and automated jailbreak attacks, which adversarially trigger LLMs to
output harmful content. However, current methods for jailbreaking LLMs, which
nest entire harmful prompts, are not effective at concealing malicious intent
and can be easily identified and rejected by well-aligned LLMs. This paper
discovers that decomposing a malicious prompt into separated sub-prompts can
effectively obscure its underlying malicious intent by presenting it in a
fragmented, less detectable form, thereby addressing these limitations. We
introduce an automatic prompt \textbf{D}ecomposition and
\textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack).
DrAttack includes three key components: (a) `Decomposition' of the original
prompt into sub-prompts, (b) `Reconstruction' of these sub-prompts implicitly
by in-context learning with semantically similar but harmless reassembling
demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts'
synonyms that maintain the original intent while jailbreaking LLMs. An
extensive empirical study across multiple open-source and closed-source LLMs
demonstrates that, with a significantly reduced number of queries, DrAttack
obtains a substantial gain of success rate over prior SOTA prompt-only
attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries
surpassed previous art by 33.1\%. The project is available at
https://github.com/xirui-li/DrAttack
Self-Progressing Robust Training
Enhancing model robustness under new and even adversarial environments is a
crucial milestone toward building trustworthy machine learning systems. Current
robust training methods such as adversarial training explicitly uses an
"attack" (e.g., -norm bounded perturbation) to generate
adversarial examples during model training for improving adversarial
robustness. In this paper, we take a different perspective and propose a new
framework called SPROUT, self-progressing robust training. During model
training, SPROUT progressively adjusts training label distribution via our
proposed parametrized label smoothing technique, making training free of attack
generation and more scalable. We also motivate SPROUT using a general
formulation based on vicinity risk minimization, which includes many robust
training methods as special cases. Compared with state-of-the-art adversarial
training methods (PGD-l_inf and TRADES) under l_inf-norm bounded attacks and
various invariance tests, SPROUT consistently attains superior performance and
is more scalable to large neural networks. Our results shed new light on
scalable, effective and attack-independent robust training methods.Comment: Accepted in AAAI202
- …