16 research outputs found
Certifying LLM Safety against Adversarial Prompting
Large language models (LLMs) released for public use incorporate guardrails
to ensure their output is safe, often referred to as "model alignment." An
aligned language model should decline a user's request to produce harmful
content. However, such safety measures are vulnerable to adversarial prompts,
which contain maliciously designed token sequences to circumvent the model's
safety guards and cause it to produce harmful content. In this work, we
introduce erase-and-check, the first framework to defend against adversarial
prompts with verifiable safety guarantees. We erase tokens individually and
inspect the resulting subsequences using a safety filter. Our procedure labels
the input prompt as harmful if any subsequences or the input prompt are
detected as harmful by the filter. This guarantees that any adversarial
modification of a harmful prompt up to a certain size is also labeled harmful.
We defend against three attack modes: i) adversarial suffix, which appends an
adversarial sequence at the end of the prompt; ii) adversarial insertion, where
the adversarial sequence is inserted anywhere in the middle of the prompt; and
iii) adversarial infusion, where adversarial tokens are inserted at arbitrary
positions in the prompt, not necessarily as a contiguous block. Empirical
results demonstrate that our technique obtains strong certified safety
guarantees on harmful prompts while maintaining good performance on safe
prompts. For example, against adversarial suffixes of length 20, it certifiably
detects 93% of the harmful prompts and labels 94% of the safe prompts as safe
using the open source language model Llama 2 as the safety filter
Adversarial Robustness through the Lens of Causality
The adversarial vulnerability of deep neural networks has attracted
significant attention in machine learning. From a causal viewpoint, adversarial
attacks can be considered as a specific type of distribution change on natural
data. As causal reasoning has an instinct for modeling distribution change, we
propose to incorporate causality into mitigating adversarial vulnerability.
However, causal formulations of the intuition of adversarial attack and the
development of robust DNNs are still lacking in the literature. To bridge this
gap, we construct a causal graph to model the generation process of adversarial
examples and define the adversarial distribution to formalize the intuition of
adversarial attacks. From a causal perspective, we find that the label is
spuriously correlated with the style (content-independent) information when an
instance is given. The spurious correlation implies that the adversarial
distribution is constructed via making the statistical conditional association
between style information and labels drastically different from that in natural
distribution. Thus, DNNs that fit the spurious correlation are vulnerable to
the adversarial distribution. Inspired by the observation, we propose the
adversarial distribution alignment method to eliminate the difference between
the natural distribution and the adversarial distribution. Extensive
experiments demonstrate the efficacy of the proposed method. Our method can be
seen as the first attempt to leverage causality for mitigating adversarial
vulnerability
Explicit Tradeoffs between Adversarial and Natural Distributional Robustness
Several existing works study either adversarial or natural distributional
robustness of deep neural networks separately. In practice, however, models
need to enjoy both types of robustness to ensure reliability. In this work, we
bridge this gap and show that in fact, explicit tradeoffs exist between
adversarial and natural distributional robustness. We first consider a simple
linear regression setting on Gaussian data with disjoint sets of core and
spurious features. In this setting, through theoretical and empirical analysis,
we show that (i) adversarial training with and norms
increases the model reliance on spurious features; (ii) For
adversarial training, spurious reliance only occurs when the scale of the
spurious features is larger than that of the core features; (iii) adversarial
training can have an unintended consequence in reducing distributional
robustness, specifically when spurious correlations are changed in the new test
domain. Next, we present extensive empirical evidence, using a test suite of
twenty adversarially trained models evaluated on five benchmark datasets
(ObjectNet, RIVAL10, Salient ImageNet-1M, ImageNet-9, Waterbirds), that
adversarially trained classifiers rely on backgrounds more than their
standardly trained counterparts, validating our theoretical results. We also
show that spurious correlations in training data (when preserved in the test
domain) can improve adversarial robustness, revealing that previous claims that
adversarial vulnerability is rooted in spurious correlations are incomplete.Comment: Accepted to NeurIPS 202