83,884 research outputs found
Reevaluating Adversarial Examples in Natural Language
State-of-the-art attacks on NLP models lack a shared definition of a what
constitutes a successful attack. We distill ideas from past work into a unified
framework: a successful natural language adversarial example is a perturbation
that fools the model and follows some linguistic constraints. We then analyze
the outputs of two state-of-the-art synonym substitution attacks. We find that
their perturbations often do not preserve semantics, and 38% introduce
grammatical errors. Human surveys reveal that to successfully preserve
semantics, we need to significantly increase the minimum cosine similarities
between the embeddings of swapped words and between the sentence encodings of
original and perturbed sentences.With constraints adjusted to better preserve
semantics and grammaticality, the attack success rate drops by over 70
percentage points.Comment: 15 pages; 9 Tables; 5 Figure
Adversarial Training for Free!
Adversarial training, in which a network is trained on adversarial examples,
is one of the few defenses against adversarial attacks that withstands strong
attacks. Unfortunately, the high cost of generating strong adversarial examples
makes standard adversarial training impractical on large-scale problems like
ImageNet. We present an algorithm that eliminates the overhead cost of
generating adversarial examples by recycling the gradient information computed
when updating model parameters. Our "free" adversarial training algorithm
achieves comparable robustness to PGD adversarial training on the CIFAR-10 and
CIFAR-100 datasets at negligible additional cost compared to natural training,
and can be 7 to 30 times faster than other strong adversarial training methods.
Using a single workstation with 4 P100 GPUs and 2 days of runtime, we can train
a robust model for the large-scale ImageNet classification task that maintains
40% accuracy against PGD attacks. The code is available at
https://github.com/ashafahi/free_adv_train.Comment: Accepted to NeurIPS 201
- …