47,279 research outputs found
A Geometry-Inspired Attack for Generating Natural Language Adversarial Examples
Generating adversarial examples for natural language is hard, as natural
language consists of discrete symbols, and examples are often of variable
lengths. In this paper, we propose a geometry-inspired attack for generating
natural language adversarial examples. Our attack generates adversarial
examples by iteratively approximating the decision boundary of Deep Neural
Networks (DNNs). Experiments on two datasets with two different models show
that our attack fools natural language models with high success rates, while
only replacing a few words. Human evaluation shows that adversarial examples
generated by our attack are hard for humans to recognize. Further experiments
show that adversarial training can improve model robustness against our attack.Comment: COLING 2020 Long Pape
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework
Textual adversarial attacks can discover models' weaknesses by adding
semantic-preserved but misleading perturbations to the inputs. The long-lasting
adversarial attack-and-defense arms race in Natural Language Processing (NLP)
is algorithm-centric, providing valuable techniques for automatic robustness
evaluation. However, the existing practice of robustness evaluation may exhibit
issues of incomprehensive evaluation, impractical evaluation protocol, and
invalid adversarial samples. In this paper, we aim to set up a unified
automatic robustness evaluation framework, shifting towards model-centric
evaluation to further exploit the advantages of adversarial attacks. To address
the above challenges, we first determine robustness evaluation dimensions based
on model capabilities and specify the reasonable algorithm to generate
adversarial samples for each dimension. Then we establish the evaluation
protocol, including evaluation settings and metrics, under realistic demands.
Finally, we use the perturbation degree of adversarial samples to control the
sample validity. We implement a toolkit RobTest that realizes our automatic
robustness evaluation framework. In our experiments, we conduct a robustness
evaluation of RoBERTa models to demonstrate the effectiveness of our evaluation
framework, and further show the rationality of each component in the framework.
The code will be made public at \url{https://github.com/thunlp/RobTest}.Comment: Accepted to Findings of ACL 202
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models
As large language models are integrated into society, robustness toward a
suite of prompts is increasingly important to maintain reliability in a
high-variance environment.Robustness evaluations must comprehensively
encapsulate the various settings in which a user may invoke an intelligent
system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming,
consisting of three methods -- semantically aligned augmentation, target
bootstrapping, and adversarial knowledge injection. For robust safety
evaluation, we apply these methods in the critical domain of AI safety to
algorithmically generate a test suite of prompts covering diverse robustness
settings -- semantic equivalence, related scenarios, and adversarial. We
partition our prompts into four safety domains for a fine-grained analysis of
how the domain affects model performance. Despite dedicated safeguards in
existing state-of-the-art models, we find statistically significant performance
differences of up to 11% in absolute classification accuracy among semantically
related scenarios and error rates of up to 19% absolute error in zero-shot
adversarial settings, raising concerns for users' physical safety.Comment: In Findings of the 2023 Conference on Empirical Methods in Natural
Language Processin
BERT Lost Patience Won't Be Robust to Adversarial Slowdown
In this paper, we systematically evaluate the robustness of multi-exit
language models against adversarial slowdown. To audit their robustness, we
design a slowdown attack that generates natural adversarial text bypassing
early-exit points. We use the resulting WAFFLE attack as a vehicle to conduct a
comprehensive evaluation of three multi-exit mechanisms with the GLUE benchmark
against adversarial slowdown. We then show our attack significantly reduces the
computational savings provided by the three methods in both white-box and
black-box settings. The more complex a mechanism is, the more vulnerable it is
to adversarial slowdown. We also perform a linguistic analysis of the perturbed
text inputs, identifying common perturbation patterns that our attack
generates, and comparing them with standard adversarial text attacks. Moreover,
we show that adversarial training is ineffective in defeating our slowdown
attack, but input sanitization with a conversational model, e.g., ChatGPT, can
remove perturbations effectively. This result suggests that future work is
needed for developing efficient yet robust multi-exit models. Our code is
available at: https://github.com/ztcoalson/WAFFLEComment: Accepted to NeurIPS 2023 [Poster
- …