53 research outputs found
Defending Substitution-Based Profile Pollution Attacks on Sequential Recommenders
While sequential recommender systems achieve significant improvements on
capturing user dynamics, we argue that sequential recommenders are vulnerable
against substitution-based profile pollution attacks. To demonstrate our
hypothesis, we propose a substitution-based adversarial attack algorithm, which
modifies the input sequence by selecting certain vulnerable elements and
substituting them with adversarial items. In both untargeted and targeted
attack scenarios, we observe significant performance deterioration using the
proposed profile pollution algorithm. Motivated by such observations, we design
an efficient adversarial defense method called Dirichlet neighborhood sampling.
Specifically, we sample item embeddings from a convex hull constructed by
multi-hop neighbors to replace the original items in input sequences. During
sampling, a Dirichlet distribution is used to approximate the probability
distribution in the neighborhood such that the recommender learns to combat
local perturbations. Additionally, we design an adversarial training method
tailored for sequential recommender systems. In particular, we represent
selected items with one-hot encodings and perform gradient ascent on the
encodings to search for the worst case linear combination of item embeddings in
training. As such, the embedding function learns robust item representations
and the trained recommender is resistant to test-time adversarial examples.
Extensive experiments show the effectiveness of both our attack and defense
methods, which consistently outperform baselines by a significant margin across
model architectures and datasets.Comment: Accepted to RecSys 202
Textual Manifold-based Defense Against Natural Language Adversarial Examples
Recent studies on adversarial images have shown that they tend to leave the
underlying low-dimensional data manifold, making them significantly more
challenging for current models to make correct predictions. This so-called
off-manifold conjecture has inspired a novel line of defenses against
adversarial attacks on images. In this study, we find a similar phenomenon
occurs in the contextualized embedding space induced by pretrained language
models, in which adversarial texts tend to have their embeddings diverge from
the manifold of natural ones. Based on this finding, we propose Textual
Manifold-based Defense (TMD), a defense mechanism that projects text embeddings
onto an approximated embedding manifold before classification. It reduces the
complexity of potential adversarial examples, which ultimately enhances the
robustness of the protected model. Through extensive experiments, our method
consistently and significantly outperforms previous defenses under various
attack settings without trading off clean accuracy. To the best of our
knowledge, this is the first NLP defense that leverages the manifold structure
against adversarial attacks. Our code is available at
\url{https://github.com/dangne/tmd}
Grey-box Adversarial Attack And Defence For Sentiment Classification
We introduce a grey-box adversarial attack and defence framework for
sentiment classification. We address the issues of differentiability, label
preservation and input reconstruction for adversarial attack and defence in one
unified framework. Our results show that once trained, the attacking model is
capable of generating high-quality adversarial examples substantially faster
(one order of magnitude less in time) than state-of-the-art attacking methods.
These examples also preserve the original sentiment according to human
evaluation. Additionally, our framework produces an improved classifier that is
robust in defending against multiple adversarial attacking methods. Code is
available at: https://github.com/ibm-aur-nlp/adv-def-text-dist
ANTONIO: Towards a Systematic Method of Generating NLP Benchmarks for Verification
Verification of machine learning models used in Natural Language Processing
(NLP) is known to be a hard problem. In particular, many known neural network
verification methods that work for computer vision and other numeric datasets
do not work for NLP. Here, we study technical reasons that underlie this
problem. Based on this analysis, we propose practical methods and heuristics
for preparing NLP datasets and models in a way that renders them amenable to
known verification methods based on abstract interpretation. We implement these
methods as a Python library called ANTONIO that links to the neural network
verifiers ERAN and Marabou. We perform evaluation of the tool using an NLP
dataset R-U-A-Robot suggested as a benchmark for verifying legally critical NLP
applications. We hope that, thanks to its general applicability, this work will
open novel possibilities for including NLP verification problems into neural
network verification competitions, and will popularise NLP problems within this
community.Comment: To appear in proceedings of 6th Workshop on Formal Methods for
ML-Enabled Autonomous Systems (Affiliated with CAV 2023
Masked Language Model Based Textual Adversarial Example Detection
Adversarial attacks are a serious threat to the reliable deployment of
machine learning models in safety-critical applications. They can misguide
current models to predict incorrectly by slightly modifying the inputs.
Recently, substantial work has shown that adversarial examples tend to deviate
from the underlying data manifold of normal examples, whereas pre-trained
masked language models can fit the manifold of normal NLP data. To explore how
to use the masked language model in adversarial detection, we propose a novel
textual adversarial example detection method, namely Masked Language
Model-based Detection (MLMD), which can produce clearly distinguishable signals
between normal examples and adversarial examples by exploring the changes in
manifolds induced by the masked language model. MLMD features a plug and play
usage (i.e., no need to retrain the victim model) for adversarial defense and
it is agnostic to classification tasks, victim model's architectures, and
to-be-defended attack methods. We evaluate MLMD on various benchmark textual
datasets, widely studied machine learning models, and state-of-the-art (SOTA)
adversarial attacks (in total settings). Experimental results show
that MLMD can achieve strong performance, with detection accuracy up to 0.984,
0.967, and 0.901 on AG-NEWS, IMDB, and SST-2 datasets, respectively.
Additionally, MLMD is superior, or at least comparable to, the SOTA detection
defenses in detection accuracy and F1 score. Among many defenses based on the
off-manifold assumption of adversarial examples, this work offers a new angle
for capturing the manifold change. The code for this work is openly accessible
at \url{https://github.com/mlmddetection/MLMDdetection}.Comment: 13 pages,3 figure
Detecting Textual Adversarial Examples through Randomized Substitution and Vote
A line of work has shown that natural text processing models are vulnerable
to adversarial examples. Correspondingly, various defense methods are proposed
to mitigate the threat of textual adversarial examples, eg, adversarial
training, input transformations, detection, etc. In this work, we treat the
optimization process for synonym substitution based textual adversarial attacks
as a specific sequence of word replacement, in which each word mutually
influences other words. We identify that we could destroy such mutual
interaction and eliminate the adversarial perturbation by randomly substituting
a word with its synonyms. Based on this observation, we propose a novel textual
adversarial example detection method, termed Randomized Substitution and Vote
(RS&V), which votes the prediction label by accumulating the logits of k
samples generated by randomly substituting the words in the input text with
synonyms. The proposed RS&V is generally applicable to any existing neural
networks without modification on the architecture or extra training, and it is
orthogonal to prior work on making the classification network itself more
robust. Empirical evaluations on three benchmark datasets demonstrate that our
RS&V could detect the textual adversarial examples more successfully than the
existing detection methods while maintaining the high classification accuracy
on benign samples.Comment: Accepted by UAI 2022, code is avaliable at
https://github.com/JHL-HUST/RS
- …