3,237 research outputs found
Interpreting Adversarial Examples with Attributes
Deep computer vision systems being vulnerable to imperceptible and carefully
crafted noise have raised questions regarding the robustness of their
decisions. We take a step back and approach this problem from an orthogonal
direction. We propose to enable black-box neural networks to justify their
reasoning both for clean and for adversarial examples by leveraging attributes,
i.e. visually discriminative properties of objects. We rank attributes based on
their class relevance, i.e. how the classification decision changes when the
input is visually slightly perturbed, as well as image relevance, i.e. how well
the attributes can be localized on both clean and perturbed images. We present
comprehensive experiments for attribute prediction, adversarial example
generation, adversarially robust learning, and their qualitative and
quantitative analysis using predicted attributes on three benchmark datasets
PatchGuard: A Provably Robust Defense against Adversarial Patches via Small Receptive Fields and Masking
Localized adversarial patches aim to induce misclassification in machine
learning models by arbitrarily modifying pixels within a restricted region of
an image. Such attacks can be realized in the physical world by attaching the
adversarial patch to the object to be misclassified, and defending against such
attacks is an unsolved/open problem. In this paper, we propose a general
defense framework called PatchGuard that can achieve high provable robustness
while maintaining high clean accuracy against localized adversarial patches.
The cornerstone of PatchGuard involves the use of CNNs with small receptive
fields to impose a bound on the number of features corrupted by an adversarial
patch. Given a bounded number of corrupted features, the problem of designing
an adversarial patch defense reduces to that of designing a secure feature
aggregation mechanism. Towards this end, we present our robust masking defense
that robustly detects and masks corrupted features to recover the correct
prediction. Notably, we can prove the robustness of our defense against any
adversary within our threat model. Our extensive evaluation on ImageNet,
ImageNette (a 10-class subset of ImageNet), and CIFAR-10 datasets demonstrates
that our defense achieves state-of-the-art performance in terms of both
provable robust accuracy and clean accuracy.Comment: USENIX Security Symposium 2021; extended technical repor
Towards Robustness against Unsuspicious Adversarial Examples
Despite the remarkable success of deep neural networks, significant concerns
have emerged about their robustness to adversarial perturbations to inputs.
While most attacks aim to ensure that these are imperceptible, physical
perturbation attacks typically aim for being unsuspicious, even if perceptible.
However, there is no universal notion of what it means for adversarial examples
to be unsuspicious. We propose an approach for modeling suspiciousness by
leveraging cognitive salience. Specifically, we split an image into foreground
(salient region) and background (the rest), and allow significantly larger
adversarial perturbations in the background, while ensuring that cognitive
salience of background remains low. We describe how to compute the resulting
non-salience-preserving dual-perturbation attacks on classifiers. We then
experimentally demonstrate that our attacks indeed do not significantly change
perceptual salience of the background, but are highly effective against
classifiers robust to conventional attacks. Furthermore, we show that
adversarial training with dual-perturbation attacks yields classifiers that are
more robust to these than state-of-the-art robust learning approaches, and
comparable in terms of robustness to conventional attacks.Comment: v2.
Attribution-driven Causal Analysis for Detection of Adversarial Examples
Attribution methods have been developed to explain the decision of a machine
learning model on a given input. We use the Integrated Gradient method for
finding attributions to define the causal neighborhood of an input by
incrementally masking high attribution features. We study the robustness of
machine learning models on benign and adversarial inputs in this neighborhood.
Our study indicates that benign inputs are robust to the masking of high
attribution features but adversarial inputs generated by the state-of-the-art
adversarial attack methods such as DeepFool, FGSM, CW and PGD, are not robust
to such masking. Further, our study demonstrates that this concentration of
high-attribution features responsible for the incorrect decision is more
pronounced in physically realizable adversarial examples. This difference in
attribution of benign and adversarial inputs can be used to detect adversarial
examples. Such a defense approach is independent of training data and attack
method, and we demonstrate its effectiveness on digital and physically
realizable perturbations.Comment: 11 pages, 6 figure
Local Gradients Smoothing: Defense against localized adversarial attacks
Deep neural networks (DNNs) have shown vulnerability to adversarial attacks,
i.e., carefully perturbed inputs designed to mislead the network at inference
time. Recently introduced localized attacks, Localized and Visible Adversarial
Noise (LaVAN) and Adversarial patch, pose a new challenge to deep learning
security by adding adversarial noise only within a specific region without
affecting the salient objects in an image. Driven by the observation that such
attacks introduce concentrated high-frequency changes at a particular image
location, we have developed an effective method to estimate noise location in
gradient domain and transform those high activation regions caused by
adversarial noise in image domain while having minimal effect on the salient
object that is important for correct classification. Our proposed Local
Gradients Smoothing (LGS) scheme achieves this by regularizing gradients in the
estimated noisy region before feeding the image to DNN for inference. We have
shown the effectiveness of our method in comparison to other defense methods
including Digital Watermarking, JPEG compression, Total Variance Minimization
(TVM) and Feature squeezing on ImageNet dataset. In addition, we systematically
study the robustness of the proposed defense mechanism against Back Pass
Differentiable Approximation (BPDA), a state of the art attack recently
developed to break defenses that transform an input sample to minimize the
adversarial effect. Compared to other defense mechanisms, LGS is by far the
most resistant to BPDA in localized adversarial attack setting.Comment: Accepted At WACV-201
PeerNets: Exploiting Peer Wisdom Against Adversarial Attacks
Deep learning systems have become ubiquitous in many aspects of our lives.
Unfortunately, it has been shown that such systems are vulnerable to
adversarial attacks, making them prone to potential unlawful uses. Designing
deep neural networks that are robust to adversarial attacks is a fundamental
step in making such systems safer and deployable in a broader variety of
applications (e.g. autonomous driving), but more importantly is a necessary
step to design novel and more advanced architectures built on new computational
paradigms rather than marginally building on the existing ones. In this paper
we introduce PeerNets, a novel family of convolutional networks alternating
classical Euclidean convolutions with graph convolutions to harness information
from a graph of peer samples. This results in a form of non-local forward
propagation in the model, where latent features are conditioned on the global
structure induced by the graph, that is up to 3 times more robust to a variety
of white- and black-box adversarial attacks compared to conventional
architectures with almost no drop in accuracy
Robust Adversarial Learning via Sparsifying Front Ends
It is by now well-known that small adversarial perturbations can induce
classification errors in deep neural networks. In this paper, we take a
bottom-up signal processing perspective to this problem and show that a
systematic exploitation of sparsity in natural data is a promising tool for
defense. For linear classifiers, we show that a sparsifying front end is
provably effective against -bounded attacks, reducing output
distortion due to the attack by a factor of roughly where is the data
dimension and is the sparsity level. We then extend this concept to deep
networks, showing that a "locally linear" model can be used to develop a
theoretical foundation for crafting attacks and defenses. We also devise
attacks based on the locally linear model that outperform the well-known FGSM
attack. We supplement our theoretical results with experiments on the MNIST
handwritten digit database, showing the efficacy of the proposed sparsity-based
defense schemes.Comment: 16 pages, 12 figures. Submitted to IEEE Transactions on Signal
Processin
Interpreting Adversarial Examples by Activation Promotion and Suppression
It is widely known that convolutional neural networks (CNNs) are vulnerable
to adversarial examples: images with imperceptible perturbations crafted to
fool classifiers. However, interpretability of these perturbations is less
explored in the literature. This work aims to better understand the roles of
adversarial perturbations and provide visual explanations from pixel, image and
network perspectives. We show that adversaries have a promotion-suppression
effect (PSE) on neurons' activations and can be primarily categorized into
three types: i) suppression-dominated perturbations that mainly reduce the
classification score of the true label, ii) promotion-dominated perturbations
that focus on boosting the confidence of the target label, and iii) balanced
perturbations that play a dual role in suppression and promotion. We also
provide image-level interpretability of adversarial examples. This links PSE of
pixel-level perturbations to class-specific discriminative image regions
localized by class activation mapping (Zhou et al. 2016). Further, we examine
the adversarial effect through network dissection (Bau et al. 2017), which
offers concept-level interpretability of hidden units. We show that there
exists a tight connection between the units' sensitivity to adversarial attacks
and their interpretability on semantic concepts. Lastly, we provide some new
insights from our interpretation to improve the adversarial robustness of
networks
Defense Methods Against Adversarial Examples for Recurrent Neural Networks
Adversarial examples are known to mislead deep learning models to incorrectly
classify them, even in domains where such models achieve state-of-the-art
performance. Until recently, research on both attack and defense methods
focused on image recognition, primarily using convolutional neural networks
(CNNs). In recent years, adversarial example generation methods for recurrent
neural networks (RNNs) have been published, demonstrating that RNN classifiers
are also vulnerable to such attacks. In this paper, we present a novel defense
method, termed sequence squeezing, to make RNN classifiers more robust against
such attacks. Our method differs from previous defense methods which were
designed only for non-sequence based models. We also implement four additional
RNN defense methods inspired by recently published CNN defense methods. We
evaluate our methods against state-of-the-art attacks in the cyber security
domain where real adversaries (malware developers) exist, but our methods can
be applied against other discrete sequence based adversarial attacks, e.g., in
the NLP domain. Using our methods we were able to decrease the effectiveness of
such attack from 99.9% to 15%.Comment: Submitted as a conference paper to Euro S&P 202
Structured Adversarial Attack: Towards General Implementation and Better Interpretability
When generating adversarial examples to attack deep neural networks (DNNs),
Lp norm of the added perturbation is usually used to measure the similarity
between original image and adversarial example. However, such adversarial
attacks perturbing the raw input spaces may fail to capture structural
information hidden in the input. This work develops a more general attack
model, i.e., the structured attack (StrAttack), which explores group sparsity
in adversarial perturbations by sliding a mask through images aiming for
extracting key spatial structures. An ADMM (alternating direction method of
multipliers)-based framework is proposed that can split the original problem
into a sequence of analytically solvable subproblems and can be generalized to
implement other attacking methods. Strong group sparsity is achieved in
adversarial perturbations even with the same level of Lp norm distortion as the
state-of-the-art attacks. We demonstrate the effectiveness of StrAttack by
extensive experimental results onMNIST, CIFAR-10, and ImageNet. We also show
that StrAttack provides better interpretability (i.e., better correspondence
with discriminative image regions)through adversarial saliency map (Papernot et
al., 2016b) and class activation map(Zhou et al., 2016).Comment: Published as a conference paper at ICLR 201
- …