33 research outputs found
CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution
The success of Vision Transformer (ViT) has been widely reported on a wide
range of image recognition tasks. The merit of ViT over CNN has been largely
attributed to large training datasets or auxiliary pre-training. Without
pre-training, the performance of ViT on small datasets is limited because the
global self-attention has limited capacity in local modeling. Towards boosting
ViT on small datasets without pre-training, this work improves its local
modeling by applying a weight mask on the original self-attention matrix. A
straightforward way to locally adapt the self-attention matrix can be realized
by an element-wise learnable weight mask (ELM), for which our preliminary
results show promising results. However, the element-wise simple learnable
weight mask not only induces a non-trivial additional parameter overhead but
also increases the optimization complexity. To this end, this work proposes a
novel Gaussian mixture mask (GMM) in which one mask only has two learnable
parameters and it can be conveniently used in any ViT variants whose attention
mechanism allows the use of masks. Experimental results on multiple small
datasets demonstrate that the effectiveness of our proposed Gaussian mask for
boosting ViTs for free (almost zero additional parameter or computation cost).
Our code will be publicly available at
\href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}
Robustness of Segment Anything Model (SAM) for Autonomous Driving in Adverse Weather Conditions
Segment Anything Model (SAM) has gained considerable interest in recent times
for its remarkable performance and has emerged as a foundational model in
computer vision. It has been integrated in diverse downstream tasks, showcasing
its strong zero-shot transfer capabilities. Given its impressive performance,
there is a strong desire to apply SAM in autonomous driving to improve the
performance of vision tasks, particularly in challenging scenarios such as
driving under adverse weather conditions. However, its robustness under adverse
weather conditions remains uncertain. In this work, we investigate the
application of SAM in autonomous driving and specifically explore its
robustness under adverse weather conditions. Overall, this work aims to enhance
understanding of SAM's robustness in challenging scenarios before integrating
it into autonomous driving vision tasks, providing valuable insights for future
applications
Segment Anything Meets Universal Adversarial Perturbation
As Segment Anything Model (SAM) becomes a popular foundation model in
computer vision, its adversarial robustness has become a concern that cannot be
ignored. This works investigates whether it is possible to attack SAM with
image-agnostic Universal Adversarial Perturbation (UAP). In other words, we
seek a single perturbation that can fool the SAM to predict invalid masks for
most (if not all) images. We demonstrate convetional image-centric attack
framework is effective for image-independent attacks but fails for universal
adversarial attack. To this end, we propose a novel perturbation-centric
framework that results in a UAP generation method based on self-supervised
contrastive learning (CL), where the UAP is set to the anchor sample and the
positive sample is augmented from the UAP. The representations of negative
samples are obtained from the image encoder in advance and saved in a memory
bank. The effectiveness of our proposed CL-based UAP generation method is
validated by both quantitative and qualitative results. On top of the ablation
study to understand various components in our proposed method, we shed light on
the roles of positive and negative samples in making the generated UAP
effective for attacking SAM
Text-to-image Diffusion Model in Generative AI: A Survey
This survey reviews text-to-image diffusion models in the context that
diffusion models have emerged to be popular for a wide range of generative
tasks. As a self-contained work, this survey starts with a brief introduction
of how a basic diffusion model works for image synthesis, followed by how
condition or guidance improves learning. Based on that, we present a review of
state-of-the-art methods on text-conditioned image synthesis, i.e.,
text-to-image. We further summarize applications beyond text-to-image
generation: text-guided creative generation and text-guided image editing.
Beyond the progress made so far, we discuss existing challenges and promising
future directions.Comment: First survey on the recent progress of text-to-image generation based
on the diffusion model (under progress
Universal Adversarial Perturbations Through the Lens of Deep Steganography: Towards A Fourier Perspective
The booming interest in adversarial attacks stems from a misalignment between
human vision and a deep neural network (DNN), i.e. a human imperceptible
perturbation fools the DNN. Moreover, a single perturbation, often called
universal adversarial perturbation (UAP), can be generated to fool the DNN for
most images. A similar misalignment phenomenon has recently also been observed
in the deep steganography task, where a decoder network can retrieve a secret
image back from a slightly perturbed cover image. We attempt explaining the
success of both in a unified manner from the Fourier perspective. We perform
task-specific and joint analysis and reveal that (a) frequency is a key factor
that influences their performance based on the proposed entropy metric for
quantifying the frequency distribution; (b) their success can be attributed to
a DNN being highly sensitive to high-frequency content. We also perform feature
layer analysis for providing deep insight on model generalization and
robustness. Additionally, we propose two new variants of universal
perturbations: (1) Universal Secret Adversarial Perturbation (USAP) that
simultaneously achieves attack and hiding; (2) high-pass UAP (HP-UAP) that is
less visible to the human eye.Comment: Accepted to AAAI 202
CD-UAP: Class Discriminative Universal Adversarial Perturbation
A single universal adversarial perturbation (UAP) can be added to all natural
images to change most of their predicted class labels. It is of high practical
relevance for an attacker to have flexible control over the targeted classes to
be attacked, however, the existing UAP method attacks samples from all classes.
In this work, we propose a new universal attack method to generate a single
perturbation that fools a target network to misclassify only a chosen group of
classes, while having limited influence on the remaining classes. Since the
proposed attack generates a universal adversarial perturbation that is
discriminative to targeted and non-targeted classes, we term it class
discriminative universal adversarial perturbation (CD-UAP). We propose one
simple yet effective algorithm framework, under which we design and compare
various loss function configurations tailored for the class discriminative
universal attack. The proposed approach has been evaluated with extensive
experiments on various benchmark datasets. Additionally, our proposed approach
achieves state-of-the-art performance for the original task of UAP attacking
all classes, which demonstrates the effectiveness of our approach