21 research outputs found
A Dive into SAM Prior in Image Restoration
The goal of image restoration (IR), a fundamental issue in computer vision,
is to restore a high-quality (HQ) image from its degraded low-quality (LQ)
observation. Multiple HQ solutions may correspond to an LQ input in this poorly
posed problem, creating an ambiguous solution space. This motivates the
investigation and incorporation of prior knowledge in order to effectively
constrain the solution space and enhance the quality of the restored images. In
spite of the pervasive use of hand-crafted and learned priors in IR, limited
attention has been paid to the incorporation of knowledge from large-scale
foundation models. In this paper, we for the first time leverage the prior
knowledge of the state-of-the-art segment anything model (SAM) to boost the
performance of existing IR networks in an parameter-efficient tuning manner. In
particular, the choice of SAM is based on its robustness to image degradations,
such that HQ semantic masks can be extracted from it. In order to leverage
semantic priors and enhance restoration quality, we propose a lightweight SAM
prior tuning (SPT) unit. This plug-and-play component allows us to effectively
integrate semantic priors into existing IR networks, resulting in significant
improvements in restoration quality. As the only trainable module in our
method, the SPT unit has the potential to improve both efficiency and
scalability. We demonstrate the effectiveness of the proposed method in
enhancing a variety of methods across multiple tasks, such as image
super-resolution and color image denoising.Comment: Technical Repor
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models
Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the
open-world generalization has gained increasing popularity due to its practical
value. However, performance advancements are limited when relying solely on
intricate algorithmic designs for a single model, even one exhibiting strong
performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the
collaborative potential of leveraging much weaker VLMs to enhance the
generalization of a robust single model. The affirmative findings motivate us
to address the generalization problem from a novel perspective, i.e., ensemble
of pre-trained VLMs. We introduce three customized ensemble strategies, each
tailored to one specific scenario. Firstly, we introduce the zero-shot
ensemble, automatically adjusting the logits of different models based on their
confidence when only pre-trained VLMs are available. Furthermore, for scenarios
with extra few-shot samples, we propose the training-free and tuning ensemble,
offering flexibility based on the availability of computing resources. The
proposed ensemble strategies are evaluated on zero-shot, base-to-new, and
cross-dataset generalization, achieving new state-of-the-art performance.
Notably, this work represents an initial stride toward enhancing the
generalization performance of VLMs via ensemble. The code is available at
https://github.com/zhiheLu/Ensemble_VLM.git.Comment: Technical repor
Backdoor Attack on Hash-based Image Retrieval via Clean-label Data Poisoning
A backdoored deep hashing model is expected to behave normally on original
query images and return the images with the target label when a specific
trigger pattern presents. To this end, we propose the confusing
perturbations-induced backdoor attack (CIBA). It injects a small number of
poisoned images with the correct label into the training data, which makes the
attack hard to be detected. To craft the poisoned images, we first propose the
confusing perturbations to disturb the hashing code learning. As such, the
hashing model can learn more about the trigger. The confusing perturbations are
imperceptible and generated by optimizing the intra-class dispersion and
inter-class shift in the Hamming space. We then employ the targeted adversarial
patch as the backdoor trigger to improve the attack performance. We have
conducted extensive experiments to verify the effectiveness of our proposed
CIBA. Our code is available at https://github.com/KuofengGao/CIBA.Comment: Accepted by BMVC 202
GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
Adapter-style efficient transfer learning (ETL) has shown excellent
performance in the tuning of vision-language models (VLMs) under the low-data
regime, where only a few additional parameters are introduced to excavate the
task-specific knowledge based on the general and powerful representation of
VLMs. However, most adapter-style works face two limitations: (i) modeling
task-specific knowledge with a single modality only; and (ii) overlooking the
exploitation of the inter-class relationships in downstream tasks, thereby
leading to sub-optimal solutions. To mitigate that, we propose an effective
adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual
adapter by explicitly modeling the dual-modality structure knowledge (i.e., the
correlation of different semantics/classes in textual and visual modalities)
with a dual knowledge graph. In particular, the dual knowledge graph is
established with two sub-graphs, i.e., a textual knowledge sub-graph, and a
visual knowledge sub-graph, where the nodes and edges represent the
semantics/classes and their correlations in two modalities, respectively. This
enables the textual feature of each prompt to leverage the task-specific
structure knowledge from both textual and visual modalities, yielding a more
effective classifier for downstream tasks. Extensive experimental results on 11
benchmark datasets reveal that our GraphAdapter significantly outperforms
previous adapter-based methods. The code will be released at
https://github.com/lixinustc/GraphAdapterComment: Accepted by NeurIPS 2023. The manuscript will be further revised
based on the review
Improving Vision Transformers by Revisiting High-frequency Components
The transformer models have shown promising effectiveness in dealing with
various vision tasks. However, compared with training Convolutional Neural
Network (CNN) models, training Vision Transformer (ViT) models is more
difficult and relies on the large-scale training set. To explain this
observation we make a hypothesis that ViT models are less effective in
capturing the high-frequency components of images than CNN models, and verify
it by a frequency analysis. Inspired by this finding, we first investigate the
effects of existing techniques for improving ViT models from a new frequency
perspective, and find that the success of some techniques (e.g., RandAugment)
can be attributed to the better usage of the high-frequency components. Then,
to compensate for this insufficient ability of ViT models, we propose HAT,
which directly augments high-frequency components of images via adversarial
training. We show that HAT can consistently boost the performance of various
ViT models (e.g., +1.2% for ViT-B, +0.5% for Swin-B), and especially enhance
the advanced model VOLO-D5 to 87.3% that only uses ImageNet-1K data, and the
superiority can also be maintained on out-of-distribution data and transferred
to downstream tasks.Comment: 18 pages, 7 figure