21 research outputs found

    A Dive into SAM Prior in Image Restoration

    Full text link
    The goal of image restoration (IR), a fundamental issue in computer vision, is to restore a high-quality (HQ) image from its degraded low-quality (LQ) observation. Multiple HQ solutions may correspond to an LQ input in this poorly posed problem, creating an ambiguous solution space. This motivates the investigation and incorporation of prior knowledge in order to effectively constrain the solution space and enhance the quality of the restored images. In spite of the pervasive use of hand-crafted and learned priors in IR, limited attention has been paid to the incorporation of knowledge from large-scale foundation models. In this paper, we for the first time leverage the prior knowledge of the state-of-the-art segment anything model (SAM) to boost the performance of existing IR networks in an parameter-efficient tuning manner. In particular, the choice of SAM is based on its robustness to image degradations, such that HQ semantic masks can be extracted from it. In order to leverage semantic priors and enhance restoration quality, we propose a lightweight SAM prior tuning (SPT) unit. This plug-and-play component allows us to effectively integrate semantic priors into existing IR networks, resulting in significant improvements in restoration quality. As the only trainable module in our method, the SPT unit has the potential to improve both efficiency and scalability. We demonstrate the effectiveness of the proposed method in enhancing a variety of methods across multiple tasks, such as image super-resolution and color image denoising.Comment: Technical Repor

    Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

    Full text link
    Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.Comment: Technical repor

    Backdoor Attack on Hash-based Image Retrieval via Clean-label Data Poisoning

    Full text link
    A backdoored deep hashing model is expected to behave normally on original query images and return the images with the target label when a specific trigger pattern presents. To this end, we propose the confusing perturbations-induced backdoor attack (CIBA). It injects a small number of poisoned images with the correct label into the training data, which makes the attack hard to be detected. To craft the poisoned images, we first propose the confusing perturbations to disturb the hashing code learning. As such, the hashing model can learn more about the trigger. The confusing perturbations are imperceptible and generated by optimizing the intra-class dispersion and inter-class shift in the Hamming space. We then employ the targeted adversarial patch as the backdoor trigger to improve the attack performance. We have conducted extensive experiments to verify the effectiveness of our proposed CIBA. Our code is available at https://github.com/KuofengGao/CIBA.Comment: Accepted by BMVC 202

    GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

    Full text link
    Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the inter-class relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapterComment: Accepted by NeurIPS 2023. The manuscript will be further revised based on the review

    Improving Vision Transformers by Revisiting High-frequency Components

    Full text link
    The transformer models have shown promising effectiveness in dealing with various vision tasks. However, compared with training Convolutional Neural Network (CNN) models, training Vision Transformer (ViT) models is more difficult and relies on the large-scale training set. To explain this observation we make a hypothesis that ViT models are less effective in capturing the high-frequency components of images than CNN models, and verify it by a frequency analysis. Inspired by this finding, we first investigate the effects of existing techniques for improving ViT models from a new frequency perspective, and find that the success of some techniques (e.g., RandAugment) can be attributed to the better usage of the high-frequency components. Then, to compensate for this insufficient ability of ViT models, we propose HAT, which directly augments high-frequency components of images via adversarial training. We show that HAT can consistently boost the performance of various ViT models (e.g., +1.2% for ViT-B, +0.5% for Swin-B), and especially enhance the advanced model VOLO-D5 to 87.3% that only uses ImageNet-1K data, and the superiority can also be maintained on out-of-distribution data and transferred to downstream tasks.Comment: 18 pages, 7 figure
    corecore