21 research outputs found

    Novel Concepts and Designs for Adversarial Attacks and Defenses

    Get PDF
    Albeit displaying remarkable performance across a range of tasks, Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples which are carefully created to deceive these networks. This thesis first demonstrates that DNNs are vulnerable against adversarial attacks even when the attacker is unaware of the model architecture or the training data used to train the model and then proposes a number of novel approaches to improve the robustness of DNNs against challenging adversarial perturbations. Specifically for adversarial attacks, our work highlights how targeted and untargeted adversarial functions can be learned without access to the original data distribution, training mechanism, or label space of an attacked computer vision system. We demonstrate state-of-the-art cross-domain transferability of adversarial perturbations learned from paintings, cartoons, and medical scans to models trained on natural image datasets (such as ImageNet). In this manner, our work highlights an important vulnerability of deep neural networks which makes their deployment challenging in a real-world scenario. Against the threat of these adversarial attacks, we develop novel defense mechanisms that can be deployed with or without retraining the deep neural networks. To this end, we design two plug-and-play defense methods that can protect off-the-shelf pre-trained models without retraining. Specifically, we propose Neural Representation Purifier (NRP) and Local Gradient Smoothing (LGS) to defend against constrained and unconstrained attacks, respectively. NRP learns to purify adversarial noise spread across entire the input image, however, it still struggles against unconstrained attacks where an attacker hides an adversarial sticker preferably in the background without disturbing the original salient image information. We develop a mechanism to smooth local gradients in an input image to stabilize abnormal adversarial patterns introduced by the unconstrained attacks such as an adversarial patch. Robustifying model's parameter space that is retraining the model on adversarial examples is of equal importance. However, current adversarial training methods not only lose performance on the clean image samples (images without the adversarial noise) but also show poor generalization to natural image corruptions. We propose a style-based adversarial training that enhances the model robustness to adversarial attacks as well as natural corruptions. A model trained on our proposed stylized adversarial training shows better generalization to shifts in data distribution including natural image corruptions such as fog, rain, and contrast. One drawback of adversarial training is the loss of accuracy on clean image samples especially when the model size is small. To address this limitation, we design a meta-learning-based approach that takes advantage of universal (instance-agnostic) as well as local (instance-specific) perturbations to train small neural networks with feature regularization that leads to better robustness with minimal drop in performance on clean image samples. Adversarial training is useful if it can be deployed against unseen adversarial attacks. However, evaluating a certain adversarial training mechanism remains a challenging feat due to gradient masking, a phenomenon where adversarial robustness is high due to failed attack optimization. Finally, we develop a generic attack algorithm based on a novel guidance mechanism in order to expose any elusive robustness due to gradient masking. In short, this thesis outlines new methods to expose the vulnerability of DNNs against adversarial perturbations and then sets out to propose novel defense techniques with special advantages over state-of-the-art methods e.g., task-agnostic behavior, good performance against natural perturbations, and less impact on model accuracy on clean image samples

    FLIP: Cross-domain Face Anti-spoofing with Language Guidance

    Full text link
    Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems deployed in security-critical applications. Existing FAS methods have poor generalizability to unseen spoof types, camera sensors, and environmental conditions. Recently, vision transformer (ViT) models have been shown to be effective for the FAS task due to their ability to capture long-range dependencies among image patches. However, adaptive modules or auxiliary loss functions are often required to adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet. In this work, we first show that initializing ViTs with multimodal (e.g., CLIP) pre-trained weights improves generalizability for the FAS task, which is in line with the zero-shot transfer capabilities of vision-language pre-trained (VLP) models. We then propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language. Specifically, we show that aligning the image representation with an ensemble of class descriptions (based on natural language semantics) improves FAS generalizability in low-data regimes. Finally, we propose a multimodal contrastive learning strategy to boost feature generalization further and bridge the gap between source and target domains. Extensive experiments on three standard protocols demonstrate that our method significantly outperforms the state-of-the-art methods, achieving better zero-shot transfer performance than five-shot transfer of adaptive ViTs. Code: https://github.com/koushiksrivats/FLIPComment: Accepted to ICCV-2023. Project Page: https://koushiksrivats.github.io/FLIP

    CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search

    Full text link
    The success of deep learning based face recognition systems has given rise to serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Existing methods for enhancing privacy fail to generate naturalistic images that can protect facial privacy without compromising user experience. We propose a novel two-step approach for facial privacy protection that relies on finding adversarial latent codes in the low-dimensional manifold of a pretrained generative model. The first step inverts the given face image into the latent space and finetunes the generative model to achieve an accurate reconstruction of the given image from its latent code. This step produces a good initialization, aiding the generation of high-quality faces that resemble the given identity. Subsequently, user-defined makeup text prompts and identity-preserving regularization are used to guide the search for adversarial codes in the latent space. Extensive experiments demonstrate that faces generated by our approach have stronger black-box transferability with an absolute gain of 12.06% over the state-of-the-art facial privacy protection approach under the face verification task. Finally, we demonstrate the effectiveness of the proposed approach for commercial face recognition systems. Our code is available at https://github.com/fahadshamshad/Clip2Protect.Comment: Accepted in CVPR 2023. Project page: https://fahadshamshad.github.io/Clip2Protect

    Enhancing Novel Object Detection via Cooperative Foundational Models

    Full text link
    In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 AP50 \text{AP}_{50} for novel classes. Our code is available at https://github.com/rohit901/cooperative-foundational-models .Comment: Code: https://github.com/rohit901/cooperative-foundational-model

    Stylized Adversarial Defense

    Full text link
    Deep Convolution Neural Networks (CNNs) can easily be fooled by subtle, imperceptible changes to the input images. To address this vulnerability, adversarial training creates perturbation patterns and includes them in the training set to robustify the model. In contrast to existing adversarial training methods that only use class-boundary information (e.g., using a cross entropy loss), we propose to exploit additional information from the feature space to craft stronger adversaries that are in turn used to learn a robust model. Specifically, we use the style and content information of the target sample from another class, alongside its class boundary information to create adversarial perturbations. We apply our proposed multi-task objective in a deeply supervised manner, extracting multi-scale feature knowledge to create maximally separating adversaries. Subsequently, we propose a max-margin adversarial training approach that minimizes the distance between source image and its adversary and maximizes the distance between the adversary and the target image. Our adversarial training approach demonstrates strong robustness compared to state of the art defenses, generalizes well to naturally occurring corruptions and data distributional shifts, and retains the model accuracy on clean examples.Comment: Code is available at this http https://github.com/Muzammal-Naseer/SA

    Learnable Weight Initialization for Volumetric Medical Image Segmentation

    Full text link
    Hybrid volumetric medical image segmentation models, combining the advantages of local convolution and global attention, have recently received considerable attention. While mainly focusing on architectural modifications, most existing hybrid approaches still use conventional data-independent weight initialization schemes which restrict their performance due to ignoring the inherent volumetric nature of the medical data. To address this issue, we propose a learnable weight initialization approach that utilizes the available medical training data to effectively learn the contextual and structural cues via the proposed self-supervised objectives. Our approach is easy to integrate into any hybrid model and requires no external training data. Experiments on multi-organ and lung cancer segmentation tasks demonstrate the effectiveness of our approach, leading to state-of-the-art segmentation performance. Our source code and models are available at: https://github.com/ShahinaKK/LWI-VMS.Comment: Technical Repor

    Videoprompter: an ensemble of foundational models for zero-shot video understanding

    Full text link
    Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query visual features are not considered. In this paper, we propose a framework which combines pre-trained discriminative VLMs with pre-trained generative video-to-text and text-to-text models. We introduce two key modifications to the standard zero-shot setting. First, we propose language-guided visual feature enhancement and employ a video-to-text model to convert the query video to its descriptive form. The resulting descriptions contain vital visual cues of the query video, such as what objects are present and their spatio-temporal interactions. These descriptive cues provide additional semantic knowledge to VLMs to enhance their zeroshot performance. Second, we propose video-specific prompts to LLMs to generate more meaningful descriptions to enrich class label representations. Specifically, we introduce prompt techniques to create a Tree Hierarchy of Categories for class names, offering a higher-level action context for additional visual cues, We demonstrate the effectiveness of our approach in video understanding across three different zero-shot settings: 1) video action recognition, 2) video-to-text and textto-video retrieval, and 3) time-sensitive video tasks. Consistent improvements across multiple benchmarks and with various VLMs demonstrate the effectiveness of our proposed framework. Our code will be made publicly available

    A Self-supervised Approach for Adversarial Robustness

    Full text link
    Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems e.g., for classification, segmentation and object detection. The vulnerability of DNNs against such attacks can prove a major roadblock towards their real-world deployment. Transferability of adversarial examples demand generalizable defenses that can provide cross-task protection. Adversarial training that enhances robustness by modifying target model's parameters lacks such generalizability. On the other hand, different input processing based defenses fall short in the face of continuously evolving attacks. In this paper, we take the first step to combine the benefits of both approaches and propose a self-supervised adversarial training mechanism in the input space. By design, our defense is a generalizable approach and provides significant robustness against the \textbf{unseen} adversarial attacks (\eg by reducing the success rate of translation-invariant \textbf{ensemble} attack from 82.6\% to 31.9\% in comparison to previous state-of-the-art). It can be deployed as a plug-and-play solution to protect a variety of vision systems, as we demonstrate for the case of classification, segmentation and detection. Code is available at: {\small\url{https://github.com/Muzammal-Naseer/NRP}}.Comment: CVPR-2020 (Oral). Code this http https://github.com/Muzammal-Naseer/NRP
    corecore