26 research outputs found

    I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models

    Full text link
    Modern image-to-text systems typically adopt the encoder-decoder framework, which comprises two main components: an image encoder, responsible for extracting image features, and a transformer-based decoder, used for generating captions. Taking inspiration from the analysis of neural networks' robustness against adversarial perturbations, we propose a novel gray-box algorithm for creating adversarial examples in image-to-text models. Unlike image classification tasks that have a finite set of class labels, finding visually similar adversarial examples in an image-to-text task poses greater challenges because the captioning system allows for a virtually infinite space of possible captions. In this paper, we present a gray-box adversarial attack on image-to-text, both untargeted and targeted. We formulate the process of discovering adversarial perturbations as an optimization problem that uses only the image-encoder component, meaning the proposed attack is language-model agnostic. Through experiments conducted on the ViT-GPT2 model, which is the most-used image-to-text model in Hugging Face, and the Flickr30k dataset, we demonstrate that our proposed attack successfully generates visually similar adversarial examples, both with untargeted and targeted captions. Notably, our attack operates in a gray-box manner, requiring no knowledge about the decoder module. We also show that our attacks fool the popular open-source platform Hugging Face

    Open Sesame! Universal Black Box Jailbreaking of Large Language Models

    Full text link
    Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM's outputs for unintended purposes. In this paper we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge this is the first automated universal black box jailbreak attack

    Foiling Explanations in Deep Neural Networks

    Full text link
    Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their black-box nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone -- sans a reasoning of how said answer was derived -- is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image -- hardly influencing the network's output -- we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, AttaXAI, a model-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method's performance on two benchmark datasets -- CIFAR100 and ImageNet -- using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet, MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Our novel algorithm is successfully able to manipulate an image in a manner imperceptible to the human eye, such that the XAI method outputs a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory

    Functional immunomics: Microarray analysis of IgG autoantibody repertoires predicts the future response of NOD mice to an inducer of accelerated diabetes

    Full text link
    One's present repertoire of antibodies encodes the history of one's past immunological experience. Can the present autoantibody repertoire be consulted to predict resistance or susceptibility to the future development of an autoimmune disease? Here we developed an antigen microarray chip and used bioinformatic analysis to study a model of type 1 diabetes developing in non-obese diabetic (NOD) male mice in which the disease was accelerated and synchronized by exposing the mice to cyclophosphamide at 4 weeks of age. We obtained sera from 19 individual mice, treated the mice to induce cyclophosphamide-accelerated diabetes (CAD), and found, as expected, that 9 mice became severely diabetic while 10 mice permanently resisted diabetes. We again obtained serum from each mouse afterCAD induction. We then analyzed the patterns of antibodies in the individualmice to 266 different antigens spotted on the antigen chip. We identified a select panel of 27 different antigens (10% of the array) that revealed a pattern of IgG antibody reactivity in the pre-CAD serathat discriminated between the mice resistant or susceptible to CAD with 100% sensitivity and 82% specificity (p=0.017). Surprisingly, the set of IgG antibodies that was informative before CAD induction did not separate the resistant and susceptible groups after the onset of CAD; new antigens became criticalfor post-CAD repertoire discrimination. Thus, at least for a model disease, present antibody repertoires can predict future disease; predictive and diagnostic repertoires can differ; and decisive information about immune system behavior can be mined by bioinformatic technology. Repertoires matter.Comment: See Advanced Publication on the PNAS website for final versio

    An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Convolutional Neural Networks

    No full text
    Deep neural networks (DNNs) are sensitive to adversarial data in a variety of scenarios, including the black-box scenario, where the attacker is only allowed to query the trained model and receive an output. Existing black-box methods for creating adversarial instances are costly, often using gradient estimation or training a replacement network. This paper introduces Query-Efficient Evolutionary Attack—QuEry Attack—an untargeted, score-based, black-box attack. QuEry Attack is based on a novel objective function that can be used in gradient-free optimization problems. The attack only requires access to the output logits of the classifier and is thus not affected by gradient masking. No additional information is needed, rendering our method more suitable to real-life situations. We test its performance with three different, commonly used, pretrained image-classifications models—Inception-v3, ResNet-50, and VGG-16-BN—against three benchmark datasets: MNIST, CIFAR10 and ImageNet. Furthermore, we evaluate QuEry Attack’s performance on non-differential transformation defenses and robust models. Our results demonstrate the superior performance of QuEry Attack, both in terms of accuracy score and query efficiency
    corecore