26 research outputs found
I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models
Modern image-to-text systems typically adopt the encoder-decoder framework,
which comprises two main components: an image encoder, responsible for
extracting image features, and a transformer-based decoder, used for generating
captions. Taking inspiration from the analysis of neural networks' robustness
against adversarial perturbations, we propose a novel gray-box algorithm for
creating adversarial examples in image-to-text models. Unlike image
classification tasks that have a finite set of class labels, finding visually
similar adversarial examples in an image-to-text task poses greater challenges
because the captioning system allows for a virtually infinite space of possible
captions. In this paper, we present a gray-box adversarial attack on
image-to-text, both untargeted and targeted. We formulate the process of
discovering adversarial perturbations as an optimization problem that uses only
the image-encoder component, meaning the proposed attack is language-model
agnostic. Through experiments conducted on the ViT-GPT2 model, which is the
most-used image-to-text model in Hugging Face, and the Flickr30k dataset, we
demonstrate that our proposed attack successfully generates visually similar
adversarial examples, both with untargeted and targeted captions. Notably, our
attack operates in a gray-box manner, requiring no knowledge about the decoder
module. We also show that our attacks fool the popular open-source platform
Hugging Face
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Large language models (LLMs), designed to provide helpful and safe responses,
often rely on alignment techniques to align with user intent and social
guidelines. Unfortunately, this alignment can be exploited by malicious actors
seeking to manipulate an LLM's outputs for unintended purposes. In this paper
we introduce a novel approach that employs a genetic algorithm (GA) to
manipulate LLMs when model architecture and parameters are inaccessible. The GA
attack works by optimizing a universal adversarial prompt that -- when combined
with a user's query -- disrupts the attacked model's alignment, resulting in
unintended and potentially harmful outputs. Our novel approach systematically
reveals a model's limitations and vulnerabilities by uncovering instances where
its responses deviate from expected behavior. Through extensive experiments we
demonstrate the efficacy of our technique, thus contributing to the ongoing
discussion on responsible AI development by providing a diagnostic tool for
evaluating and enhancing alignment of LLMs with human intent. To our knowledge
this is the first automated universal black box jailbreak attack
Foiling Explanations in Deep Neural Networks
Deep neural networks (DNNs) have greatly impacted numerous fields over the
past decade. Yet despite exhibiting superb performance over many problems,
their black-box nature still poses a significant challenge with respect to
explainability. Indeed, explainable artificial intelligence (XAI) is crucial in
several fields, wherein the answer alone -- sans a reasoning of how said answer
was derived -- is of little value. This paper uncovers a troubling property of
explanation methods for image-based DNNs: by making small visual changes to the
input image -- hardly influencing the network's output -- we demonstrate how
explanations may be arbitrarily manipulated through the use of evolution
strategies. Our novel algorithm, AttaXAI, a model-agnostic, adversarial attack
on XAI algorithms, only requires access to the output logits of a classifier
and to the explanation map; these weak assumptions render our approach highly
useful where real-world models and data are concerned. We compare our method's
performance on two benchmark datasets -- CIFAR100 and ImageNet -- using four
different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet,
MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can
be manipulated without the use of gradients or other model internals. Our novel
algorithm is successfully able to manipulate an image in a manner imperceptible
to the human eye, such that the XAI method outputs a specific explanation map.
To our knowledge, this is the first such method in a black-box setting, and we
believe it has significant value where explainability is desired, required, or
legally mandatory
Functional immunomics: Microarray analysis of IgG autoantibody repertoires predicts the future response of NOD mice to an inducer of accelerated diabetes
One's present repertoire of antibodies encodes the history of one's past
immunological experience. Can the present autoantibody repertoire be consulted
to predict resistance or susceptibility to the future development of an
autoimmune disease? Here we developed an antigen microarray chip and used
bioinformatic analysis to study a model of type 1 diabetes developing in
non-obese diabetic (NOD) male mice in which the disease was accelerated and
synchronized by exposing the mice to cyclophosphamide at 4 weeks of age. We
obtained sera from 19 individual mice, treated the mice to induce
cyclophosphamide-accelerated diabetes (CAD), and found, as expected, that 9
mice became severely diabetic while 10 mice permanently resisted diabetes. We
again obtained serum from each mouse afterCAD induction. We then analyzed the
patterns of antibodies in the individualmice to 266 different antigens spotted
on the antigen chip. We identified a select panel of 27 different antigens (10%
of the array) that revealed a pattern of IgG antibody reactivity in the pre-CAD
serathat discriminated between the mice resistant or susceptible to CAD with
100% sensitivity and 82% specificity (p=0.017). Surprisingly, the set of IgG
antibodies that was informative before CAD induction did not separate the
resistant and susceptible groups after the onset of CAD; new antigens became
criticalfor post-CAD repertoire discrimination. Thus, at least for a model
disease, present antibody repertoires can predict future disease; predictive
and diagnostic repertoires can differ; and decisive information about immune
system behavior can be mined by bioinformatic technology. Repertoires matter.Comment: See Advanced Publication on the PNAS website for final versio
An Evolutionary, Gradient-Free, Query-Efficient, Black-Box Algorithm for Generating Adversarial Instances in Deep Convolutional Neural Networks
Deep neural networks (DNNs) are sensitive to adversarial data in a variety of scenarios, including the black-box scenario, where the attacker is only allowed to query the trained model and receive an output. Existing black-box methods for creating adversarial instances are costly, often using gradient estimation or training a replacement network. This paper introduces Query-Efficient Evolutionary Attack—QuEry Attack—an untargeted, score-based, black-box attack. QuEry Attack is based on a novel objective function that can be used in gradient-free optimization problems. The attack only requires access to the output logits of the classifier and is thus not affected by gradient masking. No additional information is needed, rendering our method more suitable to real-life situations. We test its performance with three different, commonly used, pretrained image-classifications models—Inception-v3, ResNet-50, and VGG-16-BN—against three benchmark datasets: MNIST, CIFAR10 and ImageNet. Furthermore, we evaluate QuEry Attack’s performance on non-differential transformation defenses and robust models. Our results demonstrate the superior performance of QuEry Attack, both in terms of accuracy score and query efficiency