43 research outputs found
[Re] Double Sampling Randomized Smoothing
This paper is a contribution to the reproducibility challenge in the field of
machine learning, specifically addressing the issue of certifying the
robustness of neural networks (NNs) against adversarial perturbations. The
proposed Double Sampling Randomized Smoothing (DSRS) framework overcomes the
limitations of existing methods by using an additional smoothing distribution
to improve the robustness certification. The paper provides a clear
manifestation of DSRS for a generalized family of Gaussian smoothing and a
computationally efficient method for implementation. The experiments on MNIST
and CIFAR-10 demonstrate the effectiveness of DSRS, consistently certifying
larger robust radii compared to other methods. Also various ablations studies
are conducted to further analyze the hyperparameters and effect of adversarial
training methods on the certified radius by the proposed framework
Fixed Inter-Neuron Covariability Induces Adversarial Robustness
The vulnerability to adversarial perturbations is a major flaw of Deep Neural
Networks (DNNs) that raises question about their reliability when in real-world
scenarios. On the other hand, human perception, which DNNs are supposed to
emulate, is highly robust to such perturbations, indicating that there may be
certain features of the human perception that make it robust but are not
represented in the current class of DNNs. One such feature is that the activity
of biological neurons is correlated and the structure of this correlation tends
to be rather rigid over long spans of times, even if it hampers performance and
learning. We hypothesize that integrating such constraints on the activations
of a DNN would improve its adversarial robustness, and, to test this
hypothesis, we have developed the Self-Consistent Activation (SCA) layer, which
comprises of neurons whose activations are consistent with each other, as they
conform to a fixed, but learned, covariability pattern. When evaluated on image
and sound recognition tasks, the models with a SCA layer achieved high
accuracy, and exhibited significantly greater robustness than multi-layer
perceptron models to state-of-the-art Auto-PGD adversarial attacks
\textit{without being trained on adversarially perturbed dat
And/or trade-off in artificial neurons: impact on adversarial robustness
Since its discovery in 2013, the phenomenon of adversarial examples has
attracted a growing amount of attention from the machine learning community. A
deeper understanding of the problem could lead to a better comprehension of how
information is processed and encoded in neural networks and, more in general,
could help to solve the issue of interpretability in machine learning. Our idea
to increase adversarial resilience starts with the observation that artificial
neurons can be divided in two broad categories: AND-like neurons and OR-like
neurons. Intuitively, the former are characterised by a relatively low number
of combinations of input values which trigger neuron activation, while for the
latter the opposite is true. Our hypothesis is that the presence in a network
of a sufficiently high number of OR-like neurons could lead to classification
"brittleness" and increase the network's susceptibility to adversarial attacks.
After constructing an operational definition of a neuron AND-like behaviour, we
proceed to introduce several measures to increase the proportion of AND-like
neurons in the network: L1 norm weight normalisation; application of an input
filter; comparison between the neuron output's distribution obtained when the
network is fed with the actual data set and the distribution obtained when the
network is fed with a randomised version of the former called "scrambled data
set". Tests performed on the MNIST data set hint that the proposed measures
could represent an interesting direction to explore
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Large language models (LLMs), designed to provide helpful and safe responses,
often rely on alignment techniques to align with user intent and social
guidelines. Unfortunately, this alignment can be exploited by malicious actors
seeking to manipulate an LLM's outputs for unintended purposes. In this paper
we introduce a novel approach that employs a genetic algorithm (GA) to
manipulate LLMs when model architecture and parameters are inaccessible. The GA
attack works by optimizing a universal adversarial prompt that -- when combined
with a user's query -- disrupts the attacked model's alignment, resulting in
unintended and potentially harmful outputs. Our novel approach systematically
reveals a model's limitations and vulnerabilities by uncovering instances where
its responses deviate from expected behavior. Through extensive experiments we
demonstrate the efficacy of our technique, thus contributing to the ongoing
discussion on responsible AI development by providing a diagnostic tool for
evaluating and enhancing alignment of LLMs with human intent. To our knowledge
this is the first automated universal black box jailbreak attack