Neural Network Robustness Verification on GPUs by Müller, Christoph et al.
Neural Network Robustness Verification on GPUs
Christoph Mu¨ller, Gagandeep Singh, Markus Pu¨schel, Martin Vechev
Department of Computer Science
ETH Zurich, Switzerland
{christoph.mueller,gsingh,pueschel,martin.vechev}@inf.ethz.com
Abstract—Certifying the robustness of neural networks against
adversarial attacks is critical to their reliable adoption in real-
world systems including autonomous driving and medical diagno-
sis. Unfortunately, state-of-the-art verifiers either do not scale to
larger networks or are too imprecise to prove robustness, which
limits their practical adoption.
In this work, we introduce GPUPoly, a scalable verifier that
can prove the robustness of significantly larger deep neural
networks than possible with prior work. The key insight behind
GPUPoly is the design of custom, sound polyhedra algorithms for
neural network verification on a GPU. Our algorithms leverage
the available GPU parallelism and the inherent sparsity of the
underlying neural network verification task. GPUPoly scales to
very large networks: for example, it can prove the robustness of
a 1M neuron, 34-layer deep residual network in about 1 minute.
We believe GPUPoly is a promising step towards the practical
verification of large real-world networks.
I. INTRODUCTION
With the widespread adoption of deep neural networks
in many real-world applications such as face recognition,
autonomous driving, and medical diagnosis, it is critical to
ensure that they behave reliably on a wide range of inputs.
However, recent studies [1] have shown that trained networks
are vulnerable to adversarial examples.
In Fig. 1 we illustrate the concept of an adversarial example.
Here, a fully connected feedforward neural network (FFN) f
has been trained to classify images from the popular CIFAR10
dataset [2]. The network classifies image I0 from the test set
correctly as a car. However, an adversary can increase the
intensity of each pixel in I0 by a small imperceptible amount
to produce a new image I that still looks like a car to a human
but that the network wrongly classifies as a bird.
Neural network robustness. Given the susceptibility to
adversarial examples, there is growing interest in automated
methods for certifying the robustness of neural networks, that
is, proving that adversarial examples cannot occur within a
specified adversarial region. More formally, robustness certifi-
cation defines an adversarial region L∞(I0, ) as an L∞ ball
of radius  ∈ R around a correctly classified image I0 in the
test set. The goal then is to certify that the neural network
classifies all images in L∞(I0, ) correctly. The set L∞(I0, )
contains an exponential (in the image size) number of images,
which makes exhaustive enumeration infeasible. For example,
the CIFAR10 image I0 in Fig. 1 contains 3, 072 pixels. If
we consider an L∞ ball of radius  = 1/255 around I0,
then the number of images in L∞(I0, ) is 33072. We note
here that the  values used in our experiments (Section V) are
significantly larger than 1/255. Next, we discuss prior work on
𝐼𝑜
Car
𝐼 = 𝐼𝑜 +
1
255
Bird
Neural Network 𝑓
Neural Network 𝑓
Fig. 1: Image I0 (top) is classified correctly as a car by
the neural network, however image I (bottom), obtained by
increasing the intensity of each pixel by 1/255 is wrongly
classified as a bird.
neural network verification and highlight the key challenges
involved in the design of such verifiers.
Key challenge: Scalable and precise verification. Because
enumeration is infeasible, neural network verifiers compute the
output for all inputs in L∞(I0, ) symbolically. These verifiers
can be broadly classified as either exact or inexact. Exact
verifiers typically employ mixed-integer linear programming
(MILP) [3], SMT solvers [4]–[6] and Lipschitz optimization
[7]. They are computationally expensive and do not scale to
networks of the size considered in our work.
To address this scalability issue, inexact verifiers compute an
overapproximation of the network output corresponding to the
inputs in L∞(I0, ). Due to overapproximation, there can be
false positives, i.e., the verifier may fail to prove the network
robust when it actually is. The inexact verifiers are typically
based on abstract interpretation [8]–[11], duality [12], [13],
linear approximations [14]–[20], and semidefinite relaxations
[21]. There are also methods [22]–[24] that combine both exact
and inexact approaches aiming to be more scalable than exact
methods while improving the precision of inexact methods.
There is a tradeoff between the scalability and the degree of
overapproximation of inexact verifiers. More precise inexact
verifiers [8], [10], [11], [13]–[18], [21]–[23] scale to medium-
sized networks (≈ 100K neurons) but cannot handle the net-
works considered in our work (≈ 1M neurons). On the other
hand, more approximate verifiers [9], [19], [20] scale to larger
ar
X
iv
:2
00
7.
10
86
8v
1 
 [c
s.L
G]
  2
0 J
ul 
20
20
Input1 FC2 ReLU3 FC4 ReLU5 FC6 Output7
Input1 Conv2 ReLU3 Conv4 ReLU5 FC6 Output7
Input1 Conv2 ReLU3
Conv4
Conv5
ReLU6 Conv7 ReLU8 FC9 Output10
Fig. 2: Sketches of considered neural network architectures.
networks but lose too much precision (that is, may induce too
many false positives) which limits their applicability. Thus, a
key challenge today is the design of neural network verifiers
that scale to larger networks and maintain sufficient precision
to provide desired network robustness guarantees.
Scalable and precise network verification on a GPU. In
this work, we present our neural network verifier GPUPoly that
leverages the parallel processing power of GPUs for precise
and scalable verification. Designing such a verifier for GPUs
is challenging for three reasons: (i) the verification algorithm
has to offer the fine-grain data parallelism needed to benefit
from the GPU’s processing power, (ii) since GPU memory is
much smaller than CPU memory, it has to be memory efficient,
and (iii) it has to be sound for floating point arithmetic, i.e.,
it should capture all results possible under different rounding
modes and orders of execution of floating point operations.
To address these challenges, we build our GPU verifier
based on the recently introduced DeepPoly polyhedra ap-
proximations [11] and design custom algorithms for running
it efficiently on a GPU. The DeepPoly approximation is
among the more precise inexact methods and its underlying
algorithms are highly parallelizable. It is possible to implement
this approximation on a GPU using off-the-shelf libraries
such as PyTorch [25] and Tensorflow [26] as in [18], but
these cannot exploit the sparsity patterns produced by the
DeepPoly analysis. Thus, the resulting implementations lack
the performance and memory efficiency to scale to large
networks considered in our work (Section V). Indeed, the
existing scalable GPU-based verifiers [27], [28] using these
libraries sacrifice significant precision for scalability.
In this work, we design new algorithms for running Deep-
Poly approximation fast, precise, and sound on a GPU while
also ensuring memory efficiency by exploiting the sparsity
inherent in the DeepPoly analysis. Further, we design new
algorithms for handling the challenging residual layers which
DeepPoly does not support. GPUPoly is sound w.r.t. floating-
point arithmetic which is essential for verifier correctness [29].
Main contributions. Our main contributions are:
• Custom algorithms with explicit memory management to
efficiently parallelize the state-of-the-art precise Deep-
Poly approximation on GPUs, enabling fast and precise
verification of networks with up to ≈ 1M neurons.
• A novel approximation for handling residual layers that
balances precision and scalability.
• A complete floating-point-sound CUDA implementation
of our approach in a verifier called GPUPoly that handles
fully-connected, convolutional, and residual networks.
• An experimental evaluation of GPUPoly demonstrating
its effectiveness for handling large neural networks. Our
results show that GPUPoly can prove the robustness of
neural networks beyond the reach of prior work.
We next provide the necessary background on DeepPoly
analysis and GPUs in Section II. We describe our core
contributions for adapting DeepPoly analysis on the GPU in
Section III and Section IV.
II. BACKGROUND
We consider networks with fully-connected (FC), convolu-
tional (Conv), and ReLU layers. Fig. 2 shows sketches of the
considered neural network architectures. The fully-connected
and convolutional networks consist of a sequence of layers
with affine transformations and ReLUs. The residual networks
additionally have branches containing convolutional layers. At
the merging location of two branches the combined result
is obtained by adding the branch outputs neuronwise. The
networks used in our experiments have significantly more
layers than shown in Fig. 2. We number the layers as shown
in the superscripts in Fig. 2.
Next, we provide background on the neural network ver-
ification problem and the inner workings of the DeepPoly
approximation [11]. We conclude with a brief overview of
the parallelization mechanism offered by CUDA on GPUs.
A. DeepPoly approximation
We now introduce the commonly used L∞-norm based
robustness property and then show how DeepPoly approxima-
tions proves such a property on a toy fully-connected network.
L∞-norm based robustness properties. Given an image
I0 that is correctly classified by the neural network, the
adversarial region defined by L∞(I0, ),  > 0, includes all
images I such that the difference between the intensity of
each pixel in I and the corresponding pixel in I0 is ≤ .
The DeepPoly verifier tries to prove that the neural network
classifies all images I ∈ L∞(I0, ) correctly. If it succeeds
then all images are classified correctly. Otherwise, because of
approximations, the robustness of the network is unknown.
Verification of a small fully-connected network. We
use the word ”neuron” interchangeably for the abstract node
in a layer of the network and the concrete value of that
neuron during an operation (for example 0.24). We use xqi
to refer to the i-th neuron in layer q. Fig. 3 shows a fully-
connected network with 3 fully-connected layers (each an
x11
x12
x21
x22
x31
x32
x41
x42
x51
x52
x61
x62
1
-1
1
1
max(0, x21)
max(0, x22)
1
-1
1
1
max(0, x41)
max(0, x42)
[0,0.5]
[0,0.5]
1
1
0
1
〈x11 ≥ 0,
x
1
1 ≤ 0.5,
l
1
1 = 0, u
1
1 = 0.5〉
〈x12 ≥ 0,
x
1
2 ≤ 0.5,
l
1
2 = 0, u
1
2 = 0.5〉
〈x21 ≥ x
1
1 + x
1
2,
x
2
1 ≤ x
1
1 + x
1
2,
l
2
1 = 0, u
2
1 = 1〉
0
〈x22 ≥ x
1
1 − x
1
2,
x
2
2 ≤ x
1
1 − x
1
2,
l
2
2 = −0.5, u
2
2 = 0.5〉
0
〈x31 ≥ x
2
1,
x
3
1 ≤ x
2
1,
l
3
1 = 0, u
3
1 = 1〉
〈x32 ≥ 0,
x
3
2 ≤ 0.5 · x
2
2 + 0.25,
l
3
2 = 0, u
3
2 = 0.5〉
〈x41 ≥ x
3
1 + x
3
2,
x
4
1 ≤ x
3
1 + x
3
2,
l
4
1 = 0, u
4
1 = 1.25〉
0
〈x42 ≥ x
3
1 − x
3
2,
x
4
2 ≤ x
3
1 − x
3
2,
l
4
2 = −0.25, u
4
2 = 1〉
0
〈x51 ≥ x
4
1,
x
5
1 ≤ x
4
1,
l
5
1 = 0, u
5
1 = 1.25〉
〈x52 ≥ x
4
2,
x
5
2 ≤ 0.8 · x
4
2 + 0.2,
l
5
2 = −0.25, u
5
2 = 1〉
〈x61 ≥ x
5
1 + x
5
2 + 1,
x
6
1 ≤ x
5
1 + x
5
2 + 1,
l
6
1 = 1, u
6
1 = 3.05〉
1
〈x62 ≥ x
5
2,
x
6
2 ≤ x
5
2,
l
6
2 = −0.25, u
6
2 = 1〉
0
Fig. 3: Robustness verification of a simple fully connected network with DeepPoly.
affine transformation) and 2 ReLU layers with 2 neurons per
layer. The weights for the fully-connected layers are shown
on the blue edges and the biases are shown above and below
neurons. The original input to the network is the 2-D point
(0.25, 0.25) with the corresponding network output x61 = 2
and x62 = 0.5 satisfying x
6
1 > x
6
2. Now, we consider the set of
inputs where both input neurons x11 and x
1
2 can take any value
in the range [0, 0.5]. The verification problem is then to prove
that for all such inputs, the network outputs satisfy x61 > x
6
2.
The DeepPoly approximation associates four constraints
with every neuron xqi : (i) lower and upper polyhedral con-
straints of the form
∑
j aj ·xkj+c ≤ xqi and xqi ≤
∑
j a
′
j ·xkj+c′
respectively where 1 ≤ k < q, and (ii) concrete lower
and upper bounds lqi ≤ xqi ≤ uqi . Here, we have that
aj , a
′
j , c, c
′, lqi , u
q
i ∈ R. We refer to
∑
j aj · xkj + c and∑
j a
′
j ·xkj + c′ as the lower and upper polyhedral expressions
respectively. DeepPoly first sets constraints 0 ≤ x11, x12 ≤ 0.5,
l11 = l
1
2 = 0, and u
1
1 = u
1
2 = 0.5 for x
1
1, x
1
2. The first
fully-connected layer corresponds to the affine transformations
x21 := x
1
1 + x
1
2 and x
2
2 := x
1
1 − x12 which adds:
x11 + x
1
2 ≤ x21 ≤ x11 + x12, l21 = 0, u21 = 1,
x11 − x12 ≤ x22 ≤ x11 − x12, l22 = −0.5, u22 = 0.5,
Next, the ReLU layer is handled which corresponds to the
assignments x31 := max(0, x
2
1) and x
3
2 := max(0, x
2
2). The
neuron x21 takes only non-negative values as l
2
1 = 0 and thus
the ReLU activation passes all values of x21 to x
3
1 and hence
the verifier adds x21 ≤ x31 ≤ x21, l31 = 0, u31 = 1 for x31. Since
l22 < 0 and u
2
2 > 0, the neuron x
2
2 can take both positive and
negative values. As a result, ReLU sets x32 = 0 when x
2
2 < 0
and x32 = x
2
2 when x
2
2 ≥ 0. Capturing both cases is expensive
as one needs to consider both paths (x22 < 0 and x
2
2 ≥ 0).
DeepPoly approximates all possible values of x32 by adding:
0 ≤ x32 ≤ 0.5 · x22 + 0.25, l32 = 0, u32 = 0.5.
Notice that the polyhedral expressions in the constraints added
by DeepPoly for ReLU are sparse as they contain only the
input of the ReLU. We note that handling ReLU activations
is not a bottleneck for DeepPoly and thus we do not discuss
it in further detail. We refer the reader to [11] for detailed
discussion about handling ReLU activations with DeepPoly.
The next fully-connected layer adds the constraints:
x31 + x
3
2 ≤ x41 ≤ x31 + x32, x31 − x32 ≤ x42 ≤ x31 − x32,
for x41 and x
4
2, respectively.
Bottleneck backsubstitution. We note that x41 and x42 are
inputs to the next ReLU layer. The precision of the DeepPoly
ReLU approximations depends upon the tightness of the
concrete bounds l41, u
4
1, l
4
2, u
4
2. A straightforward substitution of
the concrete bounds l31, u
3
1, l
3
1, u
3
2 into the relational constraints
for x41 and x
4
2 results in imprecise values for l
4
1, u
4
1, l
4
2, u
4
2. This
is because the substitution ignores the relational information
from the polyhedral constraints.
DeepPoly employs backsubstitution for obtaining more pre-
cise concrete bounds for neurons xqi input to a ReLU layer.
Backsubstitution iteratively computes new polyhedral bounds
for each xqi by using the relational constraints of neurons in
the previous layers. Each iteration starts with the polyhedral
constraint for xqi expressed in terms of the neurons in layer k
with 2 ≤ k < q. The backsubstitution substitutes polyhedral
expressions of neurons in layer k defined in terms of neurons
in the layer k − 1 into the polyhedral constraint for xqi . The
substituted expressions usually have common terms that cancel
out, which, in turn, results in more precise bounds.
We next describe the computation of u41 with backsubstitu-
tion, other bounds are computed similarly. We substitute the
upper polyhedral expressions of the ReLU layer constraints
x31 ≤ x21 and x32 ≤ 0.5 · x22 + 0.25 for x31, x32 into the upper
polyhedra constraint x41 ≤ x31 + x32 obtaining x41 ≤ x21 +
0.5 · x22 + 0.25. Next we substitute the polyhedra expressions
of the constraints from the affine layer x21 ≤ x11 + x12 and
x22 ≤ x11 − x12 into x41 ≤ x21 + 0.5 · x22 + 0.25 obtaining
1.5 · x11 + 0.5 · x12 + 0.25. Notice that the terms containing x12
when substituting expressions for x21 and x
2
2 cancel out. We
finally substitute x11, x
1
2 ≤ 0.5 into x41 ≤ 1.5·x11+0.5·x12+0.25
obtaining u41 = 1.25.
Backsubstitution as matrix multiplication We consider
an iteration of the backsubstitution to compute bounds for the
neurons in layer q. The left matrix Mk in Fig. 4 encodes
Fig. 4: Backsubstitution in constraints for q-th layer neurons.
We omit the constant term in the constraints for simplicity.
constraints for q-th layer neurons with polyhedral expressions
defined over the neurons in layer 2 ≤ k < q. The center
matrix F k encodes constraints for k-th layer neurons defined
over the neurons in layer k−1. We focus on the computation of
the entry (h2, j2) (shown in blue) in the result matrix Mk−1.
The corresponding entries of Mk and F k used for computing
(h2, j2) are also shown in blue. The entry (h2, j2) encodes
the coefficient for neuron j2 of layer k − 1 in the constraint
for xqh2. The substitution (as defined above) computes (h2, j2)
by multiplying each entry (h2, i) in Mk with the entry (i, j2)
of F k where 1 ≤ i ≤ s. Each multiplication result represents
a term involving xk−1j2 obtained by substituting the expression
for xki in the constraint for x
q
h2. The results are then summed
which causes cancellation. This computation can be seen as
multiplying h2-th row of Mk with j2-th column of F k and
the overall computation of Mk−1 thus involves multiplying
Mk with F k.
If layer k is a ReLU layer, then the F k is a square
diagonal matrix as the corresponding expressions contain only
one neuron. Thus the backsubstitution through ReLU layers
can be implemented as cheap matrix-vector multiplication.
We note that this step is not a bottleneck. If layer k is
an affine layer (fully-connected, convolutional, residual), then
the polyhedral expressions in F k corresponds to the weight
matrix for transformation from layer k − 1 to layer k. The
backsubstitution in this case is more expensive and is the
focus of our work. We design custom algorithms with memory
management tailored to exploit the sparsity patterns observed
when handling convolutional and residual layers. We note
that while matrix multiplication can be easily parallelized
on GPUs, the standard algorithms [18] are not memory and
compute efficient for our task and run out of memory on
medium-sized benchmarks (≈ 100K neurons). Further, to
ensure floating point soundness we perform all computations
in interval arithmetic, which prevents the use of existing
libraries. We will provide more details on this later.
Proving the robustness property The analysis proceeds
and we obtain the constraints for all neurons in the network
shown in Fig. 3. The DeepPoly verifier computes the lower
bound for x61−x62 using backsubstitution (recall that our goal
was to prove that x61−x62 > 0). This final computation results
in the bound 1, which is sufficient to prove the property.
Asymptotic cost Consider a neural network with n affine
layers and with each layer containing at most N neurons.
The backsubstitution tasks for all neurons at an intermediate
layer q ≤ n perform a matrix multiplication of worst-case
asymptotic cost O(N3) for all preceding affine layers (we
ignore the quadratic cost of the ReLU layers) resulting in an
overall cost O(q ·N3). Because backsubstitution is performed
for every layer of the network, the total asymptotic cost of the
DeepPoly algorithm is O(n2 ·N3).
Since the backsubstitution through ReLU layers is cheap,
we will ignore these for the remainder of this paper. Without
loss of generality, we will present neural networks as just a
sequence of affine layers.
B. Parallelization in CUDA
We briefly introduce key features of GPUs that we exploit
in GPUPoly when redesigning DeepPoly for GPUs. We will
use the CUDA C++ language extension. There are two dif-
ferent concepts of parallelism defined for CUDA: blocks and
threads. CUDA blocks are similar to cores of a CPU and are
independent compute units. Thus, different machine code can
be executed on different blocks. A modern Nvidia graphics
card contains hundreds of these blocks.
CUDA threads are a SIMD (single instruction, multiple
data) mechanism allowing the execution of the same instruc-
tion sequence on multiple values of the same data type in
parallel. A block can execute 32 SIMD threads (called a warp)
in parallel and its is possible to allocate up to 32 warps to
each block to enable latency hiding. For optimal performance,
the number of threads per block should be divisible by 32.
Additionally, one needs to balance between providing the GPU
with a sufficient number of warps for context switching while
also not overusing block-wide resources, e.g., the number of
registers allocated per thread. The ideal number of threads per
block is often 128 or 256 in practice.
III. ROBUSTNESS VERIFICATION ON GPUS: CONCEPTS
AND ALGORITHMS
To utilize the highly parallel GPU hardware effectively,
an algorithm must provide a sufficient number of compute
tasks that can be distributed over all blocks and threads, and
its memory requirements must not exceed the comparatively
small GPU memory (compared to CPU memory). In this
section, we introduce concepts and techniques for designing
and implementing an efficient DeepPoly-based GPU algorithm
for verifying deep neural networks.
A. Network DAG and dependence set
For networks with convolutional layers, each row of the
matrices Mk, F k,Mk−1 (Fig. 4) usually contains only a few
non-zero entries. We exploit this sparsity for making DeepPoly
backsubstitution more compute and memory efficient. We now
introduce our concept of dependence set that we use for com-
puting a sparse representation of the matrices Mk, F k,Mk−1.
Let neuron xqi be the i-th neuron of layer q. The dependence
set of xqi contains the neurons in preceding layers k < q
affecting the value of xqi . Only the columns corresponding to
these neurons have non-zero entries in the row for xqi and are
Fig. 5: D1(xqi ) and D2(xqi ) for a neuron xqi with two preceding
convolutional layers.
therefore sufficient for defining polyhedral expressions in the
corresponding matrices. We first provide an example of the
dependence sets and then give their formal definition.
Example 1. We consider convolutional layers with filters of
size 3× 3 and a stride of 1 in both height (h) and width (w)
direction. Fig. 5 shows the subset of the neurons in layer q−1
influencing the value of neuron xqi . Note that the dimensions
3× 3× 4 of this subset directly correspond to the filter sizes.
The neurons in layer q − 2 that influences xqi contains all
neurons that influence the value of neurons in the computed
subset for layer q − 1. The size of this subset is 5× 5× 6.
To formally describe dependence sets, we first define the
network DAG (directed, acyclic graph; an example is Fig. 3)
associated with a neural network. As before we denote with xqi ,
the neuron i in layer q. In a network DAG (V, E), V is the set
of all neurons. Two neurons are connected by a directed edge
(xkj , x
q
i ) ∈ E if xkj is directly needed to compute xqi . More
formally, (xkj , x
q
i ) ∈ E if layer k is an immediate predecessor
(contains inputs) of layer q and layer q is fully-connected or
one of the following applies, depending on the type of layer
q (in parenthesis):
• (ReLU) j = i,
• (Convolutional) xkj is in the window for computing x
q
i ,
Note that for fully-connected and convolutional architectures
(top two rows in Fig. 2), we have that k = q − 1 while for
a residual architecture (last row in Fig. 2), layer q can have
multiple immediate predecessors k < q.
The first dependence set of a neuron xqi collects all its
immediate predecessors in the network DAG:
D1(xqi ) = {xkj |(xkj , xqi ) ∈ E}, (1)
Similarly for a set of neurons X q in the same layer q:
D1(Xq) =
⋃
xqi∈Xq
D1(xqi ) (2)
We extend this concept recursively. The m-th dependence
set, m ≥ 2, of xqi is the first dependence set of Dm−1(xqi ):
Dm(xqi ) = D1(Dm−1(xqi )) (3)
and the definition of Dm(X q) is analogous. We also define
the zeroth dependence set D0(xqi ) = {xqi }.
Finally, we define the full dependence set of a neuron xqi to
be the union of the dependence sets of all levels:
Dfull(xqi ) =
⋃
m∈{1...t}
Dm(xqi ) (4)
The full dependence set of a neuron xqi is the set of all neurons
in the preceding layers with a (computational) path to xqi .
Example 2. In the previous example of Fig. 3, Dfull(x61) =
{x11, x12, x21, x22, x31, x32, x41, x42, x51, x52}. Neuron x51 is included
even though the coefficient of x51 in the corresponding affine
transformation is 0. The reason is that our definition of the
network DAG does not consider coefficients of the affine
transformations for defining edges.
During DeepPoly analysis, all neurons appearing in the
polyhedral expressions when backsubstituting recursively on
the polyhedral constraints for xqi are available in the different
subsets Dq−k(xqi ) with k = q − 1, . . . , 0 of Dfull(xqi ). The
expression in the initial constraint contains neurons from
D1(xqi ) and we call it step 1 of the backsubstitution. Step
2 substitutes for each neuron in D1(xqi ), the polyhedral ex-
pression defined over the neurons in D2(xqi ) resulting in the
constraint for xqi to be defined over the neurons in D2(xqi ).
Continuing similarly, we see that Dq−k(xqi ) contains neurons
appearing in the expressions after q − k steps. In Section IV,
we exploit the structure of the convolutional layers to derive
recursive expressions for computing Dq−k(xqi ) that enable
fast computation with negligible overhead. Next we discuss
the backsubstitution for the different network types in greater
detail.
B. Fully-connected networks
In Section II-A we represented the backsubstitution step of
the DeepPoly analysis starting at layer q as a sequence of
matrix multiplications through all preceding layers k < q. For
fully-connected networks, the dependence sets Dq−k(xqi ) for
xqi are dense and include all neurons in layer k. Thus we use
dense matrix multiplication parallelized over the blocks of the
GPU and SIMD parallelized over the threads. The coefficients
in the polyhedral expressions are accordingly stored as dense
matrices. However, as we explain later, the computation is
done in interval arithmetic to ensure floating point soundness
(Section IV-D). Thus we could not use existing libraries and
our implementation is from scratch.
C. Convolutional networks
Naively using the same algorithm as for fully-connected
layers for the backsubstitution starting at a convolutional layer
q does not utilize the GPU efficiently. First, the majority
of the computations are not needed since the filters in the
convolutional layers are usually sparse and thus the filter
matrix F k of Fig. 4 consists of mostly zeroes. Additionally,
it is not memory efficient since many coefficients in matrices
Mk and Mk−1 of Fig. 4 will be zero during backsubstitution.
Key ideas. The neurons in the polyhedral expression for xqi
after k < q backsubstitution steps (starting from q) are in the
dependence set Dq−k(xqi ). For an efficient implementation on
GPUs, we use these sets in two ways:
• Using the set we can flatten the needed coefficients into
a dense matrix to perform the backsubstitution again
efficiently as matrix multiplication.
• All the backsubstitution tasks for a given layer (one per
neuron) are independent of each other. The sizes of the
dependence sets enable us to predict the memory needed
per task. This, in turn, enables us to determine the number
of tasks that can be allocated in the GPU memory without
exceeding its limits.
Size prediction and management As in Fig. 5, the first few
dependence sets of xqi are usually very small compared to the
size of the layers. If the number of layers is small, there is no
memory problem. For example, the total memory footprint of
the analysis of a 6-layer 230K neuron convolutional network
(e.g., ConvLarge in Table I later) fits into 16 GB of RAM.
For deeper networks the k-th dependence set will ultimately
grow to the size of the full layer. For example, assuming a
realistic layer size of 100K neurons, each backsubstitution task
then requires at least 16 · 100KB (16 since one double is read
and one written) per neuron after k steps. Doing so for all
neurons in layer q then amounts to 16 ·100 ·100KB = 160GB.
The complete algorithm has four times the memory require-
ment since both, the upper and lower, polyhedral constraints
are backsubstituted and the floating point soundness property
discussed in Section IV-D requires double the memory yet
again. The resulting 640GB exceed the memory of any GPU.
Our solution in this case is to predict the maximal mem-
ory footprint µ of one backsubstitution and then allocate t
backsubstitutions in the memory and distribute them over
the blocks of the GPU, where t is the largest number with
64 · t · µ ≤ λ where λ is the GPU memory size in bytes.
For each backsubstitution step of a single task, there are two
sets of coefficients in memory corresponding to the neurons
in Dq−k(xqi ) of the current layer k which we read (a row of
the matrix Mk in Fig. 4) and the ones in Dq−k+1(xqi ) of the
previous layer k − 1 which we compute (a row of the matrix
Mk−1 in Fig. 4). The largest memory footprint that occurs
during a single backsubstitution task is obtained by finding
the two neighboring layers k and k−1 with the biggest added
size of their dependence sets:
µ = max
k<q
(|Dq−k(xqi )|+ |Dq−k+1(xqi )|) (5)
We compute µ in advance, and use it to determine t, thus effi-
ciently using both, available parallelism and GPU memory. We
discuss the details of the actual implementation in Section IV.
D. Residual networks
To simplify the exposition of our ideas and without loss of
generality, we assume that the width of the residual network is
two, i.e., a layer has no more than two immediate predecessors
or successors. We further assume that the two branches of
a residual block contain one convolutional layer each. An
example of such an architecture is in Fig. 6 which shows
the residual block of the bottom network of Fig. 2 with all
ReLU layers removed for simplicity. The point-wise nature
of ReLU layers allows us to ignore them with respect to the
dependence set. The two branches of the residual block contain
one convolutional layer each.
. . . Conv2
Conv4
Conv5
Conv7 . . .
a
b
a
b
Fig. 6: Simplified residual architecture without ReLU layers.
We call the two branches a and b. In Fig. 6 branch a and b
contains Conv4 and Conv5 layer respectively. Naturally, the
layer at the head of the residual block (Conv2 in Fig. 6) has
two successors while the one at exit (Conv7 in Fig. 6) has
two predecessors.
The first dependence set of a neuron xqi in a layer at the exit
of a residual block (e.g. Conv7 in Fig. 6) contains neurons
from both branches (subsets of layers Conv4 and Conv5 in
Fig. 6). The resulting dependence set can be written as:
D1(xqi ) = D(1,a)(xqi ) ∪ D(1,b)(xqi ), (6)
where D(1,a)(xqi ) and D(1,b)(xqi ) are the first dependence
sets of xqi with respect to the branches a and b respectively.
The second dependence set of xqi is a subset of the layer
at the head of the residual block (Conv2 in Fig. 6) and can
then be expressed as the union of the first dependence sets of
D(1,a)(xqi ) and D(1,b)(xqi ):
D2(xqi ) = D1
(
D(1,a)(xqi ) ∪ D(1,b)(xqi )
)
(7)
= D1
(
D(1,a)(xqi )
)
∪ D1
(
D(1,b)(xqi )
)
(8)
For a residual block, we backsubstitute through both
branches independently and then join the independent back-
substitutions at the head of the block by adding the coefficients
of the expressions neuron by neuron.
For size prediction, we modify (5) for handling residual
blocks. The backsubstitution through the two branches is
performed in series and then the results are added. This
requires keeping a copy of the values of neurons in the layer
at the exit of the residual block (Conv7 in Fig. 6) while
backsubstituting through the a branch first.
Similarly we need to keep the results of the a branch (subset
of Conv4) while performing the backsubstitution through the
b branch. Thus compared to (5) we have to keep neuron
values of three dependence sets in memory at the same time.
The following equation examplifies this for a backsubstitution
starting from the exit layer of the residual block (Conv7 in
our example), where we retain a copy of the coefficients in
the exit layer while performing the backsubstitution through
branch a (from Conv4 to Conv2 in our example):
µr,a = |D(1,a)(xqi )|+ |D(2,a)(xqi )|+ |D0(xqi )|. (9)
In order to derive the µ for the size prediction (similar to
equation (5)), the maximum over all posible sizes over all
layers then has to be taken, which includes µr,a and also
includes µr,b associated with branch b.
IV. GPUPOLY
In this section we explain in greater detail the algorithms
and our implementation of the GPUPoly verifier. We first
explain how to compute the size and the elements of the
dependence set Dq−k(xqi ) of xqi in layer k < q. Then we
show our parallel algorithm for the backsubstitution in Deep-
Poly analysis. We then contrast this algorithm to the parallel
algorithm that was used for the CPU version of DeepPoly
[11] and to the well-known backpropagation algorithm [30]
used for training DNNs. Finally, we explain how we maintain
floating point soundness using interval arithmetic. As before,
we ignore the ReLU layers.
We handle the backsubstitution for the fully-connected
networks by parallelizing dense matrix-matrix multiplication
on a GPU as mentioned in Section III. We thus focus on the
most challenging convolutional and residual networks.
We will use the following notation for convolutional layers.
We represent the neurons in a convolutional layer k as a three-
dimensional block, shown as gray in Fig. 5. A neuron xki has a
unique three-dimensional index i = (w, h, d), where w and h
are the horizontal and vertical indices and d is the filter index.
To simplify the exposition and without loss of generality, we
assume that all parameters of convolutional layers have the
same value in horizontal h and vertical w directions, such as
filter sizes fkw = f
k
h = f
k, strides skw = s
k
h = s
k or the
padding pkw = p
k
h = p
k. We further assume the following
values for all layers: fk = 3, sk = 1 and pk = 0 (zero
padding). The remaining cases can be handled analogously.
A. Computing dependence sets for convolutional networks
In Fig. 5 we have seen examples of the first and second
dependence set of a neuron in a convolutional layer. Now we
derive the general equations for the size and location (or offset)
of the elements of (q − k)-th dependence set Dq−k(xqi ) as a
subset of the neurons in a convolutional layer k < q.
Dq−k(xqi ) is a cuboid and we compute the size of the
set Dq−k(xqi ) along the height, width and depth direction
separately. We note that a dependence set is always dense
in the depth direction for convolutional layers, so the size of
Dq−k(xqi ) in the depth direction is equal to the number of
filters of layer k. Because of the symmetry assumption for the
w and h direction of convolutional layers, the width and the
height of Dq−k(xqi ) are equal, and we denote it with W q−k.
The following recurrence computes W q−k+1 given W q−k:
W 0 = 1,
W q−k+1 = (W q−k − 1) · sk + fk, k = q . . . 1. (10)
For example, in Fig. 5 we obtain W 1 = (W 0− 1) · 1+ 3 = 3
for the first dependence set and W 2 = (W 1 − 1) · 1 + 3 = 5
for the second dependence set. The overall size of Dq−k is:
|Dq−k(xqi )| =W q−k ·W q−k · Ck, k = q − 1 . . . 0. (11)
where Ck is the number of channels of layer k. We note that
(11) can be plugged into (5) and (9) for size prediction. We
compute the neuron indexes next.
The indexes depend on the location of xqi in layer q. We only
need to derive the position in the width and the height direction
as all the corresponding channels of layer k are in Dq−k(xqi ).
Let the position of xqi in layer q be i = (w
q, hq, dq). Then the
w- and h-positions of the neuron with the smallest coordinates
in Dq−k(xqi ) are:
wq−k = Sq−k · wq, (12)
hq−k = Sq−k · hq, k = q − 1 . . . 0. (13)
where we introduced the quantity Sq−k, which we call accu-
mulated stride. It is computed using the following recurrence:
S0 = 1, (14)
Sq−k+1 = sk · Sq−k, k = q . . . 1. (15)
The extension to padding other than zero padding is analogous.
Using the the size and the position of Dq−k(xqi ) in layer
k, we can now compute, recursively, for k = q − 1 . . . 0 the
associated coefficients of the neurons in Dq−k(xqi ) occurring
in the backsubstituted expression. We store these in a dense
vector called Mq−k(xqi ). In each step these get modified by
the backsubstitution:
M1(xqi ) = (a1, a2, . . . , a|D1(xqi )|),
Mq−k+1(xqi ) = GBC(M
q−k(xqi ),Dq−k(xqi ),Dq−k+1(xqi ), F k).
M1(xqi ) contains the coefficients corresponding to the neurons
in the first dependence set in the initial polyhedra expression
before the backsubstitution starts bounding xqi . We ignore the
constant in the constraint for simplifying our exposition. GBC
(GPUPoly Backsubstitution for Convolution) is our algorithm
for handling a single step of a backsubstitution task in convolu-
tional networks, shown in Algorithm 1 and explained below.
F k is the constraint matrix between the neurons in layer k
and k − 1 generated during DeepPoly analysis (Fig. 4). As
in Section II, F k corresponds to the filter for convolutional
layers. We next explain GBC in greater detail.
B. Our algorithm for convolutional networks
In this section we describe our algorithm for verifying
convolutional networks on GPUs. We recall that the back-
substitution at layer q is done as one task per neuron, and all
these tasks are independent. We first use our size prediction
algorithm from Section III-C to determine the the number of
tasks t to run in parallel. Next, our algorithm distributes these
independent tasks over the GPU blocks. Since convolutional
layers usually contain tens of thousands of neurons and a
modern GPU consists of a few hundred blocks we can be
sure that all blocks are populated.
Next, we describe our mapping of a single backsubstitution
task for xqi on the SIMD thread-parallelized structure of a
single block. We focus on the backsubstitution step from
layer k to layer k − 1 (computation for the other steps is
similar). This step can be seen as a map from Mq−k(xqi ) to
Mq−k+1(xqi ). For convenience, we assume that both vectors
are reshaped as per their corresponding 3-D dependent sets.
Algorithm 1 Single backsubstitution step for xqi through
convolutional layers
1: function GBC(Mq−k, Dq−k,Dq−k+1,F k)
2: // Compute the polyhedra expression Mq−k+1 bounding xqi
3: Mq−k,Mq−k+1 ← submatrix for layer k and k − 1
4: Dq−k ← (q − k)-th dependence set of xqi
5: (W q−k,W q−k, Ck)← dimensions of Dq−k
6: Dq−k+1 ← (q − k + 1)-th dependence set of xqi
7: (W q−k+1,W q−k+1, Ck−1)← dimensions of Dq−k+1
8: (fk, fk)← filter size in w and h directions of layer k
9: (sk, sk)← strides in w and h directions for layer k
10: F k ← 4-D filter weight tensor of layer k
11: // Thread-parallelization dimension is consecutive in memory
12: (Ck, fk, fk, Ck−1)← dimensions of F k
13: // Serial loop
14: for (w, h) ∈ (0 :W q−k, 0 :W q−k) do
15: // Zero, one or two of these loops can be thread parallelized
16: // depending on if Ck−1 is big enough. If any of these is
17: // parallelized, then thread-sync is needed after going
18: // through 0..Ck loop
19: for (f, g) ∈ (0 : fk, 0 : fk) do
20: a = w · sk + f
21: b = h · sk + g
22: // Thread-parallelized loop
23: for c ∈ (0 : Ck−1) do
24: Mq−k+1[a][b][c] = 0
25: // Serial loop, reduction in every thread
26: for d ∈ (0 : Ck) do
27: // Coalesced read from F k
28: // Coalesced write to Dq−k+1
29: Mq−k+1[a][b][c] +=Mq−k[w][h][d]·F k[d][f ][g][c]
Algorithm 1 shows our algorithm for a single step (q−k →
q−k+1) of a backsubstitution task through convolutional lay-
ers. To avoid notational clutter, we write Dq−k(xqi ),Mq−k(xqi )
as simply Dq−k,Mq−k respectively for the rest of this section.
We use Matlab-style notation (i : j) = {i, . . . , j}
The two outermost loops of the algorithm are the iteration
on the w- and h-dimension of Dq−k at line 14. For a given
(w, h), we obtain a 1 × 1 × Ck subset of Dq−k. Since the
convolutional layers are fully-connected in the d-direction,
each neuron in this subset has the same first dependence set,
which is a fk × fk × Ck−1 block in layer k − 1. Fig. 5
shows such an example where the first dependence set of every
neuron in the orange 1× 1× 4 set in layer q − 1 is the same
3× 3× 6 set of neurons in layer q − 2.
We target thread-parallelization of the backsubstitution from
the 1 × 1 × Ck subset in Dq−k to the fk × fk × Ck−1
subset of Dq−k+1. Since the convolutional filter F k has
four dimensions, we have four loops, the two loops over
the filter sizes 0 : fk in w- and h-direction and the loops
(0 : Ck), (0 : Ck−1) over the d directions of layer k and k−1
respectively (lines 19–26).
We have multiple candidates for arranging the four loops. To
maximize the benefits from thread-parallelization, we require
coalescent reads and writes. Since we sum over Ck dimension
at line 29, this loop is not suited for thread parallelism. The two
(0 : fk) loops at line 19 can be thread-parallelized, however,
this requires a synchronization before moving to the next 1×
1× Ck block in layer k, otherwise there are race conditions.
The loop best suited for thread-parallelization is the loop
(0 : Ck−1) at line 23 and thus we parallelize over it. If Ck−1 <
128, then our parallelization of this loop does not fully utilize
the full power of thread parallelization. Thus our algorithm
additionally thread-parallelizes over either one or both of the
(0 : fk) loops at line 19 so that the total number of threads is
at least 128. This enables us to gain more parallelism which
improves performance at the cost of synchronization overhead.
Mixed layer backsubstitution. We discussed the backsub-
stitution algorithm in Section III-B where the starting layer q
and all preceding layers are fully-connected. The specialized
algorithm where all layers are convolutional was explained
above. Running GPUPoly on the convolutional network in
Fig. 2 (middle) requires handling the backsubstitution starting
from the fully-connected layer 6 with steps through the two
convolutional layers 2 and 4.
Let k0 be the first fully-connected layer following k0 − 1
convolutional layers in a convolutional neural network and let
all layers q ≥ k0 be fully-connected. For a backsubstitution
starting at layer q ≥ k0, we apply the algorithm for fully-
connected layers introduced in Section III-B for the first
q − k0 layers encountered during backsubstitution and apply
Algorithm 1 when backsubstituting through the remaining
convolutional layers. We note here that for all neurons xqi
in layer q the dependence set Dq−k corresponding to the
convolutional layer k contains all neurons of layer k.
Comparison to the parallel CPU implementation. Since
the number of CPU cores on modern machines is at least an or-
der of magnitude less than the number of blocks on a GPU, the
parallelized CPU implementation [11] of DeepPoly processes
smaller number of backsubstitution tasks than GPUPoly. Thus
the CPU implementation does not require explicit memory
management as the chunk size is already small (CPU memory
is also usually larger than GPU) but this also makes the imple-
mentation slow. The CPU implementation exploits sparsity in
the polyhedral expressions when performing backsubstitution
from the convolutional layers by storing the polyhedral ex-
pressions with a sparse representation, storing neuron indexes
and the corresponding coefficient. This representation does
not exploit the structure of the convolutional layers and is
not suitable for thread parallelization. In contrast, we exploit
structured sparsity in convolutional layers via dependent sets
which allows us to create smaller dense submatrices that are
suitable for thread parallelization.
Comparison to standard backpropagation We now com-
pare the backsubstitution used in DeepPoly to the standard
backpropagation algorithm [30] used for training neural net-
works. The latter is fundamentally different from our back-
substitution because it computes a scalar loss function and
propagates it back to update the network weights while the
DeepPoly backsubstitution propagates constraints backwards.
Further, the backpropagation is usually only performed starting
from the last layer which typically contains fewer neurons than
the intermediate layers. In contrast, the DeepPoly backsubsti-
tution is performed starting from all layers in the network.
Thus, we also have to backsubstitute starting from intermediate
convolutional or residual layers which typically contain orders
of magnitude more neurons than the last layer which makes
balancing the compute and the memory efficiency of our back-
substitution on GPUs more challenging (Section III). Overall,
based on the above factors, the DeepPoly backsubstitution is
mathematically different, computationally more expensive, and
more memory-demanding than the backpropagation.
C. Dependence sets and algorithm for residual networks
In Section III-D we discussed that backsubstitution through
a residual block involves independent backsubstitutions
through the two branches followed by neuronwise addition
of the two resulting coefficient vectors. We first note as per
(11), for a purely convolutional neural network the dependence
sets are cuboidal subsets of the convolutional layers. This is
not necessarily the case for residual blocks. We discuss the
computation of the dependence sets when backsubstituting
through a residual block next.
For the neuron xqi at the exit of a residual block (e.g. Conv
7
in Fig. 6), the two parts of the first dependence set D1(xqi ) =
D(1,a)(xqi ) ∪ D(1,b)(xqi ) in the branches a and b are cuboidal
just as in the purely convolutional case. However, the second
dependence set D2(xqi ) = D(2,a)(xqi ) ∪ D(2,b)(xqi ) (see also
(8)), which is a subset of layer at the head of the block (Conv2
in Fig. 6), is guaranteed to be cuboidal only if one of the two
following equations holds:
D(2,a)(xqi ) ⊆ D(2,b)(xqi ) or, (16)
D(2,b)(xqi ) ⊆ D(2,a)(xqi ), (17)
that is if one is a subset of the other or vice versa. We
note that the above condition is true for the networks used
in our experiments. For the general case, one can compute a
cuboid containing D(2,a)(xqi ) ∪ D(2,b)(xqi ) and continue the
backsubstitution with the resulting cuboid.
We use Algorithm 1 for the two independent backsubsti-
tutions through the two branches, however, the merging of
the two resulting dense vectors M (2,a)(xqi ) and M
(2,b)(xqi )
requires particular attention. Assuming that D2(xqi ) is cuboid,
we use equations (11) and (15) to calculate the relative
offset of the vectors with respect to each other: (w(2,a) −
w(2,b), h(2,a) − h(2,b)). The two dense coefficient vectors
M (2,a)(xqi ) and M
(2,b)(xqi ) representing the two resulting ex-
pressions are then combined by adding the values neuronwise.
TABLE I: Neural networks used in our experiments.
Dataset Model Type #Neurons #Layers Training
MNIST 6× 500 FC 3,010 6 Normal
ConvBig Conv 48K 6 DiffAI
ConvSuper Conv 88K 6 Normal
CIFAR10 6× 500 FC 3,010 6 Normal
ConvBig Conv 62K 6 DiffAI
ConvLarge Conv 230K 6 DiffAI
ResNetTiny Res 311K 12 PGD
ResNet18 Res 558K 18 PGD
ResNetTiny Res 311K 12 DiffAI
ResNet18 Res 558K 18 DiffAI
SkipNet18 Res 558K 18 DiffAI
ResNet34 Res 967K 34 DiffAI
Since a backsubstitution is performed starting from every
convolutional layer in the network, the last case we discuss is
a backsubstitution starting from a convolutional layer which
is part of a residual block, for example Conv4 of branch a
in Fig. 6. In this case we backsubstitute until reaching the
head of the block. Since there is no contribution from the b
branch, for a neuron xqi in this convolutional layer we simply
get D(1,a)(xqi ) = D1(xqi ) for the first dependence set and
M (1,a)(xqi ) = M
1(xqi ) for the corresponding dense vector.
The backsubstitution continues as in the purely convolutional
case based on Algorithm 1.
D. Floating point soundness
An essential property and major challenge is to ensure
floating point soundness of our analysis. This means that the
results of our analysis will always contain all results possible
under different rounding modes and under different orders
of computations. To guarantee this property, the coefficients
of all polyhedra constraints are stored as intervals and com-
putations such as matrix-matrix multiplication are performed
on intervals, as done, e.g., in [31], which also prevents the
use of existing CUDA libraries. Interval arithmetic doubles
the memory requirement and more than doubles the cost of
computations. For example, multiplying two intervals requires
four multiplications and determining the minimum and max-
imum of the results. Further, for soundness, this minimum
and maximum have to be rounded differently, namely towards
−∞ and +∞, respectively. In CUDA this is done using the
appropriate rounding intrinsics. We note that existing libraries
such as Tensorflow [26] and Pytorch [25] do not allow setting
the rounding mode on a per-operation basis, thus they cannot
be used for ensuring floating point soundness.
V. EXPERIMENTAL EVALUATION
We now demonstrate the effectiveness of GPUPoly for the
verification of large neural networks in terms of both precision
and performance. GPUPoly is implemented in C, supports 64-
bit double precision, and uses the CUDA library for GPU
support. We compare the effectiveness of GPUPoly against
two state-of-the-art verifiers: the CPU parallelized version of
DeepPoly [11] and HBox [9]. We note that DeepPoly has the
same precision as GPUPoly, however GPUPoly is up to 170x
TABLE II: Experimental results for 500 images on medium size neural networks: HBox [9] and DeepPoly [11] vs. GPUPoly
Dataset Model #Neurons Training  #Candidates Average runtime [s] #Verified
HBox DeepPoly GPUPoly HBox DeepPoly GPUPoly
MNIST 6× 500 3,010 Normal 8/255 493 0.001 8.3 0.08 0 334 334
ConvBig 48K DiffAI 3/10 487 0.05 12 0.35 439 441 441
ConvSuper 88K Normal 8/255 495 0.05 271 1.6 0 428 428
CIFAR10 6× 500 3,010 Normal 1/500 282 0.06 15 0.11 1 219 219
ConvBig 62K DiffAI 8/255 226 0.09 38 0.5 121 127 127
ConvLarge 230K DiffAI 8/255 232 0.27 309 2.5 131 138 138
TABLE III: Experimental results for 500 images on large neural networks: HBox [9] vs. GPUPoly
Dataset Model #Neurons Training  #Candidates Average runtime [s] #Verified
HBox GPUPoly HBox GPUPoly
CIFAR10 ResNetTiny 311K PGD 1/500 391 0.29 20 0 322
ResNet18 558K PGD 1/500 419 6.8 1021 0 324
ResNetTiny 311K DiffAI 8/255 184 0.29 5.0 118 127
SkipNet18 558K DiffAI 8/255 168 6.0 41 130 140
ResNet18 558K DiffAI 8/255 193 6.3 26 129 139
ResNet34 967K DiffAI 8/255 174 16 59 103 114
faster than DeepPoly on the networks that are small enough
for DeepPoly and can scale to larger networks. HBox is
implemented for GPUs using PyTorch and uses the Zonotope
approximations [32] for neural network verification. HBox
is more precise than interval based verifiers [20] and more
scalable than other GPU based verifiers [13], [18], [27]. We
note that [18] is a GPU implementation of DeepPoly based
on Pytorch but it runs out of memory on our smallest con-
volutional benchmark. As a result we do not compare against
it. Thus HBox is the most precise existing verifier that can
scale to the large neural networks used in our experiments. We
note that unlike GPUPoly and DeepPoly, HBox is not floating
point sound thus its verification results can be incorrect due
to floating point errors.
Our experimental results show that GPUPoly improves over
the state of the art by providing the most precise and scalable
verification results on all our benchmarks. We believe that the
extra scalability and precision of GPUPoly can also benefit
state-of-the-art robust training methods [18], [33] in the future
as they depend on approximate verifiers for training.
Neural networks. We used 12 deep neural networks in
our experiments. These networks were trained on the popular
MNIST [34] and CIFAR10 [2] datasets as shown in Table I.
Out of these, 3 are MNIST-based and 9 are CIFAR10-based.
Table I specifies the network architecture, the number of
neurons, the number of layers and the training method for each
network. There are 2 fully connected (FC), 4 convolutional
(Conv) and 6 residual (Res) architectures in Table I. The
largest neural network used in our experiments is ResNet34:
it has 34 layers and contains ≈1M neurons.
Regarding training, (i) 7 of our networks were trained using
DiffAI [9], [28], which performs provably robust adversarial
training, (ii) 2 of our CIFAR10 networks were trained using
Projected Gradient Descent (PGD) [35], [36], which amounts
to experimentally robust adversarial training, and (iii) the
remaining 3 networks were trained in a standard manner. Both
methods, (i) and (ii), aim to increase the robustness of the
resulting neural network.
Experimental setup. All our experiments for HBox and
GPUPoly were performed on a 2.2 GHZ 10 core Intel Xeon
Silver 4114 CPU with 512GB of main memory. The sizes
of the L1, L2, and L3 caches were 1KB, 32KB, and 14MB
respectively. The GPU on this machine was an Nvidia Tesla
V100 GPU with 16GB of memory. The PyTorch version
used for running HBox was 0.4.1 and the CUDA version
for GPUPoly was 10.2. The experiments for the (prior) CPU
version of DeepPoly were performed on a 2.6 GHz 14 core
Intel Xeon CPU E5-2690 with 512GB of memory. The sizes
of the L1, L2, and L3 caches were 32KB, 256KB, and 35MB
respectively. Note that the CPU for running the CPU version
of DeepPoly was faster than the one for HBox and GPUPoly.
Benchmarks. For each MNIST and CIFAR10 neural net-
work, we selected the first 500 images from the respective
test set and filtered out the images that were not classified
correctly. We call the correctly classified images from the test
set candidate images. The number of candidate images for
each neural network are shown in Table II and Table III.
For each candidate image I0, we defined the L∞(I0, )
based adversarial region by selecting challenging values of 
that are commonly used for testing the precision and scalability
of neural network verifiers. The  values used for defining
L∞(I0, ) for each neural network are shown in Table II
and Table III. We used larger values of  for testing DiffAI
trained networks since these networks can be proven robust
for larger values compared to the PGD and normally trained
networks. However, DiffAI-trained networks can suffer from a
substantial drop in test accuracy (see #Candidates in Table III).
Furthermore, these networks are also easier to verify and thus
even imprecise verifiers like HBox are able to verify large
number of robustness properties on these networks. We also
note that the  values for MNIST based networks are larger
than for CIFAR10 based networks for the same reason.
A. Results on medium sized networks
Table II compares the effectiveness of GPUPoly vs. Deep-
Poly and HBox on the robustness verification of our medium
sized fully connected and convolutional networks. All of these
networks contain 6 layers. The largest network among these
contains 230K neurons. We compare the precision of the three
verifiers by comparing the number of L∞(I0, ) regions where
the network is proven to be robust by the verifier. We also
compare the performance by reporting the average runtime in
seconds of the verifiers per L∞(I0, ) region.
Precision. Table II shows that both DeepPoly and GPUPoly
verify the network to be robust on more images than HBox.
The difference in precision is higher on the normally trained
networks than on the DiffAI trained ones. In total, on the
3 normally trained networks, both DeepPoly and GPUPoly
verify 981/1270 properties whereas HBox verifies only 1.
The reason why HBox verifies more properties on DiffAI
trained networks is because all inexact verifiers only sacrifice
precision for scalability on neurons that are input to a ReLU
and can take both positive and negative values during analysis
(e.g., neurons x22 and x
4
2 in Fig. 3). The number of such
neurons for DiffAI trained networks is lower than for normally
trained networks. On the 3 DiffAI trained networks, both
DeepPoly and GPUPoly verify 13 properties more than HBox.
Performance. HBox is the fastest verifier and has an aver-
age runtime of < 0.3 seconds on all benchmarks in Table II.
The speed of HBox comes at the cost of imprecision. The CPU
implementation of DeepPoly is much more precise than HBox
but is orders of magnitude slower than HBox and takes > 5
minutes on average for proving one robustness property on
the largest ConvLarge CIFAR10 network with 230K neurons
in Table II. GPUPoly in contrast achieves the same precision as
DeepPoly but is between 35x and 170x faster on all networks.
B. Results on large networks
Table III compares GPUPoly against HBox for verifying
our large CIFAR10 residual networks that are out of range for
the CPU version of DeepPoly. For example, DeepPoly takes
> 2 hours (when we timed out) per image on the ResNet18
network trained with PGD. One can observe that the PGD
trained networks classify more images correctly than DiffAI,
however they are less provably robust and we use smaller 
values for defining the robustness properties on these.
Precision. For the DiffAI trained networks, we noticed that
backsubstitution up to the first layer does not bring significant
precision gains over backsubstituting only until the end of
the first residual layer encountered during backsubstitution.
Thus for DiffAI trained networks, we only backsubstitute until
the end of the first encountered residual layer for improving
performance. We note that GPUPoly still proves 40 more
robustness properties on these networks than HBox with the
highest increase in precision (11 more properties verified) over
HBox taking place on our largest neural networks ResNet34
which is 34 layers deep and contains ≈ 1M neurons.
For the PGD trained networks, we backsubstitute until the
input layer otherwise significant precision is lost. HBox does
not prove any of the 810 properties on the two PGD trained
networks in Table III while GPUPoly proves 646.
Performance. It can be seen in Table III that HBox is faster
than GPUPoly due to the much simpler polyhedra abstraction
(zonotopes). GPUPoly runs slower when backsubstituting up
to the input layer on PGD trained networks. On all DiffAI
trained networks, limited backsubstitution allows GPUPoly to
verify robustness properties in <60 seconds.
Our work advances the state-of-the-art by precisely veri-
fying significantly larger CIFAR10 networks, with up to 1M
neurons, than possible with prior work. Based on our results,
we believe that our work is a step in the direction towards
scaling to even larger models such as those for ImageNet
which cannot be handled currently. Another limitation of our
work is that we cannot handle GNNs or GANs.
VI. CONCLUSION
We presented a scalable neural network verifier, called
GPUPoly, for verifying the robustness of various types of
deep neural networks on GPUs. GPUPoly leverages GPU
parallelization, sparsity in convolutional and residual networks,
and uses an adaptation mechanism that matches the memory
footprint to the available GPU memory. As a result, GPUPoly
scales precise, polyhedra-based verification to large neural
networks with ≈ 1M neurons. Our experimental results show
that GPUPoly verifies robustness properties of deep neural net-
works beyond the reach of existing state-of-the-art verifiers.
REFERENCES
[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint
arXiv:1312.6199, 2013.
[2] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
Tech. Rep., 2009.
[3] V. Tjeng, K. Y. Xiao, and R. Tedrake, “Evaluating robustness of neural
networks with mixed integer programming,” in International Conference
on Learning Representations, (ICLR), 2019.
[4] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer,
“Reluplex: An efficient SMT solver for verifying deep neural networks,”
in Computer Aided Verification - 29th International Conference, CAV
2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I, 2017.
[5] R. Ehlers, “Formal verification of piece-wise linear feed-forward neu-
ral networks,” in Automated Technology for Verification and Analysis
(ATVA), 2017.
[6] R. Bunel, I. Turkaslan, P. H. Torr, P. Kohli, and M. P. Kumar, “A unified
view of piecewise linear neural network verification,” in Proc. Advances
in Neural Information Processing Systems (NeurIPS), 2018, pp. 4795–
4804.
[7] W. Ruan, X. Huang, and M. Kwiatkowska, “Reachability analysis of
deep neural networks with provable guarantees,” in Proc. International
Joint Conference on Artificial Intelligence, (IJCAI), 2018.
[8] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and
M. Vechev, “AI2: Safety and robustness certification of neural networks
with abstract interpretation,” in Proc. IEEE Symposium on Security and
Privacy (SP), vol. 00, 2018, pp. 948–963.
[9] M. Mirman, T. Gehr, and M. Vechev, “Differentiable abstract inter-
pretation for provably robust neural networks,” in Proc. International
Conference on Machine Learning (ICML), 2018, pp. 3575–3583.
[10] G. Singh, T. Gehr, M. Mirman, M. Pu¨schel, and M. Vechev, “Fast
and effective robustness certification,” in Proc. Advances in Neural
Information Processing Systems (NeurIPS), 2018, pp. 10 825–10 836.
[11] G. Singh, T. Gehr, M. Pu¨schel, and M. Vechev, “An abstract domain
for certifying neural networks,” Proc. ACM Program. Lang., vol. 3, no.
POPL, pp. 41:1–41:30, 2019.
[12] K. Dvijotham, R. Stanforth, S. Gowal, T. Mann, and P. Kohli, “A dual
approach to scalable verification of deep networks,” in Proc. Uncertainty
in Artificial Intelligence (UAI), 2018, pp. 162–171.
[13] E. Wong and Z. Kolter, “Provable defenses against adversarial exam-
ples via the convex outer adversarial polytope,” in Proc. International
Conference on Machine Learning (ICML), 2018, pp. 5286–5295.
[14] L. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, L. Daniel, D. Boning,
and I. Dhillon, “Towards fast computation of certified robustness for
ReLU networks,” in Proc. International Conference on Machine Learn-
ing (ICML), 2018, pp. 5276–5285.
[15] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel, “Ef-
ficient neural network robustness certification with general activation
functions,” in Proc. Advances in Neural Information Processing Systems
(NeurIPS), 2018.
[16] A. Boopathy, T.-W. Weng, P.-Y. Chen, S. Liu, and L. Daniel, “Cnn-cert:
An efficient framework for certifying robustness of convolutional neural
networks,” in AAAI, Jan 2019.
[17] H. Salman, G. Yang, H. Zhang, C. Hsieh, and P. Zhang, “A convex
relaxation barrier to tight robustness verification of neural networks,”
CoRR, vol. abs/1902.08722, 2019.
[18] H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning,
and C.-J. Hsieh, “Towards stable and efficient training of verifiably
robust neural networks,” in International Conference on Learning Rep-
resentations (ICLR), 2020.
[19] S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security
analysis of neural networks using symbolic intervals,” in Proc. USENIX
Conference on Security Symposium, ser. SEC’18, pp. 1599–1614.
[20] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato,
R. Arandjelovic, T. A. Mann, and P. Kohli, “On the effectiveness of
interval bound propagation for training verifiably robust models,” CoRR,
vol. abs/1810.12715, 2018.
[21] A. Raghunathan, J. Steinhardt, and P. S. Liang, “Semidefinite relaxations
for certifying robustness to adversarial examples,” in Advances in Neural
Information Processing Systems (NeurIPS), 2018, pp. 10 877–10 887.
[22] S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Efficient
formal safety analysis of neural networks,” in Proc. Advances in Neural
Information Processing Systems (NeurIPS), 2018, pp. 6369–6379.
[23] G. Singh, T. Gehr, M. Pu¨schel, and M. Vechev, “Boosting robustness
certification of neural networks,” in International Conference on Learn-
ing Representations (ICLR), 2019.
[24] G. Singh, R. Ganvir, M. Pu¨schel, and M. Vechev, “Beyond the single
neuron convex barrier for neural network certification,” in Advances in
Neural Information Processing Systems (NeurIPS), 2019.
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” 2017.
[26] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mane´, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vie´gas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
“TensorFlow: Large-scale machine learning on heterogeneous systems,”
2015. [Online]. Available: http://tensorflow.org/
[27] E. Wong, F. R. Schmidt, J. H. Metzen, and J. Z. Kolter, “Scaling provable
adversarial defenses,” in Proc. Neural Information Processing Systems
(NeurIPS), 2018, pp. 8410–8419.
[28] M. Mirman, G. Singh, and M. T. Vechev, “A provable defense for deep
residual networks,” CoRR, vol. abs/1903.12519, 2019.
[29] K. Jia and M. Rinard, “Exploiting verified neural networks via floating
point numerical error,” 2020.
[30] F. Grund, “Rall, louis b., automatic differentiation: Techniques and
applications. lecture notes in computer science 120,” ZAMM - Journal
of Applied Mathematics and Mechanics, vol. 62, no. 7, 1982.
[31] K. Ozaki, T. Ogita, S. M. Rump, and S. Oishi, “Fast algorithms for
floating-point interval matrix multiplication,” Journal of Computational
and Applied Mathematics, vol. 236, no. 7, pp. 1795 – 1814, 2012.
[32] K. Ghorbal, E. Goubault, and S. Putot, “The zonotope abstract domain
taylor1+,” in Proc. Computer Aided Verification (CAV), 2009, pp. 627–
633.
[33] M. Balunovic and M. Vechev, “Adversarial training and provable de-
fenses: Bridging the gap,” in International Conference on Learning
Representations (ICLR), 2020.
[34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” in Proc. of the IEEE, 1998, pp. 2278–
2324.
[35] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “To-
wards deep learning models resistant to adversarial attacks,” in Proc.
International Conference on Learning Representations (ICLR), 2018.
[36] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting
adversarial attacks with momentum,” in Proc. Computer Vision and
Pattern Recognition (CVPR), 2018.
