Reversible designs for extreme memory cost reduction of CNN training by Hascoet, Tristan et al.
Reversible designs for extreme memory cost reduction of CNN
training
Tristan Hascoet∗†
Kobe Univeristy
tristan@people.kobe-u.ac.jp
Quentin Febvre∗
Kobe Univeristy
Sicara
quentin.febvre@gmail.com
Yasuo Ariki
Kobe Univeristy
ariki@kobe-u.ac.jp
Tetsuya Takiguchi
Kobe Univeristy
takigu@kobe-u.ac.jp
Abstract
Training Convolutional Neural Networks (CNN) is a
resource intensive task that requires specialized hard-
ware for efficient computation. One of the most lim-
iting bottleneck of CNN training is the memory cost
associated with storing the activation values of hid-
den layers needed for the computation of the weights
gradient during the backward pass of the backprop-
agation algorithm. Recently, reversible architectures
have been proposed to reduce the memory cost of
training large CNN by reconstructing the input acti-
vation values of hidden layers from their output dur-
ing the backward pass, circumventing the need to ac-
cumulate these activations in memory during the for-
ward pass. In this paper, we push this idea to the ex-
treme and analyze reversible network designs yielding
minimal training memory footprint. We investigate
the propagation of numerical errors in long chains of
invertible operations and analyze their effect on train-
ing. We introduce the notion of pixel-wise memory
cost to characterize the memory footprint of model
training, and propose a new model architecture able
to efficiently train arbitrarily deep neural networks
with a minimum memory cost of 352 bytes per input
∗Equal contribution
†Corresponding author
pixel. This new kind of architecture enables training
large neural networks on very limited memory, open-
ing the door for neural network training on embedded
devices or non-specialized hardware. For instance,
we demonstrate training of our model to 93.3% accu-
racy on the CIFAR10 dataset within 67 minutes on
a low-end Nvidia GTX750 GPU with only 1GB of
memory.
1 Introduction
Over the last few years, Convolutional Neural Net-
works (CNN) have enabled unprecedented progress
on a wide array of computer vision tasks. One dis-
advantage of these approaches is their resource con-
sumption: Training deep models within a reasonable
amount of time requires special Graphical Processing
Units (GPU) with numerous cores and large mem-
ory capacity. Given the practical importance of these
models, a lot of research effort has been directed to-
wards algorithmic and hardware innovations to im-
prove their resource efficiency such as low-precision
arithmetic [1], network pruning for inference [2], or
efficient stochastic optimization algorithms [3].
In this paper, we focus on a particular aspect of
resource efficiency: optimizing the memory cost of
training CNNs. We envision several potential ben-
1
ar
X
iv
:1
91
0.
11
12
7v
1 
 [c
s.C
V]
  2
4 O
ct 
20
19
efits from the ability to train large neural networks
within limited memory:
Democratization of Deep Learning research:
Training large CNN requires special GPUs with large
memory capacity. Typical desktop GPUs memory
capacity is too small for training large CNNs. As a
result, getting into deep learning research comes with
the barrier cost of either buying specialized hardware
or renting live instances from cloud service providers.
Reducing the memory cost of deep model training
would allow training deep nets on standard graphic
cards without the need for specialized hardware, ef-
fectively removing this barrier cost. In this paper, we
demonstrate efficient training of a CNN on the CI-
FAR10 dataset (93.3% accuracy within 67 minutes)
on an Nvidia GTX750 with only 1GB of memory.
On-device training: With mobile applications, a
lot of attention has been given to optimize inference
on edge devices with limited computation resources.
Training state-of-the-art CNN on embedded devices,
however, has still received little attention. Efficient
on-device training is a challenging task for the un-
derlying power efficiency, computation and memory
optimization challenges it involves. As such, CNN
training has thus far been relegated to large cloud
servers, and trained CNNs are typically deployed to
embedded device fleets over the network. On-device
training would allow bypassing these server-client in-
teractions over the network. We can think of several
potential applications of on-device training, includ-
ing:
• Life-long learning: Autonomous systems de-
ployed in evolving environments like drones,
robots or sensor networks might benefit from
continuous life-long learning to adapt to their
changing environment. On-device training
would enable such application without the ex-
pensive communication burden of having edge
devices continuously sending their data to re-
mote servers over the network. It would also
provide resilience to network failures in critical
application scenarios.
• In privacy-critical applications such as biomet-
ric mobile phone authentication, users might
not want to have their data sent over the net-
work. On-device training would allow fine-
tuning recognition models on local data without
sending sensitive data over the network.
In this work, we propose an architecture with min-
imal training memory cost requirements which en-
ables training within the tight memory constraints of
embedded devices.
Research in optimization: Recent works on
stochastic optimization algorithms have highlighted
the benefits of large batch training [4, 5]. For exam-
ple, in Imagenet, linear speed-ups in training have
been observed with increasing batch sizes up to tens
of thousands of samples [5]. Optimizing the mem-
ory cost of CNN training may allow further research
on the optimization trade-offs of large batch training.
For small datasets like MNIST or CIFAR10, we are
able to process the full dataset in 14 and 18 GB of
memory respectively. Although large batch training
on such small dataset is very computationally ineffi-
cient with current stochastic optimization algorithms
[5], the ability to process the full dataset in one pass
allows to easily train CNNs on the true gradient of
the error. Memory optimization techniques have the
potential to facilitate research on optimization tech-
niques outside the realm of Stochastic Gradient De-
scent to be investigated.
In this paper, we build on recent works on re-
versible networks [6, 7] and ask the question: how far
can we reduce CNN training memory cost using re-
versible designs with minimal impact on the accuracy
and computational cost? To do so, we take as a start-
ing point the Resnet-18 architecture and analyze its
training memory requirements. We then analyze the
memory cost reduction of invertible designs succes-
sively introduced in the RevNet and iRevNet archi-
tectures. We identify the memory bottleneck of such
architectures, which leads us to introduce a layer-wise
invertible architecture. However, we observe that
layer-wise invertible networks accumulate numerical
errors across their layers, which leads to numerical in-
stabilities impacting model accuracy. We character-
ize the accumulation of numerical errors within long
chains of revertible operations and investigate their
effect on model accuracy. To mitigate the impact
2
of these numerical errors on the model accuracy, we
propose both a reparameterization of invertible lay-
ers and a hybrid architecture combining the benefits
of layer-wise and residual-block-wise reversibility to
stabilize training.
Our main result is to present a new architec-
ture that allows to efficiently train a CNN with the
minimal memory cost of 352 bytes per pixel. We
demonstrate the efficiency of our method by effi-
ciently training a model to 93.3% accuracy on the
CIFAR10 dataset within 67 minutes on a low-end
Nvidia GTX750 with only 1GB of VRAM.
2 Related Work
2.1 Reversibility
Reversible network designs have been proposed for
various purposes including generative modeling, vi-
sualization, solving inverse problems, or theoretical
analysis of hidden representations.
Flow-based generative models use analytically in-
vertible transformations to compute the change of
variable formula. Invertibility is either achieved
through channel partitioning schemes (NICE [8]
Real-NVP [9]), weight matrix factorization (GLOW
[10]) or constraining layer architectures to easily in-
vertible unitary operations (Normalization flows [11])
Neural ODEs [12] take a drastically different take
on invertibility: They leverage the analogy between
residual networks and the Euler method to define
continuous hidden state systems. The conceptual
shift from a finite set of discrete transformations to
a continuous regime gives them invertibility for free.
The computational efficiency of this approach, how-
ever, remains to be demonstrated.
The RevNet model [6] was inspired by the Real-
NVP generative model. They adapt the idea of chan-
nel partitioning and propose an efficient architecture
for discriminative learning. The iRevNet [7] model
builds on the RevNet architecture: they propose to
replace the irreversible max-pooling operation with
an invertible operation that reshapes the hidden ac-
tivation states so as to compensate the loss of spatial
resolution by an increase in the channel dimension.
By preserving the volume of activations, their pool-
ing operation allows for exact reconstruction of the
inverse. In their original work, the authors focus on
the analysis of the representations learned by invert-
ible models rather than resource efficiency. From a
resource optimization point of view, one downside of
their method is that the proposed invertible pooling
scheme drastically increases the number of channels
in upper layers. As the size of the convolution kernel
weights grows quadratically in the number of chan-
nels, the memory cost associated with storing the
model weights becomes a major memory bottleneck.
We address this issue in our proposed architecture.
In [13], the authors use these reversible architectures
to study undesirable invariances in feature space.
In [14], the authors propose a unified architecture
performing well on both generative and discrimina-
tive tasks. They enforce invertibility by regularizing
the weights of residual blocks so as to guarantee the
existence of an inverse operation. However, the com-
putation of the inverse operation is performed with
power iteration methods which are not optimal from
a computational perspective.
Finally, [15] propose to reconstruct the input ac-
tivations of normalization and activation layers us-
ing their inverse function during the backward pass.
We propose a similar method for layer-wise invert-
ible networks. However, as their model does not in-
vert convolution layers, it does not feature long chains
of invertible operations so that they do not need to
account for numerical instabilities. Instead, our pro-
posed model features long chains of invertible opera-
tions so that we need to characterize numerical errors
in order to stabilize training.
2.2 Resource efficiency
Research into resource optimization of CNNs covers
a wide array of techniques, most of which are orthog-
onal to our work. We briefly present some of these
works:
On the architectural side, Squeezenet [16] was first
proposed as an efficient neural architecture reduc-
ing the number of model parameters while maintain-
ing high classification accuracy. MobileNet [17] uses
depth-wise separable convolutions to further reduce
3
the computational cost of inference for embedded de-
vice applications.
Network pruning [2] is a set of techniques devel-
oped to decrease the model weight size and com-
putational complexity. Network pruning works by
removing the network weights that contribute the
least to the model output. Pruning deep models has
been shown to drastically reduce the memory cost
and computational cost of inference without signifi-
cantly hurting model accuracy. Although pruning has
been concerned with optimization of the resource in-
ference, the recently proposed lottery ticket hypoth-
esis [18] has shown that specifically pruned networks
could be trained from scratch to high accuracy. This
may be an interesting and complementary line of
work to investigate in the future to reduce training
memory costs.
Low precision arithmetic has been proposed as a
mean to reduce both memory consumption and com-
putation time of deep learning models. Mixed preci-
sion training [19] combines float16 with float32 oper-
ations to avoid numerical instabilities due to either
overflow or underflow. For inference, integer quanti-
zation [1, 20] has been shown to drastically improve
the computation and memory efficiency and has been
successfully deployed on both edge devices and data
centers. Integrating mixed-precision training to our
proposed architecture would allow us to further re-
duce training memory costs.
Most related to our work, gradient checkpointing
was introduced as a mean to reduce the memory cost
of deep neural network training. Gradient check-
pointing, first introduced in [21], trades off memory
for computational complexity by storing only a sub-
set of the activations during the forward pass. Dur-
ing the backward pass, missing activations are recom-
puted from the stored activations as needed by the
backpropagation algorithm. Follow-up work [22] has
since built on the original gradient checkpointing al-
gorithm to improve this memory/computation trade-
off. However, reversible models like RevNet have
been shown to offer better computational complex-
ity than gradient checkpointing, at the cost of con-
straining the model architecture to invertible residual
blocks.
3 Preliminaries
In this section, we analyze the memory footprint of
training architectures with different reversibility pat-
terns. We start by introducing some notations and
briefly review the backpropagation algorithm in or-
der to characterize the training memory consump-
tion of deep neural networks. In our analysis, we use
a Resnet-18 as a reference baseline and analyze its
training memory footprint. We then gradually aug-
ment the baseline architecture with reversible designs
and analyze their impact on computation and mem-
ory consumption.
3.1 Backpropagation & Notations
Let us consider a model F made of N sequential lay-
ers trained to minimize the error e defined by a loss
function L for an input x and ground-truth label y¯:
F : x→ y (1a)
y = fN ◦ ... ◦ f2 ◦ f1(x) (1b)
e = L(y, y¯) (1c)
During the forward pass, each layer fi takes as in-
put the activations zi−1 from the previous layer and
outputs activation features zi = fi(zi−1), with z0 = x
and zN = y being the input and output of the net-
work respectively.
During the backward pass, the gradient of the loss
with respect to the hidden activations are propagated
backward through the layers of the networks using
the chain rule as:
δL
δzi−1
=
δL
δzi
× δzi
δzi−1
(2)
Before propagating the loss gradient with respect
to its input to the previous layer, each parameterized
layer computes the gradient of the loss with respect to
its parameters. In vanilla SGD, for a given learning
rate η, the weight gradients are subsequently used to
4
update the weight values as:
δL
δθi
=
δL
δzi
× δzi
δθi
(3a)
θi ← θi − η × δL
δθi
(3b)
However, the analytical form of the weight gradi-
ents are functions of the layer’s input activations zi−1.
In convolution layers, for instance, the weight gradi-
ents can be computed as the convolution of the input
activation by the output’s gradient:
δL
δθi
= zi−1 ?
δL
δzi
(4)
Hence, computing the derivative of the loss with re-
spect to each layer’s parameters θi requires knowledge
of the input activation values zi−1. In the standard
backpropagation algorithm, hidden layers activations
are stored in memory upon computation during the
forward pass. Activations accumulate in live memory
buffers until used for the weight gradients computa-
tion in the backward pass. Once the weight gradients
computed in the backward pass, the hidden activa-
tion buffers can be freed from live memory. However,
the accumulation of activation values stored within
each parameterized layer along the forward pass cre-
ates a major bottleneck in GPU memory.
The idea behind reversible designs is to constrain
the network architecture to feature invertible trans-
formations. Doing so, activations zi in lower layers
can be recomputed through inverse operations from
the activations zj>i of higher layers. In such architec-
tures, activation do not need to be kept in memory
during the forward pass as they can be recomputed
from higher layer activations during the backward
pass, effectively freeing up the GPU live memory.
3.2 Memory footprint
We denote the memory footprint of training a neural
network as a value M in bytes. Given an input x
and ground truth label y¯, the memory footprint rep-
resents the peak memory consumption during an iter-
ation of training including the forward and backward
pass. We divide the total training memory footprint
M into several memory cost factors: the cost Mθ of
storing the model weights, the hidden activations Mz,
and the gradients Mg:
M = Mθ +Mz +Mg (5)
In the following subsections, we detail the mem-
ory footprint of existing architectures with different
reversibility patterns. To help us formalize these
memory costs, we further introduce the following
notations: let n(x) denote the number of elements
in a tensor x, i.e.; if x is an h × w matrix, then
n(x) = h × w. Let bpe be the memory cost in bytes
per elements of a given precision so that the actual
memory cost for storing an h×w matrix is n(x)×bpe.
For instance, float32 tensors have a memory cost per
element bpe = 4. We use bs to denote the batch size,
and ci to denote the number of channels at layer i.
3.3 Vanilla ResNet
The architecture of a vanilla ResNet-18 is shown in
Figure 1. Vanilla ResNets do not use reversible com-
putations so that the input activations of all param-
eterized layers need to be accumulated in memory
during the forward pass for the computation of the
weight gradients to be done in the backward pass.
Hence the peak memory footprint of training a
vanilla ResNet happens at the beginning of the back-
ward pass when the top layer’s activation gradients
need to be stored in memory in addition to the full
stack of hidden activation values.
Let us denote by P ⊂ N the subset of param-
eterized layers of a network F (i.e.; convolutions
and batch normalization layers, excluding activation
functions and pooling layers). The memory cost as-
sociated with storing the hidden activation values is
given by:
Mz =
∑
i∈P
n(zi)× bpe (6a)
=
∑
i∈P
bs× ci × hi × wi × bpe (6b)
5
Where hi and wi represent the spatial dimensions
of the activation values at layer i. hi and wi are
determined by the input image size h × w and the
pooling factor pi of layer i, so we can factor out both
the spatial dimensions and the batch size from this
equation, yielding a memory cost per input pixel M ′z:
Mz =
∑
i∈P
bs× h× w × pi × ci × bpe (7a)
= bs× h× w ×
∑
i∈P
pi × ci × bpe (7b)
M ′z =
Mz
bs× h× w (7c)
=
∑
i∈P
pi × ci × bpe (7d)
The memory footprint of the weights is given by:
Mθ =
∑
i∈P
n(θi)× bpe (8)
The memory footprint of the gradients correspond
to the size of the gradient buffers at the time of peak
memory usage. In a vanilla ResNet18 model, this
peak memory usage happens during the backward
pass through the last convolution of the network.
Hence, the memory footprint of the gradients cor-
respond to the memory cost of storing the gradients
with respect to either the input or the output of this
layer, which also depends on the input pixel size:
Mg = max(n(gi−1), n(gi))× bpe (9a)
= h× w × bs× pi ×max(ci−1, ci)× bpe
(9b)
M ′g = pi ×max(ci−1, ci)× bpe (9c)
Figure 1 illustrates the peak memory consumption
of a ResNet-like architecture. For a ResNet parame-
terized following Table 1, the peak memory consump-
tion can then be computed as:
M = Mθ +Mz +Mg (10a)
= Mθ + (M
′
z +M
′
g)× (h× w × bs) (10b)
= 12.5 ∗ 106 + 1928× (h× w × bs) (10c)
(10d)
For example, a training iteration over a typical
batch of 32 images of resolution 240 × 240 requires
12.5 MB of memory to store the model weights and
3.8 GB of memory to store the hidden layers activa-
tions and gradients for a total of M = 3.81 GB of
VRAM. The memory cost of the hidden activations
is thus the main memory bottleneck of CNN train-
ing as the cost associated with the model weights is
negligible in comparison.
3.4 RevNet
The RevNet architecture introduces reversible blocks
as drop-in replacements of the residual blocks of the
ResNet architecture. Reversible blocks have analyti-
cal inverses that allow for the computation of both
their input and hidden activation values from the
value of their output activations. Two factors cre-
ate memory bottlenecks in training RevNet architec-
tures, which we refer to as the local and global bot-
tlenecks.
First, the RevNet architecture features non-volume
preserving max-pooling layers, for which the inverse
cannot be computed. As these layers do not have an-
alytical inverses, their input must be stored in mem-
ory during the forward pass for the reconstruction
of lower layer’s activations to be computed during
the backward pass. We refer to the memory cost as-
sociated with storing these activations as the global
bottleneck, since these activations need to be accu-
mulated during the forward pass through the full ar-
chitecture.
The local memory bottleneck has to do with the
synchronization of the reversible block computations:
While activations values are computed by a forward
pass through the reversible block modules, gradients
computations flow backward through these modules
so that the activations and gradient computations
6
cannot be performed simultaneously. Figure 2 illus-
trates the process of backpropagating through a re-
versible block: First, the input activation values of
the parameterized hidden layers within the reversible
blocks are recomputed from the output. Once the
full set of activation have been computed and stored
in GPU memory, the backpropagation of the gradi-
ents through the reversible block can begin. We refer
to the accumulation of the hidden activation values
within the reversible block as the local memory bot-
tleneck.
For a typical parameterization of a RevNet, as
summarized in Table 1, the local bottleneck of lower
layers actually outweighs the global memory bottle-
neck introduced by non-reversible pooling layers. In-
deed, as the spatial resolution decreases with pooling
operations, the cost associated with storing the input
activations of higher layers becomes negligible com-
pared to the cost of storing activation values in lower
layers. Hence, surprisingly, the peak memory con-
sumption of the RevNet architecture, as illustrated in
Figure 3, happens in the backward pass through the
first reversible block, in which the local memory bot-
tleneck is maximum. For the architecture described
in Table 1, the peak memory consumption can be
computed as:
M = Mθ +Mz +Mg (11a)
= (Mθ + (M
′
z +M
′
g)× (h× w × bs) (11b)
= 12.7× 106 + 640× (h× w × bs) (11c)
Following our previous example, a RevNet archi-
tecture closely mimicking the ResNet-18 architecture
requires M = 1.19 GB of VRAM for a training iter-
ation over batch of 32 images of resolution 240×240.
Finally, the memory savings allowed by the re-
versible block come with the additional computa-
tional cost of computing the hidden activations dur-
ing the backward pass. As noted in the original pa-
per, this computational cost is equivalent to perform-
ing one additional forward pass.
3.5 iRevNet
The iRevNet model builds on the RevNet architec-
ture: they replace the irreversible max-pooling oper-
ation with an invertible operation that reshapes the
hidden activation states so as to compensate for the
loss of spatial resolution by an increase in the chan-
nel dimension. As such, the iRevNet architecture is
fully invertible, which alleviates the global memory
bottleneck of the RevNet architecture.
This pooling operation works by stacking the
neighboring elements of the pooling regions along the
channel dimension, i.e.; for a 2D pooling operation
with 2 × 2 pooling window, the number of output
channels is four times the number of input channels.
Unfortunately, the size of a volume-preserving con-
volution kernel grows quadratically in the number of
input channels:
M(θ) = cin × cout × kh × kw (12a)
= c2 × kh × kw (12b)
Consider an iRevNet network with initial channel
size 32. After three levels of 2×2 pooling, the effective
channel size becomes 32 × 43 = 2048. A typical 3 ×
3 convolution layer kernel for higher layers of such
network would have n(θ) = 20482 × 3 × 3 = 37M
parameters. At this point, the memory cost of the
network weights Mθ becomes an additional memory
bottleneck.
Furthermore, the iRevNet architecture does not
address the local memory bottleneck of the reversible
blocks. Figure 4 illustrates such architecture. For an
initial channel size of 32, as summarized in Table 1,
the peak memory consumption is given by:
M = Mθ +Mz +Mg (13a)
= Mθ + (M
′
z +M
′
g)× (h× w × bs) (13b)
= 171× 106 + 640× (h× w × bs) (13c)
Training such an architecture for an iteration over
batches of 32 images of resolution 240 × 240 would
require M = 1.35GB of VRAM. In the next sec-
tion, we introduce both layer-wise reversibility and
7
a variant on this pooling operations to address the
local memory bottleneck of reversible blocks and the
weight memory bottleneck respectively.
4 Method
RevNet and iRevNet architectures implement re-
versible transformations at the level of residual
blocks. As we have seen in the previous section,
the design of these reversible blocks create a local
memory bottleneck as all hidden activations within a
reversible block need to be computed before the gra-
dients are backpropagated through the block. In or-
der to circumvent this local bottleneck, we introduce
layer-wise invertible operations. However, these in-
vertible operations introduce numerical error, which
we characterize in the following subsections. In Sec-
tion 5, we will show that these numerical errors
lead to instabilities that degrade the model accu-
racy. Hence, in section 4.2, we propose a hybrid
model combining layer-wise and residual block-wise
reversible operations to stabilize training while re-
solving the local memory bottleneck at the cost of a
small additional computational cost.
4.1 Layer-wise Invertibility
In this section, we present invertible layers that act
as drop-in replacement for convolution, batch nor-
malization, pooling and non-linearity layers. We
then characterize the numerical instabilities arising
from the invertible batch normalization and non-
linearities.
4.1.1 Invertible batch normalization
As batch normalization is not a bijective operation,
it does admit an analytical inverse. However, the in-
verse reconstruction of a batch normalization layer
can be realized with minimal memory cost. Given
first and second order moment parameters β and γ,
the forward f and inverse f−1 operation of an in-
vertible batch normalization layer can be computed
as follows:
y = f(x) = γ × x− xˆ√
x˙+ 
+ β (14a)
x = f−1(y, xˆ, x˙) = (
√
x˙+ )× y − β
γ
+ xˆ (14b)
Where xˆ and x˙ represent the mean and variance of
x respectively. Hence, the input activation x can be
recovered from y through f−1 at the minimal memory
cost of storing the input activation statistics xˆ and x˙.
Let us consider the accumulation of numerical er-
rors arising from the inverse computation of an in-
vertible batch normalization layer. During the back-
ward pass, the invertible batch norm layer is sup-
posed to compute its input x = f−1(y, xˆ, x˙) from
the output y. In reality, however, the output recov-
ered by upstream invertible layers is a noisy estimate
yˆ = y + y of the true output due to numerical er-
rors introduced by upstream layers. Let us define the
signal to noise ratio (SNR) of the input and output
signal as follows:
snro =
|y|2
|y|2 (15a)
snri =
|x|2
|x|2 (15b)
We are interested in characterizing the factor α of
reduction of the SNR through the inverse reconstruc-
tion:
α =
snri
snro
(16)
To illustrate the mechanism through which the
batch normalization inverse operation reduces the
SNR, let us consider a toy layer with only two chan-
nels and parameters β = [0, 0] and γ = [1, ρ]. For
simplicity, let us consider an input signal x indepen-
dently and identically distributed across both chan-
nels with zero mean and standard deviation 1 so that,
8
in the forward pass, we have:
y =[y0, y1] (17a)
=[x0, x1 × ρ] (17b)
|y|2 =|x0|2 + |x1|2 × ρ2 (17c)
=
1
2
× |x|2 + 1
2
× |x|2 × ρ2 (17d)
=
|x|2
2
× (1 + ρ2) (17e)
In which we used the assumption that x is indepen-
dently and identically distributed across both chan-
nels to factorize |x0|2 = |x1|2 = 12 × |x|2 in equation
(17d).
During the backward pass, the noisy estimate y˜ =
y + y is fed back as input to the inverse operation.
Similarly, let us suppose a noise y identically dis-
tributed across both channels so that we have:
y˜ =[y˜0, y˜1] (18a)
=[x0 + 
y
0, x1 × ρ+ y1] (18b)
x˜ =[y˜0,
y˜1
ρ
] (18c)
=[x0 + 
y
0, x1 +
y1
ρ
] (18d)
x =x˜− x (18e)
=[y0,
y1
ρ
] (18f)
|x|2 =|y0|2 +
|y1|2
ρ2
(18g)
=
1
2
× |y|2 + 1
2
× |
y|2
ρ2
(18h)
=
|y|2
2
× (1 + 1
ρ2
) (18i)
Using the above formulation, the SNR reduction
factor α can be expressed as:
α =
snri
snro
(19a)
=
|x|2
|x|2 ×
|y|2
|y|2 (19b)
=
4
(1 + 1ρ2 )× (1 + ρ2)
(19c)
Figure 5 shows the expected evolution of α through
our toy layer for different values of the factor ρ. To
validate our formula, we empirically evaluate α for
normal Gaussian inputs x and output noise y and
find it to closely match the theoretical results given
by equation 19.
In essence, numerical instabilities in the inverse
computation of the batch normalization layer arise
from the fact that the signal across different chan-
nels i and j are amplified by different factors γi and
γj . While the signal amplification in the forward and
inverse path cancel out each other (x = f−1(f(x))),
the noise only gets amplified in the backward pass.
In the above demonstration, we have used a toy pa-
rameterization of the invertible batch normalization
layer to illustrate the mechanism behind the SNR
degradation. For arbitrarily parameterized batch
normalization layers, the SNR degradation factor be-
comes:
α =
snri
snro
(20a)
=
|x|2
|x|2 ×
|y|2
|y|2 (20b)
=
|x|2
|y|2 ×
|y|2
|x|2 (20c)
Assuming a noise y, equally distributed across all
9
channels, the noise ratio can be computed as follows:
y˜i =γi × xi − xˆi√
x˙i + 
+ βi + 
y
i (21a)
x˜i =(
√
x˙i + )× y˜i − βi
γi
+ xˆi (21b)
=xi +
√
x˙i + 
γi
× yi (21c)
xi =x˜i − xi (21d)
=
√
x˙i + 
γi
× yi (21e)
|y|2
|x|2 =
|y|2
|y|2
c ×
∑
i
x˙i2
γ2i
(21f)
=
c∑
i
√
x˙i+
γi
(21g)
Assuming input x following a Gaussian distribu-
tion with channel-wise mean xˆi and variance x˙i, the
SNR reduction factor α becomes:
|x|2
|y|2 =
∑
i |xi|2∑
i |yi|2
(22a)
=
∑
i(xˆ
2
i + x˙i)∑
i(γ
2
i + β
2
i )
(22b)
α =
|x|2
|y|2 ×
|y|2
|x|2 (22c)
=
∑
i(xˆ
2
i + x˙i)∑
i(γ
2
i + β
2
i )
× c∑
i
√
x˙i+
γi
(22d)
Finally, we propose the following modification, in-
troducing the hyperparameter i, to the invertible
batch normalization layer:
y = f(x) = |γ + i| × x− xˆ√
x˙+ 
+ β (23a)
x = f−1(y) = (
√
x˙+ )× y − β|γ + i| + xˆ (23b)
The introduction of the i hyper parameter serves
two purposes: First, it stabilizes the numerical er-
rors described above by lower bounding the smallest
γ parameters. Second, it prevents numerical insta-
bilities that would otherwise arise from the inverse
computation as γ parameters tend towards zero.
4.1.2 Invertible activation function
A good invertible activation function must be bijec-
tive (to guarantee the existence of an inverse func-
tion) and non-saturating (for numerical stability).
For these properties, we focus our attention on Leaky
ReLUs whose forward f and inverse f−1 computa-
tions are defined, for a negative slope parameter n,
as follow:
y = f(x) =
{
x, if x > 0
x/n, otherwise
(24a)
x = f−1(y) =
{
y, if y > 0
y × n, otherwise (24b)
The analysis of the numerical errors yielded by the
invertible Leaky ReLU follows a similar reasoning as
the toy batch normalization example with an addi-
tional subtlety: Similar to the toy batch normaliza-
tion example, we can think of the leaky ReLU as
artificially splitting the input x across two different
channels, one channel leaving the output unchanged
and one channel that divides the input by a factor n
during the forward pass and multiplies its output by
a factor n during the backward pass.
However, these artificial channels are defined by
the sign of the input and output during the forward
and backward pass respectively. Hence, we need to
consider the cases in which the noise flips the sign
of the output activations, which leads to different
behaviors of the invertible Leaky ReLU across four
cases:
y =

ynn if yˆ < 0 and y < 0
ynp if yˆ >= 0 and y < 0
ypp if yˆ >= 0 and y >= 0
ypn if yˆ < 0 and y >= 1
(25a)
Where the index np, for instance, represents neg-
ative activations whose reconstructions have become
10
positive due to the added noise. The signal to noise
ratio of the input and outputs can be expressed re-
spectively as:
In the case where y >> y, the probability of sign
flips (ynp, ypn) is negligible, so that the output signal
y is evenly split along ypp and ynn. In this regime,
the degradation of the SNR obeys a formula similar
to the toy batch normalization example:
y =[ypp, ynn] (26a)
=[xpp,
xnn
n
] (26b)
|y|2 =1
2
× |x|2 + 1
2
× |x|
2
n2
(26c)
=
|x|2
2
× (1 + 1
n2
) (26d)
y˜ =[y˜pp, y˜nn] (27a)
=[xpp + 
y
pp,
xnn
n
+ ynn] (27b)
x˜ =[y˜pp, y˜nn × n] (27c)
=[xpp + 
y
pp, xnn + 
y
nn × n] (27d)
x =x˜− x (27e)
=[ypp, 
y
nn × n] (27f)
|x|2 =1
2
× |y|2 + 1
2
× |y|2 × n2 (27g)
=
|y|2
2
× (1 + n2) (27h)
Using the above formulation, the signal to noise
ration reduction factor α can be expressed as:
α =
snri
snro
(28a)
=
|x|2
|x|2 ×
|y|2
|y|2 (28b)
=
4
(1 + 1n2 )× (1 + n2)
(28c)
Hence numerical errors can be controlled by set-
ting the value of the negative slope n. As n tends
towards 1, α converges to 1, yielding minimum signal
degradation. However, as n tends towards 1, the net-
work tends toward a linear behavior, which hurts the
model expressivity. Figure 6 shows the evolution of
the SNR degradation α for different negative slopes
n; and, in Section 5.1, we investigate the impact of
the negative slope parameter on the model accuracy.
When the noise reaches an amplitude similar to
or greater than the activation signal, the effects of
sign flips complicate the equation. However, in this
regime, the signal to noise ratio becomes too low for
training to converge, as numerical errors prevent any
useful weight update, so we leave the problem of char-
acterizing this regime open.
4.1.3 Invertible convolutions
Invertible convolution layers can be defined in sev-
eral ways. The inverse operation of a convolution is
often referred to as deconvolution, and is defined for
a subspace of the kernel weight space.
However, deconvolutions are computationally ex-
pensive and subject to numerical errors. Instead, we
choose to implement invertible convolutions using the
channel partitioning scheme as the reversible block
design for its simplicity, numerical stability and com-
putational efficiency. Hence, invertible convolutions,
in our architecture, can be seen as minimal reversible
blocks in which both modules consist of a single con-
volution. Gomez et al. [6] found the numerical errors
introduced by reversible blocks to have no impact on
the model accuracy. Similarly, we found reversible
blocks extremely stable yielding negligible numerical
errors compared to the invertible batch normalization
and Leaky ReLU layers.
4.1.4 Pooling
In [7], the authors propose an invertible pooling op-
eration that operates by stacking the neighboring ele-
ments of the pooling regions along the channel dimen-
sion. As noted in Section 3.5, the increase in channel
size at each pooling level induces a quadratic increase
in the number of parameters of upstream convolution,
which creates a new memory bottleneck.
To circumvent this quadratic increase in the mem-
ory cost of the weight, we propose a new pooling layer
11
that stacks the elements of neighboring pooling re-
gions along the batch size instead of the channel size.
We refer to both kind of pooling as channel pool-
ing Pc and batch pooling Pb respectively, depending
on the dimension along which activation features are
stacked. Given a 2 × 2 pooling region and an in-
put activation tensor x of dimensions bs× c× h×w,
where bs refers to the batch size, c to the number of
channels and h× w to the spatial resolution, the re-
shaping operation performed by both pooling layers
can be formalized as follows:
Pc :x→ y (29a)
:Rbs×c×h×w → Rbs×4c×h2×w2 (29b)
Pb :x→ y (29c)
:Rbs×c×h×w → R4bs×c×h2×w2 (29d)
Channel pooling gives us a way to perform volume-
preserving pooling operations while increasing the
number of channels at a given layer of the architec-
ture, while batch pooling gives us a way to perform
volume-preserving pooling operations while keeping
the number of channel constant, By alternating be-
tween channel and batch pooling, we can control
the number of channels at each pooling level of the
model’s architecture.
4.1.5 Layer-wise invertible architecture
Putting together the above building blocks, Figure
7 illustrates a layer-wise invertible architecture. The
peak memory usage for a training iteration of this ar-
chitecture, as parameterized in Table 1, can be com-
puted as follows:
M = Mθ +Mz +Mg (30a)
= Mθ + (M
′
z +M
′
g)× (h× w × bs) (30b)
= 29.6× 106 + 320× (h× w × bs) (30c)
Training an iteration over a typical batch of 32 im-
ages with resolution 240 × 240 would require M =
590MB of VRAM. Similar to the RevNet architec-
ture, the reconstruction of the hidden activations
by inverse transformations during the backward pass
comes with an additional computational cost similar
to a forward pass.
4.2 Hybrid architecture
In section 3, we saw that layer-wise activation and
normalization layers degrade the signal to noise ra-
tio of the reconstructed activations. In section 5.1,
we will quantify the accumulation of numerical er-
rors through long chains of layer-wise invertible op-
erations and show that numerical errors negatively
impact model accuracy.
To prevent these numerical instabilities, we intro-
duce a hybrid architecture, illustrated in Figure 8,
combining reversible residual blocks with layer-wise
invertible functions. Conceptually, the role of the
residual level reversible block is to reconstruct the in-
put activation of residual blocks with minimal errors,
while the role of the layer-wise invertible layers is to
efficiently recompute the hidden activations within
the reversible residual blocks at the same time as the
gradient propagates to circumvent the local memory
bottleneck of the reversible module.
The backward pass through these hybrid reversible
blocks is illustrated in Figure 9 and proceeds as fol-
lows: First, the input x is computed from the out-
put y through the analytical inverse of the reversible
block. These computations are made without storing
the hidden activation values of the sub-modules. Sec-
ond, the gradient of the activations are propagated
backward through the reversible of the block mod-
ules. As each layer within these modules is invertible,
the hidden activation values are computed using the
layer-wise inverse along the gradient.
The analytical inverse of the residual level re-
versible blocks is used to propagate hidden activa-
tions with minimal reconstruction error to the lower
modules, while layer-wise inversion allows us to alle-
viate the local bottleneck of the reversible block by
computing the hidden activation values together with
the backward flow of the gradients. As layer-wise
inverses are only used for hidden feature computa-
tions within the scope of the reversible block, and
reversible blocks are made of relatively short chains
of operations, numerical errors do not accumulate up
12
to a damaging degree.
The peak memory consumption of our proposed
architecture, as illustrated in Figure 8 and parame-
terized in Table 1, can be computed as
M = Mθ +Mz +Mg (31a)
= Mθ + (M
′
z +M
′
g)× (h× w × bs) (31b)
= 14.8× 106 + 352× (h× w × bs) (31c)
Training an iteration over batch of 32 images of
resolution 240 × 240 would require M = 648MB of
VRAM.
It should be noted, however, that this architec-
ture adds an extra computational cost as both the
reversible block inverse and layer-wise inverse need to
be computed. Hence, instead of one additional for-
ward pass, as in the RevNet and layer-wise architec-
tures, our hybrid architecture comes with a compu-
tational cost equivalent to performing two additional
forward passes during the backward pass.
5 Results and Discussion
We use the CIFAR10 dataset as a benchmark for
our experiments. The CIFAR10 dataset is complex
enough to require efficient architectures to reach high
accuracy, yet small enough to enable us to rapidly it-
erate over different architectural designs. We start by
analyzing numerical errors arising in layer-wise in-
vertible and hybrid architectures, and outline their
impact on accuracy. This analysis motivates our
choice of architecture and hyperparameter. We then
summarize the benefits and drawbacks of our pro-
posed architecture in comparison to different baseline
architectures.
5.1 Impact of Numerical stability
5.1.1 Layer-wise Invertible Architecture
In this section, we quantify the accumulation of nu-
merical errors in layer-wise invertible architectures
and analyze their impact on the accuracy. The ar-
chitecture of these models is illustrated in Figure 7.
We investigate the evolution of numerical errors, and
their impact on accuracy, for networks of different
depth and different hyper-parameter values. Figure
10 illustrates the degradation of the signal-to-noise
ration along the layers of one such model.
We found the two most impacting parameters to be
the depth N of the network and the negative slope
n of the activation function. Figure 11 shows the
evolution of the numerical errors with both of these
parameters.
Next, we investigate the impact of numerical errors
on the accuracy. In order to isolate the impact of the
numerical errors, we compare the accuracy reached
by the same architecture with and without inverse
reconstruction of the hidden layers activations. With-
out reconstruction, the hidden activation values are
stored along the forward pass and the gradient up-
dates are computed from the true, noiseless activa-
tion values, so that the only difference between both
settings is the noise introduced by the inverse recon-
structions.
In Figure 12, we compare the evolution of the accu-
racy in both settings for different depth and negative
slopes. For small depths (or high negative slopes), in
which the numerical errors are minimum, both mod-
els yield similar accuracy. However, as the numerical
errors grow, the accuracy of the model goes down,
while the accuracy of the ideal baseline keeps increas-
ing, which can be seen with both depth and negative
slopes. This loss in accuracy is the direct result of
numerical errors, which prevent the model from con-
verging to higher accuracies.
5.1.2 Hybrid Invertible Architecture
In section 4.2, we introduced a hybrid architecture,
illustrated in Figure 8, to prevent the impact of nu-
merical errors on accuracy. Figure 13 shows the prop-
agation of the signal to noise ratio through the layers
of such hybrid architecture. As can be seen in this
figure, the hybrid architecture is much more robust to
numerical errors as activations are propagated from
one reversible block to the other using the reversible
block inverse computations instead of layer-wise in-
versions.
Figure 14 shows the evolution of the SNR with in-
creasing depth N and for different values of negative
slope n. This figure shows a much more stable evo-
13
lution of the signal to noise ratio than the layer-wise
architecture.
Figure 15 compares the evolution of the accuracy
reached by this hybrid architecture with noisy activa-
tions and noiseless ideal activations as depth and neg-
ative slope increase. The negative impacts of numer-
ical errors observed in the layer-wise architecture are
gone, confirming that the numerical stability brought
by the hybrid architecture effectively stabilizes train-
ing.
5.2 Model comparison
Table 1 summarizes our main results. In this ta-
ble, we compare architectures with different patterns
of reversibility. To allow for a fair comparison, we
have tweaked each architecture to keep the num-
ber of parameters as close as possible, with the no-
table exception of the i-RevNet architecture. The
i-Revnet pooling scheme enforces a quadratic growth
of its parameters with each level of pooling. In order
to keep the number of parameters of the i-RevNet
close to the other baselines, we would have to drasti-
cally reduce the number of channels of lower layers,
which we found yield poor performance. Further-
more, it should be noted that the i-RevNet archi-
tecture we present slightly differs from the original
i-Revnet model as our implementation uses RevNet-
like reversible modules with one module per chan-
nel split for similarity with the other architecture we
evaluate instead of the single module used in the orig-
inal architecture.
All models were trained for 50 epochs of stochas-
tic gradient descent with cyclical learning rate and
momentum [23] with minimal image augmentation.
The parameters of our proposed architecture are
given in Table 1. This architecture was selected as
the best performing architecture from an extensive
architecture search on a constrained weight budget.
Compared to the original ResNet architecture, our
model drastically cuts the memory cost of training.
These drastic memory cuts come at the cost of a small
degradation in accuracy.
Furthermore, our hybrid architecture requires the
computational equivalent of two additional forward
passes within each backward pass. The computa-
tional complexity, however, remains reasonable: In
Table 2, we compare the time of training our pro-
posed architecture to 93.3% on a high-end Nvidia
GTX 1080Ti and a low-end Nvidia GTX750. The
GTX750 only has 1GB of VRAM, which results in
roughly 400MB of available memory after the ini-
tialization of various frameworks. Training a vanilla
ResNet with large batch sizes on such limited mem-
ory resources is impractical, while our architecture
allows for efficient training.
6 Conclusion
Convolutional Neural Networks form the backbone of
modern computer vision systems. However, the ac-
curacy of these models come at the cost of resource
intensive training and inference procedures. While
tremendous efforts have been put into the optimiza-
tion of the inference step on resource-limited device,
relatively little work have focused on algorithmic so-
lutions for limited resource training. In this paper, we
have presented an architecture able to yield high ac-
curacy classifications within very tight memory con-
straints. We highlighted several potential applica-
tions of memory-efficient training procedures, such
as on-device training, and illustrated the efficiency of
our approach by training a CNN to 93.3% accuracy
on a low-end GPU with only 1GB of memory.
7 FIGURE LEGEND
References
[1] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang,
M., Howard, A., Adam, H., Kalenichenko, D.:
Quantization and training of neural networks
for efficient integer-arithmetic-only inference. In:
Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 2704–
2713 (2018)
[2] Molchanov, P., Tyree, S., Karras, T., Aila,
T., Kautz, J.: Pruning convolutional neural
networks for resource efficient inference. arXiv
preprint arXiv:1611.06440 (2016)
14
Model Accuracy #Params Channels Pooling Mθ M
′
z +M
′
g M
Resnet 94.7% 3.1M 32− 64− 128− 256 Max Pooling 12.5M 1928 1.01G
RevNet 94.5% 3.1M 40− 80− 256− 320 Max Pooling 12.7M 640 348M
i-RevNet 93.8% 42.8M 32− 128− 512− 2048 Pc − Pc − Pc 171M 640 500M
Ours 93.3% 3.7M 32− 128− 512− 512 [Pc,Pc,Pb] 14.8M 352 200M
Table 1: Summary of architectures with different levels of reversibility
GPU Accuracy Time
GTX750 93.3% 35min.
GTX 1080Ti 93.3% 67min.
Table 2: Training statistics on different hardware
[3] Kingma, D.P., Ba, J.: Adam: A method
for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
[4] Shallue, C.J., Lee, J., Antognini, J., Sohl-
Dickstein, J., Frostig, R., Dahl, G.E.: Measur-
ing the effects of data parallelism on neural net-
work training. arXiv preprint arXiv:1811.03600
(2018)
[5] McCandlish, S., Kaplan, J., Amodei, D.,
Dota Team, O.: An empirical model of large-
batch training. arXiv preprint arXiv:1812.06162
(2018)
[6] Gomez, A.N., Ren, M., Urtasun, R., Grosse,
R.B.: The reversible residual network: Back-
propagation without storing activations. In: Ad-
vances in Neural Information Processing Sys-
tems, pp. 2214–2224 (2017)
[7] Jacobsen, J.-H., Smeulders, A., Oyallon, E.: i-
revnet: Deep invertible networks. arXiv preprint
arXiv:1802.07088 (2018)
[8] Dinh, L., Krueger, D., Bengio, Y.: Nice:
Non-linear independent components estimation.
arXiv preprint arXiv:1410.8516 (2014)
[9] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Den-
sity estimation using real nvp. arXiv preprint
arXiv:1605.08803 (2016)
[10] Kingma, D.P., Dhariwal, P.: Glow: Generative
flow with invertible 1x1 convolutions. In: Ad-
vances in Neural Information Processing Sys-
tems, pp. 10215–10224 (2018)
[11] Rezende, D.J., Mohamed, S.: Variational in-
ference with normalizing flows. arXiv preprint
arXiv:1505.05770 (2015)
[12] Chen, T.Q., Rubanova, Y., Bettencourt, J., Du-
venaud, D.K.: Neural ordinary differential equa-
tions. In: Advances in Neural Information Pro-
cessing Systems, pp. 6571–6583 (2018)
[13] Jacobsen, J.-H., Behrmann, J., Zemel, R.,
Bethge, M.: Excessive invariance causes
adversarial vulnerability. arXiv preprint
arXiv:1811.00401 (2018)
[14] Behrmann, J., Duvenaud, D., Jacobsen, J.-
H.: Invertible residual networks. arXiv preprint
arXiv:1811.00995 (2018)
[15] Rota Bulo`, S., Porzi, L., Kontschieder, P.:
In-place activated batchnorm for memory-
optimized training of dnns. In: Proceedings of
the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 5639–5647 (2018)
[16] Iandola, F.N., Han, S., Moskewicz, M.W.,
Ashraf, K., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with
50x fewer parameters and¡ 0.5 mb model size.
arXiv preprint arXiv:1602.07360 (2016)
[17] Howard, A.G., Zhu, M., Chen, B., Kalenichenko,
D., Wang, W., Weyand, T., Andreetto, M.,
Adam, H.: Mobilenets: Efficient convolutional
15
neural networks for mobile vision applications.
arXiv preprint arXiv:1704.04861 (2017)
[18] Frankle, J., Carbin, M.: The lottery ticket hy-
pothesis: Finding sparse, trainable neural net-
works. arXiv preprint arXiv:1803.03635 (2018)
[19] Micikevicius, P., Narang, S., Alben, J., Di-
amos, G., Elsen, E., Garcia, D., Ginsburg, B.,
Houston, M., Kuchaiev, O., Venkatesh, G., et
al.: Mixed precision training. arXiv preprint
arXiv:1710.03740 (2017)
[20] Wu, S., Li, G., Chen, F., Shi, L.: Training and
inference with integers in deep neural networks.
arXiv preprint arXiv:1802.04680 (2018)
[21] Martens, J., Sutskever, I.: Training deep and re-
current networks with hessian-free optimization.
In: Neural Networks: Tricks of the Trade, pp.
479–535. Springer, ??? (2012)
[22] Chen, T., Xu, B., Zhang, C., Guestrin, C.:
Training deep nets with sublinear memory cost.
arXiv preprint arXiv:1604.06174 (2016)
[23] Smith, L.N., Topin, N.: Super-convergence:
Very fast training of neural networks using large
learning rates. arXiv preprint arXiv:1708.07120
(2017)
Figure 1: Illustration of the ResNet-18 architecture
and its memory requirements. Modules contribut-
ing to the peak memory consumption are shown in
red. These modules contribute to the memory cost
by storing their input in memory. The green annota-
tion represents the extra memory cost of storing the
gradient in memory. The peak memory consumption
happens in the backward pass through the last con-
volution so that this layer is annotated with an addi-
tional gradient memory cost. At this step of the com-
putation, all lower parameterized layers have stored
their input in memory, which constitutes the memory
bottleneck.
16
Figure 2: Illustration of the backpropagation pro-
cess through a reversible block. In the forward pass
(left), activations are propagated forward from top to
bottom. The activations are not kept in live memory
as they are to be recomputed in the backward pass so
no memory bottleneck occurs. The backward pass is
made of two phases: First the hidden and input acti-
vations are recomputed from the output through an
additional forward pass through both modules (mid-
dle). Once the activations recomputed, the activa-
tions gradient are propagated backward through both
modules of the reversible blocks (right). Because the
activation and gradient computations flow in oppo-
site directions through both modules, both computa-
tions cannot be efficiently overlapped, which results
in the local memory bottleneck of storing all hidden
activations within the reversible block before the gra-
dient backpropagation step.
Figure 3: Illustration of the Revnet architecture
and its memory consumption. Modules contributing
to the peak memory consumption are shown in red.
The peak memory consumption happens during the
backward pass through the first reversible block. At
this step of the computations, all hidden activations
within the reversible block are stored in memory si-
multaneously.
17
Figure 4: Illustration of the i-Revnet architecture
and its memory consumption. The peak memory con-
sumption happens during the backward pass through
the top reversible block. In addition to this local
memory bottleneck, the cost of storing the top layers
weights (in orange) becomes a new memory bottle-
neck as the weight kernel size grows quadratically in
the number of channels.
Figure 5: Illustration of the numerical errors aris-
ing from batch normalization layers. Comparison of
the theoretical and empirical evolution of the α ratio
for different ρ values in our toy example. Empiri-
cal values were computed for a Gaussian input signal
with zero mean and standard deviation 1 and a white
Gaussian noise of standard deviation 10−5.
18
Figure 6: Illustration of the numerical errors arising
from invertible activation layers. Comparison of the
theoretical and empirical evolution of the α ratio for
different negative slopes n. Empirical values were
computed for a Gaussian input signal with zero mean
and standard deviation 1 and a white Gaussian noise
of standard deviation 10−5.
Figure 7: Illustration of a layer-wise invertible ar-
chitecture and its memory consumption.
Figure 8: Illustration of a hybrid architecture and
its peak memory consumption.
19
Figure 9: Illustration of the backpropagation pro-
cess through a reversible block of our proposed hy-
brid architecture. In the forward pass (left), acti-
vations are propagated forward from top to bottom.
The activations are not kept in live memory as they
are to be recomputed in the backward pass so that
no memory bottleneck occurs. The backward pass
is made of two phases: First the input activations
are recomputed from the output using the Reversible
block analytical inverse (middle). This step allows to
reconstruct the input activations with minimal recon-
struction error. During this step, hidden activations
are not kept in live memory so as to avoid the lo-
cal memory bottleneck of the reversible block. Once
the input activation recomputed, the gradients are
propagated backward through both modules of the
reversible blocks (right). During this second phase,
hidden activations are recomputed backward through
each module using the layer-wise inverse operations,
yielding minimal memory footprint
Figure 10: Evolution of the SNR through the layers
of a layer-wise invertible model. Color boxes illus-
trate the span of two consecutive convolutional blocks
(Convolution-normalization-activation layers). The
SNR gets continuously degraded throughout each
block of the network, resulting in numerical insta-
bilities.
20
Figure 11: Illustration of the impact of depth (in
number of layers N) and negative slope n on the nu-
merical errors. Both figure shows the evolution of
the SNR at the lowest layer of a layer-wise invertible
network with increasing depth and negative slopes.
The lower the SNR is, the more important numeri-
cal errors of the inverse reconstructions are. (Left):
The SNR decreases exponentially with depth until it
reaches an SNR value of 1. At this point, the noise is
of the same scale as the signal, and no learning can
happen. These results were computed with a neg-
ative slope of n = 2 (Right) This figure shows the
evolution of the SNR with different negative slopes
n for a layer-wise reversible model of depth 3. On a
log-log scale, this figure shows an almost linear rela-
tionship between negative slope and SNR. It is im-
pressive that with only three layer depth, a negative
slope of n = 10−3 reaches a SNR superior to 1. With
such parameterization, even the most shallow models
are not capable of learning.
Figure 12: Impact of the numerical errors on the ac-
curacy of layer-wise invertible models. (Left): Evolu-
tion of a 6-layer model accuracy with and without in-
verse reconstructions with the negative slope. With-
out reconstruction, the model accuracy benefits from
smaller negative slopes. With inverse reconstruc-
tions, the model similarly benefits from smaller neg-
ative slopes as n decreases from 1 to 0.1. For smaller
negative slopes, however, the accuracy sharply de-
creases toward lower values due to numerical errors.
(Right) Evolution of the accuracy with depth for a
negative slope n = 0.2 with and without inverse re-
constructions. Without reconstruction, the model
accuracy benefits from depth. With inverse recon-
structions, the model similarly benefits from depth
as the number of layers grow from 3 to 7. For N > 7,
however, the accuracy sharply decreases toward lower
values due to numerical errors.
21
Figure 13: Evolution of the SNR through the layers
of a hybrid architecture model. The span of two con-
secutive reversible blocks are shown with color boxes.
Within reversible blocks, the SNR quickly degrades
due to the numerical errors introduced by invertible
layers. However, the signal propagated to the in-
put of each reversible block is recomputed using the
reversible block inverse, which is much more stable.
Hence, we can see a sharp decline of the SNR within
the reversible blocks, but the SNR almost raises back
to its original level at the input of each reversible
block.
Figure 14: Illustration of the impact of depth (in
number of layers N) and negative slope n on the nu-
merical errors. Both figure shows the evolution of
the SNR at the lowest layer of our hybrid architec-
ture with increasing depth and negative slopes. Our
hybrid architecture greatly reduce the impact of both
depth and negative slopes on the numerical errors
Figure 15: Impact of the numerical errors on the ac-
curacy of layer-wise invertible models. Our proposed
hybrid architecture greatly stabilizes the numerical
errors, which results in smaller effects of the depth
and negative slope on accuracy.
22
