BPPSA: Scaling Back-propagation by Parallel Scan Algorithm by Wang, Shang et al.
BPPSA: SCALING BACK-PROPAGATION BY PARALLEL SCAN ALGORITHM
Shang Wang 1 Yifan Bai 2 Gennady Pekhimenko 1
ABSTRACT
In an era when the performance of a single compute device plateaus, software must be designed to scale on a
massively parallel system for better runtime performance. However, in the context of training deep learning
models, the commonly used back-propagation (BP) algorithm imposes a strong sequential dependency in the
process of gradient computation. Under model parallelism, BP has a theoretical step complexity of Θ(n) which
hinders its scalability in a parallel computing environment, where n represents the number of compute devices
into which a model is partitioned.
Scan is a primitive operation that performs an in-order aggregation on a sequence of values and returns the partial
result at each step. Parallel algorithms (e.g., Blelloch scan) have been developed to scale the scan operation on
massively parallel systems. In this work, in order to improve the scalability of BP, we reformulate BP into a scan
operation which is then scaled by our modified version of the Blelloch scan algorithm with a theoretical step
complexity of Θ(log n). We evaluate our approach on a vanilla Recurrent Neural Network training with synthetic
datasets, and demonstrate up to 2.75× speedup in terms of the overall training time and 8.8× speedup on the
backward pass alone.
1 INTRODUCTION
The training of deep learning models demands more and
more compute resources as the models become more pow-
erful and complex with an increasing number of layers in
recent years (Krizhevsky et al., 2012; Szegedy et al., 2015;
Simonyan & Zisserman, 2015; He et al., 2015; Huang et al.,
2016). For example, ResNet can have more than a thousand
layers (He et al., 2016), and ResNet-152 takes days to train
on eight state-of-the-art GPUs (Coleman et al., 2017). Now
that the performance of a single compute device plateaus
(Esmaeilzadeh et al., 2011), training has to be designed to
scale on a massively parallel system.
Data parallelism (Shallue et al., 2018) is the most popular
way to scale training by partitioning the training data among
multiple devices, where each device contains a full replica
of the model. As the number of devices increases, data
parallelism faces the trade-off between the synchronization
cost in synchronous parameter updates and staleness in asyn-
chronous parameter updates (Ben-Nun & Hoefler, 2018).
Moreover, data parallelism cannot be applied when a model
does not fit into one device due to memory constraints (e.g.,
1Department of Computer Science, University of Toronto
2Department of Electrical Engineering and Computer Sciences,
University of California, Berkeley; work done while at the
University of Toronto. Correspondence to: Shang Wang
<wangsh46@cs.toronto.edu>.
Preprint. Under review.
caused by deep network architecture, large batch size, or
high input resolution (Rhu et al., 2016; Zhu et al., 2018)).
Model parallelism (Krizhevsky, 2014; Huang et al., 2018;
Narayanan et al., 2019) is another approach to distributed
training which partitions a model and distributes its parts
among devices. It covers a wide spectrum of training deep
learning models where data parallelism does not suffice.
Naïve training under model parallelism does not scale well
with the number of devices due to under-utilization of the
hardware resources, since at most one device can be utilized
at any given point in time (Narayanan et al., 2019). To
address the aforementioned issue, prior works on pipeline
parallelism, including PipeDream (Narayanan et al., 2019)
and GPipe (Huang et al., 2018), propose pipelining across
devices for better resource utilization; however, as the num-
ber of layers and devices increases, pipeline parallelism still
faces the trade-off between the resource utilization in syn-
chronous parameter updates and staleness in asynchronous
parameter updates (Narayanan et al., 2019). Moreover, to
fully fill the pipeline with useful computation, each device
needs to store the activations at the partition boundaries
of all mini-batches that enters the pipeline. Therefore, the
maximum number of devices that pipeline parallelism can
support is limited by the memory capacity of a single device.
Algorithmically, the fundamental reason for this scalabil-
ity limitation observed from prior works is that the back-
propagation (BP) algorithm (Rumelhart et al., 1988) im-
poses a strong sequential dependency between layers during
ar
X
iv
:1
90
7.
10
13
4v
2 
 [c
s.L
G]
  1
0 S
ep
 20
19
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
Figure 1: BP as a scan operation, scaled by our modified
version of the Blelloch scan algorithm.
the gradient computation. Since computing systems evolve
to have more and more parallel nodes (Esmaeilzadeh et al.,
2011), in this work, we aim at exploring the following ques-
tion: How can BP scale efficiently when the number of layers
and devices keeps increasing in the foreseeable future?
To answer this question, we utilize a primitive operation
called Scan (Blelloch, 1990) that performs an in-order ag-
gregation on a sequence of values and returns the partial
result at each step. Parallel algorithms (Hillis & Steele,
1986; Blelloch, 1990) have been developed to scale the scan
operation on massively parallel systems. We observe that
BP is mathematically similar to a scan operation on the trans-
posed Jacobian matrix (Weisstein, 2019) of each layer and
the gradient vector of the output from the last layer. Inspired
by this key observation, we restructure the strong sequen-
tial dependency of BP, and present a new method to scale
Back-propagation by Parallel Scan Algorithm (BPPSA).
Our major contributions are summarized below.
We reformulate BP as a scan operation and modify the Blel-
loch scan algorithm (Blelloch, 1990) to efficiently scale
BP in a distributed computing environment. Our method
has a theoretical step complexity1 of Θ(log n), where n
represents the number of devices into which a model is par-
titioned, compared to Θ(n) of the naïve implementation of
model parallelism. Moreover, our algorithm does not have
the theoretical scalability limits by the memory capacity of
a single device as pipeline parallelism does. As an example,
Figure 1 shows how BP for training a network composed of
7 layers (blue cubes) can be reformulated into a scan opera-
tion on the transposed Jacobian matrices (blue squares) of
this network and the final gradient vector (yellow squares),
as well as how this scan operation can be scaled by BPPSA.
Generating, storing and processing full Jacobian matrices
are usually considered to be prohibitively expensive. How-
ever, we observe that the Jacobian of many layers can
be extremely sparse where traditionally we can leverage
sparse matrix format (Saad, 2003) to reduce the runtime
and storage costs; more importantly, the positions of input-
1Step complexity (detailed in Section 3.6) quantifies the run-
time of a parallel algorithm.
independent zeros in the Jacobian are deterministic, which
leads to potentially more optimized implementations of
sparse matrix libraries. Based on these observations, we
develop routines to efficiently generate sparse transposed
Jacobian for various operators.
As a proof of concept, we evaluate BPPSA on a vanilla
Recurrent Neural Network (RNN) (Elman, 1990) training
with synthetic datasets. We achieve a maximum 2.75×
speedup in terms of the overall (end-to-end) training time
and a 8.8× speedup on the backward pass, compared to
the baseline BP approach which under-utilizes the GPU.
Moreover, we demonstrate that the retraining of pruned
networks (Han et al., 2015; See et al., 2016; He et al., 2017)
can also be a practical use case of BPPSA.
2 BACKGROUND AND MOTIVATION
2.1 Problem Formulation
We conceptualize a deep learning model as a vector function
f composed of sub-functions ~xi = fi(~xi−1 ; ~θi):
f(. ; ~θ1, ..., ~θn) = f1(. ; ~θ1) ◦ ... ◦ fn(. ; ~θn) (1)
where ~θi, i ∈ {1, ..., n} are the parameters of the
model. The model is evaluated by an objective function
l(f(~x0 ; ~θi, i ∈ {1, ..., n})), where ~x0 is the initial input
to the model. Figure 2 visualizes a convolutional neural
network conceptualized in this formulation.
To train the model f , a first-order optimizer requires the
gradients∇~θi l, which are derived from the gradients∇~xi l:
[∇~θ1 l, ...,∇~θn l]← [(
∂~x1
∂~θ1
)T∇~x1 l, ..., (
∂~xn
∂~θn
)T∇~xn l] (2)
where ∂~xi
∂~θi
is the Jacobian matrix of the output ~xi of fi to its
parameters ~θi. To derive∇~xi l given∇~xn l, BP (Rumelhart
et al., 1988) solves the following recursive equation, from
i = n− 1 to i = 1:
∇~xi l← (
∂~xi+1
∂~xi
)T∇~xi+1 l,∀i ∈ {n− 1, ..., 1} (3)
where ∂~xi+1∂~xi is the Jacobian matrix of the output ~xi+1 of
fi+1 to its input ~xi. Equation 2 itself does not have de-
pendency along i; therefore, the computation of ∇~θi l can
be parallelized if∇~xi l are available. However, Equation 3
imposes a strong sequential dependency along i where the
computation of ∇~xi l can not begin until the computation of
∇~xi+1 l finishes, and therefore, hinders the scalability when
multiple workers (as an abstraction of devices) are available.
2.2 Prior Works
To increase the utilization of hardware resources in model
parallelism, prior works, e.g., PipeDream (Narayanan et al.,
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
Figure 2: A visualization of the formulation in Section 2.1 on convolutional neural networks. Different parts of the model
can be distributed to different devices (workers).
Figure 3: Timing diagram of the forward pass when dis-
tributing a model via pipeline parallelism. Each color repre-
sents an individual batch.
2019) and GPipe (Huang et al., 2018), propose to pipeline
the computation in the forward and backward passes across
devices. However, these solutions are not “silver bullets” to
scalability due to the following reasons.
First, both PipeDream (Narayanan et al., 2019) and GPipe
(Huang et al., 2018) require storing activations and/or multi-
ple versions of weights for all batches that enter the pipeline.
Although GPipe’s re-materialization (Chen et al., 2016) can
mitigate the memory usage, the theoretical space complex-
ity still grows linearly with the length of the pipeline (i.e.,
the number of devices). As a result, the maximum number
of devices that pipeline parallelism can support is limited
by the memory capacity of a single device (e.g., the GPU
global memory), and such memory capacity is not expected
to grow significantly in a foreseeable future (Mutlu, 2013).
Space Complexity of GPipe Using the notations consis-
tent with GPipe, with re-materialization enabled, each de-
vice reserves Θ(L/K) space for re-computing the interme-
diate activations of each sample in a “micro-batch", where
L and K are the length of the network and the number of
devices in the pipeline correspondingly. As we show in Fig-
ure 3, to fully fill the pipeline with useful computation, the
number of “micro-batches" entering the pipeline (the solid
black box) should be equal to the length of the pipeline (the
dashed black box); thus, each device needs to store at least
Θ(K) activations at the partition boundary for each sample,
resulting in a Θ(L/K +K) per-device space complexity.
Second, if the parameter updates are partially asynchronous,
as proposed in PipeDream (Narayanan et al., 2019), stale-
ness will be introduced. Although Narayanan et al. argue
that the staleness produced by their method does not affect
the update step for a vanilla SGD optimizer (Narayanan
et al., 2019), such an argument would be invalid when com-
bined with other techniques commonly used in first-order op-
timizers (e.g., momentum in Adam (Kingma & Ba, 2015)).
If, however, the gradient updates are fully synchronized as
proposed in GPipe (Huang et al., 2018), the “bubble of idle-
ness” between the forward and backward passes increases
linearly with the length of the pipeline, thus, linearly re-
ducing the hardware utilization and decimating the original
purpose of pipelining.
Our approach fundamentally differs from these key prior
works (Narayanan et al., 2019; Huang et al., 2018) in the
following major ways. First, instead of following the depen-
dency of BP, we reformulate BP so that scaling is achieved
via the Blelloch scan algorithm (Blelloch, 1990) which is
designed for parallelism. Second, the original BP is recon-
structed exactly without introducing new sources of errors
(e.g., staleness); therefore, our method is agnostic to the
exact first-order optimizer being used. Third, our approach
becomes more advantageous as the number of devices in-
creases, instead of diminishing returns or hitting scalability
limits due to linear per-device space complexity.
2.3 Definition of the Scan Operation
For a binary and associative operator ⊕ with an identity
value I , the exclusive scan (a.k.a., prescan) on an input array
[a0, a1, a2, ..., an−1] produces an output array [I, a0, a0 ⊕
a1, a0⊕a1⊕a2, ..., a0⊕...⊕an−2] (Blelloch, 1990). Parallel
scan algorithms were developed as both the importance of
the scan operation and the parallelism of the hardware and
systems increase (Hillis & Steele, 1986; Blelloch, 1990).
3 PROPOSED METHOD: BPPSA
3.1 Back-propagation as a Scan Operation
We define a binary, associative, and non-commutative op-
erator A  B = BA, whose identity value is the identity
matrix I , whereA can be either a matrix or a vector andB is
a matrix. Using operator , we can reformulate Equation 3
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
as calculation of the following array:
[∇~xn l,∇~xn l  (
∂~xn
∂~xn−1
)
T
,∇~xn l  (
∂~xn
∂~xn−1
)
T  (∂~xn−1
∂~xn−2
)
T
,
...,∇~xn l  (
∂~xn
∂~xn−1
)
T  ...  (∂~x2
∂~x1
)
T
]
(4)
Equation 4 can be interpreted as an exclusive scan operation
of  on the input array:
[∇~xn l, (
∂~xn
∂~xn−1
)T , (
∂~xn−1
∂~xn−2
)T , ..., (
∂~x2
∂~x1
)T , (
∂~x1
∂~x0
)T ] (5)
3.2 Scaling Back-propagation with the Blelloch Scan
Algorithm
We parallelize the computation of Equation 4 on multiple
workers with the Blelloch scan algorithm (Blelloch, 1990),
formally described in Algorithm 1. The algorithm contains
two phases: up-sweep and down-sweep. As an example,
Figure 4 visualizes this algorithm applied on the convolution
layers of VGG-11 (Simonyan & Zisserman, 2015) with
levels L0-L4 as the up-sweep and levels L5-L10 as the down-
sweep. Only the up-sweep phase contains matrix-matrix
multiplications. Due to the non-commutative property of the
operator , we have to reverse the order of operands for 
during the down-sweep phase. This modification is reflected
on line 13 of Algorithm 1 and visualized in Figure 5b.
Algorithm 1 Modified Blelloch Scan Algorithm
Input: a = [∇~xn l, ( ∂~xn∂~xn−1 )
T , ..., (
∂~x1
∂~x0
)T ] . Input array of Equation 5
Output: a = [I,∇~xn l, ...,∇~x1 l] .∇~xi l for Equation 2; computed in-place
1: for d← 0 to dlog(n+ 1)e − 2 do . Up-sweep Phase
2: for all i← 0 to (n− 2d) by 2d+1 do in parallel
3: (l, r)← (i+ 2d − 1,min(i+ 2d+1 − 1, n))
4: a[r]← a[l]  a[r]
5: end for
6: end for
7: a[n]← I
8: for d← dlog(n+ 1)e − 1 to 0 do . Down-sweep Phase
9: for all i← 0 to (n− 2d) by 2d+1 do in parallel
10: (l, r)← (i+ 2d − 1,min(i+ 2d+1 − 1, n))
11: T ← a[l]
12: a[l]← a[r]
13: a[r]← a[r]  T .Modification: Reverse the operands of .
14: end for
15: end for
3.3 Jacobian Matrices in Sparse Format
A full Jacobian matrix ∂~xi+1∂~xi of fi+1(. ;
~θi+1) can be too
expensive to generate, store, and process. In fact, the Ja-
cobian matrix of the first convolution operator in VGG-11
(Simonyan & Zisserman, 2015) processing a 32 × 32 im-
age can occupy 768 MB of memory if it is stored as a full
matrix, which is prohibitively large. Fortunately, Jacobian
matrices of major operators (such as convolution, ReLU,
and max-pooling) are usually extremely sparse as shown in
Figure 6. In comparison, representing the data contained
in the same Jacobian of the aforementioned convolution
operator in the Compressed Sparse Row (CSR) (Saad, 2003)
format shrinks the memory consumption down to only 6.5
MB. We can observe that there are two reasons for zeros to
appear in an operator’s Jacobian: guaranteed zeros that are
input (~x0) invariant (e.g., zeros that are not on the diagonal
of the ReLU’s Jacobian) and related to the model’s archi-
tecture; and possible zeros that depend on the input (e.g.,
zeros on the diagonal of the ReLU’s Jacobian). For any
operator, the positions of guaranteed zeros (named as the
sparsity pattern for brevity) in the Jacobian is deterministic
with the model architecture and known ahead of time. Thus,
mapping non-zero elements in the input matrices to each
non-zero element in the product matrix (e.g., calculating
the number of non-zeros and index merging in CSR matrix-
matrix multiplication (Kunchum et al., 2017)) can be per-
formed prior to training and removed from a generic sparse
matrix multiplication routine (e.g., cuSPARSE (NVIDIA,
2018)) to achieve significantly better performance during
the training phase. The sparsity of guaranteed zeros (de-
fined as the fraction over all elements in a matrix) for various
operators is listed in Table 1. In our implementation, the
transposed Jacobian matrices are represented in the CSR
format since it is the most straightforward and commonly
used sparse matrix format; however, any other sparse matrix
format can be used as an alternative, including a potentially
more efficient customized sparse matrix format that utilizes
the deterministic property of the current sparsity pattern.
3.4 Generating Jacobian Matrix in CSR Analytically
To practically generate the Jacobian for an operator, instead
of generating one column at a time either numerically (Ma-
haffy, 2019) or via automatic differentiation (Paszke et al.,
2017; PyTorch, 2019a), we develop analytical routines to
generate the transposed Jacobian directly into the CSR for-
mat. For example2, Algorithm 2, Algorithm 3 and Algo-
rithm 4 show how to generate the CSR indptr, indices
and data arrays (Varoquaux et al., 2019) respectively for
the transposed Jacobian of a convolution operator that has
a 3 × 3 filter and padding size of 1.3 The last column of
Table 1 shows the speedup on analytical generation of the
transposed Jacobian for different operators in VGG-11 (Si-
monyan & Zisserman, 2015). To build a mature framework
with automatic differentiation capability that performs train-
ing via our method, one would need to build an equivalent
of the cuDNN library (NVIDIA, 2019a) which possesses a
“sparse transposed Jacobian operator" in place of a backward
operator for each forward operator.
2We develop the routines of generating the transposed Jaco-
bian in CSR for ReLU and max-pooling as well. The formal
descriptions are omitted due to space constraints.
3Although the example uses a specific configuration of the
convolution operator, deriving a generic routine is doable.
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
Figure 4: Applying our algorithm on the convolution layers of VGG-11 (Simonyan & Zisserman, 2015). Blue, orange, and
green squares represent transposed Jacobian matrices, gradient vectors, and symbolic identity matrices respectively. Blue
solid lines, orange solid lines, and yellow dash lines represent matrix-matrix multiplications, matrix-vector multiplications,
and logical data movements (that do not have to be performed explicitly) respectively.
Table 1: The sparsity of guaranteed zeros for various operators.
Operator Filter/Kernel Size Input Size Output Size Sparsity Examples † Analytical Generation Speedup §
Convolution co × ci × hf × wf ci × hi × wi co × ho × wo 1− hfwfhiwi
‡ 0.99157 8.3× 103×
ReLU N/A c× h× w c× h× w 1− 1
chw
0.99998 1.2× 106×
Max-pooling hf × wf ci × hi × wi co × ho × wo 1− hfwfcihiwi 0.99994 1.5× 10
5×
† The examples of sparsity for the first convolution, ReLU and max-pooling operators of VGG-11 (Simonyan & Zisserman, 2015)
operating on 32× 32 images are shown in the second last column of the table.
‡ Approximation when hi and wi are much greater than the padding size.
§ Over generating the transposed Jacobian through PyTorch’s Autograd (Paszke et al., 2017) one column at a time; measured on a Ryzen
Threadripper 1950X (AMD, 2019b) machine; averaged across 1000 trials.
(a) Up-sweep: B ← A B. (b) Down-sweep: A,B ←
B,B A.
Figure 5: Visualizations of the primitive operations per-
formed in the up-sweep and the down-sweep phases.
3.5 Convergence
Theoretically, our algorithm is a reconstruction of BP in-
stead of an approximation, and hence, expected to reproduce
the exact same outputs. However, in practice, numerical dif-
ferences could be introduced due to the change in the order
of matrix multiplications. We apply our algorithm to train
LeNet-5 (Lecun et al., 1998) on CIFAR-10 (Krizhevsky,
2009) to demonstrate that such numerical differences would
not affect model convergence. We use a mini-batch size of
256 and the SGD (Qian, 1999) optimizer with a learning
rate of 0.001 and a momentum of 0.9. The experiments
are seeded with the same constant. Figure 7 shows that the
orange lines overlap with the blue lines for both training
and test losses, which means our algorithm has negligible
impact on the convergence compared to the original BP.4
3.6 Complexity Analysis
Runtime Complexity We leverage the following defini-
tions to quantify the complexity of a parallel algorithm: (1)
step complexity (S) which evaluates the minimum number
of steps to finish the execution on the critical path (end-
to-end) given the number of parallel workers; (2) per-step
complexity (P ) which evaluates the runtime of a single step;
and (3) work complexity (W ) which evaluates the number of
total steps executed by all workers. For brevity, we refer to
4We can derive the same conclusion from Figure 9.
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
Algorithm 2 Compute the CSR indptr array for the trans-
posed Jacobian of a 3× 3 convolution.
Input: input channels ci, output channels co, input height hi, input width wi
Output: indptr← malloc(cihiwi + 1)
1: for all i← 0 to (cihiwi) do in parallel
2: a← bi/(hiwi)c
3: b← i mod (hiwi)
4: if b 6 wi then
5: indptr[i]← aco(3wi(3hi − 2)) + 6cob
6: else if b 6 wi(hi − 1) then
7: indptr[i]← aco(3wi(3hi − 2)) + 6cowi + 9co(b− wi)
8: else
9:
indptr[i]←aco(3wi(3hi − 2)) + 6cowi + 9co(wi(hi − 2))
+ 6co(b− wi(hi − 1))
10: end if
11: end for
Algorithm 3 Compute the CSR indices array for the trans-
posed Jacobian of a 3× 3 convolution.
Input: input channels ci, output channels co, input height hi, input width wi,
indptr computed from Algorithm 2
Output: indices← malloc(3wi(3hi − 2)cico)
1: for all i← 0 to (cihiwi − 1) do in parallel
2: r ← i mod (hiwi)
3: base← malloc(9co)
4: for all j ← 0 to (co − 1) do in parallel
5: for all k ← 0 to 2 do in parallel
6:
base[9j + 3k : 9j + 3(k + 1)]← (
[−1, 0, 1] + (jhi + k − 1)wi + r) mod (cohiwi)
7: end for
8: end for
9: if r < wi or r > wi(hi − 1) then
10: row← malloc(6co)
11: (left, right)← (3, 9) if r < wi; (0, 6) otherwise
12: for all j ← 0 to (co − 1) do in parallel
13: row[6j : 6j + 6]← base[9j + left : 9j + right]
14: end for
15: else
16: row← base
17: end if
18: indices[indptr[i] : indptr[i+ 1]]← sorted(row)
19: end for
(a) Convolution (b) Max-pooling.
(c) ReLU.
Figure 6: Transposed Jacobian for various operators. Yellow
dots represents locations of non-zero elements in the matrix.
Algorithm 4 Compute the CSR data array for the trans-
posed Jacobian of a 3× 3 convolution.
Input: input channels ci, output channels co, input height hi, input width wi, filter
weights, indptr computed from Algorithm 2
Output: data← malloc(3wi(3hi − 2)cico)
1: for all i← 0 to (cihiwi − 1) do in parallel
2: r ← i mod (hiwi)
3: m← bi/(hiwi)c
4:
range←(1::-1) if (r < wi);
(2:0:-1) if (r > wi(hi − 1));
(2::-1) otherwise
5:
data[indptr[i] : indptr[i+ 1]]← flatten(
weights[:,m, range, ::-1])
6: Fix corner cases when (i mod wi) = 0 or (i mod wi) = (wi − 1).
7: end for
performing the scan operation serially as linear scan, which
is essentially emulating BP by using the transposed Jacobian
and multiplying it with the gradient (as shown in Equation 3)
explicitly. Assuming the system can be conceptualized as
a parallel random-access machine (PRAM) (Kruskal et al.,
1990), the number of workers is p and the size of the input
array in Equation 5 is n+ 1, the step and work complexity
of our algorithm can be derived as:
SBlelloch(n) =
{
Θ(log n) p > n
Θ(n/p+ log p) otherwise
(6)
WBlelloch(n) = Θ(n) (7)
compared to SLinear(n) = Θ(n),WLinear(n) = Θ(n) of
the linear scan (which has the same step and work complex-
ity as BP). Therefore, in an ideal scenario where there is an
unbounded number of workers with unit per-step complex-
ity, our algorithm reduces the runtime of BP from Θ(n)
to Θ(logn). If, however, we consider the difference in
per-step complexity between our algorithm (PBlelloch) and
the baseline (PLinear) due to runtime difference between
matrix-matrix and matrix-vector multiplications, our al-
gorithm has a runtime of Θ(log n)PBlelloch compared to
Θ(n)PLinear in the baseline. There are two approaches to
make our algorithm achieve a lower runtime and better scal-
ing than the baseline. First, we can reduce PBlelloch, which
is reflected in leveraging the sparsity in the transposed Ja-
cobian as analyzed in Section 4.2 and Section 5.2. Second,
without a lower PBlelloch, our algorithm can still outperform
the baseline if PBlelloch/PLinear < Θ(n/ log n). This can
occur when n/ log n grows larger than the dimension of ~xi.
The performance benefit of such case is demonstrated in
Section 4.1 and Section 5.1.
Space Complexity Assuming space of storing a trans-
posed Jacobian matrix is bounded by MJacob and storing
~xi is bounded by M~x (note that MJacob  O(M~x2) due
to sparse matrix formats; both MJacob and M~x are not
functions of p), in our method, each worker requires the
space of MBlelloch(n) = Θ(max(np , 1))MJacob which re-
duces as p increases until a constant MJacob, comparing to
MPipeline = Θ(
n
p + p)M~x of pipeline parallelism which
increases linearly as p increases. Therefore, our method
does not have the limitation of scalability on p, as long as
each worker has the memory capacity of at least MJacob.
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
0 2000 4000 6000 8000
Iterations (# of batches)
1.2
1.4
1.6
1.8
2.0
2.2
Lo
ss
Baseline, train
Blelloch, train
(a) Training loss per iteration.
0 2000 4000 6000 8000
Iterations (# of batches)
1.4
1.6
1.8
2.0
2.2
Lo
ss
Baseline, test
Blelloch, test
(b) Test loss per iteration.
Figure 7: Training and test loss per iteration for training
LeNet-5 on CIFAR-10. The orange solid lines represent
training via the PyTorch Autograd baseline, while the blue
dash lines represent training via BPPSA.
4 METHODOLOGY
Although there have been many prior works in the indus-
try of training deep learning models on thousands of de-
vices (MLPerf, 2019) which result in p being on the similar
scale of n, setting up an experiment for such a high number
of devices would require a data center of GPUs and re-
implementing/optimizing our entire experiment framework,
which requires both monetary and engineering resources out
of reach for a typical academic research group. Therefore,
we setup a set of small-scale experiments that can reflect the
large-scale workloads to demonstrate the potential perfor-
mance benefits of our method.
Environment Setup Our experiments are performed on
two platforms with RTX 2070 (NVIDIA, 2019b) and RTX
2080Ti (NVIDIA, 2019c) respectively (both are Turing ar-
chitecture GPUs) whose specifications are listed in Table 2.
Baselines We evaluate our method against PyTorch Auto-
grad (Paszke et al., 2017) with cuDNN backend (NVIDIA,
2019a) which is a widely adopted and state-of-the-art imple-
mentation of BP. 5
5We also attempted to prototype a pipeline parallelism imple-
Table 2: Specifications of our experiment platforms.
GPU RTX 2070 RTX 2080Ti
Number of Streaming
Multiprocessors (SMs) 36 68
CUDA (Nickolls et al., 2008) 10.0.130 10.0.130
cuDNN (Chetlur et al., 2014) 7.5.1 7.6.2
PyTorch (Paszke et al., 2017) 1.1.0 1.2.0
CPU Ryzen Threadripper1950X (AMD, 2019b)
EPYC
7601 (AMD, 2019a)
Host Memory 32GB, 2400MHz 128GB, 2133MHz
Linux Kernel (Torvalds, 2019) 4.15.0-55 4.4.0-142
Metrics We use three metrics to quantify the results from
our evaluations: 1) wall-clock time which measures the
system-wide actual time spent on a process, 2) speedup
which is the ratio of the wall-clock time spent on the baseline
over our method, and 3) FLOP which represents the number
of floating-point operations executed.
We leverage two types of benchmarks to empirically evalu-
ate BPPSA: (1) an end-to-end benchmark of a vanilla RNN
training on synthesized datasets to demonstrate the scalabil-
ity benefits of BPPSA on long sequential dependency, and
(2) a micro-benchmark of a pruned VGG-11 (Simonyan &
Zisserman, 2015) to evaluate the feasibility of using sparse
matrix format to reduce the per-step complexity of BPPSA.
4.1 RNN End-to-end Benchmark
We setup experiments of training a RNN (Elman, 1990) on
sequential data, which is a classical example of workloads
where the runtime performance (in terms of the wall-clock
time) is limited due to the strong sequential dependency.
The large number of operators n is modeled through a large
sequence length T , and the large number of workers p is
reflected in the total number of CUDA threads that can be ex-
ecuted concurrently in all SMs of a single GPU normalized
by the mini-batch size B.
Datasets We synthesize the datasets X = {(x(k), c(k))}
of 32000 training samples (i.e., k ∈ {0, 31999}) for the
task of bitstream classification. Each sample consists of a
class label c(k) where c(k) ∈ {0, ..., 9} and a bitstream x(k)
where the value x(k)t at each time step t ∈ {0, ..., T − 1} is
sampled from the Bernoulli distribution (Bernoulli, 1713;
Evans & Rosenthal, 2009):
x
(k)
t ∼ Bernoulli(0.05 + c(k) × 0.1) (8)
Equivalently, each bitstream x(k) can be viewed as a bi-
nomial experiment (Bernoulli, 1713; Evans & Rosenthal,
2009) of class c(k). A set of examples is visualized in Fig-
ure 8. The objective of this task is to classify each bitstream
x(k) into its corresponding class c(k) correctly. We syn-
mentation via CUDA streams in our experiments; however, such
implementation incurs significant overhead, thus performs much
worse than our baseline.
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
Figure 8: Examples of bitstreams in the bitstream clas-
sification task when the sequence length T = 10. The
expectation of the number of ones in the bitstream x(k) is
T × (0.05 + c(k) × 0.1).
thesize eight datasets with different T , where T increases
up to 30000. In reality, long sequences of input can often
be found in audio signals such as speech (Nagrani et al.,
2017; Baumann et al., 2018; Barker et al., 2018) or music
(Bertin-Mahieux et al., 2011; Benzi et al., 2016).
Model We leverage a vanilla RNN (Elman, 1990) (de-
scribed in Equation 9) to solve the aforementioned task
since RNN is an intuitive, yet classical, deep learning model
and often used to process sequential data:
~h
(k)
t = tanh(Wihx
(k)
t +
~bih +Whh~h
(k)
t−1 +~bhh) (9)
where~h(k)t ,~bih,~bhh ∈ R20. The output classes are predicted
via the softmax function (Bridle, 1990) applied on a linear
transformation to the last hidden states ~h(k)T−1. The cross
entropy (Goodfellow et al., 2016) is used as the loss func-
tion which is optimized in training via the Adam optimizer
(Kingma & Ba, 2015) with the learning rate of 3 × 10−5.
The computation of∇~h(k)t l during the backward pass carries
the strong sequential dependency which is the target for
acceleration via BPPSA.
Implementation Our modified version of the Blelloch
scan algorithm is implemented as two custom CUDA ker-
nels for the up-sweep and down-sweep phases respectively,
along with a few other CUDA kernels for the preparation of
the input transposed Jacobian matrices. Each level during
the up-/down-sweep phase requires a single CUDA ker-
nel launch, therefore synchronization is ensured between
two consecutive levels. Each thread block is responsible
for the  operation (i.e. multiplication in reverse) of two
matrices as well as moving the intermediate results, and
the shared memory is leveraged for caching input and out-
put matrices. Our custom CUDA kernels are integrated
into the Python front-end where the RNN and the train-
ing procedure are defined through PyTorch’s Custom C++
and CUDA Extensions (Goldsborough, 2019). For the for-
ward pass and the baseline of PyTorch Autograd (Paszke
et al., 2017), we are using the PyTorch’s RNN module (Py-
Torch, 2019b) directly which calls into the cuDNN’s RNN
implementations (cudnnRNNForwardTraining and
cudnnRNNBackwardData) (NVIDIA, 2019a), there-
fore, our baseline is already much faster than implementing
RNN in Python using PyTorch’s RNNCell module (Py-
Torch, 2019b) due to GEMM streaming and operation fu-
sions (Appleyard et al., 2016).
4.2 Pruned VGG-11 Micro-benchmark
Despite the recent advances in network pruning algorithms
(Han et al., 2015; See et al., 2016; He et al., 2017), there
is no existing widely adopted software or hardware plat-
form that can exploit performance benefits from pruning, as
most techniques are evaluated through “masking simulation”
which leads to the same (if not worse) runtime and mem-
ory usage. In contrast, in this work, we discover that the
retraining of pruned networks could benefit from BPPSA,
since the values in the Jacobian of a convolution operator
only depend on the filter weights based on Algorithm 4,
and pruning the weights can lead to a higher sparsity in the
Jacobian, which then reduces the per-step complexity of
sparse matrix-matrix multiplications.
To evaluate the feasibility of leveraging the sparsity in the
transposed Jacobian of each operator, we setup a benchmark
on VGG-11 (Simonyan & Zisserman, 2015): training on
CIFAR-10 (Krizhevsky, 2009), pruning away 97% of the
weights in all convolution and linear operators using the
technique proposed by See et al. (See et al., 2016), and
retraining the pruned network. We choose this pruning
percentage so that a similar validation accuracy is reached
(90.1% v.s. 88.9%) after retraining for the same number
of epochs (100) as training. We then apply BPPSA on the
convolutional layers of VGG-11 to compute Equation 3.
Since the sparsity pattern of the transposed Jacobian can
be determined ahead of time from the model architecture
(as we show in Section 3.3), the currently available sparse
matrix libraries which target generic cases are sub-optimal
for our method. For example, cuSPARSE (NVIDIA, 2018)
calculates the number of non-zeros in the product matrix
and merges the indices of the input matrices before it can
perform the multiplication. Such preparations do not need
to repeat across iterations in BPPSA’s case and could be
performed ahead of time due to the deterministic nature
of the sparsity pattern. This, in turn, saves considerable
amount of execution time. Therefore, due to the lack of a fair
implementation, we perform our experiments by calculating
the FLOPs needed for each step in our method and the
baseline implementation through static analysis.
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
0 500 1000 1500 2000 2500 3000
Wall-clock Time (s)
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Tr
ai
ni
ng
 L
os
s
PyTorch/cuDNN
BPPSA
Figure 9: Training loss across wall-clock time when the
RNN is trained via BPPSA (blue curve) and the PyTorch
Autograd baseline with cuDNN’s RNN backend (red curve).
5 RESULTS
In this section, we present the results from the RNN end-to-
end benchmark and the pruned VGG-11 micro-benchmark
as described in Section 4.1 and Section 4.2.
5.1 RNN End-to-end Benchmark
Figure 9 shows the training curves of loss values with re-
spect to wall-clock time when we train the RNN for 50
epochs on the RTX 2070 GPU with the mini-batch size
B = 16 and the sequence length T = 1000. This experi-
ment can be viewed as the simplest mechanism to process
sequential data such as audio signals. We observe that the
blue curve (BPPSA) is roughly equivalent to the red curve
(PyTorch/cuDNN baseline) scaled down by 54% along the
horizontal axis. We conclude that, in this setting, training
the RNN through BPPSA reconstructs the original back-
propagation algorithm while achieving a 2.17× speedup on
the overall training time and 4.53× on the BP runtime.
Sensitivity Analysis We measure the performance vari-
ation as the sequence length T and the batch size B vary,
since those two parameters represent the total number of
operators n and the number of (normalized) workers p re-
spectively — the key variables in the theoretical runtime of
our method. To estimate the speedup on the overall training
time, we measure the wall-clock time of training via BPPSA
for a single epoch, and take the average of 10 measurements
from different epochs. We then compare against training
via the PyTorch/cuDNN baseline measured in the same way.
We can also derive the speedup on only the backward pass
by measuring the runtime of the training procedure with-
out actually performing the backward pass, and subtracting
from the total runtime (including the overhead of preparing
the input transposed Jacobian matrices).
Figure 10a and Figure 10b show how changing the sequence
10 30 100 300 1k 3k 10k 30k
Sequence Length
0
1
2
3
4
5
6
Sp
ee
du
p 
ov
er
 B
as
el
in
e 
(P
yT
or
ch
/c
uD
NN
)
RTX 2070
RTX 2080Ti
(a) The speedup on the backward pass as the sequence length T
increases.
10 30 100 300 1k 3k 10k 30k
Sequence Length
0.0
0.5
1.0
1.5
2.0
2.5
Sp
ee
du
p 
ov
er
 B
as
el
in
e 
(P
yT
or
ch
/c
uD
NN
)
RTX 2070
RTX 2080Ti
(b) The overall speedup as the sequence length T increases.
256 128 64 32 16 8 4 2
Mini-batch Size
0
2
4
6
8
Sp
ee
du
p 
ov
er
 B
as
el
in
e 
(P
yT
or
ch
/c
uD
NN
)
RTX 2070
RTX 2080Ti
(c) The speedup on the backward pass as the batch sizeB decreases
(equivalently, as the “effective" p increases).
256 128 64 32 16 8 4 2
Mini-batch Size
0.0
0.5
1.0
1.5
2.0
2.5
Sp
ee
du
p 
ov
er
 B
as
el
in
e 
(P
yT
or
ch
/c
uD
NN
)
RTX 2070
RTX 2080Ti
(d) The overall speedup as the batch sizeB decreases (equivalently,
as the “effective" p increases).
Figure 10: The speedup in training of a single epoch (mea-
sured and averaged across 10 epochs) of our method against
the baseline as the sequence length T and the batch size B
varies. The blue and orange bars represent speedup on the
RTX 2070 and 2080Ti GPUs. The red dash line represents
the PyTorch Autograd (Paszke et al., 2017) baseline with
cuDNN’s RNN backend (Appleyard et al., 2016).
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
length T affects the backward pass and overall training
time respectively. We make three observations from these
two figures. First, our method scales as n increases when
n is relatively in the same range as p. Second, when n
increases to be much larger than p, the performance starts
to be bounded by p. Third, even in the range of overly large
n, our method still achieves better utilization on massively
parallel hardware than the baseline.
Figure 10c and Figure 10d show how changing the batch
size B affects the backward pass and overall training time
respectively. We can conclude that BPPSA scales as the
“effective" number of workers p per sample increases (equiv-
alently, as the batch sizeB decreases, since the total number
of SMs in the GPU is constant). In reality, determining the
appropriate mini-batch size can be nontrivial: training with
large batch can lead to “generalization gap" (Keskar et al.,
2016); while training with small batch would under-utilize
the hardware resources and lead to longer training time. In
this experiment, BPPSA can be viewed as offering an alter-
native to train with smaller mini-batch while utilizing the
hardware resources more efficiently than BP.
By comparing the speedup in Figure 10a and Figure 10c
between RTX 2070 and RTX 2080Ti, since RTX 2080Ti
has a higher number of SMs than RTX 2070 (68 v.s. 36
(NVIDIA, 2019c;b)), we can observe that: 1) RTX 2080Ti
achieves its maximum speedup at a higher sequence length
compared with RTX 2070; 2) as the batch size B increases,
the speedup of BPPSA on RTX 2080Ti drops at a slower rate
than RTX 2070. These two observations are consistent with
the aforementioned conclusions regrading the performance
variation with the number of workers p. The maximum
backward pass and overall speedup can be observed from
RTX 2080Ti, which are 2.75× and 8.8× respectively.
5.2 Pruned VGG-11 Micro-benchmark
Since the sparsity of the product matrix might reduce after
each multiplication, the per-step complexity might increase
as the up-sweep phase progresses into deeper levels. Fortu-
nately, we can adopt BPPSA to balance the number of levels
in the up-/down-sweep phases according to the sparsity of
the products on each level to achieve an overall speedup.
Specifically, in this experiment, BPPSA performs the up-
sweep phase from L0 to L2 (consistent with the notations
in Figure 4), calculates the partial results that are needed
for the down-sweep phase through linear scan, and then
performs the down-sweep phase from L7 to L10.
Assuming the sparse transposed Jacobian matrices are en-
coded in the CSR format, Figure 11 shows the calculated
FLOP of each step in BPPSA and each “gradient opera-
tor" in the baseline (BP) for re-training pruned VGG-11 on
CIFAR-10. We can observe that the green circles (base-
line) have similar expected performance as the other circles
106 107 108 109 1010 1011 1012 1013 1014
m x n x k
102
103
104
105
106
107
108
FL
OP
mv
mm
mv, critical
mm, critical
baseline (BP)
Figure 11: Measuring FLOP for each step when retraining
pruned VGG-11 on CIFAR-10. The orange and blue circles
represent matrix-vector and matrix-matrix multiplications
in BPPSA respectively. A filled circle indicates that the step
is on the critical path. The x-axis represents the theoretical
runtime complexity of the step if the transposed Jacobian
were not encoded in a sparse format. The green circles
represent the FLOP estimated for each “gradient operator"
in the baseline of the original BP.
(BPPSA). Thus, we can conclude that exploiting the sparsity
in the transposed Jacobian is an efficient strategy that re-
duces the per-step complexity of our method PBlelloch to a
level similar with the baseline PLinear. This strategy makes
the overall scalability to be “guaranteed" algorithmically.
6 CONCLUSION
In this work, we explore a novel direction to scale BP by
challenging its fundamental limitation of the strong sequen-
tial dependency. We reformulate BP into a scan operation
which is scaled by our modified version of the Blelloch scan
algorithm. Our proposed algorithm, BPPSA, achieves a
logarithmic runtime complexity rather than linear. In addi-
tion, BPPSA has a constant per-device space complexity;
hence, its scalability is not limited by the memory capacity
of each device. In our detailed evaluations, we demonstrate
that overall speedup can be already achieved in two im-
portant use cases. First, for the case where there is a long
dependency in BP, we evaluate BPPSA on training a RNN
with synthetic datasets where our method achieves up to
2.75× speedup on the overall (end-to-end) training time
and 8.8× speedup on the backward pass alone. Second,
we can reduce the per-step complexity by leveraging the
sparsity in the Jacobian itself. To this end, we develop ef-
ficient routines to generate the transposed Jacobian in the
CSR format, and demonstrate that the retraining of pruned
networks can potentially benefit from BPPSA (as we show
for a pruned VGG-11 benchmark when re-training on the
CIFAR-10 dataset). We hope that our work will inspire
radically new ideas and designs to improve distributed DNN
training beyond the existing theoretical framework.
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
REFERENCES
AMD. Epyc 7601 dual amd server processors.
https://www.amd.com/en/products/cpu/
amd-epyc-7601, 2019a. Accessed: 2019-08-28.
AMD. Ryzen threadripper 1950x processor.
https://www.amd.com/en/products/cpu/
amd-ryzen-threadripper-1950x, 2019b.
Accessed: 2019-08-28.
Appleyard, J., Kociský, T., and Blunsom, P. Optimizing
performance of recurrent neural networks on gpus. CoRR,
abs/1604.01946, 2016. URL http://arxiv.org/
abs/1604.01946.
Barker, J., Watanabe, S., Vincent, E., and Trmal, J. The
fifth ’chime’ speech separation and recognition challenge:
Dataset, task and baselines. CoRR, abs/1803.10609, 2018.
URL http://arxiv.org/abs/1803.10609.
Baumann, T., Köhn, A., and Hennig, F. The spoken
wikipedia corpus collection: Harvesting, alignment
and an application to hyperlistening. Language Re-
sources and Evaluation, Jan 2018. doi: 10.1007/
s10579-017-9410-y. URL https://doi.org/10.
1007/s10579-017-9410-y.
Ben-Nun, T. and Hoefler, T. Demystifying parallel and dis-
tributed deep learning: An in-depth concurrency analysis.
CoRR, abs/1802.09941, 2018. URL http://arxiv.
org/abs/1802.09941.
Benzi, K., Defferrard, M., Vandergheynst, P., and Bres-
son, X. FMA: A dataset for music analysis. CoRR,
abs/1612.01840, 2016. URL http://arxiv.org/
abs/1612.01840.
Bernoulli, J. Jacobi Bernoulli, ... Ars conjectandi, opus
posthumum. Accedit Tractatus de seriebus infinitis, et
epistola Gallice scripta De ludo pilae reticularis. impen-
sis Thurnisiorum, fratrum, 1713.
Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere,
P. The million song dataset. In Proceedings of the 12th
International Conference on Music Information Retrieval
(ISMIR 2011), 2011.
Blelloch, G. E. Prefix sums and their applications. Technical
Report CMU-CS-90-190, School of Computer Science,
Carnegie Mellon University, Nov. 1990.
Bridle, J. S. Probabilistic interpretation of feedforward clas-
sification network outputs, with relationships to statistical
pattern recognition. In Soulié, F. F. and Hérault, J. (eds.),
Neurocomputing, pp. 227–236, Berlin, Heidelberg, 1990.
Springer Berlin Heidelberg.
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Train-
ing deep nets with sublinear memory cost. CoRR,
abs/1604.06174, 2016. URL http://arxiv.org/
abs/1604.06174.
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran,
J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient
primitives for deep learning. CoRR, abs/1410.0759, 2014.
URL http://arxiv.org/abs/1410.0759.
Coleman, C. A., Narayanan, D., Kang, D., Zhao, T. J.,
Zhang, J., Nardi, L., Bailis, P., Olukotun, K., Re, C. B.,
and Zaharia, M. A. Dawnbench : An end-to-end deep
learning benchmark and competition. 2017.
Elman, J. L. Finding structure in time. Cognitive Science, 14
(2):179–211, 1990. doi: 10.1207/s15516709cog1402\_1.
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam,
K., and Burger, D. Dark silicon and the end of multicore
scaling. In Proceedings of the 38th Annual International
Symposium on Computer Architecture, ISCA ’11, pp.
365–376. ACM, 2011.
Evans, M. and Rosenthal, J. Probability and Statistics: The
Science of Uncertainty. W. H. Freeman, 2009. ISBN
9781429281270.
Goldsborough, P. Custom c++ and cuda exten-
sions. https://pytorch.org/tutorials/
advanced/cpp_extension.html, 2019. Ac-
cessed: 2019-08-28.
Goodfellow, I., Bengio, Y., and Courville, A. Deep
Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
Han, S., Pool, J., Tran, J., and Dally, W. Learning both
weights and connections for efficient neural network. In
Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M.,
and Garnett, R. (eds.), Advances in Neural Information
Processing Systems 28, pp. 1135–1143. 2015.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. CoRR, abs/1512.03385, 2015.
URL http://arxiv.org/abs/1512.03385.
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings
in deep residual networks. CoRR, abs/1603.05027, 2016.
URL http://arxiv.org/abs/1603.05027.
He, Y., Zhang, X., and Sun, J. Channel pruning
for accelerating very deep neural networks. CoRR,
abs/1707.06168, 2017. URL http://arxiv.org/
abs/1707.06168.
Hillis, W. D. and Steele, Jr., G. L. Data parallel algorithms.
Commun. ACM, 29(12):1170–1183, December 1986.
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
Huang, G., Liu, Z., and Weinberger, K. Q. Densely con-
nected convolutional networks. CoRR, abs/1608.06993,
2016. URL http://arxiv.org/abs/1608.
06993.
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,
Q. V., and Chen, Z. Gpipe: Efficient training of gi-
ant neural networks using pipeline parallelism. CoRR,
abs/1811.06965, 2018. URL http://arxiv.org/
abs/1811.06965.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,
M., and Tang, P. T. P. On large-batch training for deep
learning: Generalization gap and sharp minima. CoRR,
abs/1609.04836, 2016. URL http://arxiv.org/
abs/1609.04836.
Kingma, D. P. and Ba, J. Adam: A method for stochastic op-
timization. In 3rd International Conference on Learning
Representations, ICLR 2015, 2015.
Krizhevsky, A. Learning multiple layers of features from
tiny images. Technical report, 2009.
Krizhevsky, A. One weird trick for parallelizing convo-
lutional neural networks. CoRR, abs/1404.5997, 2014.
URL http://arxiv.org/abs/1404.5997.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks. In
Proceedings of the 25th International Conference on Neu-
ral Information Processing Systems - Volume 1, NIPS’12,
pp. 1097–1105, 2012.
Kruskal, C. P., Rudolph, L., and Snir, M. A complex-
ity theory of efficient parallel algorithms. Theoretical
Computer Science, 71(1):95 – 132, 1990. doi: https:
//doi.org/10.1016/0304-3975(90)90192-K.
Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu,
Q., Nisa, I., and Sadayappan, P. On improving perfor-
mance of sparse matrix-matrix multiplication on gpus. In
Proceedings of the International Conference on Super-
computing, ICS ’17, pp. 14:1–14:11, 2017.
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. In Pro-
ceedings of the IEEE, pp. 2278–2324, 1998.
Mahaffy, J. Numerical evaluation of jacobians - per-
sonal.psu.edu. http://www.personal.psu.
edu/jhm/ME540/lectures/NumJacobian.
html, 2019. Accessed: 2019-08-28.
MLPerf. Mlperf training v0.6 results. https://mlperf.
org/training-results-0-6/, 2019. Accessed:
2019-08-28.
Mutlu, O. Memory scaling: A systems architecture per-
spective. pp. 21–25, 05 2013. doi: 10.1109/IMW.2013.
6582088.
Nagrani, A., Chung, J. S., and Zisserman, A. Vox-
celeb: a large-scale speaker identification dataset. CoRR,
abs/1706.08612, 2017. URL http://arxiv.org/
abs/1706.08612.
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V.,
Devanur, N., Granger, G., Gibbons, P., and Zaharia, M.
Pipedream: Generalized pipeline parallelism for dnn
training. In SOSP 2019, October 2019.
Nickolls, J., Buck, I., Garland, M., and Skadron, K. Scalable
parallel programming with cuda. Queue, 6(2):40–53,
March 2008. ISSN 1542-7730.
NVIDIA. cusparse :: Cuda toolkit documentation,
2018. URL https://docs.nvidia.com/cuda/
cusparse/index.html. [Accessed: 2018-11-06].
NVIDIA. cudnn developer guide :: Deep
learning sdk documentation. https://
docs.nvidia.com/deeplearning/sdk/
cudnn-developer-guide/index.html, 2019a.
Accessed: 2019-08-28.
NVIDIA. Geforce rtx 2070 graphics card | nvidia.
https://www.nvidia.com/en-us/geforce/
graphics-cards/rtx-2070/, 2019b. Accessed:
2019-08-28.
NVIDIA. Geforce rtx 2080 ti graphics card | nvidia.
https://www.nvidia.com/en-us/geforce/
graphics-cards/rtx-2080-ti/, 2019c. Ac-
cessed: 2019-08-28.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in PyTorch. In NIPS Autodiff
Workshop, 2017.
PyTorch. How to compute jacobian matrix in py-
torch? https://discuss.pytorch.org/t/
how-to-compute-jacobian-matrix-in-pytorch/
14968, 2019a. Accessed: 2019-08-28.
PyTorch. torch.nn — pytorch master documenta-
tion. https://pytorch.org/docs/stable/
nn.html, 2019b. Accessed: 2019-08-28.
Qian, N. On the momentum term in gradient descent learn-
ing algorithms. Neural Netw., 12(1):145–151, January
1999.
Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., and
Keckler, S. W. Virtualizing deep neural networks
BPPSA: Scaling Back-propagation by Parallel Scan Algorithm
for memory-efficient neural network design. CoRR,
abs/1602.08124, 2016. URL http://arxiv.org/
abs/1602.08124.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Neuro-
computing: Foundations of research. chapter Learning
Representations by Back-propagating Errors, pp. 696–
699. MIT Press, 1988.
Saad, Y. Iterative Methods for Sparse Linear Systems. Soci-
ety for Industrial and Applied Mathematics, Philadelphia,
PA, USA, 2nd edition, 2003.
See, A., Luong, M., and Manning, C. D. Compression of
neural machine translation models via pruning. CoRR,
abs/1606.09274, 2016. URL http://arxiv.org/
abs/1606.09274.
Shallue, C. J., Lee, J., Antognini, J. M., Sohl-Dickstein,
J., Frostig, R., and Dahl, G. E. Measuring the effects
of data parallelism on neural network training. CoRR,
abs/1811.03600, 2018. URL http://arxiv.org/
abs/1811.03600.
Simonyan, K. and Zisserman, A. Very deep convolutional
networks for large-scale image recognition. In 3rd Inter-
national Conference on Learning Representations, ICLR
2015, 2015.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. Going deeper with convolutions. In Computer Vision
and Pattern Recognition (CVPR), 2015.
Torvalds, L. Linux kernel source tree. https://github.
com/torvalds/linux, 2019. Accessed: 2019-08-
28.
Varoquaux, G., Gouillart, E., and Vahtras, O. Com-
pressed sparse row format (csr), 2019. URL
https://scipy-lectures.org/advanced/
scipy_sparse/csr_matrix.html. Accessed:
2019-05-23.
Weisstein, E. W. "jacobian." from mathworld–a wol-
fram web resource. http://mathworld.wolfram.
com/Jacobian.html, 2019. Accessed: 2019-08-28.
Zhu, H., Akrout, M., Zheng, B., Pelegris, A., Jayarajan,
A., Phanishayee, A., Schroeder, B., and Pekhimenko, G.
Benchmarking and analyzing deep neural network train-
ing. In 2018 IEEE International Symposium on Workload
Characterization, IISWC 2018, pp. 88–100, 2018. doi:
10.1109/IISWC.2018.8573476.
