











Manuscript version: Author’s Accepted Manuscript 
The version presented in WRAP is the author’s accepted manuscript and may differ from the 
published version or Version of Record. 
 
Persistent WRAP URL: 
http://wrap.warwick.ac.uk/151073                     
 
How to cite: 
Please refer to published version for the most recent bibliographic citation information.  
If a published version is known of, the repository item page linked to above, will contain 
details on accessing it. 
 
Copyright and reuse: 
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the 
University of Warwick available open access under the following conditions.  
 
Copyright © and all moral rights to the version of the paper presented here belong to the 
individual author(s) and/or other copyright owners. To the extent reasonable and 
practicable the material made available in WRAP has been checked for eligibility before 
being made available. 
 
Copies of full items can be used for personal research or study, educational, or not-for-profit 
purposes without prior permission or charge. Provided that the authors, title and full 
bibliographic details are credited, a hyperlink and/or URL is given for the original metadata 
page and the content is not changed in any way. 
 
Publisher’s statement: 
Please refer to the repository item page, publisher’s statement section, for further 
information. 
 
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk. 
 











The state-of-the-art deep neural networks (DNNs) have significant
computational and data management requirements. The size of
both training data and models continue to increase. Sparsification
and pruning methods are shown to be effective in removing a large
fraction of connections in DNNs. The resulting sparse networks
present unique challenges to further improve the computational
efficiency of training and inference in deep learning. Both the feed-
forward (inference) and backpropagation steps in stochastic gra-
dient descent (SGD) algorithm for training sparse DNNs involve
consecutive sparse matrix-vector multiplications (SpMVs). We first
introduce a distributed-memory parallel SpMV-based solution for
the SGD algorithm to improve its scalability. The parallelization
approach is based on row-wise partitioning of weight matrices
that represent neuron connections between consecutive layers. We
then propose a novel hypergraph model for partitioning weight
matrices to reduce the total communication volume and ensure
computational load-balance among processors. Experiments per-
formed on sparse DNNs demonstrate that the proposed solution
is highly efficient and scalable. By utilizing the proposed matrix
partitioning scheme, the performance of our solution is further
improved significantly.
CCS CONCEPTS
• Computing methodologies → Parallel algorithms; Artifi-
cial intelligence; Machine learning; Distributed computing
methodologies.
KEYWORDS
Scalable Deep Learning, Sparse Deep Neural Networks, Distributed
Stochastic Gradient Descent, Hypergraph Partitioning, Sparse Ma-
trix Vector Multiplication
ACM Reference Format:
Gunduz Vehbi Demirci and Hakan Ferhatosmanoglu. 2021. Partitioning
Sparse Deep Neural Networks for Scalable Training and Inference. In 2021
International Conference on Supercomputing (ICS ’21), June 14–17, 2021, Vir-
tual Event, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.
1145/3447818.3460372
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
ICS ’21, June 14–17, 2021, Virtual Event, USA
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8335-6/21/06. . . $15.00
https://doi.org/10.1145/3447818.3460372
1 INTRODUCTION
Deep neural networks (DNNs) have been extensively utilized in
computer vision, speech recognition, and natural language process-
ing [12, 21, 40]. The state-of-the-art DNN architectures demand
high storage and computational resources due to the large numbers
of parameters (i.e., connection weights) trained over huge datasets.
For instance, AlexNet [40], Deepface [56], VGG16 [54] and GPT-
3 [5] consist of 60M, 120M, 138M and 175B parameters, respectively.
As both the number of parameters and the size of training datasets
continue to increase, it is essential to develop scalable training and
inference solutions.
Neural network pruning and sparsification methods are success-
fully applied to address the storage and computational challenges of
DNNs [19, 24, 35, 42, 46, 55]. These approaches aim at reducing the
amount of memory and computation required to propagate values
through the network, typically by removing unimportant connec-
tions. They improve DNN’s efficiency, scalability, and feasibility
in practice, especially for dynamic applications with low latency
requirements [64]. Research studies demonstrate that DNNs are
tolerant to the sparsification process [28, 47]. For instance, removal
of 90% of the connections in ResNet-50 [25] incurs only 3% accuracy
loss [18], when trained over ImageNet [17].
Stochastic gradient descent (SGD) is a widely used method for
training DNNs. To achieve large-scale learning tasks, parallel SGD
algorithms for distributed computing systems (e.g., HPC systems,
GPU clusters, TPU pods) are considered in the literature [3, 6, 11,
16, 43, 62, 63]. SGD algorithms that exploit sparsity patterns of
networks should be developed to attain efficient training of sparse
DNNs and retraining of pruned DNNs. Inference (feedforward)
and backpropagation phases of SGD involve consecutive matrix-
vector multiplications in such a way that the output vector of one
layer is fed as input to the next layer. Matrices in each layer store
connection weight parameters between neurons and are updated
during the course of training. In the case of sparse DNNs, these
matrices become sparse so that computations in each layer heavily
depend on sparse matrix-vector multiplications (SpMV).
For large-scale sparse DNNs, we introduce a distributed-memory
parallel SGD solution based on efficient parallelization of SpMVs
performed in feedforward and backpropagation phases. To perform
parallel SpMVs in each layer, matrices and input-output vectors are
row-wise partitioned among processors. This partitioning strategy
achieves model-wise parallelism. This is in contrast to data-parallel
approaches which necessitate the entire model to be stored by
processors and face high bandwidth costs and memory bottleneck
to perform parameter updates [59]. Our solution reduces mem-
ory requirements and performs efficient parameter updates via
ICS ’21, June 14–17, 2021, Virtual Event, USA Demirci and Ferhatosmanoglu
model-wise parallelization and utilizes sparse point-to-point com-
munication operations to alleviate bandwidth and latency costs.
We then propose a hypergraph model for partitioning matri-
ces to further scale and improve the efficiency of parallel SpMV
computations by reducing the communication costs and achiev-
ing computational load-balance among processors. The proposed
model utilizes partitioning with fixed vertices to correctly encode
the communication requirements of processors and dependencies
between successive layers. The partitioning objective of minimizing
the cut size in the hypergraph directly encodes the minimization
of the total communication volume, and load-balancing constraints
enable computational balance among processors.
To evaluate the performance of the proposed training solution
with the hypergraph partitioning model, we conduct extensive ex-
periments on several sparse DNN models provided by the Sparse
Deep Neural Network Graph Challenge [36] and the MNIST data-
base of handwritten digits [41]. Experimental results show that
the parallel SpMV-based sparse DNN training algorithm is highly
efficient and scalable, and scales to large processor counts, and the
proposed hypergraph partitioning model provides further perfor-
mance improvements and scalability by significantly reducing both
the bandwidth and latency costs of communication.
The contributions of the paper are as follows:
• We introduce a distributed memory-parallel SGD algorithm
specifically designed for sparse DNNs to achieve model-wise
parallelism.
• To improve parallelization efficiency, we propose a novel
hypergraph-based sparse DNN partitioning model which
reduces communication costs and achieves a computational
balance among processors.
• On a set of sparse DNNs from a benchmark comprising realis-
tic representatives of real-world applications, we performed
extensive experiments to analyze the scalability and effec-
tiveness of the proposed algorithm and partitioning model.
The rest of the paper is organized as follows. Section 2 presents
related work. Section 3 presents preliminaries. Section 4 describes
the proposed distributed-memory parallel SpMV-based SGD solu-
tion for sparse DNNs. Section 5 describes our hypergraph model for
partitioning sparse DNNs. Section 6 presents experimental results
for performance evaluation. Finally, Section 7 concludes the paper.
2 RELATEDWORK
Efficient parallel SpMV algorithms for distributed-memory and
shared-memory systems are developed in the literature [2, 53, 60].
Several graph/hypergraph partitioning models are proposed to
improve the performance of parallel SpMV by reducing the com-
munication costs and achieving the load-balance among proces-
sors [7, 26, 34, 38]. Existing approaches, however, are suitablemostly
for the cases in which an input matrix is repeatedly multiplied by
a vector where the sparsity pattern of the input matrix does not
change through the iterations. Hence, these partitioning models
and parallel SpMV algorithms are not applicable for sparse DNNs,
since each layer is associated with a sparse matrix with different
nonzero patterns. Research is needed to design solutions that ad-
dress the challenges introduced by sparse DNNs and improve their
performance.
Motivated by the computational advantages and reduced sizes
to handle very large data and models, efficient inference compu-
tation on sparse DNNs has attracted significant attention [36].
Parallel algorithms for sparse computations on shared-memory
systems are recently proposed (e.g., GPUs [4, 27, 50, 58], multi-
processors [15, 48, 51]). Since these approaches implement only
inference computation and are not used for training, each input data
vector can be independently processed and distributed parallelism
can be achieved by just splitting the input dataset and replicating
DNN models among multiple compute nodes. Recently, novel tiling
strategies for sparse DNNs are developed to utilize dense matrix
kernels for GPUs [23].
Data-parallel methods are widely used to achieve scalability via
distributed SGD. In these methods, the dataset is partitioned among
multiple compute nodes and local portions of the dataset are pro-
cessed in terms of batches. Additionally, each compute node stores
a local copy of the whole DNN model and depending on the imple-
mentation, synchronous or asynchronous updates are performed
on the model parameters. Data-parallel SGD algorithms necessitate
a large volume of communication between processors since whole
model parameters are transferred at each iteration. Therefore, to
make data-parallel approaches more feasible, the batch size needs to
be increased, but larger batch sizes hurt the training performance of
the SGD algorithm. Additionally, to alleviate the high communica-
tion cost, gradient compression methods are proposed [1, 45]. More
recently, FFT-based gradient sparsification and range-based quanti-
zation methods are applied together to reduce the communication
volume in data-parallel training algorithms [57].
In synchronous data-parallel methods [3, 20, 29, 29, 44, 62], each
node computes local gradients independently and all processors
collectively perform an All-reduce communication to receive the
average of gradients to update its local parameters. Recently, com-
munication algorithms to improve the efficiency of All-reduce op-
eration on NVLink-enabled dense GPU systems are proposed [10].
To achieve efficiency in synchronous SGD, larger batch sizes should
be considered, which may result in lower test accuracy. Methods
are proposed to reduce the loss of accuracy due to the use of larger
batch sizes [20, 61].
The requirement for processors to synchronize gradient updates
after processing each batch causes a significant limitation for the
scalability of synchronized SGD. Techniques to overlap communi-
cation and computation are proposed to reduce the overheads of
synchronization [13, 20]. To achieve further performance improve-
ments, asynchronous methods which differ in communication and
update rules are proposed [9, 16, 32, 63]. In asynchronous methods,
at each step, a master node (i.e., parameter server) receives local
gradients from a worker node, and updates global model param-
eters then sends the updated model to the same worker where
worker nodes are served in arbitrary order. Federated learning al-
gorithms [8, 39] are also considered under this category where a
subset of clients download the most recent model from a central
server and computes updates to the model. Then the clients send
their model updates to the central server which aggregates these
model updates typically by averaging to improve the global model.
Alternative to data-parallel methods, training approaches that
aim at model-wise parallelism are also considered [30, 31]. For ex-
ample, FlexFlow [31] searches different parallelization strategies
Partitioning Sparse Deep Neural Networks for Scalable Training and Inference ICS ’21, June 14–17, 2021, Virtual Event, USA
by performing simulations before training. However, this tool is
mainly designed for GPU clusters and does not provide a parti-
tioning on sparse DNNs. The proposed model-wise parallelism
in our SGD algorithm offers inherent scalability, whereas in data-
parallel approaches, each processor holds the whole set of parame-
ters and broadcasts gradients for these parameters to all processors.
Therefore, in data-parallel approaches, the total communication
volume significantly increases with increasing number of proces-
sors, and the local memory size of processors limits the size of
neural networks. As validated in the experiments, the proposed
SGD algorithm reduces the total communication volume, since each
processor only keeps a small set of parameters and broadcasts their
gradients to a small subset of processors.
3 PRELIMINARIES
3.1 Hypergraph Partitioning
Let H = (V ,N ) denote a hypergraph where V and N are vertex
and net sets, respectively. Each net nj ∈N may connect multiple
vertices and the set of vertices that connected by nj is represented
by pins(nj ). Each vertex vi ∈V is associated with weightw(vi ) and
each net nj ∈N is associated with cost(nj ). A P-way partition of H
is defined as Π= {V1,V2 · · ·VP } consisting of mutually disjoint and
exhaustive subsets of vertices Vm ⊂V where Vm ∩Vn = ∅ ifm,n
and Vm ,∅ for all Vm ∈Π such that
⋃
Vm = V .
Under a partition Π, a net nj connects to a part Vm if pins(nj )∩
Vm ,∅. The set of parts that are connected by netnj is defined as the
connectivity set Λ(nj ) and the number of parts that are connected
by net nj is defined as connectivity λ(nj )= |Λ(nj )|. A net nj is said
to be cut if it connects to multiple parts (i.e., λ(nj )>1) and uncut




cost(nj ) × (λ(nj ) − 1) (1)
The weight of a part Vm ∈Π is defined asW (Vm )=
∑
vi ∈Vm w(vi ).
The partition Π is balanced if it satisfies
W (Vm ) ≤Wavд(1 + ϵ), ∀Vm ∈ Π (2)
whereWavд =
∑
vi ∈V w(vi )/P is the average part weight and ϵ is
the maximum allowed imbalance ratio.
The hypergraph partitioning problem for finding a P-way par-
tition with the objective of minimizing the cut size given in (1)
and satisfying balancing constraints in (2) is NP-Hard. There exist
tools that produce quality results for the hypergraph partitioning
problem [7, 33]. These tools also support partitioning hypergraphs
with fixed vertices where some vertices can be assigned to parts
prior to partitioning.
3.2 Stochastic Gradient Descent
Stochastic gradient descent (SGD) is an optimization technique
which is commonly used for training neural networks to iteratively
minimize a loss function over an input dataset. SGD is usually
implemented in two main phases which heavily depend on SpMVs:
(1) Feedforward (inference) phase, (2) Backpropagation phase.
Given a DNN composed of L layers where connection weights
in each layer k =1, 2, . . . ,L are represented by a matrix Wk such
that the connection weight from the ith neuron in layer k to the
jth neuron in layer k+1 is denoted by nonzero entry Wk (j, i). In
the inference phase, an input vector x0 is sent through the network
layers to compute an output vector xL . Formally, the inference step
can be given as
xk = f (Wkxk−1 + bk ) (3)
where bk denotes the bias vector and f (·) is a nonlinear activation
function applied to each element of a vector. The bias vector bk can
be embedded in matrix Wk as the first column and the first entries
of vectors xk can be set to one (i.e., the number of dimension of xk
increases by one). In a simpler form, the feedforward computation
in each layer k becomes xk = f (Wkxk−1).
In the backpropagation phase, output vector xL of the inference
step is used for computing gradient vector δL which is backpropa-
gated to compute gradients δk in preceding layers k = 1, 2, . . . ,L−1.
The ith component of vector δk (i) denotes the partial derivative of
a loss function J(xL ,y) with respect to the total input activation of
the ith neuron in layer k . Vector y is the true label for input vector
x0 where the loss function depends on both of the vectors. Each




= δk (j)xk−1(i) (4)




where η denotes the learning rate. The gradient vector δL in the
final layer L is computed as
δL = ∇xL J ⊙ f
′(zL) (6)
where ∇xL J is a vector of derivatives of the loss function J with re-
spect to the outputs of the activation functions in the final layer (i.e.,
xL) and f ′(zL) is the vector of derivatives of the outputs with re-
spect to the input activation zL = WLxL (i.e., local gradients) in
layer L and symbol “⊙” denotes element-wise multiplication. Gra-
dients for layers k = 1, 2, . . . ,L−1 are computed by a recursive
formula
δk−1 = (Wk )Tδk ⊙ f ′(zk−1). (7)
Algorithm 1 SGD
Require: T , {Wk }
1: for each x0 ∈ T do
2: for k = 1, 2, . . . ,L do
3: zk =Wkxk−1
4: xk = f (zk )
5: δL = ∇xL J ⊙ f
′(zL)
6: for k = L,L−1, . . . , 1 do
7: δk−1 = (Wk )Tδk ⊙ f ′(zk−1)
8: ∇Wk = δk ⊗ xk−1
9: Wk ←Wk − η∇Wk
Algorithm 1 displays the overall execution of SGD. The for loop
in lines 1–9 is executed overall input vectors in training dataset
T in such a way that for each input vector x0 ∈ T , feedforward
and backpropagation steps are executed. Lines 2–4 correspond to
ICS ’21, June 14–17, 2021, Virtual Event, USA Demirci and Ferhatosmanoglu
Figure 1: Illustration of sequences of SpMVs performed in
SpFF (Top) and SpBP (Bottom).
inference (feedforward) step where repeated SpMVs of the form
zk = Wkxk−1 are performed. Between two consecutive layers,
nonlinear activation function f (·) is applied to each component of
vector zk and the output vector f (zk ) of layer k is fed as input to
the next layer k+1. In line 5, the gradient vector δL is computed
using the output vector xL and the input activations zL of the
final layer L. Lines 6–9 correspond to backpropagation step where
repeated SpMVs of the form δk−1 = (Wk )Tδk are performed to
backpropagate gradient vectors. In line 8, outer product of gradient
vector δk with vector xk−1 is performed to produce matrix ∇Wk
which is used to update weight matrixWk by the gradient update
rule.
4 DISTRIBUTED SGD ALGORITHM FOR
SPARSE DNNS
In order to achieve a parallel training algorithm (i.e., parallel SGD)
for sparse DNNs, we develop parallel SpMV-based feedforward
and backpropagation algorithms in Sections 4.1 and 4.2, respec-
tively. That is, parallel sparse feedforward (SpFF) algorithm achieves
parallelization of lines 2–4, whereas parallel sparse backpropaga-
tion (SpBP) algorithm achieves parallelization of lines 5–9 in Algo-
rithm 1.
Figure 1 displays the general execution of the parallel SGD algo-
rithm together with its weightmatrix partitioning scheme. In the fig-
ure, only sequences of SpMV operations are displayed whereas the
remaining computations are omitted for ease of exposition. In the in-
ference phase, input vector x0 and weight matricesW1,W2 . . .WL
are row-wise partitioned among four processors. For each layer
k = 1, 2, . . . L, processors perform communication to receive non-
local entries of vector xk−1 and collectively perform SpMVWkxk−1
to compute vector xk . The output vector xk computed in layer k is
used as input in the next layer. This process is repeatedly performed
until the final layer L where the gradient vector δL is computed.
In the backpropagation phase, gradient vector δL is row-wise par-
titioned among processor whereas transposes of weight matrices
Algorithm 2 SpFF
Require: x0m , {Wkm }, {Xsendkm }, {Xrecvkm }
1: for all processors Pm in parallel do
2: for k = 1, 2, . . . ,L do
3: for each (Pn , x̄k−1mn ) ∈ Xsendkm do
4: Update x̄k−1mn with entries in xk−1m
5: Non-blocking send x̄k−1mn to Pn
6: zkm =Wkmxk−1m
7: for each (Pn , x̂k−1nm ) ∈ Xrecvkm do
8: Receive nonzero entries x̂k−1nm from process Pn
9: zkm ← zkm +Wkm x̂k−1nm
10: xkm = f (zkm )
(W1)T , (W2)T . . . (WL)T are column-wise partitioned. The row-
wise partitioning of weight matrices induces column-wise parti-
tioning on their transposes. For each layer k=L,L−1, . . . 1, proces-
sors collectively perform SpMV (Wk )Tδk . Here, since matrices are
column-wise partitioned, processors communicate partial products
contributing to the same nonzero entries of output vector δk−1
instead of communicating entries of input vector δk . These partial
products are summed by processors to get the final values of entries
in gradient vector δk−1 which is used as input in the next layer.
4.1 Parallel Sparse Feedforward
The parallel sparse feedforward (SpFF) performs repeated parallel
SpMV in the form ofWkxk−1 for each layer k=1, 2, . . . ,L. Paral-
lelism is achieved through row-wise partitioning of weight matrices
Wk and input/output vectors xk−1 among processors.
Algorithm 2 displays the overall execution of the proposed SpFF
algorithm. In the algorithm, each processor Pm form=1, 2, . . . , P
stores row-blocks Wkm and xk−1m of matrix Wk and vector xk−1,
respectively. Additionally, each processor Pm is provided with maps
Xsendkm and Xrecvkm that map row indices of vector xk−1m to pro-
cessor ids. In this way, each processor knows which xk−1-vector
entries to be communicated with which processor. Formally, these
sets are defined as
Xsendkm =
{




















where cols(·) and rows(·) respectively denote the indices of columns
and rows that contain at least one nonzero entry in a given ma-
trix/vector. xk−1m [·] and xk−1n [·] denote subvectors that are com-
posed of given row indices of vectors xk−1m and xk−1n , respectively.
Hence, for each (Pn , x̄k−1mn ) ∈Xsendkm , processor Pm sends subvec-
tor x̄k−1mn to processor Pn whereas for each (Pn , x̂k−1nm ) ∈ Xrecvkm ,
processor Pm receives subvector x̂k−1nm from processor Pn .
Sets Xsendk and Xrecvk are precomputed by using the sparsity
patterns of weight matrices (i.e., neuron connections) and the row
partitioning of weight matrices among processors. The row-wise
Partitioning Sparse Deep Neural Networks for Scalable Training and Inference ICS ’21, June 14–17, 2021, Virtual Event, USA
partitioning of weight matrices induces neuron partitioning in each
layer so that all computations related to a neuron are performed by
a single processor. As shown in (8) and (9), to perform Wkmxk−1,
processor Pm needs to receive all xk−1-vector rows corresponding
to column indices in cols(Wkm ). It is important to note that vectors
x̄k−1mn and x̂k−1nm are placeholders that keep coordinates of nonzero
entries. Hence, nonzero entries of these vectors are updated before
used in any operation. For instance, before sending to processor
Pn , nonzero entries of vector x̄k−1mn are updated (i.e., line 4) with the
corresponding entries in xk−1m locally computed in the preceding
layer. Similarly, nonzero entries of vector x̂k−1nm must be received
from processor Pn before it is multiplied by weight matrix Wkm (i.e.,
lines 8–9).
In the algorithm, for each layer k = 1, 2, . . . ,L, the for loop in
line 2 is executed in parallel by all processors: In lines 3–5, each
processor Pm performs a non-blocking communication for each
tuple (Pn , x̄k−1mn ) ∈Xsendkm to send its local nonzero entries in x̄k−1mn
to processor Pn . To overlap communication by computation, each
processor performs local SpMV computation zkm =Wkmxk−1m with-
out waiting for the messages to be received by recipient processors.
Entries of zk store the total activation values incoming to neurons.
For instance, nonzero entry zk (i) stores the total activation of the
ith neuron in layerk . After local SpMV computations are performed,
for each tuple (Pn , x̂k−1nm ) ∈Xrecvkm , processor Pm receives vector
x̂k−1nm from processor Pn and multiplies by Wkm to update the corre-
sponding entries in vector zkm (i.e., lines 7–9). Finally, a nonlinear
activation function (i.e., ReLu, sigmoid etc.) is applied to zkm and
the respective output elements in xkm are computed.
4.2 Parallel Sparse Backpropagation
The parallel sparse backpropagation (SpBP) works similarly to SpFF
algorithm where SpBP performs repeated SpMVs in the form of
δk−1 = (Wk )Tδk in the reverse order that of performed by SpFF.
Since the weight matrices are row-wise partitioned among proces-
sors, each processor Pm for m = 1, 2, . . . , P stores column-block
(Wkm )T of matrix (Wk )T and row-block δ
k
m of gradient vector δ
k ,
respectively. Therefore, each processor Pm multiplies its local gradi-
ent vector δkm by transpose (Wkm )T of its local weight matrix Wkm
in each layer k .
Algorithm 3 displays the overall execution of SpBP. As a first
step, each processor Pm locally computes gradient vector δLm ac-
cording to Eq. (6) by using the output vector xLm computed in the
inference phase. By executing the for loop in lines 3–13 in parallel,
vector δL is backpropagated through the layers L,L−1, . . . , 1. To
backpropagate vectorδk to the preceding layer k−1, an SpMV of the
form skm = (Wkm )Tδ
k
m is performed in line 4. Vector skm may contain
partial derivatives contributing to neuron outputs computed on
different processors as well as to local neuron outputs. Nonzeros of
vector skm that are contributing to neurons located on different pro-
cessors are sent to the corresponding processors. Nonzeros that are
contributing to the local neuron outputs are summed with the par-
tial derivatives received from other processors, before multiplying
with local gradients f ′(zk−1). That is, communication operations
are performed on nonzero entries of skm .
Algorithm 3 SpBP
Require: xLm , {(Wkm )T }, {Ssendkm }, {Srecvkm }
1: for all processors Pm in parallel do
2: δLm = ∇xLm J ⊙ f
′(zLm )
3: for k = L,L−1, . . . , 1 do
4: skm = (Wkm )Tδ
k
m
5: for each (Pn , s̄kmn ) ∈ Ssendkm do
6: Update s̄kmn with corresponding entries in skm
7: Non-blocking send s̄kmn to Pn
8: ∇Wkm = δ
k
m ⊗ xk−1m
9: Wkm ←Wkm − η∇Wkm
10: for each (Pn , ŝknm ) ∈ Srecvkm do
11: Receive nonzero entries ŝknm from process Pn
12: skm ← skm + ŝknm
13: δk−1m = skm ⊙ f ′(zk−1m )
As in SpFF algorithm, each processor is provided with maps
Ssendk and Srecvk , where for each tuple (Pn , s̄kmn ) ∈Ssendkm there
exists (Pn , x̂k−1nm ) ∈Xrecvkm and rows(s̄kmn )=rows(x̂k−1nm ). Similarly,
for each tuple (Pn , ŝknm ) ∈Srecvkm there exists (Pn , x̄k−1mn ) ∈Xsendkm
and rows(ŝknm ) = rows(x̄k−1mn ). That is, if processor Pm receives a
nonzero xk−1(i) from processor Pn , then Pm sends the correspond-
ing gradient contribution sk (i) to Pn . Similarly, if processor Pm
sends a nonzero xk−1(j) to processor Pn , then Pm receives the cor-
responding gradient contribution sk (j) from Pn .
In lines 5–7, each processor Pm performs a non-blocking com-
munication for each tuple (Pn , s̄kmn ) ∈ Ssendkm to send nonzero
entries in s̄kmn to processor Pn . To overlap communication by com-
putation, each processor locally performs outer product δkm⊗ xk−1m
without waiting for the messages to be received by recipient pro-
cessors. It is important to note that xk−1m contains nonzero entries
received from other processors in the inference phase. The outer
product produces matrix ∇Wkm which is used to update weight
matrix Wkm in lines 8–9. After updating weight matrices, for each
tuple (Pn , ŝknm ) ∈ Srecvkm , processor Pm receives nonzero entries
in ŝknm from processor Pn and sums the received nonzero entries
with the corresponding entries in skm to compute the final partial
derivatives for the local neuron outputs (i.e., lines 10–12). Finally,
nonzero entries of skm are multiplied with local gradients in line 13
and gradient vector δk−1m for the preceding layer k−1 is computed.
It is important to highlight that only the nonzero entries of skm ,
which correspond to rows(xk−1m ), are multiplied by local gradients
f (zk−1m ) and carried into vector δ
k−1
m .
5 HYPERGRAPH PARTITIONING MODEL
FOR SPARSE DNNS
We propose a hypergraph model for partitioning rows of weight
matrices (i.e., neural network) among processors to optimize com-
munication costs of parallel SpMV operations performed by SpFF
and SpBP algorithms. The proposed model adopts a multi-phase
and fixed vertex partitioning approach to correctly encode commu-
nication patterns of processors between consecutive layers.
ICS ’21, June 14–17, 2021, Virtual Event, USA Demirci and Ferhatosmanoglu
Our partitioning model consists of L phases ϕk for k=1, 2, . . . ,L.
In each phase ϕk , rows of matrix Wk are partitioned into P parts.
Note that the row-wise partitioning of weight matrix Wk induces
column-wise partitioning of (Wk )T in backpropagation phase. For
each phase ϕk , we define a hypergraph H (ϕk ) = (V k ∪ Fk ,N k ),
where for each matrix row Wk (i, :), there exists one vertex vi ∈V k ,
for each column Wk (:, j), there exists one fixed vertex vfj ∈F
k and
one net nj ∈N k .
Each vertex vi ∈V k represents row Wk (i, :) (i.e., the ith neuron)
and all computations associated with that row. In the inference
phase, vertexvi represents the task of computing the inner product
zk (i) =Wk (i, :)xk−1 =
∑
Wk (i, j)∈Wk (i, :)
Wk (i, j)xk−1(j) (10)
which corresponds to the computation of the ith neuron’s total
input activation. In the backpropagation phase, vertexvi represents
column (Wk )T(:, i) and the task of computing multiplications in
sparse SAXPY/DAXPY operations sk (j) = sk (j)+ (Wk )T(j, i)δ(i)k
for each nonzero row index j in column (Wk )T(:, i). Additionally,
vertex vi also represents gradient update operations
Wk (i, :) ←Wk (i, :) − η∇Wk (i, :) (11)
associated with the links connected to the ith neuron in layer k .
Therefore, each vertex is associated with a computational weight
equal to the number of nonzeros in row Wk (i, :) (i.e., number of
links connecting to the ith neuron). Fixed vertices in set Fk do not
represent any computation and are introduced to connect nets to
prespecified parts for correctly encoding input-output dependencies
between consecutive layers in multi-phase partitioning framework.
A P-way partitioning ΠP (ϕk ) = {V k1 ,V
k
2 , . . . ,V
k
P } on hyper-
graph H (ϕk ) denotes that all tasks corresponding to vertices in
part V km ∈ ΠP (ϕk ) are assigned to processor Pm . For instance, if
a vertex vi is assigned to part V km , then processor Pm stores row
Wk (i, :) and performs all computation associated with this row. Par-
titioning ΠP induces a partial reordering so that the matrix rows
belonging to the same part can be reordered consecutively (in any
order) to form a row block Wkm which is assigned to processor Pm .
Net set N k simultaneously encodes the total communication
volume of processors during inference and backpropagation phases.
In the inference phase, each net nj ∈ N k represents the set of
tasks (vertices) that need nonzero entry xk−1(j), whereas in the
backpropagation phase, each net nj represents the set of tasks that
contribute to the computation of nonzero entry sk (j). Hence, net
nj connects each vertex vi ∈V k for which the corresponding row
Wk (i, :) has a nonzero entry in the jth column.
In order to satisfy input-output dependencies between succes-
sive layers, each net nj ∈N k connects only one fixed vertex v
f
j ∈
Fk and fixed vertex vfj only connects nj . Fixed vertex v
f
j repre-
sents nonzero xk−1(j) and it is fixed to the same part/processor
to which row Wk−1(j, :) is assigned in the preceding phase ϕk−1,











sures that after partitioning in phase ϕk , net nj ∈ N k connects
Figure 2: Cut net n1 encoding communication operations be-
tween neurons represented by vf1 in layer k−1 and neurons
v2,v3,v4,v5 in layer k .
the part/processor which is given the responsibility of computing
nonzero xk−1(j). Formally, pins of net nj is defined as
pins(nj ) = {vi ∈ V k | ∃j ∈ rows(Wk (i, :))} ∪ {vfj }. (12)
In the inference phase, a cut net nj ∈ N k whose fixed vertex
v
f
j is assigned to a part V
k
m ∈Λ(nj ) implies that nonzero xk−1(j) is
computed by processor Pm and will be sent from Pm to all proces-





words in the kth layer of SpFF. In the back-
propagation phase, each processor Pn ∈Λ(nj )\V km , computes its
contribution to nonzero sk (j) and sends to processor Pm . Therefore,





in the backpropagation phase as well. As seen here, if a proces-
sor Pm sends a nonzero xk−1(j) to a processor Pn in the inference
phase, processor Pm receives the corresponding gradient contribu-
tion sk (j) from processor Pn . Therefore, the total communication







|Λ(nj )| − 1
)
.
Therefore, if each net nj ∈ N k is associated with cost(nj )=2, the
partitioning objective of minimizing the cutsize in phase ϕk en-
codes the minimization of the total communication volume during
performance of SpFF and SpBP in layer k . Note that each net is
associated with equal cost(nj )=2 which encodes the number of
nonzeros transferred during inference and backpropagation phases.
Any uniform cost association is valid for partitioning.
Figure 2 displays an illustrative example where cut net n1 with




z } is given. In the figure, fixed vertex v
f
1 is pre-
assigned to part V kx by partitioning Π(ϕk−1) so that the task of
computing xk−1(1) is given to processor Px (i.e., v1 ∈V k−1x ). Hence,
in the kth step of inference phase, processor Px sends xk−1(1) to
processors Py and Pz , since output of the neuron v
f
1 is connected
to neurons v2, v3 and v4. Here neuron v5 does not incur communi-
cation since it is assigned to the same processor Px by partitioning
Π(ϕk ). Even though the output of neuron vf1 (i.e., neuron v1 in
layer k −1) is connected to two neurons v3 and v4 in processor
Partitioning Sparse Deep Neural Networks for Scalable Training and Inference ICS ’21, June 14–17, 2021, Virtual Event, USA
Py , nonzero xk−1(1) is sent only once to this processor. So net n1
encodes a communication volume of |Λ(nj )| − 1=2 words during
SpFF in layer k . Similarly, in the backpropagation phase, processors
Py and Pz send partial gradient contributions for sk (1) to processor
Px . Partial gradient contribution of vertex v5 is locally summed
by Px and does not contribute to the total communication volume.
Note that Pz sums partial gradients contributions for each of its
vertices v3 and v4 before sending a single value to Px . Hence, as
in the inference phase, net n1 encodes the same communication
volume of |Λ(nj )| − 1=2 words during SpBP in layer k .
Figure 3 displays an illustrative example of the proposed hyper-
graph partitioning model. The sparse DNN in the top left in the
figure consists of three layers each of which contains four neu-
rons (i.e., x0 corresponds to the input layer). Weight matricesW1
andW2 are displayed in the top right of the figure where connec-
tions between neurons are denoted by nonzero entries. For instance,
neuron 2 in the first layer is represented by rowW1(2, :) where the
columns 1, 2 and 3 have nonzero entries, since neuron 2 connects
neurons 1, 2 and 3 in the input layer. The two subfigures of the lower
part display hypergraphs H (ϕ1) and H (ϕ2) which contain four ver-
tices and four nets corresponding to rows and columns of matrices
W1 andW2, respectively. Additionally, H (ϕ2) contains four fixed
vertices which correspond to rows of x1. Fixed vertices vf3 and v
f
4




2 are preassigned to
part V 22 , since v3 and v4 are assigned to V
1
1 whereas v1 and v2 are
assigned to V 11 by Π(ϕ
1) in the previous layer. That is, nonzeros
x1(3) and x1(4) are computed by the processor P1 whereas the rows
x1(1) and x1(2) are computed by the processor P2. Therefore, in
layer 2, nonzeros x1(3) and x1(4)will be sent from P1 to P2, and the
rows x1(1) and x1(2) will be sent from P2 to P1. In the figure, rows
of the input vector x0 can be assigned to processors with respect to
net connectivities. For instance, row x0(1) can be stored by one of
the processors P1 and P2, since net n1 connects both parts V 11 and
V 12 . On the other hand, net n2 only connectsV
1
2 and hence, x
0(2) is
stored locally by P2 and it is not communicated.
The running time complexity of the partitioning phase depends
on the sizes of hypergraphs built in each phase and the partitioning
algorithm/tool used. The sizes of hypergraphs are all linear in the
number of rows, columns and nonzero entries of weight matrices
in each layer. Hence, the complexity of generating hypergraph
H (ϕk ) for layer k can be given as θ (N +nnz(Wk )), where N and
nnz(·) respectively denote the number of neurons per layer (i.e.,
the number of rows and columns ofWk ) and number of nonzero
entries (i.e., connections) in a matrix.
5.1 Discussion
One challenge inherent in the parallel SGD algorithm is that pro-
cessors perform communication between each consecutive layer,
introducing a synchronization barrier. To alleviate synchronization
overheads and improve the parallelization efficiency, input vectors
can be processed in batches at each iteration (i.e., minibatch SGD
can be performed instead of SGD). By simply modifying SpFF, batch
processing can be enabled in such a way that instead of forwarding
a single vector xk between each consecutive layer, multiple vec-
tors can be simultaneously processed in batches. That is, sparse
matrix-matrix multiplications (SpMM) of the form WkXk−1 can
be performed in each layer where Xk−1 is formed by placing mul-
tiple xk−1 vectors as columns in Xk−1. Hence, the main iteration
of the inference step becomes Xk = f (WkXk−1). The gradient vec-
tor δL in the final layer is computed as the averages of gradients
obtained over the vectors in the current batch. The SpBP algo-
rithm is executed in the same way, since a single gradient vector
is backpropagated to update weight parameters. Additionally, the
proposed hypergraph partitioning is still applicable without any
modifications, since the proposed model depends only on the DNN
network structure.
The proposed hypergraph partitioning model can also be uti-
lized for hybrid systems that provide both shared- and distributed-
memory parallelism such as GPU or multiprocessor clusters. Im-
plementations that utilize “MPI+CUDA” or “MPI+Openmp” can
benefit from the proposed hypergraph partitioning approach to
reduce communication costs between compute nodes that are con-
nected by slower network connections. In this respect, our local
SpMV computations can be replaced by more efficient libraries
that utilize thread-level parallelism in multiprocessor and GPU ar-
chitectures [4, 15]. Additionally, the proposed hypergraph models
can also be utilized for heterogeneous computation systems by
enforcing different target part weights to distribute different sized
computational loads to processors.
The proposed hypergraph partitioning model and the SpMV-
based SGD can also be utilized for convolution/pooling layers,
which are widely utilized in popular convolutional neural net-
work (CNN) architectures. These layers can be implemented as
matrix-vector multiplications through constructing Toeplitz matri-
ces [22], that capture convolution operation, and converting input
data to vectors. Application of sparsification/pruning to CNNs in-
duces sparsification on the corresponding Toeplitz matrices, making
the proposed hypergraph model applicable to such cases.
6 EXPERIMENTS
6.1 Experimental Setup
We evaluate the performance of the proposed parallel SGD algo-
rithm and hypergraph partitioning model on a benchmark pro-
vided by Sparse Deep Neural Network Graph Challenge1 [36].
The benchmark uses synthetically generated sparse DNN mod-
els and MNIST database of handwritten digits [41]. These sparse
networks are shown to be effective in terms of their training per-
formance [37, 52]. We refer to the parallel training algorithm as
H-SGD if the proposed hypergraph partitioning model is used to
partition the neural networks. Otherwise, we refer to the algorithm
as SGD to denote that random partitioning is utilized where neu-
rons are assigned to processors uniformly at random in each layer.
Random partitioning evenly splits weight matrices by assigning
rows to processors uniformly at random and provides competitive
computation/communication balance.
Sparse DNNs are generated by RadiX-Net synthetic sparse DNN
generator [37] which takes two parameters: the number of layers
and the number of neurons per layer. We used four different sized
sparse DNNs consisting of 120 layers where numbers of neurons per
1https://graphchallenge.mit.edu/data-sets
ICS ’21, June 14–17, 2021, Virtual Event, USA Demirci and Ferhatosmanoglu
Figure 3: Top: A 3-layer sparse DNN and the corresponding weight matricesW1 andW2. Bottom: HypergraphsH (ϕ1) andH (ϕ2)
built for weight matrices. Partitions Π(ϕ1) and Π(ϕ2) induce a 2-way partitioning on weight matrices. [·] next to a vertex or a
part shows the weight associated with that vertex and part, respectively.
layer are selected as N =1024, 4096, 16384, and 65536, respectively.
The MNIST database consists of 60,000 images of size 28×28 pixels
and these images are scaled to 32×32, 64×64, 128×128 and 256×256.
The scaled images are thresholded and flattened into 0-1 column
vectors to be conformable with the input layers of sparse DNNs.
Running time experiments are performed on a high-performance
computing system in which compute nodes are Lenovo NeXtScale
nx360 M5 servers with 2×Intel Xeon E5-2630 v3 2.4 GHz (Haswell)
8 core processors (16 cores per node, 203 nodes, 3488 cores, 64GB
DDR4memory per node/4GB per core). The system provides atmost
32 compute nodes (512 cores) to run our parallel codes. Compute
nodes are connected via QLogic TrueScale InfiniBand. To test the
effectiveness of the proposed hypergraph partitioningmodel as well
as the scalability of the parallel SpMV-based training algorithm,
we performed strong scaling experiments for H-SGD and SGD
on numbers of processors P = 32, 64, 128, 256 and 512. Our SGD
algorithm currently supports single-thread execution where we
assign a single core to each MPI process and run a single thread per
MPI rank. In our HPC system, the total memory of a compute node
is not sufficient to store the whole DNN model for N =65536 (Data-
parallel approaches fail due to memory constraints). Therefore, our
strong scaling experiments start from 32 cores (i.e., 2 nodes).
We implemented the parallel sparse SGD algorithm in C++ and
implemented the inter-process communication operations via Mes-
sage Passing Interface (MPI). We used sigmoid function as linear
activation function f (·) and mean squared error as loss function J.
Initial connection weights of sparse DNNs are chosen uniformly
at random from the interval [−1, 1] and the learning rate is set to
η=0.01. The proposed hypergraph model is partitioned by using
Patoh [7] where the maximum allowed imbalance ratio is set to
ϵ =0.01 in each layer.
Algorithms that only perform inference computations on sparse
DNNs [4, 15, 27, 49, 51] are not applicable in our general experi-
mental setting. The best performing sparse DNN inference algo-
rithms are generally designed for GPU-based systems and adopt
data-parallelism. In these solutions, the backpropagation phase and
weight update operations are not implemented. Data-parallel SGD
solutions independently process input vectors in parallel and can
not parallelize the computations associated with a single input
vector, which limits the scalability by the batch size in training.
Our solution achieves model-wise parallelism and can process a
single input vector in parallel. Due to this fundamental difference
of objectives and functionalities, we omit comparison against data-
parallel solutions. To the best of our knowledge, our SGD solution
is the first parallel SpMV-based training algorithm that achieves
model-wise parallelism to train sparse DNNs on high-performance
computing systems.
6.2 Performance Results
Table 1 compares the performance of SGD and H-SGD in terms
of the communication volume and message counts metrics which
relate to bandwidth and latency overheads of parallelization. The
table displays both the average and maximum volume/number of
messages sent by a processor for comparison of the average and
maximum values. For each P , the first row displays the ratios of the
respective values attained by H-SGD to those by SGD, whereas the
Partitioning Sparse Deep Neural Networks for Scalable Training and Inference ICS ’21, June 14–17, 2021, Virtual Event, USA
Table 1: Performance comparison of SGD and H-SGD
1024 4096
Volume Messages Volume Messages
P Avg Max Avg Max imb Avg Max Avg Max imb
32 H
0.34 0.34 0.96 0.96 0.22 0.22 0.94 0.94
50 52 7 7 1.01 130 134 7 7 1.01
R 149 154 7 7 1.05 594 603 7 7 1.04
64 H
0.31 0.32 0.82 0.83 0.23 0.23 0.93 0.93
29 31 12 12 1.01 84 87 14 14 1.01
R 94 98 15 15 1.05 375 388 15 15 1.05
128 H
0.29 0.28 0.54 0.55 0.23 0.23 0.75 0.76
15 16 13 14 1.01 48 50 23 23 1.01
R 53 57 24 25 1.08 212 222 30 30 1.08
256 H
0.39 0.36 0.49 0.48 0.20 0.20 0.42 0.43
9 10 11 12 1.03 23 24 21 22 1.03
R 23 26 22 24 1.17 113 119 51 52 1.17
512 H
0.62 0.53 0.64 0.58 0.25 0.25 0.33 0.34
6 6 9 9 1.05 12 13 15 17 1.05
R 10 12 14 16 1.24 47 52 46 49 1.24
16384 65536
32 H
0.17 0.17 0.92 0.93 0.15 0.15 0.91 0.91
407 412 7 7 1.01 1,439 1,454 7 7 1.01
R 2,365 2,377 7 7 1.05 9,419 9,454 7 7 1.05
64 H
0.16 0.16 0.92 0.93 0.13 0.13 0.91 0.91
240 245 14 14 1.01 786 796 14 14 1.01
R 1,491 1,512 15 15 1.05 5,938 5,973 15 15 1.05
128 H
0.15 0.16 0.90 0.90 0.12 0.13 0.91 0.91
130 134 27 27 1.01 417 424 27 28 1.02
R 842 859 30 30 1.08 3,355 3,386 30 30 1.08
256 H
0.15 0.15 0.64 0.65 0.12 0.12 0.87 0.88
69 71 39 40 1.03 216 221 53 53 1.03
R 448 462 61 61 1.17 1,786 1,820 61 61 1.17
512 H
0.14 0.14 0.32 0.33 0.12 0.12 0.57 0.58
32 34 33 34 1.05 109 112 69 70 1.05
R 231 243 103 105 1.24 922 944 122 122 1.24
second and third rows display actual values. In the table, the last col-
umn shows the computational imbalance where the computational
load is computed as the number of floating-point operations.
As seen in Table 1, on all processor counts, H-SGD incurs 38–
71%, 75–80%, 83–86% and 85–88% less average/total communication
volume for sparse DNNs with N = 1024, 4096, 16384 and 65536,
respectively. Similarly, H-SGD incurs 47–72%, 75–80%, 83–86% and
85–88% less maximum send volume. The decrease in the bandwidth-
related costs increases as the size of DNNs increases. The average
and maximum communication volumes of processors are close
to each other which denotes that communication balance is also
achieved via the hypergraph partitioning.
In terms of the message count metrics, H-SGD achieves 4–51%,
6–67%, 8–68% and 9–43% smaller average message counts and 4–
52%, 6–66%, 7–67% and 9–42% smaller maximum message counts.
As the number of processors increases, the performance gap be-
tween the message count metrics of H-SGD and SGD increases

















































Figure 4: Strong scaling of SGD andH-SGD on different sized
sparse DNNs consisting of 120 layers, and the number of neu-
ron per layer N = 1024, 4096, 16384 and 65536, respectively.
in favor of H-SGD. In terms of computational load balance, H-
SGD provides consistently better performance than SGD. These
results demonstrate the effectiveness of the proposed hypergraph
partitioning model since both the bandwidth- and latency-related
costs are considerably minimized. Moreover, as the number of lay-
ers in sparse DNNs increases, performance improvement of the
hypergraph partitioning model is expected to be higher due to
optimizations achieved in each layer.
Figure 4 shows strong scaling of SGD and H-SGD. On each
processor count, running times are measured as the average time
required to process an input vector by H-SGD and SGD, where the
averages are taken over 103 randomly selected input vectors. For all
processor counts, H-SGD considerably improves the parallelization
efficiency and runs 2.01–2.37x, 1.97–2.96x, 2.10–3.39x and 2.88–
3.37x faster than SGD on sparse DNNs with N =1024, 4096, 16384
and 65536, respectively. The best speedup is achieved on N =65536
and P = 512 where H-SGD runs 3.39x faster than SGD. H-SpBP
achieves the ideal speedup up to 512 processors on DNN with
N =65536.
The synchronization barrier due to the communication oper-
ations between successive layers constitutes the main source of
latency overheads of the parallel SGD algorithm. As seen in Figure 4,
the efficiency of parallel SGD algorithm considerably improves with
the increasing number of neurons per layer, since latency overheads
are considerably amortized on larger networks. Additionally, the
ICS ’21, June 14–17, 2021, Virtual Event, USA Demirci and Ferhatosmanoglu




































Figure 5: Breakdown of running time of H-SGD (solid) and
SGD (tiled). “SpMV” corresponds to time spent on local
sparse matrix-vector multiplications.“Updt” corresponds to
the time spent on gradient update operations.“Comm” cor-
responds time spend for communication operations.
performance improvement achieved on running time by hyper-
graph partitioning increases with the increasing sizes of DNNs as
well as the increasing number of processors.
In Figure 5, to better analyze the effects of the hypergraph par-
titioning on the performance of SGD, we break down the total
time spent on communication and computation. As seen in the fig-
ure, the proportion of communication time to the overall running
time increases with the increasing number of processors, whereas
the proportion of time spent on local SpMV and gradient update
computations decrease together with the total running time. For
example, when N =65536, the proportion of communication time
respectively increases from 26% to 67% and 40% to 80% for H-SGD
and SGD as the number of processors increases from p = 32 to
p = 512. Hence, the improvements of hypergraph partitioning on
the communication costs become more significant on the overall
Table 2: Throughput results
H-SpFF GB
Neurons Layers Throughput Throughput Speedup
1024
120 4.90E+10 7.11E+10 0.69
480 5.41E+10 8.55E+10 0.63
1920 5.57E+10 8.89E+10 0.63
4096
120 3.87E+10 7.38E+10 0.52
480 3.71E+10 8.58E+10 0.43
1920 3.63E+10 8.70E+10 0.42
16384
120 8.20E+10 5.13E+10 1.60
480 7.91E+10 5.60E+10 1.41
1920 7.81E+10 5.61E+10 1.39
65536
120 9.01E+10 2.80E+10 3.21
480 8.57E+10 2.85E+10 3.01
1920 8.55E+10 2.85E+10 3.00
running time on larger processor counts. As the number of proces-
sors increases, the ratio of improvement in communication time to
the improvement in the overall execution time gradually increases
from 48% to 82% and 50% to 80% for N = 16384 and N = 65536,
respectively. This can be attributed to the fact that on larger pro-
cessor counts, communication costs become more dominant on the
overall parallelization overheads and optimizations achieved by
hypergraph partitioning on communication volume and message
count metrics considerably improves.
We also observe that the hypergraph partitioning improves the
performance of local SpMV and gradient update computations.
Specifically, H-SGD reduces the running time of local computations
by 1.9–2.7x on all processor counts as compared to SGD. The per-
formance improvement on the local computations arises because
hypergraph partitioning consistently achieves better computational
balance and temporal cache-locality than random partitioning. The
hypergraph partitioning assigns weight matrix rows, that are ac-
cessing similar input vector entries, to the same processor, which
provides temporal cache locality in accessing input vector entries
during local SpMV and gradient update computations. We refer
the reader to [2] for a detailed explanation of how temporal cache-
locality is achieved.
6.3 Inference-only Computations
For inference-only computations, we enhanced SpFF by implement-
ing local sparse matrix operations via SuiteSparse:GraphBLAS li-
brary [14]. The enhanced SpFF implementation supports batch pro-
cessing and multi-thread execution. We also use the proposed hy-
pergraph partitioning model and hence, we refer to SpFF as H-SpFF
here. We compare H-SpFF against a data-parallel solution (GB) [15],
that became one of the Graphchallange 2019 champions. GB uti-
lizes SuiteSparse:GraphBLAS library to achieve shared-memory
parallelism and is able to run on a single compute node. Similar to
GB, H-SpFF processes all input vectors in a single batch.
Table 2 compares throughput values achieved by H-SpFF and
GB for all sparse DNN configurations. Throughput corresponds
to the ratio of the number of input vectors times the number of
connections in a DNN divided by the execution time (i.e., number of
Partitioning Sparse Deep Neural Networks for Scalable Training and Inference ICS ’21, June 14–17, 2021, Virtual Event, USA
Table 3: Partitioning times (secs)
P 1024 4096 16384 65536
32 2.48 10.93 52.61 344.79
64 3.41 12.57 63.09 355.03
128 3.89 13.46 67.46 387.56
256 4.97 16.77 71.59 408.48
512 5.63 20.85 77.91 423.17
edges processed per second). The best throughput values of H-SpFF
are measured on 512 cores with 128MPI processes where we assign
4 cores for each MPI process and run 4 threads per MPI rank. We
run GB on a single node in our local HPC system where the last two
columns in the table display throughput and the relative speedup
values measured on our local system. Standard nodes’ memories
were not enough for GB; hence we used fat nodes, which are in
less number, that contain the same CPU configuration with higher
memory.
As seen in Table 2, H-SpFF performs slightly worse than GB for
small networks, whereas its performance considerably improves for
larger networks, providing higher speedup values. For network con-
figurations with N =16384, 65536 and L=120, H-SpFF achieves 1.6x
and 3.2x speedups over GB, respectively. This can be attributed to
the fact that the latency overheads introduced by the synchroniza-
tion barrier between successive layers reduce the parallelization
efficiency. The latency overheads are considerably amortized as
the number of neurons per layer increases and the number of lay-
ers decreases. Therefore, H-SpFF is expected to perform better for
network configurations with higher number of neurons and lower
number of layers.
6.4 Partitioning Times
The preprocessing overhead of the partitioning is easily amortized,
since the partitioning overhead is independent of the number of
input vectors (i.e., training data size) fed into sparse DNNs, whereas
the communication costs and the performance improvement at-
tained by the hypergraph partitioning model increases with the
increasing number of input vectors. Partitioning is performed once
for each layer. Sets Xsend and Xrecv are computed in partitioning
time and not modified hence do not affect the runtime. Table 3 dis-
plays partitioning times for L=120 layer sparse DNNs we used in
our experiments. As seen in the table, as the number of parts and the
number of neurons per layer increases, partitioning times increase.
Partitioning times are measured on a server with 2×Intel Xeon W-
2245 3.90GHz 8 core processors and 500GB DDR4 main memory.
7 CONCLUSION
We first introduced a distributed-memory parallel sparse DNN infer-
ence/training algorithm for high-performance computing systems.
The solution is based on efficient parallelization of consecutive
SpMV operations and achieves model-wise parallelism which sig-
nificantly eliminates memory and bandwidth bottlenecks inherent
in data-parallel approaches. We then proposed a novel hypergraph
partitioning-based solution to address the latency overheads due
to the communication operations between consecutive layers. The
hypergraph partitioning model considerably improves communica-
tion overheads by reducing the total communication volume and the
number of messages between processors while satisfying compu-
tational balance. Extensive experiments suggest that the proposed
model-wise parallel solution scales to large processor counts espe-
cially when the proposed hypergraph partitioning is utilized. With
the increasing number of neurons per layer and decreasing number
of layers, latency overheads between consecutive layers are consid-
erably amortized. Therefore, in cases where the whole DNN model
can not fit into main memory and the data-parallel approaches are
not feasible, the model-wise parallel inference/training algorithm
and hypergraph partitioning model offer a feasible alternative for
distributed memory systems.
8 ACKNOWLEDGMENTS
Computing resources used were provided by The Scientific Com-
puting Research Technology Platform2 at University of Warwick.
REFERENCES
[1] Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for dis-
tributed gradient descent. arXiv preprint arXiv:1704.05021 (2017).
[2] Kadir Akbudak, Enver Kayaaslan, and Cevdet Aykanat. 2013. Hypergraph parti-
tioning based models and methods for exploiting cache locality in sparse matrix-
vector multiplication. SIAM Journal on Scientific Computing 35, 3 (2013), C237–
C262.
[3] Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and
Dhabaleswar K Panda. 2017. S-caffe: Co-designing mpi runtimes and caffe for
scalable deep learning on modern gpu clusters. In Proceedings of the 22nd ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM,
193–205.
[4] Mauro Bisson and Massimiliano Fatica. 2019. A GPU Implementation of the
Sparse Deep Neural Network Graph Challenge. In 2019 IEEE High Performance
Extreme Computing Conference (HPEC). IEEE, 1–8.
[5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot learners. arXiv preprint
arXiv:2005.14165 (2020).
[6] Adrián Castelló, Manuel F Dolz, Enrique S Quintana-Ortí, and José Duato. 2019.
Analysis of model parallelism for distributed neural networks. In Proceedings of
the 26th European MPI Users’ Group Meeting. 1–10.
[7] Umit V Catalyurek and Cevdet Aykanat. 1999. Hypergraph-partitioning-based
decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions
on parallel and distributed systems 10, 7 (1999), 673–693.
[8] Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo,
Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. 2020. Tifl: A tier-based
federated learning system. In Proceedings of the 29th International Symposium on
High-Performance Parallel and Distributed Computing. 125–136.
[9] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman.
2014. Project adam: Building an efficient and scalable deep learning training
system. In 11th {USENIX} Symposium on Operating Systems Design and Imple-
mentation ({OSDI} 14). 571–582.
[10] Ching-Hsiang Chu, Pouya Kousha, Ammar Ahmad Awan, Kawthar Shafie Khoras-
sani, Hari Subramoni, and Dhabaleswar K Panda. 2020. Nv-group: link-efficient
reduction for distributed deep learning on modern dense gpu systems. In Pro-
ceedings of the 34th ACM International Conference on Supercomputing. 1–12.
[11] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng
Andrew. 2013. Deep learning with COTS HPC systems. In International conference
on machine learning. 1337–1345.
[12] Ronan Collobert, JasonWeston, Léon Bottou,Michael Karlen, Koray Kavukcuoglu,
and Pavel Kuksa. 2011. Natural language processing (almost) from scratch.
Journal of machine learning research 12, Aug (2011), 2493–2537.
[13] Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidy-
nathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey.
2016. Distributed deep learning using synchronous stochastic gradient descent.
arXiv preprint arXiv:1602.06709 (2016).
[14] Timothy A Davis. 2019. Algorithm 1000: SuiteSparse: GraphBLAS: Graph algo-
rithms in the language of sparse linear algebra. ACM Transactions on Mathemati-
cal Software (TOMS) 45, 4 (2019), 1–25.
2https://warwick.ac.uk/research/rtp/sc/
ICS ’21, June 14–17, 2021, Virtual Event, USA Demirci and Ferhatosmanoglu
[15] Timothy A Davis, Mohsen Aznaveh, and Scott Kolodziej. 2019. Write quick,
run fast: Sparse deep neural network in 20 minutes of development time via
SuiteSparse: GraphBLAS. In 2019 IEEE High Performance extreme Computing
Conference (HPEC). IEEE, 1–6.
[16] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao,
Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large
scale distributed deep networks. In Advances in neural information processing
systems. 1223–1231.
[17] Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition. Ieee, 248–255.
[18] Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The state of sparsity in deep
neural networks. arXiv preprint arXiv:1902.09574 (2019).
[19] TongGeng, TianqiWang, ChunshuWu, Chen Yang,WeiWu, Ang Li, andMartin C
Herbordt. 2019. O3BNN: An out-of-order architecture for high-performance
binarized neural network inference with fine-grained pruning. In Proceedings of
the ACM International Conference on Supercomputing. 461–472.
[20] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski,
Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate,
large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677
(2017).
[21] Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification
with bidirectional LSTM and other neural network architectures. Neural networks
18, 5-6 (2005), 602–610.
[22] Robert M Gray. 2006. Toeplitz and circulant matrices: A review. now publishers
inc.
[23] Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang,
Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse
DNN models without hardware-support via tile-wise sparsity. In Proceedings of
the International Conference for High Performance Computing, Networking, Storage
and Analysis. 1–15.
[24] Babak Hassibi and David G Stork. 1993. Second order derivatives for network
pruning: Optimal brain surgeon. In Advances in neural information processing
systems. 164–171.
[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[26] B Hendrickson and TG Kolda. [n.d.]. Partitioning Rectangular and Structurally
Nonsymmetric SparseMatrices for Parallel Processing, submitted to SIAM Journal
of Scientific Computing.
[27] Mert Hidayetoğlu, Carl Pearson, Vikram Sharma Mailthody, Eiman Ebrahimi,
Jinjun Xiong, Rakesh Nagi, and Wen-mei Hwu. 2020. At-Scale Sparse Deep
Neural Network Inference With Efficient GPU Implementation. In 2020 IEEE High
Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
[28] Sara Hooker, Aaron Courville, Yann Dauphin, and Andrea Frome. 2019. Selective
Brain Damage: Measuring the Disparate Impact of Model Pruning. arXiv preprint
arXiv:1911.05248 (2019).
[29] Forrest N Iandola, MatthewWMoskewicz, Khalid Ashraf, and Kurt Keutzer. 2016.
Firecaffe: near-linear acceleration of deep neural network training on compute
clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2592–2600.
[30] Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring hidden
dimensions in parallelizing convolutional neural networks. arXiv preprint
arXiv:1802.04924 (2018).
[31] Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model
Parallelism for Deep Neural Networks. SysML 2019 (2019).
[32] Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. 2016. How to scale
distributed deep learning? arXiv preprint arXiv:1611.04581 (2016).
[33] George Karypis. 1998. hMETIS 1.5: A hypergraph partitioning package.
http://www. cs. umn. edu/˜ metis (1998).
[34] Oguz Kaya and Bora Uçar. 2015. Scalable sparse tensor decompositions in dis-
tributed memory systems. In SC’15: Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
[35] Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin,
Albert Reuther, Ryan Robinett, and Sid Samsi. 2020. GraphChallenge. org Sparse
Deep Neural Network Performance. arXiv preprint arXiv:2004.01181 (2020).
[36] Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin,
Ryan Robinett, and Sid Samsi. 2019. Sparse deep neural network graph challenge.
In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
[37] Jeremy Kepner and Ryan Robinett. 2019. RadiX-Net: Structured Sparse Matrices
for Deep Neural Networks. In 2019 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW). IEEE, 268–274.
[38] Tamara G Kolda. 1998. Partitioning sparse rectangular matrices for parallel
processing. In International Symposium on Solving Irregularly Structured Problems
in Parallel. Springer, 68–79.
[39] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik,
Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies
for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
[40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[41] Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann. lecun.
com/exdb/mnist/ (1998).
[42] Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In
Advances in neural information processing systems. 598–605.
[43] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed,
Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling
distributed machine learning with the parameter server. In 11th {USENIX} Sym-
posium on Operating Systems Design and Implementation ({OSDI} 14). 583–598.
[44] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li,
Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. [n.d.]. PyTorch
Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of
the VLDB Endowment 13, 12 ([n. d.]).
[45] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep
gradient compression: Reducing the communication bandwidth for distributed
training. arXiv preprint arXiv:1712.01887 (2017).
[46] Baoyuan Liu, MinWang, Hassan Foroosh,Marshall Tappen, andMarianna Pensky.
2015. Sparse convolutional neural networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition. 806–814.
[47] Christos Louizos, Max Welling, and Diederik P Kingma. 2017. Learning Sparse
Neural Networks through L_0 Regularization. arXiv preprint arXiv:1712.01312
(2017).
[48] Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, and Mohammad
Hammoud. 2019. Multithreaded Layer-wise Training of Sparse Deep Neural
Networks using Compressed Sparse Column. In 2019 IEEE High Performance
Extreme Computing Conference (HPEC). IEEE, 1–6.
[49] Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, and Mohammad
Hammoud. 2020. Studying the effects of hashing of sparse deep neural networks
on data andmodel parallelisms. In 2020 IEEEHigh Performance Extreme Computing
Conference (HPEC). IEEE, 1–7.
[50] Lin Ning and Xipeng Shen. 2019. Deep reuse: streamline CNN inference on the
fly via coarse-grained computation reuse. In Proceedings of the ACM International
Conference on Supercomputing. 438–448.
[51] Filip Pawłowski, Rob H Bisseling, Bora Uçar, and AN Yzelman. 2020. Combinato-
rial Tiling for Sparse Neural Networks. In 2020 IEEE High Performance Extreme
Computing Conference (HPEC). IEEE, 1–7.
[52] Ameya Prabhu, Girish Varma, and Anoop Namboodiri. 2018. Deep expander net-
works: Efficient deep networks from graph theory. In Proceedings of the European
Conference on Computer Vision (ECCV). 20–35.
[53] Gerald Schubert, Georg Hager, Holger Fehske, and GerhardWellein. 2011. Parallel
sparse matrix-vector multiplication as a test case for hybrid MPI+ OpenMP
programming. In 2011 IEEE International Symposium on Parallel and Distributed
Processing Workshops and Phd Forum. IEEE, 1751–1758.
[54] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[55] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from
overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
[56] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. Deepface:
Closing the gap to human-level performance in face verification. In Proceedings
of the IEEE conference on computer vision and pattern recognition. 1701–1708.
[57] Linnan Wang, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy,
and Rodrigo Fonseca. 2020. FFT-based Gradient Sparsification for the Distributed
Training of Deep Neural Networks. In Proceedings of the 29th International Sym-
posium on High-Performance Parallel and Distributed Computing. 113–124.
[58] Xiaoyun Wang, Zhongyi Lin, Carl Yang, and John D Owens. 2019. Accelerating
DNN Inference with GraphBLAS and the GPU. In 2019 IEEE High Performance
Extreme Computing Conference (HPEC). IEEE, 1–6.
[59] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsifica-
tion for communication-efficient distributed optimization. In Advances in Neural
Information Processing Systems. 1299–1309.
[60] Xintian Yang, Srinivasan Parthasarathy, and Ponnuswamy Sadayappan. 2011.
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining.
arXiv preprint arXiv:1103.2405 (2011).
[61] Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling sgd batch size to 32k
for imagenet training. arXiv preprint arXiv:1708.03888 6 (2017).
[62] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2019.
Fast deep neural network training on distributed systems and cloud TPUs. IEEE
Transactions on Parallel and Distributed Systems 30, 11 (2019), 2449–2462.
[63] Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with
elastic averaging SGD. In Advances in neural information processing systems.
685–693.
[64] Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring
the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878
(2017).
