Batched Sparse Matrix Multiplication for Accelerating Graph
  Convolutional Networks by Nagasaka, Yusuke et al.
 
IEEE Copyright Notice 
 
 
 
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained 
for all other uses, in any current or future media, including reprinting/republishing this material 
for advertising or promotional purposes, creating new collective works, for resale or redistribution 
to servers or lists, or reuse of any copyrighted component of this work in other works. 
 
 
 
Accepted to be Published in: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud 
and Grid Computing (CCGRID), May 14-17, 2019, Larnaca, Cyprus 
 
ar
X
iv
:1
90
3.
11
40
9v
1 
 [c
s.D
C]
  2
7 M
ar 
20
19
Batched Sparse Matrix Multiplication for
Accelerating Graph Convolutional Networks
Yusuke Nagasaka, Akira Nukada
Tokyo Institute of Technology
Tokyo, Japan
nagasaka.y.aa@m.titech.ac.jp
nukada@gsic.titech.ac.jp
Ryosuke Kojima
Kyoto University
Kyoto, Japan
kojima.ryosuke.8e@kyoto-u.ac.jp
Satoshi Matsuoka
RIKEN Center for Computational Science
Kobe, Japan
matsu@acm.org
Abstract—Graph Convolutional Networks (GCNs) are recently
getting much attention in bioinformatics and chemoinformatics
as a state-of-the-art machine learning approach with high accu-
racy. GCNs process convolutional operations along with graph
structures, and GPUs are used to process enormous operations
including sparse-dense matrix multiplication (SpMM) when the
graph structure is expressed as an adjacency matrix with sparse
matrix format. However, the SpMM operation on small graph,
where the number of nodes is tens or hundreds, hardly exploits
high parallelism or compute power of GPU. Therefore, SpMM
becomes a bottleneck of training and inference in GCNs applica-
tions. In order to improve the performance of GCNs applications,
we propose new SpMM algorithm especially for small sparse
matrix and Batched SpMM, which exploits high parallelism of
GPU by processing multiple SpMM operations with single CUDA
kernel. To the best of our knowledge, this is the first work of
batched approach for SpMM. We evaluated the performance of
the GCNs application on TSUBAME3.0 implementing NVIDIA
Tesla P100 GPU, and our batched approach shows significant
speedups of up to 1.59x and 1.37x in training and inference,
respectively.
Index Terms—Graph Convolution, GCN, Batched, SpMM
I. INTRODUCTION
Recently, Deep Learning is getting much attention for its
high accuracy in image recognition and other machine learning
topics. In image recognition, Convolutional Neural Networks
(CNNs), which consist of convolution and pooling layers,
effectively expose the features of images, and show significant
accuracy. On the other hand, Graph Convolutional Networks
(GCNs) have been proposed [1] and can deal with a general
graph structure, not only for a regular grid structure such
as image. The GCNs show high accuracy in bioassay to
predict the characteristics of chemical compounds or protein
by treating the structures of substances as graph [2]–[6].
Furthermore, a knowledge graph is also applied to GCNs [7].
As well as CNNs, GCNs require enormous computational
operations, and GPUs with high compute capability are crucial
component for the GCNs applications. The structure of input
graph is expressed as an adjacency matrix or adjacency list.
As the connectivity between nodes among the graph is usually
sparse, the adjacency matrix is also sparse. In order to achieve
high throughput in training and inference of GCNs appli-
cations, the operations on sparse matrix, especially sparse-
dense matrix multiplication (SpMM), need to be accelerated.
However, the dataset used in GCNs often includes very small
graphs, where the number of nodes is less than one hundred.
SpMM operation on such small matrix hardly exploits massive
parallelism of GPU. Furthermore, it is not clear how to
efficiently process SpMM between small matrices since no
research in high performance computing or other fields focuses
on small sparse matrix. The researchers and developers are
forced to process the training or inference of GCNs with
low throughput SpMM on small matrices. As a result, the
operations on sparse matrix, especially SpMM, become the
bottlenecks in GCNs applications.
We propose Batched SpMM, which increases the paral-
lelism and attains whole computing power of GPU. Firstly,
we devise new efficient SpMM algorithm, named Sub-Warp-
Assigned (SWA) SpMM, for the data structure such as CSR
and SparseTensor in TensorFlow, and integrate cache blocking
optimization for utilizing shared memory. Next, we show how
Batched SpMM assigns computing resources. The Batched
SpMM appropriately applies cache blocking optimization and
assigns threads and shared memory to each SpMM operation
based on the matrix sizes in the batch. We also mention how
to efficiently apply our batched approaches to GCNs appli-
cation, improving the performance of matrix multiplication
and addition besides SpMM. The Batched SpMM launches
single CUDA kernel and executes tens or hundreds of SpMM
operations, corresponding to mini batch size, in parallel. The
Batched SpMM enhances the parallelism and throughput of
SpMM operations, and also reduces the overhead of CUDA
kernel launches.
We have preliminary evaluation of the Batched SpMM on
TSUBAME3.0 implementing NVIDIA Tesla P100 GPU. Our
batched approaches achieve significant speedups of up to 9.27x
compared to non-batched approaches, and attain improvement
of up to 6.09x speedup for larger dense input matrix. We also
evaluate the performance of Batched GEMM of cuBLAS by
treating input sparse matrix as dense matrix, and our Batched
SpMM shows superior performance to the Batched GEMM.
This preliminary evaluation exposes the effectiveness of our
Batched SpMM to specific batch size and matrix size. The
Batched SpMM is integrated to the GCNs application, and
accelerates the training and inference of the application. The
Batched SpMM attains the speedups of up to 1.59x and 1.37x
in training and inference, respectively.
II. BACKGROUND
A. Graph Convolution
A graph convolution is a general convolutional operation,
according to a graph structure, while the conventional con-
volutional operation assumes regular grid structure. A graph
convolutional network consists of multiple graph convolution
layers. Graph structures and input features are given as input
data, and the convolutional operation is executed along with
its graph structure. More specifically, the features of adjacent
nodes to the target node is convoluted with filters. This
operation is formulated as below when a graph is G = {V,E},
feature vector of the node v ∈ V is xv and a set of filter is
expressed as a weight matrix, W .
yi =
∑
j∈V
aijx
T
j W (1)
The adjacency matrix is set as auu = 1, and if an edge from
u to v exists, avu = 1, otherwise 0. Executing the operations
for all nodes among the graph is regarded as the multiplication
between the adjacency matrix A, feature vectors and weight
matrix as below.
Y = AXW (2)
The adjacency matrix representing a graph structure usually
shows sparse property, while the feature vectors and weight
matrix are expressed as dense matrix.
B. Sparse Matrix Format
Sparse matrix is usually compressed by removing zero
elements and holding only non-zero elements, which are
necessary for computation. Sparse matrix format is designed
to reduce both operations and memory usage by compres-
sion, and various sparse matrix formats have been proposed.
Coordinated (COO) and Compressed Sparse Row (CSR) are
widely used. Figure 1 shows an example of each sparse matrix
format. The COO format has the tuple of value, row index
and column index of each non-zero element in the matrix.
The CSR accumulates the non-zero elements with same row
indices, and then manages the non-zero elements by holding
row pointer (rpt), which indicates the beginning point of each
row. The CSR can reduce memory usage compared to the
COO format. In TensorFlow [8], one of the sophisticated
deep learning frameworks, sparse matrix (tensor) is treated
as SparseTensor class object. As Figure 1 shows, the data
structure of SparseTensor is similar to COO. In SparseTensor,
the indices of each non-zero element are stored as the array
of the pairs of the row and column indices.
C. Sparse Matrix Multiplication
Let A, B be input matrices, sparse matrix multiplication
(SpMM) computes C = AB when A is sparse matrix and
B is dense matrix. In the rest of this paper, the number of
non-zero elements, row size and column size of a matrix X is
written in nnzX , mX , nX , respectively. Figure 2 shows the
a
b c
d e
a b c d e
1 0 1 1 2
0 1 3 5
values
col ids
rpt
0 1 1 2 2
1 0 1 1 2
row ids
col ids
a b c d evalues
0 1 1 2 21 0 1 1 2ids
a b c d evalues
SparseTensor
COOCSR
Fig. 1: Example of sparse matrix formats
SPARSETENSORDENSEMATMUL(C,A,B)
1 // set matrix C to O
2 for i ← 0 to nnzA
3 do for j ← 0 to nB
4 do rid← idsA[i ∗ 2]
5 cid← idsA[i ∗ 2 + 1]
6 val← valuesA[i]
7 C[rid][j]← C[rid][j] + val ∗B[cid][j]
Fig. 2: Pseudo code of the multiplication kernel between sparse
tensor and dense matrix provided in TensorFlow. A is stored
as SparseTensor data structure.
pseudo code of SparseTensorDenseMatMul routine, which can
compute SpMM in TensorFlow. The multiplications for each
non-zero element are processed one after another. In CUDA
code designed for GPU, two “for” loops are parallelized by
nnzA ∗nB threads, and each thread computes a multiplication
independently. The product is added to output matrix by using
atomic operation since the memory access to output matrix
and the “add” operation by each thread have a chance of race
condition. When the negative influence of atomic operations is
enough negligible, the work assigned to each thread is about
equal and the SparseTensorDenseMatMul on GPU can achieve
good load balance. However, this approach has poor locality of
memory access in spite of matrix-matrix multiplication. Also,
the performance is limited by the atomic operations on the
global memory of GPU. To make matters worse on small
matrices, the approach with nnzA ∗nB threads hardly exploits
enormous parallelism of GPU.
III. RELATED WORK
A. SpMM for GPU
Va´zquez and et al. proposed new approach of SpMM for
GPU with ELLR-T format, which has favorable property of
memory access on GPU. While the approach of SpMM with
ELLR-T achieves higher performance, the format conversion
from COO or CSR format causes performance degradation
when the calculation on each matrix is executed only once.
Hong and et al. proposed new sparse matrix format and
algorithm for SpMM named Row-Segmented-SpMM (RS-
SpMM) [9]. The sparse matrix is divided into two groups.
One is dense non-zero area and the other is rest of non-
zero elements. Each part is processed by it own kernel.
This approach can improve the locality of memory access
and reduce the memory access to global memory. Also, in
order to identify “dense” area, the performance model is
developed and brings enough high performance even compared
to best case. The format conversion including parameter setting
and preprocessing for DCSR format [10] requires the cost
equivalent to several times of SpMM calculations.
Yang and et al. proposed the state-of-the-art SpMM algo-
rithm for GPU with CSR format [11]. The authors proposed
two kernels, and the heuristic selects between the kernels by
input matrix with high accuracy. One kernel adopts merge-
based approach [12] designed for improving the load-balance
in SpMV, and the other kernel is Row-splitting, performing
coalesced memory access. It is not clear that the proposed
approach is effective for smaller matrices since the approach
is evaluated only on large matrices.
B. Batched BLAS
The calculation on a dense matrix divided into small sub
matrices [13]–[16] or sparse matrix with block structure [17]–
[19] is dealt with the operations on small dense matrices.
The computing kernel on small matrices suffers from the
overhead of repeated kernel launches, especially on GPU,
and the high parallelism and memory bandwidth of GPU are
hardly exploited. Thus, the total performance of applications
is decreased. To overcome these issues, a new basic routine
called Batched BLAS has been proposed handling multiple
data and operations in a single kernel [20]–[22]. Even if
the data sizes in the batch are different to each, Batched
BLAS keeps good load-balance. The approach for GEMM
(multiplication between dense matrices) on small matrices is
also proposed [23], and a lot of GEMM operations between
small matrices can be executed with high throughput. Recently,
Batched SpMV for small matrices has been proposed [24].
Our batched approach is very similar concept to the Batched
SpMV, but the paper about Batched SpMV includes many
assumptions (eg. same non-zero pattern among matrices in
batch), and the Batched SpMV is highly specialized for the
application. Thus, the batched approach is very simple by
adding z-axis parallelism to thread grid in CUDA, and the
load imbalance among a batch is not taken into account.
C. GCNs Application
The GCNs applications are actively developed in bioinfor-
matics or chemoinformatics areas as a practical use case of
machine learning approach. Deepchem, an open source library,
is publicly available, providing the deep-learning approaches
to wide range areas such as drug design and materials chem-
istry [25]. DeepChem manages a graph structure as adjacency
lists. The convolutional operations accumulate the feature
vectors from adjacent nodes of the target node by exploring
the adjacency lists. The convolutional operations in DeepChem
hardly exploit high parallelism of GPU.
A library for deep learning in chemistry area named Chainer
Chemistry is also publicly available [26]. Chainer Chemistry
enables to predict the characteristics of chemical compounds
by applying a molecular structure to the deep learning ap-
proaches. The data with graph structure is applied to GCNs as
input data. Chainer Chemistry, however, manages an adjacency
matrix as dense matrix while the matrix shows sparse property.
This policy causes many redundant zero-related calculations.
Furthermore, it is hard to keep the data on memory or cache
when the graph becomes large.
IV. PROPOSAL
The GCNs application executes convolutional operations in
accordance with graph structure by computing SpMM when
the graph structure is expressed as sparse matrix. Learning
phase of GCNs requires many SpMM executions since the
GCNs need to process a lot of data through the graph convolu-
tion layers. However, SpMM on very small matrices such that
the dimension is tens or hundreds hardly utilizes the high com-
puting capability of GPU. Therefore, the overhead of CUDA
kernel launch for each SpMM operation appears to dominate
the total GCNs performance. We propose Batched SpMM,
which is high-throughput batched approach for processing
many SpMM operations especially on small sparse matrices.
Firstly, we devise new efficient SpMM algorithm, named Sub-
Warp-Assigned (SWA) SpMM, for the data structures such
as CSR and SparseTensor in TensorFlow. Next, we show the
cache blocking optimization to the SWA SpMM for utilizing
fast shared memory. Finally, we demonstrate the batch strategy
for the cache blocking and the assignment of GPU computing
resources such as threads and shared memory. We also mention
how to efficiently integrate our batched approaches to GCNs
application. In our supposed scenario of GCNs application,
the format conversion for all sparse matrices causes expensive
overhead. Thus, we focus on developing algorithms for simple
sparse matrix formats such as COO and CSR. Our target
application is implemented with TensorFlow, and stores sparse
matrix data as SparseTensor data structure. We assume that the
non-zero elements are not sorted by row or column indices
in SparseTensor data structure. The column size of dense
input matrix in SpMM operation is the size of model in the
application. Each column size of dense input in the batch is
same.
A. Sub-Warp-Assigned SpMM
The SparseTensorDenseMatMul in TensorFlow launches
nnzA ∗ nB threads, and one thread computes one
multiplication-and-addition. Thus, nB threads are assigned to
each non-zero element. However, the actual CUDA imple-
mentation requires many operations for each thread to find
which non-zero element or column is assigned, reducing the
efficiency of parallelism. Furthermore, when nB becomes
larger, especially overcomes 32, more threads access to same
non-zero element. This increases memory access demand and
instruction counts. That is why the SparseTensorDenseMatMul
results in lower performance. We propose Sub-Warp-Assigned
SWA SPMM ST(C,A,B, subWarp)
1 // A is stored as SparseTensor data structure
2 // set matrix C to O
3 i ← threadid
4 nzid ← i /subWarp
5 rid← idsA[nzid ∗ 2]
6 cid← idsA[nzid ∗ 2 + 1]
7 val← valuesA[nzid]
8 for j ← (i %subWarp) to nB by subWarp
9 do Atomic(C[rid][j]← C[rid][j] + val ∗B[cid][j])
Fig. 3: Pseudo CUDA code of Sub-Warp-Assigned SpMM for
SparseTensor data structure
(SWA) SpMM, which assigns up to 32 threads (1 warp) to each
non-zero element or row. The subWarp is set based on the
column-size of dense input matrix, i.e. nB .
subWarp =
{
32 (nB > 16)
min 2p s.t. nB ≤ 2p (nB ≤ 16)
The number of assigned threads to each non-zero element is
set as the power of two. This enables to compute division
and modulo operations by executing low-cost bit operations.
Furthermore, by setting the limit of assigning threads up to 32
threads to each non-zero element, the SWA SpMM reduces
redundant memory pressure and instruction counts. We firstly
show the algorithm of SWA SpMM for SparseTensor data
structure, and then show that for CSR format. The algorithm
for SparseTensor is easily applied to COO format.
Figure 3 shows the pseudo CUDA code of SWA SpMM
algorithm for SparseTensor data structure. The SWA SpMM
algorithm assigns subWarp of threads to each non-zero ele-
ment. The memory access to the dense input matrix and output
matrix by threads among same subWarp is coalesced. As well
as SparseTensorDenseMatMul, the SWA SpMM algorithm is
load balanced since the work is parallelized by the number
of non-zero elements. Since another subWarp assigned to
different non-zero element has a chance of accessing same
entry of output matrix, add operation is atomically executed.
Figure 4 shows the pseudo CUDA code of SWA SpMM
algorithm for CSR format. By utilizing the characteristic of
managing non-zero elements by row, the SWA SpMM for
CSR assigns subWarp to each row. As well as SWA SpMM
for SparseTensor, the threads among same subWarp access
to same non-zero element, and the memory access to dense
input and output matrix is coalesced. The significant difference
from the SWA SpMM for SparseTensor is that the algorithm
for CSR causes no data race on accessing output matrix. The
product can be added without atomic operation.
B. Efficient use of shared memory and cache blocking
Utilizing shared memory and improving the locality of
memory access highly contribute to the performance of matrix
multiplication on GPU. Shared memory is implemented on
SWA SPMM CSR(C,A,B, subWarp)
1 // A is stored as CSR format
2 // set matrix C to O
3 i ← threadid
4 rid ← i /subWarp
5 for nzid ← rptA[rid] to rptA[rid+ 1]
6 do cid← colidsA[nzid]
7 val← valuesA[nzid]
8 for j ← (i %subWarp) to nB by subWarp
9 do C[rid][j]← C[rid][j] + val ∗B[cid][j]
Fig. 4: Pseudo CUDA code of Sub-Warp-Assigned SpMM for
CSR format
each SM, and provides fast memory access. Also, hardware
support improves the performance of atomic operations on
shared memory. In our approach for both SparseTensor data
structure and CSR format, output matrix is temporally placed
on shared memory.
For small matrix stored as COO format or SparseTensor
data structure, our approach utilizes shared memory as Fig-
ure 5-(a). Utilizing shared memory for output matrix reduces
the overhead of CUDA kernel launch for initializing output
matrix. As Figure 3 shows, the output matrix needs to be
initialized with zero before actual SpMM operation starts. The
overhead of another CUDA kernel launch for initialization
is not negligible, especially for SpMM on small matrices. If
nB or sparse matrix is large, whole output matrix can not be
placed on shared memory of single streaming multiprocessor
(SM). In this case, cache blocking optimization divides the
output matrix along the column, as Figure 5-(b) shows. The
cache blocking optimization enables SWA SpMM to utilize
fast shared memory and achieve high performance even for
larger input matrices. Not only for output matrix, the locality
of memory access to input dense matrix is also improved.
While each subWarp in SWA SpMM for SparseTensor
specifies only which column of output matrix is assigned
without index information of non-zero element, the SWA
SpMM for CSR can easily specify which row and column of
output matrix is assigned. Thus, shared memory does not need
to keep whole output matrix. A subWarp only needs shared
memory with the size of nB . If TB/subWarp ∗ nB , TB is
thread block size, exceeds the capacity of shared memory, the
cache blocking optimization is applied as Figure 5-(d) shows.
As well as SparseTensor data structure, the output matrix and
input dense matrix are divided along the column.
C. Batched Algorithm for SpMM
Our Batched SpMM manages tens or hundreds of SpMM
operations by single CUDA kernel launch. Based on the
sizes of input sparse and dense matrices, the Batched SpMM
algorithm decides whether the cache blocking is applied and
how many threads are assigned to whole SpMM operations.
The Batched SpMM for SparseTensor decides whether
cache blocking optimization is applied by the maximum size
Threads SM Input Matrix (sparse) Input Matrix (dense) Output Matrix (dense)
Global Memory
Shared Memory
(a) For small matrix (b) Cache blocking
with SparseTensor
(c) CSR for larger sparse matrix (d) CSR for wider dense matrix
Fig. 5: Utilization of shared memory and cache blocking optimization for SpMM
of output matrix in batch, i.e. maxmA ∗ nB . There are three
cases by the sizes of input matrices.
1) Whole output matrix is small enough to be placed on
shared memory (Figure 5-(a))
2) Sub output matrix divided by cache blocking can be
placed on shared memory (Figure 5-(b))
3) A graph (sparse matrix) is too large to use shared
memory even with cache blocking optimization
However, when we assume that the size of shared memory to
each SpMM operation in single precision is 32KB, only the
input sparse matrices with mA > 8192 require the case 3). In
this case, the SpMM operations do not need to be batched,
and it is better option to execute single SpMM operation by
optimized kernel for large input sparse matrix. We focus on
only the cases 1) and 2) although we support the case 3)
as the implementation without using shared memory. The
Batched SpMM algorithm applies cache blocking optimization
to all SpMM operations in the batch if an output matrix
which cannot be placed on shared memory exists. The Batched
SpMM assigns one thread block to each SpMM operation for
whole matrix or sub matrix. For example, we consider that the
Batched SpMM executes one hundred of SpMM operations in
single precision and the capacity of shared memory assigned
to each thread block is 32KB. If any output matrix in the batch
can be placed on shared memory without cache blocking, that
is mA ∗ nB ∗ sizeof(float) ≤ 32 ∗ 1024, the Batched SpMM
launches one hundred thread blocks and each thread block
computes one SpMM operation. If one SpMM operation in
the batch requires cache blocking and the output matrix is
divided into two sub matrices, two hundreds thread blocks
are launched and each thread block is assigned to one SpMM
operation on sub matrix.
The required number of threads for each SpMM operation in
the Batched SpMM for CSR format is simply subWarp∗mA.
As Figure 5-(c) shows, larger input matrix increases the
number of thread blocks launched. The Batched SpMM for
CSR format applies cache blocking optimization only when
nB is large as Figure 5-(d) shows. If the column size of
dense input, nB , is small enough, the Batched SpMM launches
maxmA ∗ subWarp ∗ batch threads as a whole. If wider
input dense matrix needs cache blocking optimization and
the output matrix is divided into p sub matrices, the Batched
SpMM launches maxmA ∗ subWarp ∗ batch ∗ p threads as a
whole. Although the Batched SpMM for CSR launches more
threads than necessary if the batch includes various sizes of
input sparse matrices, these redundant threads do not become
a problem since the threads terminate immediately.
D. Batched SpMM for GCNs Application
Our target GCNs application named “ChemGCN” con-
structs a neural network with multiple graph convolution
layers. The application is written with TensorFlow, but our
batched approach does not limit the framework or implemen-
tation. In our implementation with TensorFlow, we focus on
applying the Batched SpMM for SparseTensor, but the Batched
GRAPHCONVOLUTION(Y,A,X,W, bias)
1 for b← 0 to batchsize
2 do for ch← 0 to channel
3 do U ← MATMUL(X[b],W [ch])
4 B ← ADD(bias[ch], U)
5 C[ch]← SPMM(A[b][ch], B)
6 Y [b]← ELEMENTWISEADD(C)
Fig. 6: Pseudo code of graph convolution layer in GCN
application without batched optimization
SpMM for CSR format can also be utilized in another frame-
work such as Chainer, which supports CSR format. Figure 6
shows the flow of the graph convolution layer in ChemGCN.
The graph convolution layer executes convolutional operation
for each input data. Firstly, the feature vectors on each vertex
within the input graph are multiplied by the weight matrix for
each channel (MatMul). The bias is added to the multiplication
result (Add), and the result is convoluted along with graph
structure (SpMM). Finally, the output of SpMM for each
channel is accumulated. This GraphConvolution layer requires
batchsize∗channel times of kernel launches for MatMul, Add
and SpMM. Since input matrices are small, the computing ker-
nels for these three operations hardly exploit high parallelism
of GPU. The overhead of CUDA kernel launches becomes
critical performance issue. The implementation insufficient for
the performance is simply due to no availability of batched
approach for SpMM.
Figure 7 shows the algorithm of graph convolution layer
with the Batched SpMM. To attain more parallelism, the
operations within mini batch are batched. The Batched SpMM
applied to the graph convolution layer enables MatMul and
Add to be also executed by single kernel, respectively. Due
to the limitation of the framework, the dimensions of input
tensors are changed. In Figure 7, the input dense tensor
is reshaped from RmX×nX×batchsize to R(mX∗batchsize)×nX .
However, this operation just changes the meta data of input
data, and brings only a little overhead. Generating the list
of adjacency matrices for executing Batched SpMM requires
only accumulating the pointers to the objects. In this way, the
operations on the graph convolution layer can be batched with
little overhead. The CUDA kernel launches for computing con-
volution are largely reduced from O(channel ∗ batchsize) to
O(channel). The Batched SpMM is also applied to backward
propagation.
V. PERFORMANCE EVALUATION
We evaluated the performance and effectiveness of the
Batched SpMM. First we conducted the preliminary evalu-
ation on randomly generated matrices to see the performance
difference of SpMM between the non-batched and batched
approaches. Next we evaluated the performance of ChemGCN
application with the Batched SpMM.
GRAPHCONVOLUTIONBATCHED(Y,A,X,W, bias)
1 for ch← 0 to channel
2 do Xr ← RESHAPE(X, (mX ∗ batchsize, nX))
3 U ← MATMUL(Xr,W [ch])
4 B ← ADD(bias[ch], U)
5 Alist ← [A[0][ch], ..., A[batchsize− 1][ch]]
6 C[ch]← BATCHEDSPMM(Alist, B)
7 Y ← ELEMENTWISEADD(C)
Fig. 7: Pseudo code of graph convolution layer in GCN
application with batched optimization
We used TSUBAME3.0 at Tokyo Institute of Technology,
which has two sockets of Intel(R) Xeon(R) CPU E5-2680 v4
@ 2.40GHz and four cards of NVIDIA Tesla P100-SMX2
GPU per computing node. In our evaluation, we used only
one GPU. The Tesla P100 GPU consists of 56 SMs, which
are 3584 of CUDA cores, and has 16GB HBM2 main memory
with 732GB/sec peak memory bandwidth. The shared memory
with 64KB is implemented on each SM, and 4MB L2 cache is
shared among all threads. The OS was SUSE Linux Enterprise
Server 12 (x86 64), and the version of CUDA was 9.0.176.
Our target GCNs application is implemented with TensorFlow.
The version of TensorFlow was 1.8.0, and cuDNN was ver.7.0.
All evaluation in this paper was in single precision.
A. Preliminary Evaluation Results
We evaluated the performance of batched approaches on
randomly generated sparse matrix data. We used SpMM
functions from cuSPARSE library with CSR format. We
evaluated both “csrmm()” and “csrmm2()”, and show the
best one as “cuSPARSE”. We implemented SpMM function
with SparseTensor-like format following SparseTensorDense-
MatMul as the non-batched approach. Compared to the non-
batched approaches, we evaluated Batched SpMM for CSR
and SparseTensor-like formats. In addition to them, we eval-
uated the performance of “gemmBatched()” from cuBLAS
library by treating the input sparse matrix as dense matrix. This
Batched GEMM function is optimized for processing many
GEMM operations by single kernel. The Batched GEMM can
exploit GPU’s computing power while most of operations
are zero-related. Although we just show the throughput of
Batched GEMM as evaluation result, holding sparse matrix
as dense format requires much memory usage. When we con-
sider the practical application scenario, much more memory
requirement by dense format makes it hard to keep the data
on GPU memory, causing more memory transfer between
CPU and GPU. Since our target is graph data, the randomly
generated sparse matrices are square. The row size (dim)
and nnz/row are parameterized in generating matrix, and
the non-zero pattern is different from each other. The column
size of input dense matrix (nB) is also parameterized. The
performance number is showed with FLOPS, calculated by
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
2
−5
2
−3
2
−1
2
1
2
3
2
5
2
7
G
FL
O
P
S
cuSPARSE
TensorFlow
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(a) batchsize = 50, dim = 50, nnz/row = 2
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
2
−4
2
−2
2
0
2
2
2
4
2
6
G
FL
O
P
S
cuSPARSE
TensorFlow
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(b) batchsize = 100, dim = 50, nnz/row = 3
Fig. 8: Performance of SpMM and Batched GEMM on randomly generated dataset following the one used in target GCNs
application. BatchedSpMM (ST) represents the result of Batched SpMM for Sparse Tensor.
2∗nnzA∗nB/exe time. Since the concern is actual execution
time, the “gemmBatched()” also follows this performance
metric although many zero-related operations are executed.
The Batched SpMM and Batched GEMM require the pointer
to each matrix data to be placed on device memory, while
it is stored on host memory in the non-batched cases. Our
evaluation for batched approaches includes memory transfer of
pointer arrays from host to device. Each performance number
is the average of 10 times executions.
Figure 8 shows the performance of non-batched and batched
approaches for SpMM. The parameter setting for the input
sparse matrices in the evaluation is based on dataset and
configurations of GCNs application showed in later part of
this section. The evaluation result in Figure 8 works as a
proxy for estimating the impact of Batched SpMM applied to
the GCNs application. In both cases in Figure 8, the Batched
SpMM shows best performance on fat dense matrices. The
non-batched approaches, which sequentially process SpMM,
cannot achieve high performance due to the overhead of
repeated CUDA kernel launches and poor parallelism of GPU.
On the other hand, all of the batched approaches acquire the
large performance gain. Compared to the implementation of
SpMM following TensorFlow, the Batched SpMM achieves
significant speedups of up to 9.27x at nB = 64 in Figure 8-
(a) and 6.09x at nB = 512 in Figure 8-(b). To dive into
more detailed analysis, we evaluated the sm efficiency by
using nvprof, which is one of the profiler tools for NVIDIA
GPU. The sm efficiency is the percentage of active streaming
multiprocessors (SMs). While the sm efficiency with non-
batched implementation of SpMM following TensorFlow is
35.51% at nB = 512 in Figure 8-(b), the Batched SpMM
for SparseTensor and for CSR show 89.07% and 87.87%,
respectively. This result simply shows that the Batched SpMM
exploits GPU’s computing resources.
The Batched GEMM routine of cuBLAS also shows high
throughput although it includes many zero-related operations.
This is because the target sparse matrices are small and do
not include so much zero-related operations. It should be
noted, however, that treating as sparse matrix and executing
as SpMM is natural, and the Batched SpMM overcomes
cuBLAS. The performance improvement of Batched SpMM
from cuBLAS is 1.26x at nB = 64 in Figure 8-(a) and 1.43x
at nB = 512 in Figure 8-(b). In the cases with smaller nB ,
the Batched GEMM of cuBLAS shows superior performance
to our Batched SpMM. This is because the actual execution
time on GPU is too short and the overhead for more memory
transfer from CPU to GPU with Batched SpMM compared to
Batched GEMM becomes dominant.
Next, we show the evaluation results of three batched
approaches with changing the parameters of input matrices
and batch size. The Figure 9-(b) and (d) show the performance
difference between batchsize. The results indicate that more
batch size brings more throughput to all batched approaches.
The batched approaches with batchsize = 50 fail in using
all SMs on GPU. While small batch size hardly exploits the
computing power of GPU, larger batch size, batchsize = 100,
ensures enough parallelism. The first row of Figure 9 (i.e.
(a), (b) and (c)) shows the result of dim = 32, 64 and
128, respectively. By increasing the size of input sparse
matrix, the Batched SpMM for CSR and cuBLAS are getting
better performance numbers. This performance boost simply
comes from the improvement of parallelism. The Batched
SpMM for CSR format launches more threads in proportion
to the row size of input sparse matrix. Thus, a larger input
sparse matrix can improve the occupancy of GPU. While the
cuBLAS also increases the parallelism on larger input sparse
matrix, the performance improvement is relatively smaller than
Batched SpMM for CSR since the sparsity is also increased.
The Batched SpMM for SparseTensor shows only slight
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
0
25
50
75
100
125
150
175
200
G
FL
O
P
S
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(a) batchsize = 100, dim = 32, nnz/row = 3
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
0
25
50
75
100
125
150
175
200
G
FL
O
P
S
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(b) batchsize = 100, dim = 64, nnz/row = 3
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
0
25
50
75
100
125
150
175
200
G
FL
O
P
S
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(c) batchsize = 100, dim = 128, nnz/row = 3
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
0
25
50
75
100
125
150
175
200
G
FL
O
P
S
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(d) batchsize = 50, dim = 64, nnz/row = 3
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
0
10
20
30
40
50
60
70
80
G
FL
O
P
S
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(e) batchsize = 100, dim = 64, nnz/row = 1
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
0
50
100
150
200
250
G
FL
O
P
S
cuBLAS
BatchedSpMM (ST)
BatchedSpMM (CSR)
(f) batchsize = 100, dim = 64, nnz/row = 5
Fig. 9: Performance of batched approaches of SpMM on randomly generated dataset
performance change. This is because more cache blocking
in Batched SpMM for SparseTensor causes more memory
pressure to same non-zero element.
The Figure 9-(e) and (f) show the evaluation results of
nnz/row = 1 and 5, respectively. Obviously, the Batched
SpMM works efficiently on sparser matrices, and the cuBLAS
appears to show better performance on denser matrices. For
low nnz/row matrices, the Batched SpMM both for SparseTen-
sor and for CSR are superior to cuBLAS. On the other
hand, the performance improvement by Batched SpMM for
SparseTensor is limited on denser input sparse matrix due to
more race condition by atomic operations. The Batched SpMM
for CSR is atomic-free algorithm, and keeps best performer on
denser input sparse matrices.
Finally, we consider the batch includes various sizes or
densities of input sparse matrices. Figure 10 shows the result
with mixed sizes and densities, batchsize = 100, dim =
[32, 256], nnz/row = [1, 5]. We excluded cuBLAS from this
evaluation since the kernel only processes GEMM operations
with same matrix sizes. The Batched SpMM achieves signif-
icant improvements compared to the non-batched approaches
even though the matrices among batch have different shapes
to each. At nB = 1024, our Batched SpMM achieves up to
3.29x speedup from the non-batched approaches.
B. Evaluation on GCNs Application
We evaluated the impact of the Batched SpMM in
ChemGCN application implemented with TensorFlow. A
graph convolutional network consists of several graph con-
volution layers. After each graph convolution layer, batch
normalization is executed. Table I shows the dataset and the
2
1
2
3
2
5
2
7
2
9
Column size of right hand matrix
2
−3
2
−1
2
1
2
3
2
5
2
7
G
FL
O
P
S
cuSPARSE
TensorFlow
BatchedSpMM (ST)
BatchedSpMM (CSR)
Fig. 10: Performance of SpMM on randomly generated dataset
(batchsize = 100, dim and nnz/row are mixed.)
TABLE I: Dataset and configurations in the evaluation of
ChemGCN application
#Matrices Max dim Epoch Batch size
Tox21 7,862 50 50 50
Reaction100 75,477 50 20 100
configurations for the application. We used Tox21 [27] and
Reaction100 datasets, which predicts the reaction from its own
graph structure of compound structure by selecting 100 kinds
of typical reactions from Reaxys database [28].
Non-Batched
MatMul Add
Batched SpMM
Batched
SparseTensorDenseMatMul
Fig. 11: Visualization of the execution of ChemGCN with Timeline
TABLE II: Training time of ChemGCN [sec]
CPU GPU SpeedupNon-Batched Non-Batched Batched
Tox21 854.51 918.03 723.80 1.18x
Reaction100 16223.98 3029.13 1905.32 1.59x
TABLE III: Inference time of ChemGCN [sec]
CPU GPU SpeedupNon-Batched Non-Batched Batched
Tox21 2.71 2.56 1.97 1.30x
Reaction100 44.66 22.42 16.32 1.37x
The #Matrices in Table I is the number of pairs of adjacency
matrix and feature matrix. The model is trained by K-fold
cross validation, and k=5 in our evaluation. The trained model
is used for the evaluation of inference. The evaluation result of
inference is the execution time for inferring all data of dataset.
In the evaluation of inference, the batch size is set as 200 to
increase the throughput since the batch size does not affect the
accuracy of inference. The evaluation results of both training
and inference are the mean of 5 executions. We used the
Batched SpMM for SparseTensor data structure as the batched
approach. It should be noted that our batched optimization
does not change the configuration or hyper parameters from
non-batched version, and thus no effect on the accuracy
in training. The architecture for Tox21 includes two graph
convolution layers. The column size of each weight matrix
is 64. For Reaction100, three graph convolutional layers are
stacked, and the column size of each weight matrix is 512.
Table II and III show the performance of training and
inference in ChemGCN application, respectively. The training
of Tox21 dataset on CPU finishes in less execution time
compared to that on GPU with non-batched approach. This
is because Tox21 dataset is small enough that all data such
as adjacency matrices and feature matrices can be placed
on last-level cache of CPU. The non-batched approach does
not exploit the computing power of GPU due to the lack
of parallelism, and the overhead of repeated CUDA kernel
launches degrades the performance. In contrast, the batched
approach on GPU significantly improves the performance of
GCNs application. Especially on Reaction100 dataset, which
has more data and adopts larger batch size, the benefit of
batched approach is prominent and the speedup of training
compared to the non-batched approaches achieves 1.59x. We
can maximize the benefit of the batched approach in the
TABLE IV: Execution time of each kernel in Figure 11 [µsec]
Non-Batched Batched
MatMul 1,571 31
Add 1,316 23
SpMM 1,981 190
inference by setting larger batch size. The batched approach
achieves up to 1.39x speedup of inference throughput.
Next, we analyzed the performance of each kernel by using
Timeline, which is a performance profiling tool in TensorFlow.
Figure 11 shows the visualization of each kernel execution in a
graph convolutional layer for one mini batch on Tox21 dataset
with the non-batched or batched approaches. Although the bars
are illustrated with same length, their execution time are com-
pletely different. Table IV shows the “actual” execution time of
each operation. As Figure 11 shows, the non-batched approach
requires batchsize ∗ 3 = 150 times of CUDA kernel launches
while the batched approach requires only three times. The
batched approach can largely reduce the overhead of CUDA
kernel launches. The speedup of SpMM with Batched SpMM
is about 10x. This performance improvement corresponds to
the number showed in Figure 8-(a). Also, the batched approach
significantly reduces the execution time of MatMul and Add
as well as SpMM. This implies that the non-batched approach
does not exploit the parallelism of GPU. We concluded that
the batched approach can significantly increase the occupancy
of GPU not only in SpMM operation but also in other kernels
on small matrices.
VI. CONCLUSION
In the application of machine learning approaches to various
fields including chemistry and biology, GCNs show high
accuracy. However, there are yet less work of GCNs in terms
of throughput, and the current approach for GCNs, which
require many operations on small sparse and dense matrices,
cannot exploit the computing power of GPU. The scenario
requiring small sparse matrix operations, especially SpMM,
does not yet exist, and thus no research or engineering effort
has been put into accelerating for such kind of operations.
In order to accelerate GCNs applications, we propose the
Batched SpMM, which efficiently handles tens or hundreds
of SpMM operations together by single kernel. To exploit
GPU architecture more, we devise new SpMM algorithm,
Sub-Warp-Assigned SpMM, especially for small sparse ma-
trices with cache blocking optimization to dense matrices. We
evaluate the performance of our Batched SpMM on randomly
generated matrices. The Batched SpMM achieves up to 9.27x
speedup compared to the non-batched approaches, and 1.43x
compared to Batched GEMM of cuBLAS. Furthermore, the
batched optimization improves the performance of GCNs
application with 1.59x speedup for training and 1.37x speedup
for inference.
For future work, we are planning to have more analysis
and optimization to improve the GCNs application since the
hotspot of the application moves to other while our batched
approach significantly improves the performance of SpMM-
related operations. Our batched approach contributes not only
to the reduction of the overhead of kernel launches, but also
to increasing the occupancy of computing devices. We are
planning to evaluate whether our batched approach works well
on other architectures such as multi-core CPUs or Intel KNL.
ACKNOWLEDGMENT
This work was partially supported by JST CREST Grant
Number JPMJCR1303 and JPMJCR1687, and performed un-
der the collaboration with DENSO IT Laboratory, inc., and
performed under the auspices of Real-World Big-Data Com-
putation Open Innovation Laboratory, Japan, and based on
results obtained from a project commissioned by the New
Energy and Industrial Technology Development Organization
(NEDO). We would like to thank Mr. Akira Naruse of NVIDIA
for providing much of advices on our research.
REFERENCES
[1] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[2] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,
A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs
for learning molecular fingerprints,” in Advances in neural information
processing systems, 2015, pp. 2224–2232.
[3] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph
sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
[4] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley,
“Molecular graph convolutions: moving beyond fingerprints,” Journal
of computer-aided molecular design, vol. 30, no. 8, pp. 595–608, 2016.
[5] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.
Dahl, “Neural message passing for quantum chemistry,” arXiv preprint
arXiv:1704.01212, 2017.
[6] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E.
Dahl, O. Vinyals, S. Kearnes, P. F. Riley, and O. A. von Lilienfeld,
“Machine learning prediction errors better than dft accuracy,” arXiv
preprint arXiv:1702.05532, 2017.
[7] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov,
and M. Welling, “Modeling relational data with graph convolutional
networks,” in European Semantic Web Conference. Springer, 2018, pp.
593–607.
[8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,
J. Levenberg, D. Mane´, R. Monga, S. Moore, D. Murray, C. Olah,
M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Vie´gas, O. Vinyals, P. Warden, M. Wat-
tenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale ma-
chine learning on heterogeneous systems,” http://tensorflow.org/, 2015.
[9] C. Hong, A. Sukumaran-Rajam, B. Bandyopadhyay, J. Kim, S. E.
Kurt, I. Nisa, S. Sabhlok, U¨. V. C¸atalyu¨rek, S. Parthasarathy, and
P. Sadayappan, “Efficient sparse-matrix multi-vector product on gpus,” in
Proceedings of the 27th International Symposium on High-Performance
Parallel and Distributed Computing. ACM, 2018, pp. 66–79.
[10] A. Buluc and J. R. Gilbert, “On the representation and multiplication
of hypersparse matrices,” in IEEE International Symposium on Parallel
and Distributed Processing, 2008. IPDPS 2008. IEEE, 2008, pp. 1–11.
[11] C. Yang, A. Buluc, and J. D. Owens, “Design principles for sparse matrix
multiplication on the gpu,” arXiv preprint, no. arXiv:1803.08601, 2018.
[12] D. Merrill and M. Garland, “Merge-based parallel sparse matrix-vector
multiplication,” in SC ’16 Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis.
IEEE Press, 2016, p. 58.
[13] T. Dong, A. Haidar, P. Luszczek, J. A. Harris, S. Tomov, and J. Dongarra,
“Lu factorization of small matrices: Accelerating batched dgetrf on the
gpu.” in HPCC/CSS/ICESS, 2014, pp. 157–160.
[14] A. Haidar, T. T. Dong, S. Tomov, P. Luszczek, and J. Dongarra, “A
framework for batched and gpu-resident factorization algorithms applied
to block householder transformations,” in International Conference on
High Performance Computing. Springer, 2015, pp. 31–47.
[15] A. Charara, D. E. Keyes, and H. Ltaief, “Batched triangular dense linear
algebra kernels for very small matrix sizes on gpus,” ACM Transactions
on Mathematical Software, 2017.
[16] W. H. Boukaram, G. Turkiyyah, H. Ltaief, and D. E. Keyes, “Batched
qr and svd algorithms on gpus with applications in hierarchical matrix
compression,” Parallel Computing, vol. 74, pp. 19–33, 2018.
[17] R. Zheng, W. Wang, H. Jin, S. Wu, Y. Chen, and H. Jiang, “Gpu-
based multifrontal optimizing method in sparse cholesky factorization,”
in 2015 IEEE 26th International Conference on Application-specific
Systems, Architectures and Processors (ASAP). IEEE, 2015, pp. 90–97.
[18] A. Venkat, M. S. Mohammadi, J. Park, H. Rong, R. Barik, M. M. Strout,
and M. Hall, “Automating wavefront parallelization for sparse matrix
computations,” in SC ’16 Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis.
IEEE Press, 2016, p. 41.
[19] X. Li, F. Li, and J. M. Clark, “Exploration of multifrontal method
with gpu in power flow computation,” in 2013 IEEE Power and Energy
Society General Meeting (PES). IEEE, 2013, pp. 1–5.
[20] A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra, “Performance,
design, and autotuning of batched gemm for gpus,” in International
Conference on High Performance Computing. Springer, 2016, pp. 21–
38.
[21] R. Nath, S. Tomov, and J. Dongarra, “An improved magma gemm for
fermi graphics processing units,” The International Journal of High
Performance Computing Applications, vol. 24, no. 4, pp. 511–515, 2010.
[22] A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. Dongarra, “Batched
matrix computations on hardware accelerators based on gpus,” The
International Journal of High Performance Computing Applications,
vol. 29, no. 2, pp. 193–208, 2015.
[23] I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou,
and J. Dongarra, “High-performance matrix-matrix multiplications of
very small matrices,” in European Conference on Parallel Processing.
Springer, 2016, pp. 659–671.
[24] H. Anzt, G. Collins, J. Dongarra, G. Flegar, and E. S. Quintana-Ortı´,
“Flexible batched sparse matrix-vector product on gpus,” in Proceedings
of the 8th Workshop on Latest Advances in Scalable Algorithms for
Large-Scale Systems. ACM, 2017, p. 3.
[25] “Deepchem,” https://deepchem.io/.
[26] pfnet research, “Chainer chemistry,” https://github.com/pfnet-
research/chainer-chemistry.
[27] N. C. for Advancing Translational Sciences, “Tox21 data challenge
2014,” https://tripod.nih.gov/tox21/challenge/about.jsp.
[28] “Reaxys,” https://www.reaxys.com/.
