Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix
  Multiplication using Propagation Blocking by Gu, Zhixiang et al.
Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix
Multiplication using Propagation Blocking
Zhixiang Gu∗, Jose Moreira†, David Edelsohn†, Ariful Azad∗,
E-mail: zg3@iu.edu, jmoreira@us.ibm.com, edelsohn@us.ibm.com, azad@iu.edu
∗Indiana University Bloomington, †IBM Corporation
Abstract—Sparse matrix-matrix multiplication (SpGEMM) is
a widely used kernel in various graph, scientific computing and
machine learning algorithms. It is well known that SpGEMM
is a memory-bound operation, and its peak performance is
expected to be bound by the memory bandwidth. Yet, existing
algorithms fail to saturate the memory bandwidth, resulting
in suboptimal performance under the Roofline model. In this
paper we characterize existing SpGEMM algorithms based on
their memory access patterns and develop practical lower and
upper bounds for SpGEMM performance. We then develop
an SpGEMM algorithm based on outer product matrix mul-
tiplication. The newly developed algorithm called PB-SpGEMM
saturates memory bandwidth by using the propagation blocking
technique and by performing in-cache sorting and merging. For
many practical matrices, PB-SpGEMM runs 20%-50% faster
than the state-of-the-art heap and hash SpGEMM algorithms on
modern multicore processors. Most importantly, PB-SpGEMM
attains performance predicted by the Roofline model, and its
performance remains stable with respect to matrix size and
sparsity.
I. INTRODUCTION
Sparse matrix-matrix multiplication (SpGEMM) is a
widely-used kernel in many graph analytics, scientific com-
puting, and machine learning algorithms. In graph analytics,
SpGEMM is used in betweenness centrality [1], clustering
coefficients, triangle counting [2], multi-source breadth-first
search [3], colored intersection search [4], and cycle detection
[5] algorithms. In scientific computing, SpGEMM is used
in algebraic multigrid [6] and linear solvers. Many machine
learning tasks like dimensionality reduction (e.g., NMF, PCA
[7]), spectral clustering [8], and Markov clustering (MCL) [9])
rely on an efficient SpGEMM algorithm as well. Additionally,
SpGEMM algorithms are applied to evaluate the chained
product of sparse Jacobians [10] and optimize join operations
on modern relational databases [11].
In most data analytics applications, SpGEMM has very
low arithmetic intensity (AI) measured by the ratio of total
floating-point operations to total data movement. For example,
when multiplying two Erdo˝s-Re´nyi random matrices with d
nonzeros per column (matrices with d nonzeros uniformly
distributed in each column) , an algorithm has an AI of just
1
16 flops/byte (see Sec. II-C). At this arithmetic intensity,
SpGEMM is a memory-bound operation, and SpGEMM’s
peak performance has an upper bound of β∗AI , where β is the
memory bandwidth. Assuming 50GB/s bandwidth available on
a multicore processor, the estimated peak performance can be
as high as 50/16 = 3.13 GFLOPS (billions of floating point
operations per second). However, state-of-the art parallel al-
gorithms based on heap and hash merging [12] attain no more
than 500 MFLOPS on a socket of an Intel Skylate processor.
No prior work clearly explained these observed performances
of SpGEMM algorithms as no standard performance model
has been developed to understand SpGEMM’s performance.
In this paper, we rely on the Roofline model [13] and
develop lower and upper performance bounds for SpGEMM
algorithms. In developing practical lower bounds of AI, we
considered the fact that an SpGEMM may read data more than
once. Even with a tight lower bound on AI, an algorithm can
attain peak performance (that is β ∗AI) only if it (a) saturates
the memory bandwidth, (b) does not have high latency cost,
and (c) makes use of full cache lines of data. We show that
these requirements are not satisfied by current column-by-
column algorithms, resulting into lower than attainable peak
FLOPS.
Here, we develop a new algorithm based on the outer prod-
uct of matrices. The goal of this algorithm is to eliminate irreg-
ular data accesses, increase bandwidth utilization, and attain
the performance predicted by the Roofline model. Our algo-
rithm is built upon the expansion-sort-compress paradigm [14],
[15]. Given two n×n matrices, this algorithm performs n outer
products to generate intermediate tuples of row index, column
index, and multiplies values. These tuples are then sorted, and
duplicate row and column indices are merged to get the final
product. To ensure efficient bandwidth utilization, we store
intermediate tuples into partially-sorted bins so that each bin
can be sorted and merged independently. This technique of
binning intermediate data for better bandwidth utilization is
called propagation blocking (PB) [16]. Prior work has used
propagation blocking to improve the bandwidth utilization
as well as the overall performance of sparse matrix-vector
multiplications [16], [17]. Here we use propagation blocking to
regularize data movements in SpGEMM. Hence, our algorithm
is named as PB-SpGEMM.
All three phases of PB-SpGEMM (expand, sort, and com-
press or merge) stream data from memory. Hence, every
phase has no significant latency overhead, utilizes full cache
lines, and attains a bandwidth close the STREAM benchmark.
Therefore, given the bandwidth of a system and input matrices,
PB-SpGEMM’s performance matches the prediction of the
Roofline model.
Similar to any expansion-sort-compress algorithm, PB-
SpGEMM has to store flop intermediate tuples, where flop
ar
X
iv
:2
00
2.
11
30
2v
1 
 [c
s.D
C]
  2
6 F
eb
 20
20
is the number of multiplications needed by SpGEMM. This
can lead to significant data movements when the compression
factor (the ratio of flop to nonzeros in the output) is greater
than four. However, most practical SpGEMM operations have
small compression factors. For example, when squaring ma-
trices from the SuiteSparse Matrix Collection, more than
80% of SpGEMMs have a compression factor less than 3
and about 99% have a compression factor less than 6 [18].
Hence, for most practical scenarios, PB-SpGEMM performs
predictably better than existing heap and hash algorithms. For
multiplications with compression factor greater than four, PB-
SpGEMM’s performance is still predictable, but it can run
slower than the alternatives.
We summarize the key contributions of this paper below:
1) We develop a Roofline performance model for SpGEMM
algorithm. This model can show the limitations of exist-
ing column-by-column SpGEMM algorithms.
2) We develop an outer-product based SpGEMM algorithm
called PB-SpGEMM. We use propagation blocking, in-
cache sorting and merging for better bandwidth utiliza-
tion.
3) Different phases of PB-SpGEMM utilize bandwidth close
to the STREAM benchmark. Given the bandwidth of a
system and input matrices, PB-SpGEMM’s performance
matches the prediction from our Roofline model.
4) For SpGEMMs with compression factor less than four,
PB-SpGEMM is 30% to 50% faster than previous state-
of-the-art algorithms for multicore processors.
Our implementation of PB-SpGEMM is publicly available
at Bitbucket: https://bitbucket.org/azadcse/outerspgemm .
II. A PERFORMANCE MODEL FOR SPGEMM
A. Notations
Given two sparse matrices A∈Rm×k and B∈Rk×n,
SpGEMM computes another potentially sparse matrix
C∈Rm×n. A(:, j) denotes jth column of A, and A(i, j)
denotes ith entry in the jth column. In our analysis, we con-
sider n-by-n matrices for simplicity. We use Erdo˝s-Re´nyi (ER)
random matrices throughout the paper. An ER matrix with
d nonzeros per column has d nonzeros uniformly distributed
in each column. Among many representations for sparse
matrices, we consider three standard data structures in this
paper: Compressed Sparse Row (CSR), Compressed Sparse
Column (CSC), and Coordinate format (COO). See Langr and
Tvrdik [19] for a comprehensive discussion on sparse matrix
storage formats.
Given a matrix A, nnz(A) denotes the number of nonzeros
in A, and d(A) = nnz(A)n denotes average nonzeros in a row
or column. In computing C=AB, flop denotes the number of
multiplications needed. Throughout the paper, floating point
operations only denote multiplications. The compression fac-
tor (cf) denotes the ratio of flop to nonzeros in the output
matrix: cf= flopnnz(C) . Since at least one multiplication is needed
for every output nonzero, cf≥1.
Input Access
Column wise Outer Product
Output Heap/Hash/SPA [12], [20], [21], [22] [23]
Formation ESC [15], [18] [24], this
TABLE I: Classification of SpGEMM algorithms based on
access patterns of input and output matrices. This paper falls
into the lower right cell.
=
Merge
x
MultiplyC"
C A B
Fig. 1: An illustration of column SpGEMM algorithm. To gen-
erateC(:, 5), the algorithm selects 2nd, 4th and 7th columns of
A (shown in red) based on the nonzeros in B(:, 5) (shown in
blue). The selected columns are multiplied by corresponding
entries in B(:, 5), forming entries in Cˆ(:, 5) (shown in yellow).
Duplicate entries in Cˆ(:, 5) are merged to generate C(:, 5)
(shown in green). The merging strategy differs in different
column SpGEMM algorithm
B. Classes of SpGEMM algorithms categorized by data access
patterns
To compute C=AB, most algorithms also manipulate an
expanded matrix Cˆ∈Rm×n that contains flop unmerged en-
tries. The final output C can be obtained by merging entries
in Cˆ with the same row and column indices. Therefore, if we
primarily focus on data accesses, SpGEMM algorithms have
two distinct phases: (a) input matrices are accessed to form Cˆ,
(b) duplicate entries in Cˆ are merged to formC. Here, merging
means adding multiplied values with the same row and column
indices. For input accesses, we have only two options: (1)
access A and B column-by-column (or equivalently row-by-
row) [12], [20], [21], [22], [15], and (2) access A column-
by-column and access B row-by-row for outer product [23],
[24]. For output formations, we have many options. Prior work
used the expand-sort-compress strategy [15], [24], [18] or used
accumulators based on heap [22], hash table [12], and a dense
vector called SPA [20], [25]. Table I summarizes prior work
based on their data access patterns. Next, we briefly summarize
four major classes of algorithms and discuss their data access
patterns.
Column SpGEMM algorithms based on the
heap/hash/SPA accumulator. Some algorithms materialize
only one column of Cˆ, merge duplicate entries in that
column and generate the corresponding column of C. Since
Fig. 2: An illustration of outer product SpGEMM algorithm. Each multiplication between a column from A and a corresponding
row from B generate a sub 1-rank matrix, those sub 1-rank matrices added up yield the final result C
No of Accesses Streamed Access Cache Line Utilization
Algorithm A B Cˆ C A B Cˆ C A B Cˆ C
Column SpGEMM (Heap/Hash/SPA) d 1 0∗ 1 × X X X × (when d < 8 ) X X X
ESC (column-wise) d 1 2 1 × X × X × (when d < 8 ) X X X
ESC (outer product) 1 1 2 1 X X X∗∗ X X X X X
* Column SpGEMM generates one column of Cˆ at a time. ** Using blocking techniques discussed in this paper.
TABLE II: Data access patterns in different SpGEMM algorithms when multiplying two ER matrices with d nonzeros per
column. If an algorithm does not stream data, it has high latency cost.
column-by-column and row-by-row algorithms have similar
computational patterns, we only discuss column SpGEMM
algorithms in this paper. An illustration of the column
SpGEMM algorithm is shown in Fig. 1, where C(:, 5) is
generated by merging a subset of columns in A determined
by nonzeros in B(:, 5). Most column-by-column algorithms
are based on Gustavson’s algorithm [26], and they differ
from one another based on how they merge entries in Cˆ(:, j)
to obtain C(:, j). Prior work have used heap [22], hash
table [27], or a dense vector called SPA [25] for merging
columns. A common characteristic of all column-by-column
algorithms is that they read from B and write to C one
column at a time. However, columns of A may be read
irregularly several times based on the nonzero pattern of B.
We explain the access pattern of A when multiplying two
ER matrices with d nonzeros per column. In the worst case,
each column of A will be read d times from memory because
columns of A are accessed randomly without any spatial
locality. Thus, we read A d times over the execution of the
algorithm. Second, if we store matrices in the CSC format,
column SpGEMM algorithms have good spatial locality for B,
Cˆ andC, but not forA. Hence, we pay heavy latency costs for
irregular accesses ofA’s columns. Third, if a column ofA has
very few entries (e.g., when d is less than 8), we read a column
of A in a cache line, but the whole cache line is not used.
Hence, column SpGEMM waste memory bandwidth for very
sparse matrtices. The first row of Table II summarizes the data
access patterns in column SpGEMM algorithms. Similarly, a
row-by-row algorithm (when matrices are stored in the CSR
format) has good spatial locality for A, Cˆ and C, but not for
B.
The Expand-Sort-Compress (ESC) algorithms. Algo-
rithms based on the ESC technique generate full Cˆ before
merging duplicate entries. The original ESC algorithm [15]
developed for GPUs generates Cˆ(:, i) using B(:, i)1 similar to
the first step of column SpGEMM algorithm shown in Fig. 1.
After the entire Cˆ is constructed, flop tuples in Cˆ are sorted
and merged to generate the final output. Since sorting can be
efficiently performed on GPUs, ESC SpGEMM can performed
better than other algorithms on GPUs [15], [18]. The column
ESC algorithm has access patterns for A, B, and C similar
to the column SpGEMM algorithm. Additionally, Cˆ needs
to access twice (one write after multiplication and one read
before merging). The second row of Table II summarizes the
data access patterns in the column ESC algorithms.
Previous work [15] tried to eliminate reading and writing Cˆ
from memory by partitioning A by rows and multiplying each
partition with B. Thus, this approach generates one partition
of Cˆ at a time, which can fit in cache when a large number
of partitions is used. However, the partitioned ESC algorithm
will need to read Bˆ several times as well as reading Aˆ
multiple times for each partition. Hence, the effectiveness of
partitioning depends on the number of partitions and nonzero
structures of input matrices.
Outer product algorithms. Fig. 2 shows an illustration
of the outer product algorithm. In this formulation, A(:, i) is
multiplied with B(i, :) to form a rank-1 outer product matrix.
The rank-1 matrices can be merged by using a heap or using
ESC. A heap can be used to merge the outer product of A(:, i)
and B(i, :) with the current output after every iteration [23].
However, this algorithm is too expensive as it requires n
merging operations. Hence, we do not elaborate this algorithm
further.
The rank-1 matrices from outer product can also be merged
using the ESC strategy. In this case, input matrices are
streamed only once. Hence, outer product can fully utilize
1Earlier work [15] actually used a row-by-row algorithm with CSR matri-
ces, which is equivalent to column SpGEMM with CSC matrices.
cache lines when reading inputs. To sort and merge unmerged
tuples in Cˆ, we need to read them from memory. This can
significantly increase the memory traffic. Nonetheless, with
an efficient blocking technique discussed in this paper, we
can stream Cˆ whenever it is accessed. Hence, we can utilize
full cache lines and saturate memory bandwidth to offset more
data accesses. The last row of Table II summarizes the data
access patterns in the outer-product-based ESC algorithm.
C. Arithmetic Intensity (AI) of SpGEMM.
AI is the ratio of total floating-point operations to total data
movement (bytes). To compute C=AB, one must read A and
B from memory, and write C to memory2. Assume that on
average, we need b bytes to store a nonzero. Then,
AI ≤ nnz(C) ∗ cf
[nnz(A) + nnz(B) + nnz(C)] ∗ b ≤
cf
b
. (1)
If we use 4 bytes for indices and 8 bytes for values, then b
is 16 bytes (assuming that matrices are stored in the COO
format). Here, cf is a property of input matrices and it varies
from 1 to 8 for most practical sparse matrices [18]. Even in
the best scenario when we read and write matrices just once,
the arithmetic intensity of SpGEMM is very low: often less
than one. Let’s consider multiplying two ER matrices A and
B, since cf for ER matrix is close to 1 in expectation [28],
according to our model, AI will be around 116 flops/byte.
At this AI, SpGEMM’s performance is completely bound by
memory bandwidth3.
Peak performance of an SpGEMM algorithm. Suppose,
a smart algorithm achieves the best AI of cfb . Let β be the
memory bandwidth (BW) of the system measured by the
STREAM benchmark. Then, the performance measured by
FLOPS (floating point operations per second) follows this
inequality:
FLOPS ≤ β cf
b
. (2)
Hence, the peak FLOPS for a given problem on a given
architecture can be at most β cfb assuming that SpGEMM is
bandwidth bound. For example, on an Intel Skylake processor
with 50GB/s memory bandwidth, the peak performance for
multiplying ER matrices can be at most 50 ∗ 1/16 = 3.13
GFLOPS as shown in Fig. 3. However, state-of-the-art column
SpGEMM algorithms achieve less than 20% of this peak
performance as discussed in several recent papers [12].
As discussed before, the primary reasons behind the subop-
timal performance of SpGEMM algorithms are: (1) algorithms
read/write data multiple times, (2) algorithms access data
at random memory locations, (3) algorithms may not fully
utilize cache lines. The first problem is inherent to SpGEMM
and cannot be completely overcome when input matrices
are unstructured. The irregular data access can under utilize
bandwidth, impeding the performance of SpGEMM when the
2We ignore that C may have to first read from memory to cache before
writing.
3On modern processors, SpGEMM can be computation bound only if cf >
1000, it is unrealistic for sparse matrices.
compression factor is small. In this paper, we address this
irregular access problem and develop an algorithm where all
steps of SpGEMM utilize full memory bandwidth. Never-
theless, Equation 1 seems an upper bound that no existing
algorithm can attain. Next, we consider a more practical bound
for AI.
A more practical bound on AI for SpGEMM. In column
SpGEMM algorithm, we read the first input matrix several
times depending on the nonzero pattern of B. To obtain
a lower bound for column SpGEMM, we assume that the
accesses of A have no temporal or spatial locality, and all
accesses of A incur memory traffic. Hence, in the worst case,
the amount of data read from A is flop ∗ b bytes. Hence AI
for column SpGEMM can be approximated as follows:
AI(col) ≥ nnz(C) ∗ cf
[nnz(C) ∗ cf + nnz(B) + nnz(C)] ∗ b
≥ cf
(2 + cf)b
. (3)
By contrast, the outer product based algorithm based on the
ESC strategy generates all unmerged tuples in Cˆ, writes all
nnz(Cˆ) = flop tuples in memory, reads them again for sorting
and merging. Hence, in the worst case, ESC-based algorithms
performs additional 2 ∗ flop memory red-write operations,
giving us the following AI:
AI(outer) ≥ nnz(C) ∗ cf
[nnz(A) + nnz(B) + nnz(Cˆ) + nnz(C)] ∗ b
=
nnz(C) ∗ cf
[nnz(A) + nnz(B) + 2 ∗ flop + nnz(C)] ∗ b
≥ cf
(3 + 2cf)b
. (4)
For ER matrices with cf = 1, Eq. 4 gives us an arithmetic
intensity of 1/80 (assuming b = 16 bytes). Fig. 3 shows
these lower bounds on AI with the corresponding attainable
performance. We will experimentally show that the newly-
developed outer-product-based algorithm can attain the peak
performance based on Eq. 4.
III. THE PB-SPGEMM ALGORITHM
A. Overview of the PB-SpGEMM algorithm
Algorithm 1 provides a high-level description of an
SpGEMM algorithm based on the expand-sort-compress
scheme. In the symbolic step, we estimate flop for the current
multiplication and allocate memory for unmerged tuples in Cˆ.
Then, multiplied tuples are formed and stored in Cˆ, which is
then sorted and merged to form C.
Our algorithm follows the exact same principle, but uses
outer products and propagation blocking for efficient band-
width utilization. Fig. 4 explains the propagation blocking idea
based on two 4 × 4 matrices and two bins. After we expand
tuples, we partially order them in two bins, where the first
bin stores rowid 0 and 1, and the second bin stores rowid
2 and 3. If these bins fit in L2 cache, sorting and merging
can be be performed efficiently in cache by different threads.
0.125
0.25
0.5
1
2
4
8
   1/128    1/64    1/32    1/16    1/8    1/4A
tta
in
ab
le
 P
er
fo
rm
an
ce
 (G
FL
O
P/
s)
Arithmatic Intensity (AI)
Sp
GE
M
M
up
pe
r b
ou
nd
Ou
te
r S
pG
EM
M
Co
lu
m
n
Sp
GE
M
M
𝜷 ∗ 𝑨𝑰
Fig. 3: Using Roofline model to estimate performance when
multiplying two Erdo˝s-Re´nyi matrices on a single-socket Intel
Skylake processor. Memory bandwidth β is 50GB/s as mea-
sured with STREAM benchmark
Algorithm 1: ESC-SpGEMM algorithm
Input: A , B
Output: C
1 Cˆ←Symbolic (A, B) . Create space for Cˆ;
2 Cˆ← Expand (A, B) . Create unmerged tuples ;
3 Sort (Cˆ) . perform radix sort using (rowid, colid) as
keys;
4 C ← Compress (Cˆ) . merge duplicated tuples ;
After generating a tuple, if we directly write it to its designated
global bin, we may not fully utilize the cache line. Hence, each
thread also maintains small local bins that are filled in cache
before flushing to global bins in memory. The use of local
bins is illustrated in Fig. 5. Hence, our algorithm has several
tunable parameters, including (a) nbins: number of global or
local bins, and (b) Lbinwidth: the width of local bins. We
experimentally select these parameters as will be discussed in
the experimental section.
Algorithm 2 described the PB-SpGEMM algorithm. To
facilitate outer product operations, input matrices A and B
are passed as CSC and CSR formats, respectively. Here, we
store C in CSR format, but it can be easily stored in CSC
without any overhead. Finally, the expanded matrix Cˆ is stored
in the COO format. Similar to Algorithm 1, PB-SpGEMM
has four phases (a) symbolic (line 1) (b) Expand (lines 5-
14), (c) Sort (line 16), and (d) Compress (line 17). However,
Algorithm 2 differs from previous ESC algorithms in two
crucial ways: (1) we use outer product to stream data from
input matrices, (2) we use propagation blocking to organize the
expanded matrix into bins so that all phases of the algorithm
saturate the memory bandwidth. We additionally perform a
post-processing step (line 9) to convert the output to CSR
format. Next, we discuss these phases in detail.
B. Symbolic Phase
In the symbolic phase, we estimate the memory requirement
for Cˆ, estimate number of bins and allocate space for global
Algorithm 2: PB-SpGEMM algorithm
Input: A in CSC, B in CSR
Output: C in CSR
1 GBin ← Symbolic(A,B) . expanded tuples (Cˆ)
partitioned into nbins bins;
2 nthreads ← Number of threads ;
3 Lbinwidth ← 512 . Local bin width, default set to 512
bytes ;
4 LBin ← Create a 3D array of size nthreads × nbins ×
Lbinwidth . LBin[Ti] is local bin, private for thread Ti ;
5 for i← 1 to k in parallel do
6 Ti ← The thread executing the ith iteration ;
7 for (rowid, i, aval) ∈ A(:, i) do
8 for (i, colid, bval) ∈ B(i :, ) do
9 binid ← rowid % nbins ;
10 if size(LBin[Ti][binid]) = Lbinwidth then
11 MemCopy (GBin[binid], LBin[Ti][binid])
;
12 LBin[Ti][binid] ← φ ;
13 mval ← aval × bval ;
14 Append (LBin[Ti][binid], (rowid, colid,
mval));
15 for all thread Ti in parallel do
16 for binid ← 1 to nbins do
17 if size(LBin[Ti][binid]) 6= 0 then
18 MemCopy (GBin[binid], LBin[Ti][binid]) ;
19 for binid ← 1 to nbins in parallel do
20 GBin[binid] ← Sort (GBin[binid]) . perform radix
sort in the ith bin using (rowid, colid) as keys;
21 GBin[binid] ← Compress (GBin[binid]) . merge
duplicated tuples by two-pointer method;
22 C← ConvertCSR(GBin);
bins (Gbin). Algorithm 3 describes the symbolic step. We
compute flops for the current multiplication using an outer-
product style computation. The loop in Line 2 accesses A
column by column and B row by row. If there is a nonzero
entry in A(rowid, i), it must be multiplied by all nonzeros in
the ith row of B. Hence, line 5 adds nnz(B(i, :))×nnz(A(:, i))
to the flop count. After we compute flops, we compute the
number bins (line 6) so that each global bin fits in the L2 cache
in the sorting and merging phase. We then allocate memory
for global bins (line 7). Algorithm 3 needs only O(n) time
and attains higher memory bandwidth by streaming just row
and column pointer arrays of B and A, respectively.
Note that Algorithm 3 is much simpler than symbolic
steps used in column SpGEMM algorithms where we need
to estimate nnz(C). An outer product algorithm can be also
developed without a symbolic phase. For example, a linked-
list can be used to dynamically append expanded tuples [24].
However, the dynamic memory allocations in parallel sections
Fig. 4: An example of PB-SpGEMM multiplying two 4×4 matrices with two bins, the first bin blocks propagation of tuples
with rowid 0 and 1, the second blocks 2 and 3
Fig. 5: Using thread-private local bins to utilize cache lines.
If nbins is the total number of global bins, each thread also
creates nbins number of small local bins to store tuples as
they are generated. When a local bin is full, tuples inside will
be flushed to the corresponding global bin. This example is
using two local bins per thread and two global bins
Algorithm 3: Symbolic Phase (Static Schedule)
Input: A in CSC, B in CSR
Output: Gbin, nbins
1 flops ← 0;
2 for i ← 1 to k in parallel do
3 nnz(B(i, :)) ← B.rowptr[i+ 1] - B.rowptr[i];
4 nnz(A(:, i)) ← A.colptr[i+ 1] - A.colptr[i];
5 flops + = nnz(B(i, :))× nnz(A(:, i))
6 nbins ← flops / L2_CACHE_SIZE ;
7 Gbin ← MemAlloc(flops) . allocate shared array to store
tuples, no initialization needed
could result in poor performance.
C. Expand
Lines 5-14 in Algorithm 2 describes the expand phase of
PB-SpGEMM. In the expand phase, a thread reads A(; , i)
and B(i, :) and performs their outer product. The binid is
computed by rowid of the multiplied tuple (rowid comes from
the rowid in A(:, i)). Once a local bin, this thread will flush
the tuples inside to the corresponding global bin (line 10-12).
Then, the newly-formed tuple is appended to its designated
local bin in line 14. After the multiplication, there still could
be some tuples left in the local bins because bins were not
full. Lines 15-18 send the partially full local bins to global
bins.
With the local binning, we always write tuples in multiple
of cache lines. We make sure the number of local bins and
the size of local bins are small. typically, we create 1024 bins
and 512 bytes per local bin so that all local bins for a thread
easily fit in the cache.
D. Sorting
After the multiplication and propagation blocking phase, the
expanded matrix Cˆ is stored in the COO format, partitioned
into several bins. Then, we sort Cˆ to bring tuples with the same
(rowid, colid) pair close to one another for merging in the next
phase. As shown in Algorithm 2, sorting can be performed
independently in each bin because bins do not share tuples
with the same rowid. Hence, a thread can sort tuples in a bin
sequentially, while other threads sort other bins in parallel.
The sorting algorithm uses the (rowid, colid) pairs as keys
and the multiplied values as payloads. For this purpose, we
use an in-place radix sort (similar to American flag sort [29])
that groups the keys by individual bytes sharing the same
significant byte position. In the worst case, this in-place radix
sort needs b passes over the data, where b is the number of
bytes needed to store a key. Hence, Radix sort can be faster
than comparison-based sorting when keys are stored in fewer
bytes.
Preparing integer keys for Radix sort. In our algorithm,
we use 4-byte integers for row and column indices. We
concatenate the rowid and colid to form a combined 8-byte
integer key for radix sort. With 8-byte keys, radix sort may
need 8 passes over the data to sort tuples, which can incur
significant data transfers. We can reduce the key space by
using the fact that bins are already grouped by consecutive
row indices. For example, if the input is a 1M × 1M matrix
and we create 1K bins to block the propagation, the rowids
of tuples within the same bin will in adjacent 1k, then we
only need 10 bits to represent the remainder rather than a full
32-bit integer, and we still have 32-10=22 bits to store colid.
Furthermore, if we assume that matrices have at most 1M rows
and columns, we can use 20 bits for colid and 20-10=10 bits
for rowid (assuming 1K bins). Hence, in most practical cases,
we can potentially squeeze keys into 4-byte integers, needing
four passes over the data for sorting.
In-cache sorting. Since we sort Cˆ containing flop tuples,
four passes over the data directly from memory can be the
performance bottleneck. Fortunately, bins can help in this case
if tuples in a bin fit in L3 or L2 cache. In many practical
problems, we can indeed fit a bin into L2 cache. For example,
consider an ER matrix A with 1M rows, 1M columns and
4M nonzeros. When computing A2, we generate 16M tuples
in expectation. If 1K bins are used, each bin will contain 16K
tuples. If we use 4 byte keys (as described in the previous
paragraph) and 8 byte payloads, we need 192KB memory to
store all tuples in a bin. This can easily fit in L2 cache of most
modern processors. Multiplications with high compression
ratios and matrices with denser rows can create problems for
some bins as tuples may not fit in L2 cache. In these cases,
we either use more bins or create bins with variable ranges
of rows. Hence, our algorithm reads a bin from memory and
perform radix sort on data stored in cache. The sorted data
can then be compressed while it is still in the cache. Hence,
the sorting phase reads (b ∗ flop) bytes.
E. Compression
After we sort each bin, tuples with the same (rowid, colid)
pair are stored in adjacent locations. Then, in the compression
phase, we sum numeric values from tuples with the same
(rowid, colid). As shown in Algorithm 2, compression can be
performed independently in each bin because bins do not share
tuples with the same rowid. Hence, a thread can compress
tuples in a bin sequentially, while other threads compress other
bins in parallel.
As tuples are already sorted within a bin, compression is
done by scanning the tuples in the sorted order. We implement
this using two in-place pointers, which only walk the array
once. The first pointer (p1) walks through the array, the second
pointer (p2) maintains the location to be merged. Every time
when p1 points to a new location it checks with p2, if the keys
of the two tuples are the same, simply add the numeric value
of the first tuple to the second, if not, we move p2 to the next
location and copy the tuple2 there, keep doing this until the
p1 reach the end of the array.
Table II summarizes the computational complexity, band-
width and latency costs, in-cache operations and parallelism
schemes for all phases of PB-SpGEMM.
IV. EXPERIMENT SETUP
A. Software
We evaluate the performance of PB-SpGEMM against
some state-of-the-art column SpGEMM algorithms, namely
HeapSpGEMM, HashSpGEMM and HashVecSpGEMM. All
of these algorithms have memory access patterns of a typical
column SpGEMM algorithm discussed earlier.
• HeapSpGEMM is a column SpGEMM algorithm that uses
heaps to merge columns [22]. To multiply two n×n ER ma-
trices with d average nonzeros per column, HeapSpGEMM
complexity is O(nd2 log d), where the log d term comes
from manipulating heaps. Hence, HeapSpGEMM can be
efficient for matrices with small d, but can be expensive for
relatively dense matrices. Each column of C can be formed
in parallel using thread-private heaps.
• HashSpGEMM uses a hash table to merge columns[27],
[12]. Its complexity is O(nd2) for ER matrices assuming
that hash tables have limited collisions. Each column of C
can be formed in parallel using thread-private hash tables.
• HashVecSpGEMM is a variant of hash algorithm, which
utilizes vector registers for hash probing [12]. HashVec-
SpGEMM may preform better when the collision in the hash
table is high.
Previous work [12] has shown that the optimized heap and
hash algorithms largely outperform Intel MKL and Kokkos-
kernel. Considering this, we will not include those algorithms
and software in our evaluation. All implementations are com-
piled by GCC-8.2.0, with flags ”-fopenmp” ”-O3” ”-m64” ”-
march=native” enabled.
B. Hardware
In our experiments, we use a dual-socket Intel Skylake
system and an IBM POWER9 system as described in Table
IV. Since most of our experiments are conducted on the
Skylake processor, we examine its memory system carefully
using the STREAM benchmark [30]. Table V shows the
sustainable memory bandwidth for Copy, Scale, Add and Triad
benchmarks on single and dual sockets of the Skylake system.
Hence, we expect that PB-SpGEMM should attain ∼ 55
GB/s on a single socket and ∼ 105 GB/s on two sockets
of Skylake. While PB-SpGEMM indeed attains ∼ 55 GB/s
in every phase, its dual socket performance falls short of
STREAM benchmark. Hence, we primarily focus the single-
socket performance because memory bandwidth is harder to
predict in Non-Uniform Memory Access (NUMA) domains.
We also show some results with both sockets and explain
dual socket performance in Sec. V-D. When experimenting
with a single socket, we set ”OMP PLACES” to ”cores”,
”OMP PROC BIND” to ”close”, and restrict the memory
allocation to NUMA node 0 by using ”numactl –membind=0”.
C. Dataset
We closely follow a recent paper [12] to select matrices
to test SpGEMM algorithms. We use the R-MAT recursive
matrix generator to generate synthetic matrices. RMAT has
four parameters a, b, c, and d. ER matrices are generated with
a=b=c=d=0.25, and Graph-500 matrices are generated with
a=0.57, b=c=0.19, d=0.05. Here, we refer to the latter graphs
as RMAT. For random graphs, edge factor denotes the average
number of nonzeros in a row or column. A matrix of scale k
Phase Comp. Complexity Bandwidth cost Latency In-cache operations Parallelism
Expand O(flop) reading (b ∗ (nnz(A) + nnz(B))) bytes Negligible manipulating local bins cols of A andwriting (b ∗ flop) bytes rows of B per thread
Sort O(flop) reading (b ∗ flop) bytes Negligible shuffling (4 ∗ b ∗ flop) bytes bins per thread
Compress O(flop) writing (b ∗ nnz(C)) bytes Negligible read/write (b ∗ flop) bytes bins per thread
TABLE III: Computational complexity and data access patterns in different phases of PB-SpGEMM.
Skylake-SP POWER9
CPU Model Intel Xeon Platinum 8160 IBM POWER9
Architecture x86 64 ppc64le
#Sockets 2 2
#Cores/socket 24 20
Clock 2.1GHz 3.8GHz
L2 cache 1024KB/core 512KB/two cores
L3 cache 33792KB/socket 10240KB/two cores
Memory Size 250GB 1TB
Bandwidth 100GB/s 250GB/s
TABLE IV: The configuration of hardware used for evaluation
Copy Scale Add Triad
single socket 47.40 46.85 54.00 57.04
dual socket 97.73 87.43 107.00 108.42
TABLE V: Stream benchmark result of the evaluation platform
Graph n(A) nnz(A) d(A) flops nnz(C) cf
2cubes sphere 101.5K 1.6M 16.23 27.5M 9.0M 3.06
amazon0505 410.2K 3.4M 8.18 31.9M 16.1M 1.98
cage12 130.2K 2.0M 15.61 34.6M 15.2M 2.14
cant 62.5K 4.0M 64.17 269.5M 17.4M 15.45
hood 220.5K 9.9M 44.87 562.0M 34.2M 16.41
m133 b3 200.2K 800.8K 4.00 3.2M 3.2M 1.01
majorbasis 160.0K 1.8M 10.94 19.2M 8.2M 2.33
mc2depi 525.8K 2.1M 3.99 8.4M 5.2M 1.6
offshore 259.8K 4.2M 16.33 71.3M 69.8M 3.05
patents main 240.5K 560.9K 2.33 2.6M 2.3M 1.14
scircuit 171.0K 958.9K 5.61 8.7M 5.2M 1.66
web-Google 916.4K 5.1M 5.57 60.7M 29.7M 2.04
TABLE VI: A list of real-world graphs used in our evaluation
has 2k rows and columns. We also use 12 real-world matrices
from SuiteSparse Matrix Collection [31] as shown in Table
VI. In our experiments, we randomly generate two random
matrices and multiply them or multiply a real matrix with
itself. These computations cover some representative applica-
tions such as Markov clustering [9] and triangle counting [2].
Due to space limitation, we did not explore other application
scenarios such as multiplying a square matrix by a tall-and-
skinny matrix as needed in betweenness centrality algorithms.
V. RESULTS
A. Select parameters of PB-SpGEMM
In the expand phase, local bins are used to improve data
locality, the propagation of tuples will be blocked into small
bins, and they are moved to global shared memory. A local bin
should be large enough to have good utilization of a cache line
so that it can be send to global bins without wasting memory
(a) Local bin width (b) The number of bins
Fig. 6: Impact of bin width and the number of bins, data based
on ER matrices with scale 20 and edge factor 4
bandwidth. Furthermore, the total size of local bins per thread
should be smaller than the L2 cache 4. Hence, number of
bins and local bin width are two parameters upon which the
performance of PB-SpGEMM depends. Fig. 6a shows our
experiment where we vary the width of local bins to observe
its impact on the performance of the expand phase. Smaller
local bins do not utilize full cache lines, resulting in reduced
sustained bandwidth. Based on this experiment, we used 512
bytes for every thread-private bin in all experiments in the
paper.
The number of bins (nbins), on the other hand, is a tradeoff
between the expand and sort phases. Recall that sorting is
performed independently in each bin. Hence, increasing the
number of bins ensures that tuples in a global bin fit in the
cache. However, increasing the number of bins also reduces the
average bin size, which may reduce the bandwidth utilization
of the expand phase. Fig. 6b shows the impact of nbins over
the expand and sorting phases. With more bins, radix sort can
be entirely performed in L2 cache. Hence, in-cache sorting
bandwidth can be as high as 200 GB/s. Hence, the number of
bins is determined by L2/L3 cache size and total number of
flop, and for most practical matrices, we use 1K or 2K bins.
B. Overall performance of PB-SpGEMM with respect to the
state-of-the-art
At first we discuss the overall performance of PB-SpGEMM
with respect to state-of-the-art column SpGEMM algorithms.
Here, we consider both ER, RMAT, and real matrices as
discussed in Sec IV-C.
Performance with ER random matrices. Fig. 7a compares
the performance of PB-SpGEMM, HeapSpGEMM, Hash-
SpGEMM, and HashVec-SpGEMM with ER matrices of var-
ious scales and edge factors on a single socket of the Skylake
4For system that has multiple threads per core, it should be counted by
core.
(a) Performance scaling with matrix density (b) Sustained bandwidth
Fig. 7: Performance and bandwidth evaluation with ER matrices on a single socket (24 cores) from Skylake
Fig. 8: Performance evaluation with ER matrices on a single socket (20 cores) from Power9
(a) Performance scaling with matrix density (b) Sustainable bandwidth
Fig. 9: Performance and bandwidth evaluation with RMAT matrices on a single socket (24 cores) from Skylake
Fig. 10: Performance evaluation with RMAT matrices on a single socket (20 cores) from Power9
server. Here, we multiply two ER matrices with the same
scale and edge factor. For a given scale, the performance
of PB-SpGEMM is stable (between 700 and 800 GFLOPS)
and is better than column SpGEMM algorithms for all edge
factors considered. This performance of PB-SpGEMM can
be explained by Fig. 7b that reports the sustained bandwidth
of PB-SpGEMM. We observed that the sustained bandwidth
on a single socket is between 40 and 50 GB/s, which are
close to the STREAM benchmark. Since RE matrices have
cf = 1, the lower bound of AI is 180 flops/byte according
to Eq. 4. Hence, the performance of PB-SpGEMM should
be at least 40 ∗ 180 = 500 MFLOPS when the sustained
bandwidth is 40 GB/s and at least 50 ∗ 180 = 625 MFLOPS
when the sustained bandwidth is 50 GB/s. Fig. 7a confirms
that PB-SpGEMM’s performance remains close to this lower
bound estimate. By comparison, hash and heap algorithms
have lower performance primarily because of their irregular
memory accesses. However, as we increase the edge factor,
the performance of column SpGEMM may increase because
of increased utilization of cache lines.
Fig. 11: Performance evaluation with real matrices (matrix squaring) on a single socket of the Skylake server. From left to
right, we sort matrices in the ascending order of the compression factor
Fig. 12: Scalability on ER (left) and R-MAT matrices (right).
Both are scale 16 matrices with edge factor of 16.
Similar performance is observed on the Power9 system
as shown in Fig. 8. As observed on Skylake, PB-SpGEMM
performs better than column SpGEMM algorithms and its
performance remains relatively stable for various matrix size
and sparsity.
Performance with RMAT random matrices. Fig. 9a com-
pares the performance of SpGEMM algorithms with RMAT
matrices of various scales and edge factors on a single socket
of the Skylake server. Here, we multiply two RMAT matrices
with the same scale and edge factor. As with the ER matrices,
the performance of PB-SpGEMM remains between 700 and
900 GFLOPS and is generally better than column SpGEMM
algorithms. However, Fig. 9b shows that PB-SpGEMM attains
lower sustained bandwidth between 30 and 40 GB/s. The
reason behind this lower than STREAM bandwidth is the
skewed degree distributions in RMAT matrices, resulted in
variable-size bins. Load imbalance in different bins makes the
expansion phase less bandwidth efficient. We will discuss more
in the scalability results.
Similar performance is observed on the Power9 system as
shown in Fig. 10. As observed on Skylake, PB-SpGEMM
performs better than column SpGEMM algorithms and its
performance remains relatively stable for various matrix size
and sparsity.
Fig. 13: Scalability breakdown on ER (left) and R-MAT
matrices (right). Both are scale 16 matrices with edge factor
16.
Performance with real matrices. Fig. 11 shows the perfor-
mance of SpGEMM algorithms when squaring real matrices
on a single socket of the Skylake server. As before, the
sustained bandwidth of PB-SpGEMM is between 47 and 55
GB/s and its performance is relatively stable. In Fig. 11, we
sort matrices in the ascending order of the compression factor
(from left to right). PB-SpGEMM is generally faster than its
peers.
C. Scalability
Fig. 12 shows the strong scaling of SpGEMM algorithms
from 1 to 24 threads within a socket of the Skylake proces-
sor. We observe that PB-SpGEMM runs faster that column
SpGEMM on all concurrencies. All SpGEMM algorithms
scale well within a socket. On 24 cores, PB-SpGEMM attains
about 16× speedup for ER matrices and 10× speedup for
RMAT matrices. For RMAT matrices, PB-SpGEMM does not
scale well on high thread counts because of the load imbalance
caused by highly skewed nonzero and flop distributions. We
tried to eliminate the load imbalance by variable length bins,
but this can lead to lower sustained bandwidth as was observed
in Fig. 9b.
Fig. 14: Multi-socket performance of SpGEMM algorithms on
Skylake system
NUMA socket 0 NUMA socket 1
NUMA socket 0 50.26GB/s and 88.1ns 33.36GB/s and 147.4ns
NUMA socket 1 34.06GB/s and 146.7ns 50.12GB/s and 88.3ns
TABLE VII: NUMA local and cross-socket memory band-
width and latency on Skylake
D. Dual-socket Performance
Thus far, we have only considered the performance of
SpGEMM algorithms on a single socket of Skylake and
Power9 processors. Fig. 14 shows the performance of
SpGEMM algorithms on dual socket Skylake processor. PB-
SpGEMM still runs faster for ER matrices, but runs slightly
slower than heap algorithm for RMAT matrices. This lower-
than-expected performance of PB-SpGEMM on NUMA sys-
tems is due to inter-socket communication contentions. If a
bin is allocated on socket-1 in the expand phase and sorted
by a thread from socket-2, the performance of PB-SpGEMM
depends on cross-socket memory bandwidth. We checked
cross-socket memory bandwidth empirically by placing data in
one socket and accessing from another socket in a STREAM
copy kernel. Table VII shows the local and remote access
bandwidth and latency. Memory latency was measured by Intel
Memory Latency Checker. We observe that cross-socket access
is much slower than local access. Hence, PB-SpGEMM’s
performance relies saturating the memory bandwidth, it is
affected by lower cross-socket bandwidth. Note that column
SpGEMM algorithms are not significantly affected by cross-
socket bandwidth because they generate one column at a time,
where the active column usually stays in cache.
In the Master’s thesis of the first author, we tried to improve
the dual socket performance by partitioning A into two n2 ×
n matrices and multiply each part with B independently in
two sockets. This partitioned PB-SpGEMM partially mitigates
the cross-socket bandwidth problem, but it does not perform
uniformly well for all matrices due to the additional cost of
reading B more than once. We did not cite the thesis as per
the double blind policy.
VI. CONCLUSIONS
With the rise of sparse and irregular data, SpGEMM has
emerged as an important operation in many scientific do-
mains. Over the past decade, the state-of-the-art of parallel
SpGEMM algorithms has progressed significantly. However,
understanding the performance of SpGEMM algorithms re-
mains a challenge without an established performance model.
Relying on the fact that SpGEMM is a bandwidth-bound
operation, we used the Roofline model to develop bounds for
SpGEMM algorithms based on column-by-column merging
and the expand-sort-compress strategy. We conclude the paper
with the following key findings:
1) We can estimate the arithmetic intensity (AI) of an
SpGEMM algorithm based on the compression factor of
the multiplication and number of bytes needed to store each
nonzero.
2) The attainable performance of an algorithm is (AI ∗ β),
where β is the memory bandwidth. This peak perfor-
mance can only be attained if the algorithm saturates the
bandwidth. We showed that existing column SpGEMM
algorithms do not attain peak performance according to
the Roofline model because of irregular data accesses and
underutilization of cache lines.
3) We develop a new algorithm based on outer product of
matrices. This algorithm called PB-SpGEMM uses prop-
agation blocking to group multiplied tuples into bins and
then sort and merge tuples independently in each bin.
4) PB-SpGEMM approximately saturates the memory band-
width in all of its three phases and attains performance as
predicted by the Roofline model.
5) On a single socket, PB-SpGEMM performs better than the
best column SpGEMM algorithms for multiplications with
compression factors less than four.
6) For multiplications with compression factors greater than
four, HashSpGEMM is the best performer.
REFERENCES
[1] A. Bulu and J. Gilbert, “On the representation and multiplication
of hypersparse matrices,” in 2008 IEEE International Symposium on
Parallel and Distributed Processing, 04 2008, pp. 1–11.
[2] A. Azad, A. Buluc¸, and J. Gilbert, “Parallel triangle counting and
enumeration using matrix algebra,” in IPDPSW, 2015.
[3] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “High performance graph
algorithms from parallel sparse matrices,” in PARA, 2007, pp. 260–269.
[4] H. Kaplan, M. Sharir, and E. Verbin, “Colored intersection searching
via sparse rectangular matrix multiplication,” in Proceedings of
the Twenty-second Annual Symposium on Computational Geometry,
ser. SCG ’06. ACM, 2006, pp. 52–60. [Online]. Available:
http://doi.acm.org/10.1145/1137856.1137866
[5] R. Yuster and U. Zwick, “Detecting short directed cycles using rectan-
gular matrix multiplication and dynamic programming,” in Proceedings
of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms,
2004.
[6] G. Ballard, C. Siefert, and J. Hu, “Reducing communication costs for
sparse matrix multiplication within algebraic multigrid,” SIAM Journal
on Scientific Computing, vol. 38, no. 3, pp. C203–C231, 2016.
[7] X. Feng, Y. Xie, M. Song, W. Yu, and J. Tang, “Fast randomized pca
for sparse data,” in ACML, 2018.
[8] Y. Jin and J. Ja´Ja´, “A high performance implementation of spectral
clustering on cpu-gpu platforms,” 2016 IEEE International Parallel and
Distributed Processing Symposium Workshops (IPDPSW), pp. 825–834,
2016.
[9] A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and
A. Buluc¸, “HipMCL: A high-performance parallel implementation of
the markov clustering algorithm for large-scale networks,” Nucleic acids
research, 2018.
[10] A. Griewank and U. Naumann, “Accumulating jacobians as chained
sparse matrix products,” Math. Program., vol. 95, pp. 555–571, 03 2003.
[11] R. R. Amossen and R. Pagh, “Faster join-projects and sparse matrix
multiplications,” in Proceedings of the 12th International Conference
on Database Theory, 2009.
[12] Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluc¸, “Performance
optimization, modeling and analysis of sparse matrix-matrix products
on multi-core and many-core processors,” Parallel Computing, vol. 90,
p. 102545, 2019.
[13] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful
visual performance model for multicore architectures,” Communications
of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
[14] N. Bell, S. Dalton, and L. N. Olson, “Exposing Fine-Grained Parallelism
in Algebraic Multigrid Methods,” SIAM Journal on Scientific Comput-
ing, vol. 34, no. 4, pp. C123–C152, 2012.
[15] S. Dalton, L. Olson, and N. Bell, “Optimizing sparse matrix—matrix
multiplication for the GPU,” ACM Transactions on Mathematical Soft-
ware (TOMS), vol. 41, no. 4, p. 25, 2015.
[16] S. Beamer, K. Asanovic´, and D. Patterson, “Reducing pagerank commu-
nication via propagation blocking,” in 2017 IEEE International Parallel
and Distributed Processing Symposium (IPDPS). IEEE, 2017, pp. 820–
831.
[17] M. M. Ozdal, “Improving efficiency of parallel vertex-centric algorithms
for irregular graphs,” IEEE Transactions on Parallel and Distributed
Systems, vol. 30, no. 10, pp. 2265–2282, 2019.
[18] J. Liu, X. He, W. Liu, and G. Tan, “Register-aware optimizations for
parallel sparse matrix–matrix multiplication,” International Journal of
Parallel Programming, vol. 47, no. 3, pp. 403–417, 2019.
[19] D. Langr and P. Tvrdik, “Evaluation criteria for sparse matrix storage
formats,” IEEE Transactions on parallel and distributed systems, vol. 27,
no. 2, pp. 428–440, 2015.
[20] M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson,
S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey,
“Parallel efficient sparse matrix-matrix multiplication on multicore plat-
forms,” in ISC. Springer, 2015, pp. 48–57.
[21] M. Deveci, C. Trott, and S. Rajamanickam, “Performance-portable
sparse matrix-matrix multiplication for many-core architectures,” in
IPDPSW. IEEE, 2017, pp. 693–702.
[22] A. Azad, G. Ballard, A. Buluc¸, J. Demmel, L. Grigori, O. Schwartz,
S. Toledo, and S. Williams, “Exploiting multiple levels of parallelism
in sparse matrix-matrix multiplication,” SIAM Journal on Scientific
Computing, vol. 38, no. 6, pp. C624–C651, 2016.
[23] A. Buluc and J. R. Gilbert, “On the representation and multiplication
of hypersparse matrices,” in 2008 IEEE International Symposium on
Parallel and Distributed Processing. IEEE, 2008, pp. 1–11.
[24] S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti,
H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “Outerspace: An outer
product based sparse matrix multiplication accelerator,” in 2018 IEEE
International Symposium on High Performance Computer Architecture
(HPCA), 2018.
[25] J. R. Gilbert, C. Moler, and R. Schreiber, “Sparse matrices in MATLAB:
Design and implementation,” SIAM Journal on Matrix Analysis and
Applications, vol. 13, no. 1, pp. 333–356, 1992.
[26] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplication
and permuted transposition,” ACM TOMS, vol. 4, no. 3, pp. 250–269,
1978.
[27] Y. Nagasaka, A. Nukada, and S. Matsuoka, “High-performance and
memory-saving sparse general matrix-matrix multiplication for NVIDIA
Pascal GPU,” in ICPP. IEEE, 2017, pp. 101–110.
[28] G. Ballard, A. Buluc, J. Demmel, L. Grigori, B. Lipshitz, O. Schwartz,
and S. Toledo, “Communication optimal parallel multiplication of sparse
random matrices,” in Proceedings of the twenty-fifth annual ACM
symposium on Parallelism in algorithms and architectures, 2013, pp.
222–231.
[29] P. M. McIlroy, K. Bostic, and M. D. McIlroy, “Engineering radix sort,”
Computing systems, vol. 6, no. 1, pp. 5–27, 1993.
[30] J. D. McCalpin, “Stream: Sustainable memory bandwidth in high
performance computers,” University of Virginia, Tech. Rep., 1991-2007.
[31] T. Davis, “The University of Florida Sparse Matrix Collection.”
[Online]. Available: http://www.cise.ufl.edu/research/sparse/matrices
