A Systematic Survey of General Sparse Matrix-Matrix Multiplication by Gao, Jianhua et al.
A Systematic Survey of General Sparse
Matrix-Matrix Multiplication
Jianhua Gao
School of Computer Science and Technology
Beijing Institute of Technology
Beijing, China
gjh@bit.edu.cn
Weixing Ji
School of Computer Science and Technology
Beijing Institute of Technology
Beijing, China
jwx@bit.edu.cn
Zhaonian Tan
School of Computer Science and Technology
Beijing Institute of Technology
Beijing, China
tan zn@163.com
Yueyan Zhao
School of Information and Electronics
Beijing Institute of Technology
Beijing, China
1120161354@bit.edu.cn
Abstract—SpGEMM (General Sparse Matrix-Matrix Multipli-
cation) has attracted much attention from researchers in fields
of multigrid methods and graph analysis. Many optimization
techniques have been developed for certain application fields and
computing architecture over the decades. The objective of this
paper is to provide a structured and comprehensive overview
of the research on SpGEMM. Existing optimization techniques
have been grouped into different categories based on their target
problems and architectures. Covered topics include SpGEMM
applications, size prediction of result matrix, matrix partitioning
and load balancing, result accumulating, and target architecture-
oriented optimization. The rationales of different algorithms in
each category are analyzed, and a wide range of SpGEMM algo-
rithms are summarized. This survey sufficiently reveals the latest
progress and research status of SpGEMM optimization from
1977 to 2019. More specifically, an experimentally comparative
study of existing implementations on CPU and GPU is presented.
Based on our findings, we highlight future research directions and
how future studies can leverage our findings to encourage better
design and implementation.
Index Terms—SpGEMM, Parallel Optimization, Sparse Matrix
Partitioning, Parallel Architecture.
I. INTRODUCTION
General sparse matrix-matrix multiplication, abbreviated as
SpGEMM, is a fundamental and expensive computational ker-
nel in numerous scientific computing applications and graph
algorithms, such as algebraic multigrid solvers [1] [2] [3]
[4], triangle counting [5] [6] [7] [8], multi-source breadth-
first searching [9] [10] [11], the shortest path finding [12],
colored intersecting [13] [14], and subgraphs [15] [16]. Hence
the optimization of SpGEMM has the potential to impact a
wide variety of applications.
Given two input sparse matrices A and B, SpGEMM
computes a sparse matrix C such that C = A × B, where
A ∈ Rp×q, B ∈ Rq×r, C ∈ Rp×r, and the operator ×
represents the general matrix multiplication. Let aij denote
the element in the ith row and the jth column of matrix A,
and we write ai∗ to denote the vector comprising of the entries
in the ith row of A, also a∗j to denote the vector comprising
of the entries in the jth column of A. In the general matrix-
matrix multiplication (GEMM), the ith row of the result matrix
C can be generated by
ci∗ = (ai∗ · b∗1, ai∗ · b∗2, ..., ai∗ · b∗r), (1)
where the operation · represents the dot product of two vectors
[17].
Compared with GEMM, SpGEMM requires to exploit the
sparsity of the two input matrices and the result matrix because
of huge memory and computational cost for the zero entries
with dense formats. Most of the research on SpGEMM is
proposed based on the row-wise SpGEMM algorithm given
by Gustavson [18]. The i-th row of the result matrix C is
given by ci∗ =
∑
j aij ∗ bj∗, where the operator ∗ represents
the scalar multiplication, and the computation only occurs in
non-zero entries of sparse matrices A and B. The pseudocode
for Gustavson’s SpGEMM algorithm is presented in Algorithm
1.
The goal of this survey paper is to provide a structured
and broad overview of extensive research on SpGEMM design
and implementations spanning multiple application domains
and research areas. We not only strive to provide an overview
of existing algorithms, data structures, and libraries available
but also try to uncover the design decisions behind the
various implementation by the designers. It gives an in-depth
presentation of the many algorithms and libraries available for
computing SpGEMM.
Sparse matrix has been a hot topic of many surveys and
reviews. Duff et al. [19] provide an extensive survey of
sparse matrix research developed before the year of 1976. A
broad review of the sparse matrix and vector multiplication is
presented by Grossman et al. [20]. Over the past decades,
SpGEMM attracts intensive attention in a wide range of
application fields, and many new designs and optimizations are
proposed targeting different architectures. The development of
ar
X
iv
:2
00
2.
11
27
3v
1 
 [c
s.D
C]
  2
6 F
eb
 20
20
Algorithm 1: Pseudocode for the Gustavson’s SpGEMM
algorithm.
Input: A ∈ Rp×q, B ∈ Rq×r
Output: C ∈ Rp×r
1 for ai∗ ∈ A do
2 for aij ∈ ai∗ and aij is non-zero do
3 for bjk ∈ bj∗ and bjk is non-zero do
4 value← aij ∗ bjk;
5 if cik /∈ ci∗ then
6 cik ← 0;
7 end
8 cik ← cik + value;
9 end
10 end
11 end
multi/many-core architectures and parallel programming have
brought new opportunities for the acceleration of SpGEMM,
and also challenging problems emerged need to be considered
intensively. To the best of our knowledge, this is the first
survey paper that overviews the developments of SpGEMM
over the decades. The goal of this survey is to present a
working knowledge of the underlying theory and practice of
SpGEMM for solving large-scale scientific problems, and to
provide an overview of the algorithms, data structures, and
libraries available to solve these problems, so that the readers
can both understand the methods and know how to use them
best. This survey covers the application domains, challeng-
ing problems, architecture-oriented optimization techniques,
and performance evaluation of available implementations for
SpGEMM. We adopt the categories of challenging problems
given in [21] and summarize existing techniques about size
predicting of the result matrix, load balancing, and intermedi-
ate results merging.
The main factors that differentiate our survey from the
existing related work are:
• the systematic process for collecting the data;
• a classified list of studies on both applications and opti-
mization techniques;
• analysis of the collected studies in the view of both
algorithm design and architecture optimization;
• an detailed evaluation of 6 library implementations.
The rest of the paper is organized as follows. Section
2 gives a brief overview of this survey. Section 3 intro-
duces several classical applications of SpGEMM in detail.
In section 4, the related work on three pivotal problems of
SpGEMM computation is presented. In section 5, the related
work of SpGEMM optimization based on different platforms
is introduced in detail. Besides, we perform a performance
evaluation employing different SpGEMM implementations in
section 6. A discussion about the challenges and future work
of SpGEMM is presented in section 7, and followed by our
conclusion in section 8.
II. LITERATURE OVERVIEW
A. Research Questions
This study follows the guidelines of the systematic literature
review proposed by Kitchenham [22], which is initially used in
medical science but later gained interest in other fields as well.
According to the three main phases: planning, conducting,
and reporting, we have formulated the following research
questions:
RQ1: What are the applications of SpGEMM, and how are
they formulated?
RQ2: What is the current state of SpGEMM research?
RQ3: How do the SpGEMM implementations on the off-
the-shelf processors perform?
RQ4: What challenges could be inferred from the current
research effort that will inform future research?
Regarding RQ1, the study will address how SpGEMM is
used in these applications, including the sub-questions like
what are the typical applications of SpGEMM? Is it used in
an iterative algorithm? Do the matrices keep constant in the
computation? This will give an insight into the requirements of
SpGEMM in solving real problems. In RQ2, the study looks
at existing techniques that were proposed in recent decades.
Regarding RQ3, we will perform some evaluations for both
CPU and GPU to have a general idea about the performance
of these implementations. Finally, in RQ4, we will summarize
the challenges and future research directions according to our
investigative results.
B. Search Procedure
There have been some other reviews presented in the related
work section [21]. However, they cover a more limited number
of studies. We evaluate and interpret all high-quality studies
related to SpGEMM in this study.
To have a broad coverage, we perform a systematic literature
survey by indexing papers from several popular digital libraries
(IEEE Explore Digital Library, ACM Digital Library, Elsevier
ScienceDirect, Springer Digital Library, Google Scholar, Web
of Science, DBLP) using the keywords “SpGEMM”, “Sparse
matrix”, “sparse matrix multiplication”, “sparse matrix-matrix
multiplication”. It is an iterative process, and the keywords
are fine-tuned according to the returned results step by step.
Just like other literature survey work, we try to broaden the
search as much possible while maintaining a manageable result
set. In order to do that, only refereed journal and conference
publications were included. Then, we read the titles and
abstracts of these papers and finally include 87 papers of more
than 142 papers we found in this survey.
C. Search Result
It is difficult to give a sufficient and systematic taxonomy
for classifying SpGEMM research because the same topic may
have different meanings in different contexts. For example,
load balance of SpGEMM in distributed computing, shared
memory multi-core computing, single GPU computing, multi-
GPU computing, CPU+GPU computing. We select several
TABLE I: Categories of selected papers.
Category Papers
Application
Multigrid solvers [1] [2] [3] [4] [23] [24] [25]
Triangle counting [5] [6] [7] [8] [26] [27] [28]
MBFS [9] [10] [11] [29]
Other [12] [13] [14] [15] [16] [30] [31]
[32] [33] [34] [35] [36] [37] [38]
[39] [40] [41] [42]
Optimization
Algorithm [18] [21] [43] [44] [45] [46] [47]
[14] [7] [48] [49] [50] [2] [51] [52]
[53] [42] [54] [3] [55] [56] [57]
[58] [59] [60] [61] [62] [63] [64]
[16] [65] [66] [67] [68] [69] [25]
[70] [71] [72] [73] [74] [75] [26]
[41] [76] [?] [77]
Many-core CPU [69] [51] [25] [59] [58] [68]
GPU [21] [57] [47] [58] [68]
Distributed [78] [42] [79] [67] [46]
Heterogeneous [17] [53] [80] [81]
FPGA [82] [83]
Programming models [84] [85] [78] [54]
Surveys [19] [20] [22]
Benchmark [86]
topics that are frequently covered by existing papers, and
present the correlation between papers and topics in Table I.
III. TYPICAL APPLICATION
SpGEMM is mainly used in linear algebra and graph
analytics, and among which algebraic multigrid solver, triangle
counting, and multi-source breadth-first search are three most
classical applications of SpGEMM. In the following subsec-
tions, we will introduce the procedures of these three applica-
tions and how the SpGEMM is applied in these applications
in detail.
A. Algebraic Multigrid Solvers
Algebraic Multigrid (AMG) is a multi-grid method de-
veloped by utilizing some important principles and concepts
of Geometric multigrid (GMG), and it requires no explicit
geometric knowledge of the problem. AMG method itera-
tively solves large and sparse linear systems Ax = b by
automatically constructing a hierarchy of grids and inter-grid
transfer operators. It achieves optimality by employing two
complementary processes: smoothing and coarse-grid correc-
tion. The smoother is generally fixed to be a simple point-
wise method such as Gauss-Seidel or weighted Jacobi for
eliminating the high-frequency error. Coarse-grid correction
involves three phases, including projecting error to a coarser
grid through restriction operator, solving a coarse-grid system
of equations, and then projecting the solution back to the fine
grid through interpolation (also called prolongation) operator
[23] [24].
Algorithm 2 describes the construction of several important
operators in AMG. First, the strength operation identifies the
strongly connected edges in the matrix Al at level l to construct
strength-of-connection matrix Sl (line 3 of Algorithm 2). Next,
the interpolation operator Pl is constructed to transfer informa-
tion from level l+1 to level l (line 4 of Algorithm 2). Finally,
Algorithm 2: Construction of several important operators
in AMG [2].
Input: A,B
Output: A1, ..., AL, P0, ..., PL−1
1 A0 ← A,B0 ← B;
2 for l = 0, ..., L− 1 do
// Strength-of-connection of edges
3 Sl =strength(Al);
// Construct interpolation and
restriction
4 Pl =interpolation(Al, Sl);
// Galerkin product (triple sparse
matrices multiplication)
5 Al+1 = P
T
l AlPl;
6 end
the coarse-grid operator is constructed by a Galerkin triple-
matrix product (line 5 of Algorithm 2), which is implemented
with two sparse matrix-matrix multiplications [1] [2] [3] [4]
[25].
These matrix-matrix multiplications, taking more than 80%
of the total construction time, are the most expensive com-
ponents. Moreover, the construction of operators (and thus
SpGEMM) is a considerable part of the overall execution since
it may occur at every time step (for transient problems) or even
multiple times per time step (for non-linear problems), making
it important to optimize SpGEMM [25].
B. Triangle Counting
Triangle counting is one of the major kernels for graph
analysis and constitutes the core of several critical graph
algorithms, such as triangle enumeration [8] [7] [26], subgraph
isomorphism [30] [31], and k-truss computation [31] [32].
The problem of triangle counting in an undirected graph is
to find the number of triangles in the graph. Given a graph
G = (V,E), it can be represented by a square sparse matrix
A of dimension n × n, where n is the number of vertexes
in graph G: n = |V |, and either the matrix entry aij or
aji represents the edge (i, j) [5]. Azad et al. [8] describe
an adjacency matrix-based triangle counting method based
on Cohen’s MapReduce algorithm [6]. First, they construct
an adjacency matrix A from graph G, with its rows ordered
with the increasing number of non-zeros they contain (equal
to the degree of the corresponding vertex). Assuming that
sparse matrix L is the lower triangular part of A holding edges
(i, j) where i > j, and U is the upper triangular part of A
holding edges (i, j) where i < j, then A = L + U . Next,
all the wedges of (i, j, k) where j is the smallest numbered
vertex are counted via the SpGEMM operation: B = L× U ,
and B(i, k) represents the count for (i, j, k) wedges. Finally,
finding the close wedges by doing element-wise multiplication
with the original matrix: C = A. ∗ B. Algorithm 3 presents
the pseudocode for the serial adjacency matrix-based triangle
counting algorithm, and an example of triangle counting is
presented in Fig. 1.
Algorithm 3: Pseudocode for the serial adjacency matrix-
based triangle counting algorithm [6] [8].
Input: Ordered adjacency matrix A of graph G
Output: Number Nt of triangles in graph G
1 (L,U)←Split(A);
2 B← SpGEMM(L,U);
3 C ←ElementWiseSpGEMM(A,B);
// Sum all entries of matrix C
4 Nt ← Sumi,j(cij)/2;
Besides, Wolf et al. [27] develop a GraphBLAS like
approach for triangle enumeration using an overloaded
SpGEMM of the graph adjacency matrix (A) and incidence
matrix (H): C = A×H , which can also be used for triangle
counting. Although this method is useful for more complex
graph calculations, it is less efficient than the adjacent matrix-
based method. Furthermore, a variant of the adjacency matrix-
based triangle counting algorithm is given by Wolf et al. in
[7], which uses the lower triangle portion of the adjacency
matrix to count triangles.
a
b d
c
G Two triangles
a
b
c
a
d
c
Fig. 1: An example of triangle counting. Graph G includes
two triangles: (a, b, c) and (a, d, c).
Collected from social media, sensor feeds, the World Wide
Web, biological and genetic fields, co-author networks, and
citations, increasing large amounts of data are being analyzed
with graph analytics to reveal the complicated underlying
relationship between different data entities [28] [31]. One of
the fundamental techniques for graph analytics is subgraph
finding, and the triangle is one of the most important sub-
graphs. Moreover, the number of triangles is an important
metric in many network analysis applications, including link
recommendation [33], spam detection [34], social network
analysis [34], and dense neighborhood graph discovery [35].
In the adjacency matrix-based triangle counting algorithm, the
SpGEMM operation B = L×U costs d2n operations, where
d is the average degree and n is the number of vertices [8].
Considering that the computational complexity increases with
the scale of the problem, the optimization of triangle counting
(thus SpGEMM) becomes extremely considerable.
C. Multi-Source Breadth-first Search
Breadth-first search (BFS) is a key and fundamental sub-
routine in many graph analysis algorithms, such as finding
connectivity, finding the shortest path, and finding k-hop
neighbors [36] [37].
The goal of the BFS is to traverse a graph from a given
source vertex, which can be performed by a sparse matrix-
vector multiplication (SpMV) between the adjacency matrix
of a graph and a sparse vector [29].
Let us take an example in a sparse unweighted graph G =
(V,E) represented by sparse matrix A. Assume n = |V |, then
the size of A is n×n. Let x be a sparse vector with x(i) = 1
and all other elements being zero, then the 1-hop vertexes
from source vertex i, denoted as Vi1, can be derived by the
SpMV operation: y = Ax. Repeat the operation from Vi1, we
can receive 1-hop vertexes from Vi1 or 2-hop vertexes from i.
Finally, a complete BFS for graph G from vertex i is yielded
[9].
In contrast, Multi-Source BFS (MS-BFS) executes multiple
independent BFSs concurrently on the same graph from mul-
tiple source vertexes. It can be performed by SpGEMM. A
simple example for MS-BFS is presented in Fig. 2 [10]. Let
{1, 2} be two source vertexes, A is the adjacency matrix of
the graph, and X = (x1, x2) is a rectangular matrix repre-
senting the source vertexes, where x1 and x2 are two sparse
column vectors with x1(1) = 1 and x2(2) = 1 respectively
and all other elements being zero. Let Bi = (bi1, b
i
2), then
B1 = A × X . Obviously, B1 is a sparse rectangular matrix
representing the 1-hop vertexes (denoted as V1) from source
vertexes {1, 2}. Repeat the SpGEMM operation of adjacency
matrix and V1, we can derive the 1-hop vertexes from V1, also
2-hop vertexes from {1, 2} by B2 = A×B1. Finally, we get
the breadth-first traversal results from vertices 1 and 2, which
are {1, 3, 4, 2, 5, 6} and {2, 3, 4, 1, 5, 6}, respectively.
Fig. 2: An example of BFS. Vertexes filling in horizontal lines
and vertical lines are being visited by the BFS starting from
vertex {1} and {2} respectively, and vertexes filling in left
oblique lines have been visited by the BFS starting from vertex
{1} or {2} [10].
Many applications require hundreds or even billion of BFSs
over the same graph. Examples of such applications include
calculating graph centrality, enumerating the neighborhoods
for all vertexes, and solving the all-pairs shortest distance
problem. Comparing with running parallel and distributed
BFSs sequentially, executing multiple BFSs concurrently in
a single core allows us to share the computation between
different BFSs without paying the synchronization cost [9].
As the most expensive part of MS-BFS, SpGEMM is worthy
of more attention to be paid.
In addition, SpGEMM is also one of the most important
components for graph contraction [38], graph matching [39],
graph coloring [13] [14], all pairs shortest path [12], sub-graph
[15] [16], cycle detection or counting [40], and molecular
dynamics [41] [42]. Because of limited space, we will not
introduce these applications and corresponding SpGEMM
formulation in detail.
IV. METHODOLOGY
The development of multi/many-core architecture and par-
allel programming has brought new opportunities for the
acceleration of SpGEMM, and three challenging problems
emerged need to be considered intensively [21], including size
predicting of result matrix, matrix partitioning and load bal-
ancing, and intermediate result accumulating. In the following
subsections, we will introduce related studies in these three
challenges in detail.
A. Size Prediction
The first problem is the unknown number of non-zeros in the
result matrix before real execution. The number of non-zeros
of a sparse matrix usually dominates its memory overhead,
while in SpGEMM, the sparsity of the result matrix is always
unknown in advance. Therefore, precise memory allocation
for the result matrix is impossible before real execution. The
research focusing on the size prediction of the result matrix
can be mainly classified as the following four approaches.
1) Precise Method: The SpGEMM algorithms, which use
precise method to solve size prediction problem, usually
consist of two phases: symbolic and numeric phase. In the
symbolic phase, the precise number of non-zeros in result
matrix C is computed, while the real values are calculated
in the numeric phase.
The implementations of SpGEMM in KokkosKernels [87],
cuSPARSE [88], MKL [89] and RMerge [43] are typical rep-
resentatives of this method. Besides, Demouth [44], Nagasaka
et al. [45] and Demirci et al. [46] also adopt this approach.
In [47], Deveci et al. design a graph compression technique,
similar to the color compression idea [14], to compress matrix
B by packing its columns as bits in order to speedup the
symbolic phase.
Accurate pre-computation of the number of non-zeros not
only saves the memory usage but also enables the sparse
structure of C to be reused for different multiplies with the
same structure of input matrices [47]. Moreover, it presents
significant benefits in graph analytics, because most of them
work only on the symbolic structure, no numeric phase [7].
However, the calculation of two phases also means that it
needs to iterate through the input matrices twice, indicating
the higher computation overhead than the following methods.
2) Probabilistic Method: The second method estimates the
sparse structure of the result matrix based on random sampling
and probability analysis on the input matrices, which is known
as the probabilistic method.
Cohen [48] thinks that the problem of estimating the number
of non-zeros in result matrix can be converted to estimate
the size of reachability sets in a directed graph. Then he
presents a Monte Carlo-based algorithm that estimates the size
of reachability sets with high confidence and small relative
error. The non-zero structure not only can guide efficient
memory allocation for the output matrix but also can be
used to determine the optimal order of multiplications when
computing a chain product of three or more matrices, where
the optimal order means the minimization of a total number
of operations. Therefore, Cohen [48] proposes a method that
estimates the non-zero structure of the result matrix in time
linear based on the previous reachability-set size estimation
algorithm. For any constant error probability δ > 0 and any
the tolerated relative error  > 0, it can compute a 1 ± 
approximation of z of result matrix in time O(n/2), where
z denotes the number of non-zeros in the result matrix. Then,
Amossen et al. [49] improve this method to expected time
O(n) for  > 4/ 4√z. Pagh et al. also present a method
to estimate column/row sizes of the result matrix based on
the non-zeros of input matrices, and a detailed description is
referred to the Lemma 1 in [50].
This type of method does not require SpGEMM-like com-
putation and then yields a low time cost for estimation.
However, considering that this type of method usually gives
an approximate estimation, hence an extra memory allocation
must be performed when the estimation fails.
3) Upper Bound Method: The third method computes an
upper bound of the number of non-zeros in the result matrix
and allocates corresponding memory space, which is known as
the upper bound method. The most commonly used method of
calculating upper bound for C is to count the number of non-
zeros of the corresponding row in matrix B for each non-zero
entry in matrix A. Assume that input matrices A and B are
compressed with CSR format, and I and J are the row pointer
array and column index array, respectively. Then Algorithm 4
gives the pseudocode for computing the upper bound of the
number of non-zeros in each row of the result matrix, which
is stored in array U = {u0, u1, . . . , up−1}.
Bell et al. [2] propose a classical ESC (Expansion, Sorting,
Compression) method to accelerate AMG method, and it
allocates the memory space of the upper-bounded size for
result matrix, which is applied in CUSP [90]. Nagasaka et al.
[51] also count a maximum of scalar non-zero multiplications
per row of the result matrix. Then each thread allocates the
hash table based on the maximum and reuses the hash table
throughout the computation by re-initializing for each row.
This type of method is efficient and straightforward to
implement, while it does not consider the merging of the
values with the same index items in the calculation process,
and usually leads to over-allocation. Especially for some
particular matrices, the estimated size with the upper bound
Algorithm 4: Pseudocode for computing the upper bound
of the number of non-zeros in each row of the result matrix,
and all sparse matrices are compressed with CSR format.
Input: IA, JA, IB .
Output: U = {u0, u1, . . . , up−1}.
1 for i = 0, p− 1 do
2 ui = 0;
3 for k = IA(i), IA(i+ 1)− 1 do
4 j = JA(k);
5 ui+ = IB(j + 1)− IB(j);
6 end
7 end
method and precise size have large difference.
4) Progressive Method: The fourth method, known as the
progressive method, dynamically allocates memory as needed.
It first allocates memory of proper size and then starts matrix-
matrix multiplication. The reallocation of larger memory is
required if the current memory is insufficient.
The implementation of SpGEMM in Matlab [52] is the
representative of this type of method. It first guesses the size
of the result matrix and allocates a new memory space that
is larger by a constant factor (typically 1.5) than the current
space if more space is required at some point. Meanwhile, the
columns which are already computed are required to be copied
into the new memory.
If this type of method estimates successfully in first, then
it has no additional time overhead in SpGEMM. However,
if the first estimation fails, this type of method is extremely
inefficient, especially in GPU. Moreover, if the SpGEMM
algorithm is implemented in thread-level parallel, and threads
are using their local storage, then it must take time to aggregate
all the independent storage into the result matrix.
Besides, the hybrid methods are also proposed to cope with
the problem. Weifeng Liu and Brian Vinter propose a hybrid
method for result matrix pre-allocation [21]. They calculate
the upper bound of the number of non-zeros in each row
of the result matrix and divide all rows into multiple groups
according to the number of non-zeros. Then they allocate space
of the upper-bound size for the short rows and progressively
allocate space for the long rows.
B. Matrix Partition and Load Balancing
The second problem is matrix partitioning and load balanc-
ing. With the development of multi/many-core architecture,
reasonable partitioning of matrices and balanced assigning of
tasks among multiple cores or nodes are critical for making
full use of hardware resources.
1) Formulations of SpGEMM: It is closely related between
matrix partitioning and SpGEMM formulations. There are four
formulations of SpGEMM can be devised.
1) Outer-product formulation [53] [42] [54] [3] [55] [56].
This formulation is based on the column-wise and row-
wise partitioning of input matrices A and B, respectively.
Result matrix C is obtained by summing the outer product
of each column a∗i of A and corresponding row bi∗ of B,
i.e.
C =
q∑
i=1
a∗i ⊗ bi∗, (2)
where the operator ⊗ represents the outer product of
vectors.
2) Inner-product formulation [53] [55] [45] [56]. This
formulation is based on the row-wise and column-wise
partitioning of input matrices A and B, respectively. Result
matrix C consists of the inner product of each row of
matrix A and each column of matrix B, i.e.
cij =
∑
k∈I(i,j)
aik ∗ bkj , (3)
where i = 1, 2, ..., p, j = 1, 2, ..., r, and I(i, j) denotes the
set of indexes k such that both the entries aik and bkj are
non-zero.
3) Row-by-row-product formulation [53] [3] [56] [46]. This
formulation is based on the row-wise partitioning of input
matrices. Each row ci∗ of result matrix C is obtained by
summing the intermediate multiplication results of each
non-zero entry aik of ai∗ and corresponding row bk∗of B.
i.e.
ci∗ =
∑
k∈Ii(A)
aik ∗ bk∗, (4)
where Ii(A) denotes the set of column indexes k of the
non-zero entries in the i-th row of A.
4) Column-by-column-product formulation [3] [56]. This
formulation is based on the column-wise partitioning of
input matrices, which is similar to the row-by-row-product
formulation. Each column c∗j of result matrix C is ob-
tained by summing the intermediate multiplication results
of each non-zero entry bkj of b∗j and corresponding
column a∗k, i.e.
c∗j =
∑
k∈Ij(B)
a∗k ∗ bkj , (5)
where Ij(B) denotes the set of row indexes k of the non-
zero entries in the j-th column of B.
2) Block Partitioning: Four formulations of SpGEMM rep-
resent four different 1D matrix partitioning schemes: column-
by-row, row-by-column, row-by-row, and column-by-column.
However, considering the irregular distribution of non-zero
entries in sparse matrices, plain slice-based matrix parti-
tioning usually results in load unbalancing. Therefore, some
researchers pay attention to load balancing for specific archi-
tectures.
Weifeng Liu and Brian Vinter propose a method that
guarantees a balanced load through assigning rows with a
different number of non-zeros to the different number of bins
[21] for GPU platform. In order to balance the load and
efficiently exploit GPU resources, Nagasaka et al. [45] divide
the rows into groups in the symbolic phase by the number of
intermediate products, which is the upper bound of the number
of non-zeros in output matrix. In the numeric phase, they
divide the rows into groups by the number of non-zeros in each
row. Winter et al. [57] also design a balanced work assigning
scheme for GPU. Each thread block is assigned an equal
number of non-zeros of A while ignoring the row boundaries
of a matrix. Deveci et al. [58] propose two chunking methods
for KNL and GPU. The difference lies in partitioning A in the
column-wise method for KNL, but for GPU partitioning A in
the 2D method because of its multi-level memory property.
Deveci et al. [68] also utilize row-by-row matrix partitioning
but with different tasks assigning scheme (thread-sequential,
team-sequential, thread-parallel, and thread-flat-parallel) for
high performance computing architectures. Chen et al. [59]
design a three-level partitioning scheme to improve the load
balance of SpGEMM on Sunway TaihuLight supercomputer.
The first level represents that A is partitioned into NP sub-
matrices, where NP is the number of core groups applied
in the computation. In the second level, B is partitioned into
NC sub-matrices, where NC is the number of computing
processing element (CPE) cores applied in each core group.
Sub-A and sub-B are further partitioned into several sets in
the third level.
Kurt et al. [60] propose an adaptive task partition scheme
across virtual warps of a thread block in GPU. For example,
for a thread block of size 128 and a virtual warp of size 4, each
thread block will preside over four rows of A simultaneously.
Then 32(=128/4) threads are assigned cyclically to deal with
the corresponding entries of B for each row of A. They
also give a systematic exploration and analysis of upper
data movement requirements for SpGEMM for the first time.
The work distribution scheme is similar to the row merging
method by GPU sub-warps in [43] [61]. Ballard et al. [3]
present a detailed theoretical prediction for communication
cost for SpMM (sparse-dense matrix multiplication) operation
within the Galerkin triple product of smoothed aggregation
AMG methods. They also conclude that the row-by-row-
product algorithm is the best 1D method for the first two
multiplications and that the outer-product algorithm is the best
1D method for the third multiplication.
Van de Geijn et al. [62] propose a 2D block data de-
composition and mapping methods for dense matrix-matrix
multiplication based on the platform with distributed memory.
It is applied in SpGEMM by Buluc et al. [63] [64] [16].
Noted increasing hypersparse matrices after 2D block data
decomposition, they propose a new format DCSC (doubly
compressed sparse column) to exploit the hypersparsity of the
individual sub-matrices. Then, they developed two sequential
SpGEMM algorithms based on the outer-product and column-
by-column-product multiplications, respectively, which mainly
improve the hypersparse sub-matrices multiplication after the
2D partitioning. Finally, they present a parallel efficient sparse
SUMMA [62]. Furthermore, Azad et al. [65] first implement a
3D parallel SpGEMM algorithm, based on the work of Buluc
et al. that exploits both the inter-node parallelism within a third
processor grid dimension and the multi-threading parallelism
within the node.
3) Hypergraph Partitioning: Traditional block-based ma-
trix partitioning schemes seldom consider the communication
problem from the sparse patterns of input or output matri-
ces. Akbudak et al. [42] propose an elementary hypergraph
partitioning (HP) model to partition input matrices for outer-
product parallel SpGEMM. It aims at minimizing communi-
cation cost while maintaining a balanced computational load
by a predetermined maximum allowable imbalance ratio. The
HP model is composed of three components. The first is the
vertex for representing the outer product of each column of A
with the corresponding row of B. The second is the vertex
for each non-zero entry of C to enable 2D nonzero-based
partitioning of the output matrix. The third is the hyperedge
(or net) for each non-zero entry of C to encode the total
volume of communication that occurred in intermediate result
accumulating. Assuming that the partitioning of the input
matrices A and B in the HP model is presented as
Aˆ = AQ = [Ac1 A
c
2 · · · AcK ] and Bˆ = QB =

Br1
Br2
...
BrK
 , (6)
where Q is the permutation matrix induced by the partitioning,
the HP model for outer-product parallel SpGEMM includes
two phases. In the first phase (also called as multiplication
phase), each processor Pk owns column stripe Ack and row
stripe Brk of the permuted matrices, and then finishes the
communication-free local SpGEMM computations Ack × Brk.
The second phase (also called as summation phase) reduces
partial results yielded in the first phase to calculate the final
value of each non-zero entry of C. In order to maintain a
balanced load over two phases, the two-constraint formulation
is proposed. Each vertex is assigned two weights, which rep-
resent the computational overhead associated with the vertex
for two phases, respectively. Moreover, an imbalance ratio is
introduced to direct partitioning for load balancing.
Ballard et al. [66], based on the HP model, present a fine-
grained framework to prove sparsity-dependent communica-
tion lower bounds for both parallel and sequential SpGEMM.
Further, they reduce identifying a communication-optimal al-
gorithm for given input matrices to solve the HP problem.
Akbudak et al. [55] also apply the HP model to improve the
outer-product parallel and inner-product parallel sparse matrix-
matrix multiplication (SpMM) algorithms for the Xeon Phi
architecture. The HP model allocates A-matrix columns and
B-matrix rows that contribute to the same C-matrix entries
into the same parts as much as possible in the inner-product-
parallel SpMM. In the outer-product-parallel SpMM, the HP
model allocates A-matrix rows that require the same rows of
B into the same parts as much as possible. The reason is that
it enables the exploiting of temporal locality.
Besides, Akbudak et al. [56] also apply the graph model
and HP model on the outer-product, inner-product, and row-
by-row-product parallel SpGEMM algorithms. Demirci et al.
[46] adopt a similar bipartite graph model for matrices parti-
tioning on the distributed-memory platform but complement
a 3-constraint edge weight assignment scheme to maintain
a balanced load among tablet servers. Selvitopi et al. [67]
apply the graph model and hypergraph model for simultane-
ous assigning of the map and reduce tasks to improve data
locality and maintain a balanced load among processors in
both map and reduce phases. Then, Considering the similarity
between the map and reduce tasks in the MapReduce and the
accumulation of intermediate results in SpGEMM, they apply
the obtained MapReduce partitioning scheme based on graph
and hypergraph models to improve the data locality and load
balancing in SpGEMM.
C. Result Accumulating
The third problem is the expensive intermediate result
accumulating. In outer-product SpGEMM, result matrix C is
obtained by summing the outer-product result of each row of
A and the corresponding column of B, where the intermediate
outer-product result is usually a sparse matrix. In row-by-row-
product SpGEMM, the result matrix is obtained by summing
the multiplication result of each non-zero entry of A and the
corresponding row of B, where the intermediate multiplication
results are also sparse. The other two formulations are similar.
So no matter which SpGEMM formulation is chosed, the
merging of sparse structures will be involved, and it is a
challenging problem.
1) Accumulator: Here, we call the data structure that holds
intermediate results as the accumulator, and let us take row-
by-row-product SpGEMM as an example to introduce the
accumulator (because most of SpGEMM research is based on
the row-by-row-product formulation). There are two types of
accumulators at present: dense and sparse accumulators.
In dense accumulator, intermediate results for a row are
stored in an array of size r in dense format, where r is
the column dimension of result matrix. It does not require
extra time cost for hashing, sorting or merging operations,
while it may not be suitable to parallelization with large
threads and large r values. Deveci et al. [68] [7] utilize dense
accumulator for CPU or KNL instead of GPU because of its
high memory requirements. Patwary et al. [69] utilize dense
accumulator with a column-wise partitioning of B to reduce
memory overhead. Elliott et al. [25] note that prolongation
matrices (P ) in the AMG method are usually tall and skinny,
so they use a local dense lookup array for each thread and
allocate memory space fast for result matrix by page-aligned
and size-tracked technologies.
In row-by-row-product SpGEMM, each row of result matrix
C can be expressed as ci∗ =
∑
k∈Ii(A) aik ∗ bk∗, which is
also known as sparse accumulator [70]. In sparse accumulator,
intermediate results for a row are stored in the sparse formats
like COO and CSR. Therefore, the insertion operations at
random positions hold expensive time overhead in SpGEMM
with a sparse accumulator. There are two types of methods
for intermediate result merging: sort-based and hash-based
methods.
2) Sort-based: A sort-based merging is performed by sort-
ing the intermediate results according to the column indexes
and then summing the values with the same column indexes.
The difference between sort-based works often lies in the
different sort algorithms utilized, including radix sort, merge
sort, and heap sort.
1) Radix sort based. Bell et al. [2] adopt the radix-sort
method (implemented in [84]) to sort the intermediate
results, and Dalton et al. [71] employ the B40C radix sort
algorithm [72] which allows specifications in the number
and location of the sorting bits to accelerate the sort
operation.
2) Merge sort based. In the GPU-accelerated SpGEMM
given by Gremse et al. [43], the sparse rows of B selected
and weighted by corresponding non-zeros of A, are merged
in a hierarchical way similar to merge sort. At each level,
the number of rows is reduced by half, but the length of
rows becomes larger due to merging. Winter et al. [57]
utilize the ESC method to sort intermediate results yielded
in a chunk while ignoring row boundaries of a matrix.
Simple sorting of the chunks is performed based on their
global chunk order in the merge stage. Final, a prefix
scanning is utilized to merge rows shared between chunks
to generate the final result.
3) Heap sort based. Azad et al. [65] represent the interme-
diate result as a list of triples, and each list of triples
is sorted by the (j, i) pair, where the triple (i, j, val)
stores the row index, column index, and value of a
non-zero entry. A k-way merge on k lists of triples
T1, T2, ..., Tk is performed by maintaining a heap of size
k, which stores the current lexicographically minimum
entry in each list of triples. Then a multi-way merge
routine finds the minimum triple from the heap and
merges it into the current result. Nagasaka et al. [51]
implement a heap sort-based sparse accumulator adopting
the one-phase SpGEMM in KNL and multi-core archi-
tecture. In merge-based sparse accumulator (similar to
heap sort-based accumulator) proposed by Junhong et al.
[70], each thread stores multiple column indexes, and
then performs an intra-thread-min operation to ob-
tain the local minimum column index inta-min. Mean-
while, the intermediate result value for each intra-min
is fetched by each thread. Next, each thread performs
inter-thread-min(intra-min) operation to obtain the
minimum column index min among the threads at the
same warp. Final, the inter-sum(value(min)) is utilized
to obtain the reduction sum of multiple threads.
Besides, Junhong et al. [70] propose a register-aware sort-
based sparse accumulator and a vectorized method for parallel
segmented sum [73], where the sparse accumulator exploits the
N -to-M pattern [74] to sort the data in the register. It is for
the first time that the two methods are implemented for GPU
registers.
3) Hash-based: The hash-based sparse accumulator first
allocates a memory space based on the upper bound estimation
as the hash table and uses the column indexes of the inter-
mediate results as the key. Then the hash table is required
to be shrunk to a dense state after hash functions. At last,
sorting the values of each row of the result matrix according
to their column indexes to obtain the final result matrix
compressed with sparse format [70]. Result accumulation in
cuSPARSE [44] [88] is based on the hash-based sparse accu-
mulator. Deveci et al. [75] [47] [58] design a sparse, two-level,
hashmap-based accumulator which supports parallel insertions
or accumulations from multiple vector lanes. It is stored in a
linked list structure and consists of 4 parallel arrays, column
indexes, and their numerical values in the numeric phase, as
well as beginning indexes of the linked list corresponding to
the hash values and the indexes of the next elements within
the liked list. In [68], they implement linked list-based and
linear probing hashmap accumulators utilizing different data
structures. Junhong et al. [70] optimize the data allocations
of each thread in a hash-based accumulator utilizing GPU
registers. Besides, Yaar et al. [26] utilize hashmap-based ac-
cumulators in triangle counting. Weber et al. [41] use the hash
table to store each row of matrix C based on the distributed
blocked compress sparse row (BCSR) format. Nagasaka et al.
[45] utilize the hash table for counting the number of non-
zeros of each row in the symbolic phase and for calculation of
values of result matrix in the numeric phase. In the hash-based
sparse accumulator given by [45], Nagasaka et al. utilize the
vector registers in KNL and multicore architectures for hash
probing to accelerate the hash-based SpGEMM.
Besides, Weifeng Liu et al. [21] propose a hybrid parallel
result merging method. Heapsort-like, bitonic ESC and merge
methods are employed respectively according to the upper
bound of the number of non-zeros in each row.
4) Discussion: Sort-based join and hash-based join have
been two popular join algorithms, and the debate over which
is the best joint algorithm has been going on for decades.
Kim et al. [76] claim that sort-based join will be faster than
hash-based join when SIMD wide enough through a series of
experiments. Albutiu et al. [?] devise a new massively parallel
sort-merge (MPSM) join algorithm, and the experimental
result shows that the MPSM algorithm outperforms competing
hash join. However, the experiments performed by Balkesen
et al. [77] show that the hash-based join is still superior, and
sort-based joint algorithms outperform hash-based join only
when vast amounts of data are involved. Therefore, there is
no consensus on which join method is the best.
V. ARCHITECTURE ORIENTED OPTIMIZATION
Different computer architectures usually have different com-
puting and storage properties, so the optimization of SpGEMM
is closely related to the architecture used. In following sub-
sections, we will introduce the related research of optimizing
SpGEMM based on the multi/many-core platforms, FPGA
(Field-Programmable Gate Array), heterogeneous computing
platform, and distributed platform respectively.
A. Multi-core and Many-core Platforms
The research on multi-core and many-core platforms mainly
focuses on two problems: memory management and load
balance.
1) CPU-based Optimization: Memory management. To
exploit data locality, Patwary et al. optimize the traditional
sequential SpGEMM algorithm by using dense arrays in [69].
In order to reduce deallocation cost in SpGEMM, Nagasaka
et al. compute the amount of memory required by each thread
and allocate this amount of thread-private memory in each
thread independently on the Intel KNL processor [51]. They
also develop a hash table-based SpGEMM algorithm targeting
shared-memory multi-core and many-core processors. For ar-
chitectures with reasonably large memory and modest thread
counts, Elliott et al. present a single-pass OpenMP variant
of Gustavsons SpGEMM algorithm in [25]. They partition
the rows of the right-hand side matrix and then perform
Gustavson algorithm locally using thread-local storage for
data and column arrays while constructing the row array in
place. To reduce the overhead of memory allocation, they
use page-aligned allocations and track used size and capacity
separately. To implement efficient SpGEMM for many large-
scale applications, Chen et al. propose optimized SpGEMM
kernels based on COO, CSR, ELL, and CSC formats on the
Sunway TaihuLight supercomputer in [59]. In this paper, they
use the DMA transmission to access data for CPE core in
the main memory. Beyond doubt, continuous data transfer
using DMA performs better than discrete memory accesses.
To study the relationship between different multi-level memory
architectures and the performance of SpGEMM kernel, Deveci
et al. design different chunking-based SpGEMM algorithms
for Intel KNL where the memory subsystems have similar
latency [58]. They find that the performance difference of
subsystems decreases with the better temporal and spatial
locality of SpGEMM. Considering that the accesses to B can
be irregular depending on the structure of A, a data placement
method of storing matrix B as much as possible for KNL is
proposed. In [68], Deveci et al. provide heuristics to choose the
best kernel parameters according to problem characteristics,
and present their results for the selection of partitioning
schemes and data structures with trade-offs between memory
access and computational cost. Before the kernel call and
service requests from thousands of threads, a memory pool
is allocated and initialized beforehand. A chunk of memory is
returned to the request thread and it will not be released until
the thread puts the chunk back to the pool. The parameters of
the memory pool are architecture-specific and the number of
chunks is chosen based on the available concurrency of the
target device.
Load balance. In [69], matrices are divided into small par-
titions and dynamic load balancing is used over the partitions.
Experiments show that reducing the size of each partition leads
to better load balance on a system with two Intel Xeon E5-
2697 V3 processors. They find that better results are achieved
when the total number of partitions is 6-10 times the number
of threads. To reduce the overhead of dynamic scheduling,
the static scheduling is used in the SpGEMM algorithm on the
Intel KNL processor [51]. A three-level partitioning scheme is
presented to achieve load balancing based on the architecture
of the Sunway [59]. For the form of C = A × B, on the
first level, the left sparse matrix A is partitioned into multiple
sub-A. Each core group in the SW26010 processor performs
the multiplication of a sub-A and B. On the second level, B
is partitioned into multiple sub-B. Each CPE core in a core
group multiplies the sub-A by a sub-B. On the third level, to
fit the 64-KB local data memory in each CPE core, sub-A and
sub-B are further partitioned into several sets.
2) GPU-based Optimization: Memory management. Liu
et al. summarize the extra irregularity that an efficient parallel
SpGEMM algorithm has to handle in [21]. One of them is
that the number of non-zeros in the result sparse matrix is
unknown in advance. In their paper, memory pre-allocation
for the result matrix is organized using a hybrid approach
that saves a large amount of global memory space. Besides,
the very limited on-chip scratchpad memory is efficiently
utilized. To make full use of the high-speed registers and on
GPU, Liu et al. design different register-aware SpGEMM algo-
rithms for different sparse accumulators respectively in [70].
These sparse accumulators include sort-based, merge-based,
and hash-based, where merge-based is similar to the merge
sort-based accumulator mentioned in Section IV. They fully
utilize the GPU registers to fetch data, finish computations,
and store results out. In [57], Winter et al. propose an adaptive
chunk-based GPU SpGEMM approach (AC-SpGEMM) to cut
down the computational overhead throughout all stages. They
perform chunk-based ESC (explicitly expand all temporary
products, sort them and combine them to yield the final
matrix) to produce deterministic bit-stable results. With the
help of the local task distribution, this stage performs multiple
iterations of ESC, fully utilizing local memory resources
while keeping temporary results in scratchpad memory. To
achieve the performance portability on massively threaded
architectures, Deveci et al. design a hierarchical and memory-
efficient SpGEMM algorithm in [47] by introducing two thread
scalable data structures, a multilevel hashmap and a memory
pool. Since the memory subsystems of NVIDIA GPU differ
significantly in both bandwidth and latency, two different data
placements methods are proposed in [58]. One is to keep the
partitions of A and C in fast memory and stream the partitions
of B to fast memory. Another is to keep the partitions of
B in fast memory and stream the partitions of A and C to
fast memory. Which method to be chosen often depends on
the input matrix properties. To select proper parameters of
the memory pool on GPU, Deveci et al. over-estimate the
concurrency to access memory efficiently. They check the
available memory and reduce the number of chunks if the
memory allocation becomes too expensive on GPUs [68].
Load balancing. To balance the workload of SpGEMM on
GPU, a heuristic method is presented to assign the rows of
the result matrix to multiple bins with different subsequent
computational methods [21]. In this paper, Liu et al. first
allocate 38 bins and put them into five bin groups. Then,
all rows are assigned to corresponding bins according to the
number of the non-zeros. Finally, they allocate a temporary
matrix for the non-zero entries in the result matrix C based on
the size of each bin. Because the rows in the first two bins only
require trivial operations, these two bins are excluded from
subsequent more complex computation on the GPUs. Thus a
better load balancing can be expected. In [57], Winter et al.
split the non-zeros of matrix A uniformly, assigning the same
number of non-zeros to each thread block. Though memory
requirements from A are static, the actual workload for each
block varies based on the generated intermediate products. A
static splitting of A′s non-zeros does not require processing,
but the second stage still needs to associate each entry from
A with a row. Therefore, the global load balancing in parallel
checks A′s row pointers and writes the required information
into an auxiliary array. The cost of this step is negligible
compared to enumerating temporary products.
B. Field-Programmable Gate Array (FPGA)
Related research of SpGEMM on FPGA is quite rare. Lin
et al. are the first to address SpGEMM on FPGAs. In [82],
the authors design and implement the SpGEMM algorithm
on FPGAs. They study the performance of their design in
terms of computational latency, as well as the associated
power-delay and energy-delay trade off. Taking advantage of
the sparsity of the input matrices, their design allows user-
tunable power-delay and energy-delay trade off by employing
the different numbers of processing elements in architecture
design and different block size in blocking decomposition.
Such ability allows designers to employ different on-chip
resources for different system power-delay and energy-delay
requirements. Jamro et al. think that indices comparisons
dominate floating-point operations. Therefore they propose a
highly parallel architecture, which can carry out at least 8×8
indices comparison in a single clock cycle in [83].
C. Heterogeneous Computing Platform
Researchers mainly focus on load balancing problems on
heterogeneous platforms. For heterogeneous systems com-
posing CPUs and GPUs, Siegel et al. develop a task-based
programming model and a runtime-supported execution model
to achieve dynamic load balancing in [80]. They partition
the SpGEMM problem in blocks in the form of tasks to
address the unbalancing introduced by the sparse matrix and
the heterogeneous platform. For a CPU-GPU heterogeneous
computing system, Matam et al. propose several heuristic
methods to identify the proper work division of a subclass
of matrices between CPU and GPU in [53]. Liu et al. propose
a load-balancing method based on output space decomposition
in [17]. They assign rows to a fixed number of bins through
a much faster linear-time traverse on the CPU. Moreover,
they decompose the output space in a more detailed way that
guarantees much more efficient load balancing. Their method
is always load balanced in all stages for maximizing resource
utilization of GPUs. Besides, in [81], Rubensson et al. present
a method for parallel block SpGEMM on distributed memory
clusters based on their previously proposed Chunks and Tasks
model [54]. They use a quad-tree matrix representation to
exploit data locality without prior information about the matrix
sparsity pattern. The quad-tree representation, combined with
the Chunks and Tasks model, leads to favorable weak and
strong scaling of the communication cost with the number
of processes. Matrices are represented as sparse quad-trees
of chunk objects, and the leaves in the hierarchy are block-
sparse sub-matrices. Sparsity is dynamically detected and may
occur at any level in the hierarchy and/or within the sub-matrix
leaves. In case GPU is available, this design uses both CPUs
and GPUs for leaf-level multiplication, thus making full use
of the computing capacity of each node.
D. Distributed Platform
The research of SpGEMM on the cluster system mainly
focuses on the minimizing of inter-node communication. In
[78], Jin et al. conduct a series of experiments to study the data
communication issue of SpGEMM on the PC cluster. They
find that due to communication overheads, it is difficult for
clusters to exploit low-level parallelism. Thus, to successfully
achieve computation parallel, it is critical to exploit high-level
parallelism sufficiently. Increasing the degree of high-level
parallelism makes easier the task of balancing the workload.
Dynamic scheduling algorithms always have less unbalanced
overhead than static scheduling algorithms, no matter uni-
casting or multicasting is used for data communication. In
[42], for outer-product-parallel SpGEMM, Akbudak et al.
propose three hypergraph models that achieve simultaneous
partitioning of input and output matrices without any repli-
cation of input data. All three hypergraph models perform
conformable 1D column-wise and row-wise partitioning of the
input matrices A and B, respectively. The first hypergraph
model performs two-dimensional nonzero-based partitioning
of the output matrix, whereas the second and third models
perform 1D row-wise and column-wise partitioning of the
output matrix, respectively. This partitioning scheme leads to a
two-phase parallel SpGEMM algorithm: communication-free
local SpGEMM computations and the multiple single-node-
accumulation operations on the local SpGEMM results.
In [79], Bethune et al. study the performance of SpGEMM
using the open-source sparse linear algebra library DBCSR
(Distributed Block Compressed Sparse Row) on different
distributed systems. DBCSR matrices are stored in a blocked
compressed sparse row (CSR) format distributed over a two-
dimensional grid of P MPI processes. Inter-process com-
munication is based on the communication-reducing 2.5D
algorithm.
The local multiplication will start as soon as all the
data has arrived at the destination process. The amount of
communicated data by each process scales as O(1/
√
P ).
In [67], Selvitopi et al. achieve static scheduling of “map”
and “reduce” tasks in a MapReduce job based on graph
and hypergraph partitioning. The locality in task assignment
reduces the amount of data transfer in the shuffle phase, and
the balancing of computations leads to faster task execution.
They show the validity of their scheduling on the MapReduce
parallelization of SpGEMM. Their model leads to tremendous
savings in data transfer and consequently achieves up to 4.2x
speedup for SpGEMM compared with random scheduling.
In [46], Demirci et al. design a distributed SpGEMM
algorithm on Accumolo, which is a highly scalable, distributed
key-value store built on the design of Googles Bigtable. The
basic idea of this work is that reducing the values for a key
locally before they are sent to a remote node. They use a tablet
server to iterate through multiple rows in sub-A and create
batches of rows to be processed together before performing
any computation or a batch-scan operation. Batch processing
of rows significantly reduces the latency and communication
volume among servers, because both the number of lookup
operations and the redundant retrieval of rows of B are
reduced.
VI. PERFORMANCE EVALUATION
A. Overview
The performance of the SpGEMM kernel is dominated by
several factors, including sparse format, matrix sparsity, and
computational device. It is extremely challenging to figure out
the exact relationship between the SpGEMM kernel and these
factors.
In this section, we conduct a series of experiments to
compare the SpGEMM performance of several state-of-art
approaches, and the non-zeros of sparse matrices are cached
and involved in calculation in double precision. Our tested
libraries include Intel Math Kernel Library (MKL) [89],
NVIDIA CUDA Sparse Library (cuSPARSE) [88], CSPARSE
[91], CUSP [90], bhSPARSE [17] [21] and KokkosKernels
[87].
Intel MKL is a highly-parallel closed-source library for
Intel processors. One of its sub-library, named spblas, includes
standard C and Fortran API for sparse linear algebra. It
uses a highly encapsulated internal data structure to perform
operations upon sparse matrices. We call mkl sparse sp2m
to perform SpGEMM, and both input matrices are encoded
with either CSR or BSR. This interface supports two-stage
execution. The row pointer array of the output matrix is
calculated in the first stage. In the second stage, the remaining
column indexes and value arrays of the output matrix are
calculated. At the same time, We record the execution time
of these two stages.
cuSPARSE is a closed-source library provided by NVIDIA.
It contains a set of basic linear algebra subroutines used for
handling sparse matrices. In our test program, input matri-
ces encoded with COO format are first converted into CSR
format by calling cusparseXcoo2csr, then we call cusparseXc-
srgemmnnz and cusparseDcsrgemm in turn to complete a two-
stage SpGEMM, which is similar to MKL. Also, the execution
time of each step is recorded.
CSPARSE is an open-source, single-thread sparse matrix
package written in C program. It uses the CSC format store the
sparse matrices. We call cs multiply to calculate the product
TABLE II: Detailed information of evaluated libraries.
Library Version Compiler Open source Size prediction Accumulator Supported format
MKL 2019 icc 19.0.3.199 No precise - CSC/CSR/BSR
cuSPARSE 10.0.130 nvcc 10.0.130 No precise hash-based CSR
CSPARSE 3.1.4 icc 19.0.3.199 Yes progressive dense CSC
CUSP 0.5.1 nvcc 8.0.61 Yes upper bound sort-based COO/CSR
bhsparse Lastest nvcc 8.0.61 Yes hybrid hybrid CSR
KokkosKernels 2.8.00 nvcc 10.0.130 Yes precise CPU:dense; GPU:hash-based CSR
Fig. 3: The detailed information of dataset.
of two sparse matrices and also record the execution time of
this procedure simultaneously.
CUSP is an open-source library for sparse basic linear
algebra and graph computations based on Thrust, C++. It
provides a flexible API for users and is highly encapsulated.
We call multiply to calculate the product of two sparse matrices
encoded with the COO format and record the execution time
of this call procedure.
bhSPARSE is an open-source library that provides sparse
linear algebra interfaces used for sparse matrix computations
on heterogeneous parallel processors. A SpGEMM implemen-
tation based on CSR format is included in this library, and it
is claimed to be efficient and general for the multiplication
of irregular sparse matrices. It is composed of four stages:
calculating upper bound, binning, computing the result matrix,
and arranging data. The detailed introduction can be found in
paper [21].
KokkosKernels implements local computational kernels for
linear algebra and graph operations, using the Kokkos shared-
memory parallel programming model [85]. It also supports
two-stage execution and records the execution time of the
symbolic stage and numeric stage, respectively. Moreover,
we test its performance on the Intel multi-core platform and
NVIDIA GPU platform, respectively.
B. System Setup
The computer used for the experiment runs a 64-bit Ubuntu
operation system. The host memory is up to 16GB. The
CPU of this machine is Intel CoreTM i7-4790K with 4GHz
clock frequency. The GPU is NVIDIA Geforce GTX 1080Ti,
which is based on the Pascal architecture and equipped with
3584 CUDA cores and 11GB device memory. The versions,
compilers, open-source information, size prediction method,
intermediate result accumulator, and support format of evalu-
ated libraries are set out in Table II.
Fig. 4: The proportion of execution time of each phase.
Involved matrices are encoded with CSR format. Format con-
version, stage1, stage2 and format export represent the format
conversion from COO to CSR, calculation of row pointer array
of the output matrix, calculation of remaining column indexes
and value arrays of the output matrix, format export from
internal representation to CSR format, respectively.
Fig. 5: The proportion of execution time of each phase.
Involved matrices are encoded with BSR format. Format con-
version, stage1, stage2 and format export represent the format
conversion from COO to BSR, calculation of row pointer array
of the output matrix, calculation of remaining column indexes
and value arrays of the output matrix, format export from
internal representation to BSR format, respectively.
C. Benchmark
The sparse matrices used in our experiment are from the
SuiteSparse Matrix Collection (formerly the University of
Florida Sparse Matrix Collection) [86]. We choose the square
matrices to perform the SpGEMM kernel C = A × A. The
corresponding dataset is shown in Fig. 3. The dimensions of
matrices range from 14K to 1M, and the number of non-
zeros ranges from 0.4M to 11.5M. These matrices are from
many applications, including Finite Element Method (FEM),
Quantum Chromodynamics (QCD), Protein data, electromag-
netics, Circuit simulation, Directed graph, structural problem,
and web connectivity. Besides, it can be seen from Fig. 3 that
left matrices have a regular distribution, most of their non-zero
elements are distributed near main diagonal, while right sparse
matrices have an irregular non-zeros distribution.
D. Evaluation Result
We conduct three groups of experiments on the benchmark
to evaluate the performance of the ahead mentioned libraries.
1) Sparse Storage Format: Although a variety of storage
formats are currently proposed for sparse matrix, the existing
SpGEMM implementations involve few storage formats. CSR
is the most commonly used sparse format, which is used in
the SpGEMM implementation of CUSP, MKL, cuSPARSE,
bhSPARSE, and KokkosKernels. Besides, MKL supports the
multiplication of two sparse matrices encoded with BSR
format, and CSPARSE only supports CSC format.
In this experiment, we compare the SpGEMM performance
in MKL, where the involved sparse matrices are encoded
with CSR or BSR format. The SpGEMM kernel in MKL
is composed of four phases: format conversion (COO to
CSR/BSR), stage1 (the first stage mentioned earlier), stage2
(the second stage mentioned earlier), and format export (export
CSR/BSR matrix from internal representation).
Fig. 4 and Fig. 5 provide the proportion of execution time
of each phase where all involved sparse matrices are encoded
with CSR or BSR formats. It can be seen from Fig. 4 that when
the involved matrices are stored in CSR format, the execution
time of format conversion for most sparse matrices takes up a
small proportion. The execution time of the last three phases
varies from matrix to matrix. Both stage1 and stage2 account
for a larger proportion than other phases, while the execution
time of format export phase varies greatly. From Fig. 5, we
can see that the execution time of the stage1 phase takes up
a small ratio for most sparse matrices, while the other three
phases account for a larger proportion.
Comparison of the execution time of each phase based on
different sparse formats is set out in Fig. 6. From Fig. 6(a) we
can see that it takes a longer time to convert the matrix from
COO to BSR than to CSR on each sparse matrix, and the time
difference is huge on some sparse matrices. As can be seen
from Fig. 6(b), stage1 based on CSR format, which computes
the row pointer array of output matrix, takes a longer time
than BSR format on most sparse matrices. From Fig. 6(c) we
can see that the SpGEMM based on BSR format gives better
performance than the SpGEMM based on CSR format on most
irregular sparse matrices, while there is no specific winner for
regular sparse matrices. Fig. 6(d) indicates that there is no
absolute advantage on exporting the internal representation to
CSR or BSR.
The final performance comparison between CSR and BSR-
based SpGEMM is presented in Fig. 7. It can be seen that
neither CSR-based nor BSR-based SpGEMM shows the over-
whelming advantage. The specific performance of SpGEMM
has a complex relationship with the chosen sparse format and
the matrix sparsity.
(a) Format conversion. (b) Stage1.
(c) Stage2. (d) Format export.
Fig. 6: Comparison of execution time of each phase in different sparse formats, including CSR and BSR.
Fig. 7: Performance comparison of SpGEMM implementation
in MKL where involved sparse matrices are compressed with
CSR or BSR formats.
2) SpGEMM Implementation: In this section, we compare
the performance of different SpGEMM implementations. First,
we compare the performance of each stage of the libraries
whose SpGEMM implementation predicts the number of non-
zeros of the output matrix using the precise method. In existing
libraries, MKL, cuSPARSE, and KokkosKernels utilize the
precise method in the first symbolic phase.
Fig. 8 presents the speedup of the execution time of cuS-
PARSE and KokkosKernels to MKL (SpGEMM based on
CSR) in the symbolic phase. The SpGEMM in KokkosKer-
nels involved mac econ fwd500, scircuit and webbase-1M
matrices encounter runtime error, so their values are set to
0 (the same below). From Fig. 8 we can see that for most
regular matrices, the implementation of the symbolic phase
in cuSPARSE is better than other libraries, while the CUDA-
based implementation of symbolic phase in KokkosKernels
is the best for most irregular matrices. Besides, the MKL
and OpenMP-based implementation of the symbolic phase in
KokkosKernels show worse performance than other libraries.
The speedup of the execution time of cuSPARSE and
KokkosKernels in the second numeric phase to MKL can
be compared in Fig. 9. It can be seen that the CUDA-
based implementation of the numeric phase in KokkosKernels
outperforms cuSPARSE and OpenMP-based KokkosKernels
on each sparse matrix except three problematic matrices
Fig. 8: The speedup of the execution time of cuSPARSE, CUDA-based and OpenMP-based Kokkoskernels in the symbolic
phase to MKL (based on CSR).
Fig. 9: The speedup of cuSPARSE, CUDA-based and OpenMP-based Kokkoskernels in the numeric phase to MKL (based on
CSR).
and matrix cant. Especially on irregular sparse matrices, its
performance advantage is considerable.
Fig. 10 presents the proportion of execution time of sym-
bolic and numeric phases in different SpGEMM implemen-
tations, where MKL 1 and MKL 2 represent the symbolic
and numeric phases in MKL, respectively, and others are
similar. It can be seen from Fig. 10 that the execution time
of the symbolic phase is less than or equal to the numeric
phase on each sparse matrix in cuSPARSE and OpenMP-
based KokkosKernels. Moreover, there are 8 and 12 matrices
out of the whole 20 matrices whose time proportions of the
symbolic phase are less than or equal to 30% in cuSPARSE
and OpenMP-based SpGEMM, respectively. In MKL and
CUDA-based KokkosKernels, however, there are 11 and 16
out of 17 matrices, respectively, whose time proportions of
symbolic phase are larger than 30%, and the time proportions
of 3 and 7 out of 17 matrices respectively are larger than 50%.
The final performance of SpGEMM implementation in
different libraries is summarised in Fig. 11. It can be seen that
CUDA-based KokkosKernels, cuSPARSE, bhSPARSE, and
OpenMP-based KokkosKernels outperform other SpGEMM
implementations on 12, 5, 2 and 1 sparse matrices, respec-
tively.
VII. CHALLENGES AND FUTURE WORK
SpGEMM has gained much attention in the past decades,
and the work has been conducted in many directions. We
believe that it will be used in many other fields as the
development of IoT, big data, and AI. A systematic literature
(a) MKL. (b) cuSPARSE.
(c) CUDA-based KokkosKernels. (d) OpenMP-based KokkosKernels.
Fig. 10: The proportion of execution time of symbolic and numeric phases in different SpGEMM implementations. MKL 1
and MKL 2 represent the symbolic and numeric phases respectively in MKL, and others are similar.
Fig. 11: Performance comparsion of SpGEMM implementation in CSPARSE, MKL, CUSP, bhSPARSE, cuSPARSE, and
KokkosKernels. To be fair, the statistics in this figure do not include the time cost of format conversion and format exportd,
and the performance of MKL in this figure is based on CSR format.
review sheds light on the potential research directions and
some challenges that require further study.
One of the challenges of SpGEMM is exploring the sparsity
of sparse matrices. Researchers find that the non-zero ele-
ments distribution dominates the performance of SpMV and
SpGEMM on the same architecture. To further explore the
potential performance improvement, automatic format select
algorithms have been designed for SpMV based on machine
learning models. Significant performance improvement has
been observed using auto-tuning and format selection based
on these models. However, the situation for SpGEMM is more
complicated because two sparse matrices are involved. The
performance depends on not only the sparsity of the two
input matrices but also the combination of the sub-matrices
partitioned from the input matrices.
Size prediction of the result matrix is also a challenge
for SpGEMM. Upper-bound prediction may result in memory
over-allocation, and others may lead to memory re-allocation.
Both over-allocation and re-allocation are challenging tasks for
acceleration devices, such as GPU, DSP, FPGA, and TPU. On-
device memory capacity is limited and may not accommodate
such a large amount of data. It also takes time to copy data
to and from device memory because of separated memory
space. Precise size prediction not only reduces the overhead
of memory management but also indirectly improves the
performance of result accumulating. This is because smaller
memory footprint and seldom re-allocation are required for
output matrix manipulation.
Finally, heterogeneous architecture-oriented optimization is
a challenge for diversified systems. CPU is good at processing
complicated logic, while GPU is good at dense computations.
Besides, DSP and FPGA accelerators may be used in different
systems. One of the critical questions of porting SpGEMM
to the heterogeneous system is how to achieve load balance
and minimize the communication traffic. Moreover, each sub-
matrix may have its feature in non-zeros distribution. Ideally,
relatively dense blocks should be mapped to acceleration
devices, while super sparse blocks should be mapped to
CPU. Only in this way can each device give full play to its
architecture and optimize the overall computing performance
of SpGEMM.
VIII. CONCLUSION
The design of an efficient SpGEMM algorithm, as well as
its implementation, are critical to many large-scale scientific
applications. Therefore it is not surprising that SpGEMM has
attracted much attention from researchers over the years. In
this survey, we highlight some of the developments in recent
years and emphasize the application, challenging problems,
architecture-oriented optimizations, and performance evalua-
tion. In concluding, we stress the fact that despite recent
progress, there are still important areas where much work
remains to be done, such as different formats support, hetero-
geneous architecture-oriented optimization, efficient size pre-
diction of target matrix. On the other hand, the heterogeneous
architecture may give rise to new optimization opportunities,
and efficient implementation in the specific architecture is
within reach and can be expected to continue in the new future.
REFERENCES
[1] J. Georgii and R. Westermann, “A streaming approach for sparse matrix
products and its application in galerkin multigrid methods,” Electronic
Transactions on Numerical Analysis, vol. 37, no. 263-275, pp. 3–5, 2010.
[2] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grained parallelism
in algebraic multigrid methods,” SIAM J. Scientific Computing, vol. 34,
no. 4, 2012. [Online]. Available: https://doi.org/10.1137/110838844
[3] G. Ballard, C. Siefert, and J. Hu, “Reducing communication costs
for sparse matrix multiplication within algebraic multigrid,” SIAM
Journal on Scientific Computing, vol. 38, no. 3, pp. C203–C231, 2016.
[Online]. Available: https://doi.org/10.1137/15M1028807
[4] A. Bienz, “Reducing communication in sparse solvers,” phdthesis,
University of Illinois at Urbana-Champaign, Sep. 2018.
[5] T. A. Davis, “Graph algorithms via suitesparse: Graphblas: triangle
counting and k-truss,” in 2018 IEEE High Performance extreme Com-
puting Conference (HPEC), Sep. 2018, pp. 1–6.
[6] J. Cohen, “Graph twiddling in a mapreduce world,” Computing in
Science and Engg., vol. 11, no. 4, pp. 29–41, Jul. 2009. [Online].
Available: https://doi.org/10.1109/MCSE.2009.120
[7] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Raja-
manickam, “Fast linear algebra-based triangle counting with kokkosker-
nels,” in 2017 IEEE High Performance Extreme Computing Conference
(HPEC), Sep. 2017, pp. 1–7.
[8] A. Azad, A. Bulu, and J. Gilbert, “Parallel triangle counting and
enumeration using matrix algebra,” in 2015 IEEE International Parallel
and Distributed Processing Symposium Workshop, May 2015, pp. 804–
811.
[9] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “High-performance graph
algorithms from parallel sparse matrices,” in International Workshop on
Applied Parallel Computing. Springer, 2006, pp. 260–269.
[10] M. Then, M. Kaufmann, F. Chirigati, T.-A. Hoang-Vu, K. Pham,
A. Kemper, T. Neumann, and H. T. Vo, “The more the merrier:
Efficient multi-source graph traversal,” Proc. VLDB Endow., vol. 8,
no. 4, pp. 449–460, Dec. 2014. [Online]. Available: http://dx.doi.org/
10.14778/2735496.2735507
[11] A. Buluc¸ and J. R. Gilbert, “The combinatorial BLAS: design,
implementation, and applications,” International Journal of High
Performance Computing Applications, vol. 25, no. 4, pp. 496–509,
2011. [Online]. Available: https://doi.org/10.1177/1094342011403516
[12] T. M. Chan, “More algorithms for all-pairs shortest paths in
weighted graphs,” in Proceedings of the Thirty-ninth Annual ACM
Symposium on Theory of Computing, ser. STOC ’07. New
York, NY, USA: ACM, 2007, pp. 590–598. [Online]. Available:
http://doi.acm.org/10.1145/1250790.1250877
[13] H. Kaplan, M. Sharir, and E. Verbin, “Colored intersection searching via
sparse rectangular matrix multiplication,” in Proceedings of the twenty-
second annual symposium on Computational geometry. ACM, 2006,
pp. 52–60.
[14] M. Deveci, E. G. Boman, K. D. Devine, and S. Rajamanickam, “Parallel
graph coloring for manycore architectures,” in 2016 IEEE International
Parallel and Distributed Processing Symposium (IPDPS), May 2016,
pp. 892–901.
[15] V. Vassilevska, R. Williams, and R. Yuster, “Finding heaviest
h-subgraphs in real weighted graphs, with applications,” CoRR,
vol. abs/cs/0609009, 2006. [Online]. Available: http://arxiv.org/abs/cs/
0609009
[16] A. Buluc¸ and K. Madduri, “Parallel breadth-first search on distributed
memory systems,” in Proceedings of 2011 International Conference for
High Performance Computing, Networking, Storage and Analysis, ser.
SC ’11. New York, NY, USA: ACM, 2011, pp. 65:1–65:12. [Online].
Available: http://doi.acm.org/10.1145/2063384.2063471
[17] W. Liu and B. Vinter, “A framework for general sparse matrix-matrix
multiplication on gpus and heterogeneous processors,” J. Parallel
Distrib. Comput., vol. 85, no. C, pp. 47–61, Nov. 2015. [Online].
Available: http://dx.doi.org/10.1016/j.jpdc.2015.06.010
[18] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplication
and permuted transposition,” ACM Trans. Math. Softw., vol. 4, pp. 250–
269, 1978.
[19] I. S. Duff, “A survey of sparse matrix research,” Proceedings of the
IEEE, vol. 65, no. 4, pp. 500–535, April 1977.
[20] M. Grossman, C. Thiele, M. Araya-Polo, F. Frank, F. O. Alpak,
and V. Sarkar, “A survey of sparse matrix-vector multiplication
performance on large matrices,” CoRR, vol. abs/1608.00636, 2016.
[Online]. Available: http://arxiv.org/abs/1608.00636
[21] W. Liu and B. Vinter, “An efficient GPU general sparse matrix-matrix
multiplication for irregular data,” in 2014 IEEE 28th International
Parallel and Distributed Processing Symposium, 2014, pp. 370–381.
[Online]. Available: https://doi.org/10.1109/IPDPS.2014.47
[22] B. Kitchenham, “Procedures for performing systematic reviews,” Keele,
UK, Keele Univ., vol. 33, 08 2004.
[23] R. D. Falgout, “An introduction to algebraic multigrid,” Computing in
Science Engineering, vol. 8, no. 6, pp. 24–33, Nov 2006.
[24] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial:
Second Edition. Philadelphia, PA, USA: Society for Industrial and
Applied Mathematics, 2000.
[25] J. J. Elliott and C. M. Siefert, “Low thread-count gustavson: A multi-
threaded algorithm for sparse matrix-matrix multiplication using perfect
hashing,” in 2018 IEEE/ACM 9th Workshop on Latest Advances in
Scalable Algorithms for Large-Scale Systems (scalA), Nov 2018, pp.
57–64.
[26] A. Yas¸ar, S. Rajamanickam, M. Wolf, J. Berry, and U. V. atalyu¨rek,
“Fast triangle counting using cilk,” in 2018 IEEE High Performance
extreme Computing Conference (HPEC), Sep. 2018, pp. 1–7.
[27] M. M. Wolf, J. W. Berry, and D. T. Stark, “A task-based linear algebra
building blocks approach for scalable graph analytics,” in 2015 IEEE
High Performance Extreme Computing Conference (HPEC), Sep. 2015,
pp. 1–6.
[28] L. Wang, Y. Wang, C. Yang, and J. D. Owens, “A comparative study
on exact triangle counting algorithms on the gpu,” in Proceedings
of the ACM Workshop on High Performance Graph Processing, ser.
HPGP ’16. New York, NY, USA: ACM, 2016, pp. 1–8. [Online].
Available: http://doi.acm.org/10.1145/2915516.2915521
[29] Y. Zhang, L. Wu, G. Wei, and S. Wang, “A novel algorithm
for all pairs shortest path problem based on matrix multiplication
and pulse coupled neural network,” Digital Signal Processing,
vol. 21, no. 4, pp. 517 – 521, 2011. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1051200411000650
[30] D. Eppstein and E. S. Spiro, “The h-index of a graph and its application
to dynamic subgraph statistics,” in Proceedings of the 11th International
Symposium on Algorithms and Data Structures, ser. WADS ’09. Berlin,
Heidelberg: Springer-Verlag, 2009, pp. 278–289.
[31] S. Samsi, V. Gadepally, M. B. Hurley, M. Jones, E. K. Kao,
S. Mohindra, P. Monticciolo, A. Reuther, S. Smith, W. S. Song,
D. Staheli, and J. Kepner, “Static graph challenge: Subgraph
isomorphism,” CoRR, vol. abs/1708.06866, 2017. [Online]. Available:
http://arxiv.org/abs/1708.06866
[32] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. A. Miller,
and J. Kepner, “Graphulo: Linear algebra graph kernels for nosql
databases,” CoRR, vol. abs/1508.07372, 2015. [Online]. Available:
http://arxiv.org/abs/1508.07372
[33] C. E. Tsourakakis, P. Drineas, E. Michelakis, I. Koutis, and C. Faloutsos,
“Spectral counting of triangles via element-wise sparsification and
triangle-based link recommendation,” Social Network Analysis and
Mining, vol. 1, no. 2, pp. 75–81, Apr 2011. [Online]. Available:
https://doi.org/10.1007/s13278-010-0001-9
[34] A. Pavan, K. Tangwongsan, S. Tirthapura, and K.-L. Wu, “Counting
and sampling triangles from a graph stream,” Proc. VLDB Endow.,
vol. 6, no. 14, pp. 1870–1881, Sep. 2013. [Online]. Available:
http://dx.doi.org/10.14778/2556549.2556569
[35] N. Wang, J. Zhang, K.-L. Tan, and A. K. H. Tung, “On
triangulation-based dense neighborhood graph discovery,” Proc. VLDB
Endow., vol. 4, no. 2, pp. 58–68, Nov. 2010. [Online]. Available:
http://dx.doi.org/10.14778/1921071.1921073
[36] T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos,
J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson,
A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker,
S. Wallach, and A. Yoo, “Standards for graph algorithm primitives,” in
2013 IEEE High Performance Extreme Computing Conference (HPEC),
Sep. 2013, pp. 1–2.
[37] A. Buluc¸ and J. R. Gilbert, “Highly parallel sparse matrix-matrix
multiplication,” arXiv preprint arXiv:1006.2183, 2010.
[38] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “A unified framework
for numerical and combinatorial computing,” Computing in Science
Engineering, vol. 10, no. 2, pp. 20–25, March 2008.
[39] M. O. Rabin and V. V. Vazirani, “Maximum matchings in general graphs
through randomization,” J. Algorithms, vol. 10, no. 4, pp. 557–567,
Dec. 1989. [Online]. Available: https://doi.org/10.1016/0196-6774(89)
90005-9
[40] K. Censor-Hillel, P. Kaski, J. H. Korhonen, C. Lenzen, A. Paz, and
J. Suomela, “Algebraic methods in the congested clique,” CoRR, vol.
abs/1503.04963, 2015. [Online]. Available: http://arxiv.org/abs/1503.
04963
[41] V. Weber, T. Laino, A. Pozdneev, I. Fedulova, and A. Curioni,
“Semiempirical molecular dynamics (semd) i: Midpoint-based parallel
sparse matrixmatrix multiplication algorithm for matrices with
decay,” Journal of Chemical Theory and Computation, vol. 11,
no. 7, pp. 3145–3152, 2015, pMID: 26575751. [Online]. Available:
https://doi.org/10.1021/acs.jctc.5b00382
[42] K. Akbudak and C. Aykanat, “Simultaneous input and output
matrix partitioning for outer-product–parallel sparse matrix-matrix
multiplication,” SIAM Journal on Scientific Computing, vol. 36, no. 5,
pp. C568–C590, 2014. [Online]. Available: https://doi.org/10.1137/
13092589X
[43] F. Gremse, A. Hfter, L. Schwen, F. Kiessling, and U. Naumann,
“Gpu-accelerated sparse matrix-matrix multiplication by iterative row
merging,” SIAM Journal on Scientific Computing, vol. 37, no. 1, pp.
C54–C71, 2015. [Online]. Available: https://doi.org/10.1137/130948811
[44] J. Demouth, “Sparse matrix-matrix multiplication on the gpu,” in GPU
Technology Conference 2012, 2012.
[45] Y. Nagasaka, A. Nukada, and S. Matsuoka, “High-performance and
memory-saving sparse general matrix-matrix multiplication for nvidia
pascal gpu,” in 2017 46th International Conference on Parallel Pro-
cessing (ICPP), Aug 2017, pp. 101–110.
[46] G. V. Demirci and C. Aykanat, “Scaling sparse matrix-matrix
multiplication in the accumulo database,” Distributed and Parallel
Databases, Jan 2019. [Online]. Available: https://doi.org/10.1007/
s10619-019-07257-y
[47] M. Deveci, C. Trott, and S. Rajamanickam, “Performance-portable
sparse matrix-matrix multiplication for many-core architectures,”
in 2017 IEEE International Parallel and Distributed Processing
Symposium Workshops, IPDPS Workshops 2017, Orlando / Buena
Vista, FL, USA, May 29 - June 2, 2017, 2017, pp. 693–702. [Online].
Available: https://doi.org/10.1109/IPDPSW.2017.8
[48] E. Cohen, “Size-estimation framework with applications to transitive
closure and reachability,” J. Comput. Syst. Sci., vol. 55, no. 3, pp.
441–453, Dec. 1997. [Online]. Available: http://dx.doi.org/10.1006/jcss.
1997.1534
[49] R. R. Amossen, A. Campagna, and R. Pagh, “Better size estimation for
sparse matrix products,” 6 2010.
[50] R. Pagh and M. Sto¨ckel, “The input/output complexity of sparse matrix
multiplication,” CoRR, vol. abs/1403.3551, 2014. [Online]. Available:
http://arxiv.org/abs/1403.3551
[51] Y. Nagasaka, S. Matsuoka, A. Azad, and A. Bulu, “High-performance
sparse matrix-matrix products on intel knl and multicore architectures.”
[52] J. Gilbert, C. Moler, and R. Schreiber, “Sparse matrices in matlab:
Design and implementation,” SIAM Journal on Matrix Analysis and
Applications, vol. 13, no. 1, pp. 333–356, 1992. [Online]. Available:
https://doi.org/10.1137/0613024
[53] K. Matam, S. R. Krishna Bharadwaj Indarapu, and K. Kothapalli,
“Sparse matrix-matrix multiplication on modern architectures,” in 2012
19th International Conference on High Performance Computing, Dec
2012, pp. 1–10.
[54] E. H. Rubensson and E. Rudberg, “Chunks and tasks: A programming
model for parallelization of dynamic algorithms,” Parallel Comput.,
vol. 40, no. 7, pp. 328–343, Jul. 2014. [Online]. Available:
http://dx.doi.org/10.1016/j.parco.2013.09.006
[55] K. Akbudak and C. Aykanat, “Exploiting locality in sparse matrix-
matrix multiplication on many-core architectures,” IEEE Transactions
on Parallel and Distributed Systems, vol. 28, no. 8, pp. 2258–2271,
Aug 2017.
[56] K. Akbudak, O. Selvitopi, and C. Aykanat, “Partitioning models for
scaling parallel sparse matrix-matrix multiplication,” vol. 4, pp. 1–34,
2018.
[57] M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger,
“Adaptive sparse matrix-matrix multiplication on the gpu,” in
Proceedings of the 24th Symposium on Principles and Practice
of Parallel Programming, ser. PPoPP ’19. New York, NY, USA:
ACM, 2019, pp. 68–81. [Online]. Available: http://doi.acm.org/10.1145/
3293883.3295701
[58] M. Deveci, S. D. Hammond, M. M. Wolf, and S. Rajamanickam,
“Sparse matrix-matrix multiplication on multilevel memory architectures
: Algorithms and experiments,” CoRR, vol. abs/1804.00695, 2018.
[Online]. Available: http://arxiv.org/abs/1804.00695
[59] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, and T. Li, “Performance-aware
model for sparse matrix-matrix multiplication on the sunway taihulight
supercomputer,” vol. 30, pp. 923–938, 2019.
[60] S. E. Kurt, V. Thumma, C. Hong, A. Sukumaran-Rajam, and P. Sadayap-
pan, “Characterization of data movement requirements for sparse matrix
computations on gpus,” in 2017 IEEE 24th International Conference on
High Performance Computing (HiPC), Dec 2017, pp. 283–293.
[61] F. Gremse, K. Kpper, and U. Naumann, “Memory-efficient sparse
matrix-matrix multiplication by row merging on many-core architec-
tures,” vol. 40, pp. C429–C449, 2018.
[62] R. A. van de Geijn and J. Watts, “Summa: Scalable universal matrix
multiplication algorithm,” Austin, TX, USA, Tech. Rep., 1995.
[63] A. Buluc and J. R. Gilbert, “On the representation and multiplication
of hypersparse matrices,” in 2008 IEEE International Symposium on
Parallel and Distributed Processing. IEEE, 2008, pp. 1–11.
[64] ——, “Challenges and advances in parallel sparse matrix-matrix multi-
plication,” in 2008 37th International Conference on Parallel Processing.
IEEE, 2008, pp. 503–510.
[65] A. Azad, G. Ballard, A. Buluc¸, J. Demmel, L. Grigori, O. Schwartz,
S. Toledo, and S. Williams, “Exploiting multiple levels of parallelism
in sparse matrix-matrix multiplication,” CoRR, vol. abs/1510.00844,
2015. [Online]. Available: http://arxiv.org/abs/1510.00844
[66] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz, “Hypergraph
partitioning for sparse matrix-matrix multiplication,” ACM Trans.
Parallel Comput., vol. 3, no. 3, pp. 18:1–18:34, Dec. 2016. [Online].
Available: http://doi.acm.org/10.1145/3015144
[67] O. Selvitopi, G. V. Demirci, A. Turk, and C. Aykanat, “Locality-
aware and load-balanced static task scheduling for mapreduce,”
Future Generation Computer Systems, vol. 90, pp. 49 – 61,
2019. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0167739X17329266
[68] M. Deveci, C. Trott, and S. Rajamanickam, “Multi-threaded sparse
matrix-matrix multiplication for many-core and gpu architectures,”
vol. 78, pp. 33–46, 2018.
[69] M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson,
S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey,
Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore
Platforms, 2015.
[70] J. Liu, X. He, W. Liu, and G. Tan, “Register-aware optimizations
for parallel sparse matrix–matrix multiplication,” International Journal
of Parallel Programming, Jan 2019. [Online]. Available: https:
//doi.org/10.1007/s10766-018-0604-8
[71] S. Dalton, L. N. Olson, and N. Bell, “Optimizing sparse matrix - matrix
multiplication for the GPU,” ACM Trans. Math. Softw., vol. 41, no. 4, pp.
25:1–25:20, 2015. [Online]. Available: https://doi.org/10.1145/2699470
[72] D. Merrill and A. Grimshaw, “High performance and scalable radix
sorting: a case study of implementing dynamic parallelism for gpu
computing.” Parallel Processing Letters, vol. 21, pp. 245–272, 06 2011.
[73] G. E. Blelloch, M. A. Heroux, and M. Zagha, “Segmented operations
for sparse matrix computation on vector multiprocessors,” School of
Computer Science, Carnegie Mellon University, Tech. Rep. CMU-CS-
93-173, Aug. 1993.
[74] K. Hou, W. Liu, H. Wang, and W.-c. Feng, “Fast segmented
sort on gpus,” in Proceedings of the International Conference on
Supercomputing, ser. ICS ’17. New York, NY, USA: ACM, 2017, pp.
12:1–12:10. [Online]. Available: http://doi.acm.org/10.1145/3079079.
3079105
[75] M. Deveci, K. Kaya, and U. V. atalyu¨rek, “Hypergraph sparsification and
its application to partitioning,” in 2013 42nd International Conference
on Parallel Processing, Oct 2013, pp. 200–209.
[76] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish,
J. Chhugani, A. Di Blas, and P. Dubey, “Sort vs. hash revisited:
Fast join implementation on modern multi-core cpus,” Proc. VLDB
Endow., vol. 2, no. 2, pp. 1378–1389, Aug. 2009. [Online]. Available:
https://doi.org/10.14778/1687553.1687564
[77] C. Balkesen, G. Alonso, J. Teubner, and M. T. O¨zsu, “Multi-
core, main-memory joins: Sort vs. hash revisited,” Proc. VLDB
Endow., vol. 7, no. 1, pp. 85–96, Sep. 2013. [Online]. Available:
http://dx.doi.org/10.14778/2732219.2732227
[78] D. Jin and S. G. Ziavras, “A super-programming technique for large
sparse matrix multiplication on pc clusters,” IEICE TRANSACTIONS on
Information and Systems, vol. 87, no. 7, pp. 1774–1781, 2004.
[79] I. Bethune, A. Gloess, J. Hutter, A. Lazzaro, H. Pabst, and F. Reid,
“Porting of the dbcsr library for sparse matrix-matrix multiplications to
intel xeon phi systems.”
[80] J. Siegel, O. Villa, S. Krishnamoorthy, A. Tumeo, and X. Li, “Efficient
sparse matrix-matrix multiplication on heterogeneous high performance
systems,” in 2010 IEEE International Conference On Cluster Computing
Workshops and Posters (CLUSTER WORKSHOPS). IEEE, 2010, pp.
1–8.
[81] E. H. Rubensson and E. Rudberg, “Locality-aware parallel block-sparse
matrix-matrix multiplication using the chunks and tasks programming
model,” Parallel Comput., vol. 57, no. C, pp. 87–106, Sep. 2016.
[Online]. Available: https://doi.org/10.1016/j.parco.2016.06.005
[82] C. Y. Lin, Z. Zhang, N. Wong, and H. K. So, “Design space exploration
for sparse matrix-matrix multiplication on fpgas,” in 2010 International
Conference on Field-Programmable Technology, Dec 2010, pp. 369–
372.
[83] E. Jamro, T. Pabis, P. Russek, and K. Wiatr, “The algorithms for
fpga implementation of sparse matrices multiplication,” Computing and
Informatics, vol. 33, pp. 667–684, 01 2014.
[84] J. Hoberock and N. Bell, Thrust: A parallel template library, 2013,
version 1.7.0. [Online]. Available: http://thrust.github.io/
[85] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling
manycore performance portability through polymorphic memory access
patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12,
pp. 3202 – 3216, 2014, domain-Specific Languages and High-Level
Frameworks for High-Performance Computing. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0743731514001257
[86] T. A. Davis and Y. Hu, “The university of florida sparse matrix
collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, 2011.
[87] Kokkoskernels. [Online]. Available: https://github.com/kokkos/
kokkos-kernels
[88] NVIDIA. Nvidia cusparse library. [Online]. Available: https://developer.
nvidia.com/cusparse
[89] Intel. Intel math kernel library. [Online]. Available: https://software.
intel.com/en-us/mkl
[90] S. Dalton and N. Bell. Cusp : A c++ templated sparse matrix library.
[Online]. Available: http://cusplibrary.github.com
[91] T. Davis. Csparse: a concise sparse matrix package. [Online]. Available:
http://www.cise.ufl.edu/research/sparse/CSparse
