High-performance sparse matrix-matrix products on Intel KNL and
  multicore architectures by Nagasaka, Yusuke et al.
High-performance sparse matrix-matrix products
on Intel KNL and multicore architectures
Yusuke Nagasaka
Tokyo Institute of Technology
Tokyo, Japan
nagasaka.y.aa@m.titech.ac.jp
Satoshi Matsuoka∗
RIKEN Center for Computational Science
Kobe, Japan
matsu@acm.org
Ariful Azad
Lawrence Berkeley National Laboratory
Berkeley, California, USA
azad@lbl.gov
Aydın Buluç
Lawrence Berkeley National Laboratory
Berkeley, California, USA
abuluc@lbl.gov
ABSTRACT
Sparse matrix-matrix multiplication (SpGEMM) is a computational
primitive that is widely used in areas ranging from traditional
numerical applications to recent big data analysis and machine
learning. Although many SpGEMM algorithms have been
proposed, hardware specific optimizations for multi- and
many-core processors are lacking and a detailed analysis of their
performance under various use cases and matrices is not available.
We firstly identify and mitigate multiple bottlenecks with memory
management and thread scheduling on Intel Xeon Phi (Knights
Landing or KNL). Specifically targeting multi- and many-core
processors, we develop a hash-table-based algorithm and optimize
a heap-based shared-memory SpGEMM algorithm. We examine
their performance together with other publicly available codes.
Different from the literature, our evaluation also includes use cases
that are representative of real graph algorithms, such as
multi-source breadth-first search or triangle counting. Our
hash-table and heap-based algorithms are showing significant
speedups from libraries in the majority of the cases while different
algorithms dominate the other scenarios with different matrix size,
sparsity, compression factor and operation type. We wrap up
in-depth evaluation results and make a recipe to give the best
SpGEMM algorithm for target scenario. A critical finding is that
hash-table-based SpGEMM gets a significant performance boost if
the nonzeros are not required to be sorted within each row of the
output matrix.
KEYWORDS
Sparse matrix, SpGEMM, Intel KNL
1 INTRODUCTION
Multiplication of two sparse matrices (SpGEMM) is a recurrent
kernel in many algorithms in machine learning, data analysis, and
graph analysis. For example, bulk of the computation in
multi-source breadth-first search [17], betweenness centrality [8],
Markov clustering [5], label propagation [27], peer pressure
clustering [30], clustering coefficients [4], high-dimensional
similarity search [1], and topological similarity search [20] can be
expressed as SpGEMM. Similarly, numerical applications such as
∗Alsowith Tokyo Institute of Technology, Department ofMathematical and Computing
Sciences.
scientific simulations also use SpGEMM as a subroutine. Typical
examples include the Algebraic Multigrid (AMG) method for
solving sparse system of linear equations [6], volumetric mesh
processing [24], and linear-scaling electronic structure
calculations [7].
The extensive use of SpGEMM in data-intensive applications
has led to the development of several sequential and parallel
algorithms. Most of these algorithms are based on Gustavson’s
row-wise SpGEMM algorithm [19] where a row of the output
matrix is constructed by accumulating a subset of rows of the
second input matrix (see Figure 1 for details). It is the
accumulation (also called merging) technique that often
distinguishes major classes of SpGEMM algorithms from one
another. Popular data structures for accumulating rows or columns
of the output matrix include heap [3], hash [25], and sparse
accumulator (SPA) [16].
Recently, researchers have developed parallel heap-, hash-, and
SPA-based SpGEMM algorithms for shared-memory
platforms [2, 10, 18, 22, 31]. These algorithms are also packaged in
publicly-available software that can tremendously benefit many
scientific applications. However, when using an SpGEMM
algorithm and implementation for a scientific problem, one needs
answers to the following questions: (a) what is the best
algorithm/implementation for a problem at hand? (b) what is the
best algorithm/implementation for the architecture to be used in
solving the problem? These practically important questions remain
mostly unanswered for many scientific applications running on
highly-threaded architectures. This paper answers both questions
in the context of existing SpGEMM algorithms. That means our
focus is not to develop new parallel algorithms, but to characterize,
optimize and evaluate existing algorithms for real-world
applications on modern multicore and manycore architectures.
This paper addresses the following gaps in the literature.
First, previous algorithmic work did not pay close attention
to architecture-specific optimizations that have big impacts on
the performance of SpGEMM. We fill this gap by characterizing
the performance of SpGEMM on shared-memory platforms and
identifying bottlenecks in memory allocation and deallocation as
well as overheads in thread scheduling. We propose solutions to
mitigate those bottlenecks. Using microbenchmarks that model
SpGEMM access patterns, we also uncover reasons behind the
non-uniform performance boost provided by the MCDRAM on
ar
X
iv
:1
80
4.
01
69
8v
2 
 [c
s.D
C]
  2
6 J
un
 20
18
KNL. These optimizations resulted in efficient heap-based and hash-
table-based SpGEMM algorithms that outperform state-of-the-art
SpGEMM libraries including Intel MKL and Kokkos-Kernels [14]
for many practical problems.
Second, previous work has narrowly focused on one or two real
world application scenarios such as squaring a matrix and
studying SpGEMM in the context of AMG solvers [14, 26].
Different from the literature, our evaluation also includes use cases
that are representative of real graph algorithms, such as the
multiplication of a square matrix with a tall skinny one that
represent multi-source breadth-first search and the multiplication
of triangular matrices that is used in triangle counting. While in
the majority of the cases the hash-table-based SpGEMM algorithm
is dominant, we also find that different algorithms dominate
depending on matrix size, sparsity, compression factor, and
operation type. This in-depth analysis exposes many interesting
features of algorithms, applications, and multithreaded platforms.
Third, while many SpGEMM algorithms keep nonzeros sorted
within each row (or column) in increasing column (or row)
identifiers, this is not universally necessary for subsequent sparse
matrix operations. For example, CSparse [11, 12] assumes none of
the input matrices are sorted. Clearly, if an algorithm accepts its
inputs only in sorted format, then it must also emit sorted output
for fairness. This is the case with the heap-based algorithms.
However, hash based algorithm do not need their inputs sorted. In
this case we see a significant performance benefit due to skipping
the sorting of the output as well.
Based on these architecture- and application-centric
optimizations and evaluations, we make a recipe for selecting the
best-performing algorithm for a specific application scenario. This
recipe is based on empirical results, but also supported by
theoretical performance estimations. Therefore, this paper brings
various SpGEMM algorithms and libraries together, analyzes them
based on algorithm, application, and architecture related features
and provides exhaustive guidelines for SpGEMM-dependent
applications.
2 BACKGROUND AND RELATEDWORK
LetA,B be input matrices, and SpGEMM computes a matrixC such
thatC = AB whereA ∈ Rm×n , B ∈ Rn×k ,C ∈ Rm×k . The input and
output matrices are sparse and they are stored in a sparse format.
The number of nonzeros inmatrixA is denotedwith nnz(A). Figure 1
shows the skeleton of the most commonly implemented SpGEMM
algorithm, which is due to Gustavson [19]. When the matrices
are stored using the Compressed Sparse Rows (CSR) format, this
SpGEMM algorithm proceeds row-by-row on matrix A (and hence
on the output matrix C). Let ai j be the element in i-th row and j-th
column of matrix A and ai∗ be the i-th row of matrix A. The row
of matrix B corresponding to each non-zero element of matrix A is
read, and each non-zero element of output matrix C is calculated.
SpGEMM computation has three critical issues unlike dense
matrix multiplication. First issue is indirect memory access to
matrix data. As seen in Figure 1, the memory access to matrix B
depends on the non-zero elements of matrix A. Therefore, the
memory access to each non-zero element of matrix B is indirect,
and the cache miss may occur frequently. Secondly, the pattern
RowWise_SpGEMM(C,A,B)
1 // set matrix C to ∅
2 for ai∗ in matrix A in parallel
3 do for aik in row ai∗
4 do for bk j in row bk∗
5 do value ← aikbk j
6 if ci j < ci∗
7 then insert (cij ← value)
8 else cij ← cij + value
Figure 1: Pseudo code of Gustavson’s Row-wise SpGEMM
algorithm. The in parallel keyword does not exist in the
original algorithm but is used here to illustrate the common
parallelization pattern of this algorithm used by all known
implementations.
and the number of non-zero elements of output matrix are not
known beforehand. For this reason, the memory allocation of
output matrix becomes hard, and we need to select from two
strategies. One is a two-phase method, which counts the number
of non-zero elements of output matrix first (symbolic phase), and
then allocates memory and computes output matrix (numeric
phase). The other is that we allocate large enough memory space
for output matrix and compute. The former requires more
computation cost, and the latter uses much more memory space.
Finally, third issue is about combining the intermediate products
(value in Fig. 1) to non-zero elements of output matrix. Since the
output matrix is also sparse, it is hard to efficiently accumulate
intermediate products into non-zero elements. This procedure is a
performance bottleneck of SpGEMM computation, and it is
important to devise and select better accumulator for SpGEMM.
Since each row of C can be constructed independently of each
other, Gustavson’s algorithm is conceptually highly parallel. For
accumulation, Gustavson’s algorithm uses a dense vector and a list
of indices that hold the nonzero entries in the current active row.
This particular set of data structures used in accumulation are later
formalized by Gilbert et al. under the name of sparse accumulator
(SPA) [16]. Consequently, a naive parallelization of Gustavson’s
algorithm requires temporary storage of O(n t) where t is the
number of threads. For matrices with large dimensions, a
SPA-based algorithm can still achieve good performance by
“blocking" SPA in order to decrease cache miss rates. Patwary et al.
[26] achieved this by partitioning the data structure of B by
columns.
Sulatycke and Ghose [31] presented the first shared-memory
parallel algorithm for the SpGEMM problem, to the best of our
knowledge. Their parallel algorithm, dubbed IKJ method due to the
order of the loops, has a double-nested loop over the rows and the
column of the matrix A. Therefore, the IKJ method has work
complexity O(n2 + flop) where flop is the number of the
non-trivial scalar multiplications (i.e. those multiplications where
both operands are nonzero) required to compute the product.
Consequently, the IKJ method is only competitive when flop ≥ n2,
which is rare for SpGEMM.
Several GPU algorithms that are also based on the row-by-row
formulation are presented [21, 25]. These algorithms first bin the
2
rows based on their density due to the peculiarities of the GPU
architectures. Then, a poly-algorithm executes a different
specialized kernel for each bin, depending on its density. Two
recent algorithms that are implemented in both GPUs and CPUs
also follow the same row-by-row pattern, only differing on how
they perform the merging operation. ViennaCL [29]
implementation, which was first described for GPUs [18],
iteratively merges sorted lists, similar to merge sort.
KokkosKernels implementation [14], which we also include in our
evaluation, uses a multi-level hash map data structure.
The CSR format is composed of three arrays: row pointers array
(rpts) of length n + 1, column indices (cols) of length nnz, and
values (vals) of length nnz. Array rpts indexes the beginning and
end locations of nonzeros within each row such that the range
cols[rpts[i] . . . rpts[i + 1] − 1] lists the column indices of row i . The
CSR format does not specify whether this range should be sorted
with increasing column indices; that decision has been left to the
library implementation. As we will show in our results, there are
significant performance benefits of operating on unsorted CSR
format. Table 1 lists high-level properties of the codes we study in
this paper. Since MKL code is proprietary, we do not know the
accumulator.
Table 1: Summary of SpGEMM codes studied in this paper
Algorithm Phases Accumulator
Sortedness
(Input/Output)
MKL 2 - Any/Select
MKL-inspector 1 - Any/Unsorted
KokkosKernels 2 HashMap Any/Unsorted
Heap 1 Heap Sorted/Sorted
Hash/HashVector 2 Hash Table Any/Select
3 MICROBENCHMARKS ON INTEL KNL
Our experiments target Intel Xeon and Xeon Phi architectures. We
conducted some preliminary experiments to tune optimization
parameters to expose the performance bottlenecks. These
microbenchmarks are especially valuable for designing an
algorithm of SpGEMM for these architectures. Details of
evaluation environment are summarized in Table 3.
3.1 Scheduling Cost of OpenMP
When parallelizing a loop, OpenMP provides three thread
scheduling choices: static, dynamic and guided. Static scheduling
divides loop iterations equally among threads, dynamic scheduling
allocates iterations to threads dynamically, and guided scheduling
starts with static scheduling with smaller iterations and switch to
dynamic scheduling in the later part. Here, we experimentally
evaluate the scheduling cost of these three scheduling options on
Haswell and KNL processors by running simple program, which
only repeats loop iterations without doing anything in the loop.
We measure the time during loop iterations and the result is
shown in Figure 2. In load-balanced situation with a large number
of iterations, static scheduling takes very little scheduling
overhead than dynamic scheduling on both Haswell and KNL, as
25 27 29 211 213 215 217 219
#iterations
10−2
10−1
100
101
m
ill
i s
ec
on
ds
KNL static
KNL dynamic
KNL guided
Haswell static
Haswell dynamic
Haswell guided
Figure 2: OpenMP Scheduling Cost on Haswell and KNL
expected. The guided scheduling is also as expensive as dynamic
scheduling, especially on the KNL processor. Based on these
evaluations, we opt to use static scheduling in our SpGEMM
algorithms since the scheduling overhead of dynamic or guided
scheduling becomes an bottleneck.
3.2 Memory Allocation/Deallocation
To find a suitable memory allocation/deallocation scheme on KNL,
we performed a simple experiment: allocate a memory space,
access elements on the allocated memory and then deallocate it. To
contrast this “single” memory management scheme, we considered
a “parallel” approach where each thread independently
allocates/deallocates equal portion of the total requested memory
and accesses only its own allocated memory space. The latter
approach is described in Figure 3. We examined three ways to
allocate or deallocate memory; new/delete of C++, aligned
allocation (_mm_malloc/_free), and scalable_malloc/_free
provided by Intel TBB (Thread Building Block). Figure 4 shows the
results of “single” deallocation and “parallel” deallocation with 256
threads on KNL. Since aligned allocation showed nearly same
performance as C++, we show only the results of C++ and TBB.
All “single” allocators have extremely high cost for large memory:
over 100 milliseconds for the deallocation of 1GB memory space.
The “parallel” deallocation for large memory chunks is much
cheaper than “single” deallocation. The cost of “parallel”
deallocation suddenly rises at 8GB (C++) or 64GB (TBB), where
each thread allocates 32MB or 256MB, respectively. These
thresholds match those of “single” deallocation. On the other hand,
the cost of “parallel” deallocation for small memory space becomes
larger than “single” deallocation since “parallel” deallocation
causes the overheads of OpenMP scheduling and synchronization.
From this result, we compute the amount of memory required by
each thread and allocate this amount of thread-private memory in
each thread independently in order to reduce deallocation cost in
SpGEMM computation, which requires temporally memory
allocation and deallocation. In the following experiments in this
paper, the TBB is used for both “single” and “parallel” memory
allocation/deallocation to simply have performance gain.
3.3 DDR vs. MCDRAM
Intel KNL implements MCDRAM, which can accelerate
bandwidth-bound algorithms and applications. While MCDRAM
provides high bandwidth, its memory latency is larger than that of
DDR4. In row-wise SpGEMM (Algorithm 1), there are three main
3
1 eachN ← N /nthreads
2 Allocate(a, nthreads)
3 for tid ← to nthreads in parallel
4 do Allocate(a[tid], eachN )
5 do for i ← to eachN
6 do a[tid][i] ← i
7 do Deallocate(a[tid], eachN )
8 Deallocate(a[tnum])
Figure 3: Experiment of Parallel Memory Management. a is
an array of arrays. Total requested memory is N , and each
thread independently requests equal size, eachN .
21 23 25 27 29 211 213 215
Allocation/Deallocation Array Size [MB]
10−2
100
102
104
m
ill
i s
ec
on
ds
C++ (Single)
TBB (Single)
C++ (Parallel)
TBB (Parallel)
Figure 4: Cost of deallocation on KNL
types of data accesses for the formation of each row of C . First,
there is a unit-stride streaming access pattern arising from access
of the row pointers of A as well as the creation of the sparse
output vector ci∗. Second, access to rows of B follows a stanza-like
memory access pattern where small blocks (stanzas) of consecutive
elements are fetched from effectively random locations in memory.
Finally, updates to the accumulator exhibit different access pattern
depending on the type of the accumulator (a hash table, SPA, or
heap). The streaming access to the input vector is usually the
cheapest of the three and the accumulator access depends on the
data structure used. Hence, stanza access pattern is the most
canonical of the three and provides a decent proxy to study.
To quantify the stanza bandwidth which we expect to be quite
different than STREAM [23], we used a custom microbenchmark
that provides stanza-like memory access patterns (read or update)
with spatial locality varying from 8 bytes (random access) to the
size of the array (i.e. asymptotically the STREAM benchmark).
Figure 5 shows a comparison result between DDR only and use of
MCDRAM as Cache with scaling the length of contiguous memory
access. When the contiguous memory access is wider, both DDR
only and MCDRAM as Cache achieve their peak bandwidth, and
especially MCDRAM as Cache shows over 3.4× superior
bandwidth compared to DDR only. However, the use of MCDRAM
as Cache is incompatible with fine-grained memory access. When
the stanza length is small, there is little benefit of using MCDRAM.
This benchmark hints that it would be hard to get the benefits of
MCDRAM on very sparse matrices.
24 26 28 210 212 214
Contiguous Memory Access Length [Byte]
21
23
25
27
G
B
yt
e/
se
c
DDR only
MCDRAM as Cache
Figure 5: Benchmark result of random memory access with
DDR only or MCDRAM as Cache
4 ARCHITECTURE SPECIFIC OPTIMIZATION
OF SPGEMM
Based on our findings of performance critical bottlenecks on Intel
KNL, we design SpGEMM algorithms taking account in
architecture specific issues. First, we show light-weight thread
scheduling scheme with load-balancing for SpGEMM. Next, we
show the optimization schemes for hash table based SpGEMM,
which is proposed for GPU [25], and heap based shared-memory
SpGEMM algorithms [3]. Additionally, we extend the Hash
SpGEMM with utilizing vector registers of Intel Xeon or Xeon Phi.
Finally, we show which accumulator works well for target scenario
from the theoretical point of view by estimating each
accumulation cost.
4.1 Light-weight Load-balancing Thread
Scheduling Scheme
To achieve good load-balance with static scheduling, the bundle
of rows should be assigned to threads with equal computation
complexity before symbolic or numeric phase. Figure 6 shows how
to assign rows to threads. First, we count flop of each row, then do
prefix sum. Each thread can find the start point of rows by binary
search. lowbnd(vec, value) in line 14 finds the minimum id such
that vec[id] is larger than or equal to value. Each of these operations
can be executed in parallel.
4.2 Symbolic and Numeric Phases
We optimized two approaches of accumulation for KNL, one is
hash-table based algorithm and the other is heap based algorithm.
Furthermore, we add another version of Hash SpGEMM, where
hash probing is vectorized with AVX-512 or AVX2 instructions.
4.2.1 Hash SpGEMM. We use hash table for accumulator in
SpGEMMcomputation, based on GPUwork [25]. Figure 7 shows the
algorithm of Hash SpGEMM for multi- and many-core processors.
The allocation/deallocation scheme of hash table for each thread
is based on “parallel” approach like FIgure 3. We count a flop per
row of output matrix. The upper limit of any thread’s local hash
table size is the maximum number of flop per row within the rows
assigned to it. Each thread once allocates the hash table based
on its own upper limit and reuses that hash table throughout the
computation by reinitializing for each row. Next is about hashing
algorithm we adopted. A column index is inserted into hash table
4
RowsToThreads(offset,A,B)
1 // 1. Set FLOPS vector
2 for i ← 0 to m in parallel
3 do flop[i] ← 0
4 for j ← rptsA[i] to rptsA[i + 1]
5 do rnz ← rptsB[colsA[j] + 1] − rptB[colsA[j]]
6 flop[i] ← flop[i] + rnz
7 // 2. Assign rows to thread
8 flopps ← ParallelPrefixSum(flop)
9 sumflop ← flopps[m]
10 tnum ← omp_get_max_threads()
11 aveflop ← sumflop/tnum
12 offset[0] ← 0
13 for tid ← 1 to tnum in parallel
14 do offset[tid] ← lowbnd(flopps , aveflop ∗ tid)
15 offset[tnum] ← m
Figure 6: Load-balanced Thread Assignment
as key. Since the column index is no less than 0, the hash table is
initialized by storing−1. The column index is multiplied by constant
number and divided by hash table size to compute the remainder.
In order to compute modulus operation efficiently, the hash table
size is set as 2n (n is a integer). The hashing algorithm is based on
linear probing. Figure 8-(a) shows an example of hash probing on
16 entries hash table.
In symbolic phase, it is enough to insert keys to the hash table.
In numeric phase, however, we need to store the resulting value
data. Once the computation on the hash table finishes, the results
are sorted by column indices in ascending order (if necessary), and
stored to memory as output. This Hash SpGEMM for multi/many-
core processors differs from the GPU version as follows. While a
row of output is computed by multiple threads in the GPU version
to exploit massive number of threads on GPU, each row is processed
by a single thread in the present algorithm. Hash SpGEMM on GPU
requires some form of mutual exclusion since multiple threads
access the same entry of the hash table concurrently. We were
able to remove this overhead in our present Hash SpGEMM for
multi/many-core processors.
4.2.2 HashVector SpGEMM. Intel Xeon or Xeon Phi
implements 256 and 512-bit wide vector register, respectively. This
vector register reduces instruction counts and brings large benefit
to algorithms and applications, which require contiguous memory
access. However, sparse matrix computation has indirect memory
access, and hence it is hard to utilize vector registers. In this paper,
we utilize vector register for hash probing in our Hash SpGEMM
algorithm. The vectorization of hash probing is based on Ross [28].
Figure 8-(b) shows how HashVector algorithm works hash probing.
The size of hash table is 16, same as (a), and it is divided into
chunks based on vector width. A chunk consists of 8 entries on
Haswell since a key (= column index) is represented as 32-bit in
our evaluation. In HashVector, the hash indicates the identifier of
target chunk in hash table. In order to examine the keys in the
chunk, we use comparison instruction with vector register. If the
entry with target key is found, the algorithm finishes the probing
for the element in symbolic phase. In numeric phase, the target
Hash_SpGEMM(C,A,B)
1 RowsToThreads(offset,A,B)
2 Allocate(tableid , tnum)
3 Allocate(tableval , tnum)
4 // Determine hash table size for each thread
5 for tid ← 0 to tnum
6 do sizet ← 0
7 for i ← offset[tid] to offset[tid + 1]
8 do sizet ← max(sizet , flop[i])
9 // Required maximum hash table size is Ncol
10 sizet ← min(Ncol , sizet )
11 // Return minimum 2n so that 2n > sizet
12 sizet ← lowest_p2(sizet )
13 Allocate(tableid [tid], sizet )
14 Allocate(tableval[tid], sizet )
15 Allocate(rptsC ,Nrow +1)
16 Symbolic(rptsC ,A,B)
17 Allocate(colsC , rptsC[Nrow])
18 Allocate(valsC , rptsC[Nrow])
19 Numeric(C,A,B)
Figure 7: Hash SpGEMM Pseudocode
1) Check the entry 2) If hash is collided, check next entry
3) If the entry is empty, add the element
1) Check multiple entries
with vector register 
2) If the element is not found and the 
row has empty entry, add the element
(a) Hash
(b) HashVector
: element to be added : non-empty entry : empty entry
Figure 8: Hash Probing in Hash and HashVector SpGEMM
entry in chunk is identified by __builtin_ctz function, which
counts trailing zeros, and the multiplied value is added to the value
of the entry. If the algorithm finds no entry with the key, the
element is pushed to the hash table. In HashVector, new element is
pushed into the table in order from the beginning. The entries in
chunk are compared with the initial value of hash table, -1, by
using vector register. The beginning of empty entries can be found
by counting the number of bit flags of comparison result. When
the chunk is occupied with other keys, the next chunk is to be
checked in accordance with linear probing. Since Hash vector
SpGEMM can reduce the number of probing caused by hash
collision, it can achieve better performance compared to Hash
SpGEMM. However, HashVector requires a few more instructions
for each check. Thus, HashVector may degrade the performance
when the collisions in Hash SpGEMM are rare.
4.2.3 Heap SpGEMM. In another variant of SpGEMM [3], we
use a priority queue (heap) – indexed by column indices – to
accumulate each row of C . To construct ci∗, a heap of size nnz(ai∗)
5
is allocated. For every nonzero aik , the first nonzero entry in bk∗
along with its column index is inserted into the heap. The
algorithm iteratively extracts an entry with the minimum column
index from the heap, accumulates it to ci∗, and inserts the next
nonzero entry from the last extracted row of B into the heap.
When the heap becomes empty, the algorithm moves to construct
the next row of C .
Heap SpGEMM can be more expensive than hash- and SPA-
based algorithms because it requires logarithmic time to extract
elements from the heap. However, from the accumulator point
of view, Heap SpGEMM is space efficient as it uses O(nnz(ai∗))
memory to accumulate ci∗ instead ofO(flop(ci∗)) andO(n)memory
used by hash- and SPA-based algorithms, respectively.
Our implementation of Heap SpGEMM adopts the one-phase
method, which require larger memory usage for temporally
keeping the output. In parallel Heap SpGEMM, because rows of C
are independently constructed by different threads, this temporary
memory use for keeping output is thread-independent and we can
adapt “parallel” approach for memory management.
Thread-private heaps are also managed with “parallel” approach.
As with the Hash algorithm, Heap SpGEMM estimates flop via a
symbolic step and uses it to balance computational load evenly
among threads.
4.2.4 Performance Estimation for A Recipe. To estimate how our
algorithms would perform in practice, we show their computation
costs. As described, Heap SpGEMM requires logarithmic time to
extract elements to the heap. The complexity of Heap SpGEMM is:
Theap =
n∑
i=1
(flop(ci∗) ∗ log nnz(ai∗)) (1)
On the other hand, Hash SpGEMM requires O(1) cost to explore in
its hash table if there is no hash collision. We introduce c , collision
factor, which is an average number of probing to detect/insert
the key. When c = 1, no hash collision occurs during SpGEMM
computation. In addition to this hash probing cost, Hash SpGEMM
requires sorting, which takesO(n logn) time, if an application needs.
The computational complexity of Hash SpGEMM is:
Thash = flop ∗ c +
n∑
i=1
(nnz(ci∗) ∗ log nnz(ci∗)) (2)
From (1) and (2), Hash SpGEMM tends to achieve superior
performance to Heap SpGEMM when nnz(ci∗) or
flop(ci∗)/nnz(ci∗) is large. Denser input matrices make output
matrix denser too. Also, the multiplication of input matrices with
regular non-zero patterns output a regular matrix, and in that case,
flop(ci∗)/nnz(ci∗) is large. Thus, we can guess that Hash becomes
better choice when input matrices are dense or have regular
structure.
5 EXPERIMENTAL RESULTS
Different SpGEMM algorithms can dominate others depending on
the aspect ratio (i.e. ratio of its dimensions), density, sparsity
structure, and size (i.e. dimensions) of its inputs. To provide a
comprehensive and fair assessment, we evaluate SpGEMM codes
under several different scenarios. For the case where input and
output matrices are sorted, we evaluate MKL, Heap and
Table 2: Matrix data used in our experiments (all numbers
are in millions)
Matrix n nnz(A) flop(A2) nnz(A2)
2cubes_sphere 0.101 1.65 27.45 8.97
cage12 0.130 2.03 34.61 15.23
cage15 5.155 99.20 2,078.63 929.02
cant 0.062 4.01 269.49 17.44
conf5_4-8x8-05 0.049 1.92 74.76 10.91
consph 0.083 6.01 463.85 26.54
cop20k_A 0.121 2.62 79.88 18.71
delaunay_n24 16.777 100.66 633.91 347.32
filter3D 0.106 2.71 85.96 20.16
hood 0.221 10.77 562.03 34.24
m133-b3 0.200 0.80 3.20 3.18
mac_econ_fwd500 0.207 1.27 7.56 6.70
majorbasis 0.160 1.75 19.18 8.24
mario002 0.390 2.10 12.83 6.45
mc2depi 0.526 2.10 8.39 5.25
mono_500Hz 0.169 5.04 204.03 41.38
offshore 0.260 4.24 71.34 23.36
patents_main 0.241 0.56 2.60 2.28
pdb1HYS 0.036 4.34 555.32 19.59
poisson3Da 0.014 0.35 11.77 2.96
pwtk 0.218 11.63 626.05 32.77
rma10 0.047 2.37 156.48 7.90
scircuit 0.171 0.96 8.68 5.22
shipsec1 0.141 7.81 450.64 24.09
wb-edu 9.846 57.16 1,559.58 630.08
webbase-1M 1.000 3.11 69.52 51.11
Hash/HashVector, and for the case where they are unsorted we
evaluate MKL, MKL-inspector, KokkosKernels (with ‘kkmem’
option) and Hash/HashVector. Each performance number in the
following part is the average of ten SpGEMM runs.
5.1 Input Types
We use two types of matrices for the evaluation. We generate
synthetic matrix using matrix generator, and take matrices from
SuiteSparse Matrix Collection (Formerly University of Florida
Sparse Matrix Collection) [13]. For the evaluation of unsorted
output, the column indices of input matrices are randomly
permuted. We use 26 sparse matrices used in [4, 14, 21]. The
matrices are listed in Table 2.
We use R-MAT [9], the recursive matrix generator, to generate
two different non-zero patterns of synthetic matrices represented
as ER and G500. ER matrix represents Erdős-Rényi random graphs,
and G500 represents graphs with power-law degree distributions
used for Graph500 benchmark. These matrices are generated with
R-MAT seed parameters; a = b = c = d = 0.25 for ER matrix and
a = 0.57,b = c = 0.19,d = 0.05 for G500 matrix. A scale n matrix
represents 2n -by-2n . The edge factor parameter for these generators
is the average number of non-zero elements per row (or column)
of the matrix. In other words, it is the ratio of nnz to n.
6
Table 3: Overview of Evaluation Environment (Cori system)
Haswell cluster KNL cluster
CPU
Intel Xeon
Processor E5-2698 v3
Intel Xeon Phi
Processor 7250
#Sockets 2 1
#Cores/socket 16 68
Clock 2.3GHz 1.4GHz
L1 cache 32KB/core 32KB/core
L2 cache 256KB/core 1MB/tile
L3 cache 40MB per socket -
Memory
DDR4 128GB 96GB
MCDRAM - 16GB
Software
OS SuSE Linux Enterprise Server
Compiler Intel C++ Compiler (icpc) ver18.0.0
Option -g -O3 -qopenmp
5.2 Experimental Environment
We evaluate the performance of SpGEMM on a single node of
the Cori supercomputer at NERSC. Cori system consists of two
partitions; one is Intel Xeon Haswell cluster (Haswell), and the
other is Intel KNL cluster. We use nodes from both partitions of
Cori. Details are summarized in Table 3.
The Haswell and KNL processors provide hyperthreading with
2 or 4 threads for each core respectively. We set the number of
threads as 68, 136, 204 or 272 for KNL, and 16, 32 or 64 for Haswell.
For the evaluation of Kokkos on KNL, we set 256 threads instead of
272 threads since the execution fails on more than 256 threads. We
show the result with the best thread count. For the evaluation on
KNL, we set “quadrant” cluster mode, and mainly “Cache” memory
mode. To select DDR4 or MCDRAM with “Flat” memory mode, we
use “numactl -p”. The thread affinity is set as
“KMP_AFFINITY=‘granularity=fine’,scatter”.
5.3 Preliminary Evaluation on KNL
5.3.1 Advantage of Performance Optimization on KNL for
SpGEMM. We examined performance difference between OpenMP
scheduling and ways to allocate memory. Figure 9 shows the
performance of Heap SpGEMM for squaring G500 matrices with
edge factor 16. When simply parallelizing SpGEMM by row, we
cannot achieve higher performance because of load imbalance
with static scheduling or expensive scheduling overhead with
dynamic/guided scheduling. On the other hand, our light-weight
load-balancing thread scheduling scheme, ‘balanced’, works well
on SpGEMM. For larger inputs, Heap SpGEMM temporally
requires large memory use, whose deallocation causes
performance degradation. Our “parallel” memory management
scheme ‘balanced parallel’ reduces the overhead of memory
deallocation for temporal memory use, and keeps high
performance on large size inputs.
5.3.2 DDR vs MCDRAM. We examine the benefit of using
MCDRAM over DD4 memory by squaring G500 matrices with or
without using MCDRAM on KNL. Figure 10 shows the speedup
6 8 10 12 14 16 18
Scale
200
400
600
800
M
FL
O
P
S
static
dynamic
guided
balanced single
balanced parallel
Figure 9: Performance of Heap SpGEMM scaling with size of
G500 inputs on KNL with Cache mode
22 23 24 25 26
Edge Factor
0.9
1.0
1.1
1.2
1.3
1.4
S
pe
ed
up
 w
ith
 M
C
D
R
A
M
Heap
Hash
HashVec
Hash (unsorted)
HashVec (unsorted)
Figure 10: Speedups attained with the use of Cache mode
on KNL compared to Flat mode on DDR4. G500 (scale 15)
matrices are used with different edge factors.
attained with the Cache mode against the Flat mode on DDR4 for
various matrix densities. We observe that Hash SpGEMM
algorithms can be benefitted, albeit moderately, from MCDRAM
when denser matrices are multiplied. This observation is
consistent with the benchmark shown in Figure 5. The limited
benefit stems from the fact that SpGEMM frequently requires
indirect fine-grained memory accesses often as small as 8 bytes.
On denser matrices, MCDRAM can still bring benefit from
contiguous memory accesses of input matrices. By contrast, Heap
SpGEMM is not benefitted from high-bandwidth MCDRAM
because of its fine-grained memory accesses. The performance of
Heap SpGEMM even degrades when edge factor is 64 at which
point the memory requirement of Heap SpGEMM surpasses the
capacity of MCDRAM.
5.4 Squaring a matrix
Multiplying a sparse matrix by itself is a well-studied SpGEMM
scenario. Markov clustering is an example of this case, which
requires A2 for a given doubly-stochastic similarity matrix. We
evaluate this case extensively, under using real and synthetically
generated data. For synthetic data, we provide experiments with
varying density (for a fixed sized matrix) and with varying size (for
a fixed density).
7
4 8 16
Edge Factor
0
500
1000
1500
2000
M
FL
O
P
S
KNL (Cache) / ER
MKL Heap Hash HashVec MKL (unsorted) MKL-inspector (unsorted) Kokkos (unsorted) Hash (unsorted) HashVec (unsorted)
4 8 16
Edge Factor
0
250
500
750
1000
1250
1500
1750 KNL (Cache) / G500
4 8 16
Edge Factor
0
500
1000
1500
2000
2500
3000
3500 Haswell / ER
4 8 16
Edge Factor
0
500
1000
1500
2000
2500
3000 Haswell / G500
Figure 11: Scaling with increasing density (scale 16) on KNL (left) and Haswell (right)
8 10 12 14 16 18 20
Scale
0
1000
2000
3000
M
FL
O
P
S
KNL (Cache) / ER
MKL Heap Hash HashVec MKL (unsorted) MKL-inspector (unsorted) Kokkos (unsorted) Hash (unsorted) HashVec (unsorted)
8 10 12 14 16
Scale
0
1000
2000
3000
M
FL
O
P
S
KNL (Cache) / G500
8 10 12 14 16 18 20
Scale
0
2000
4000
6000
M
FL
O
P
S
Haswell / ER
8 10 12 14 16
Scale
0
1000
2000
3000
4000
5000
M
FL
O
P
S
Haswell / G500
Figure 12: Scaling with size on KNL with Cache mode (top) and Haswell (bottom), both with edge factor 16
5.4.1 Scaling with Density. Figure 11 shows the result of
scaling with density. When output is sorted, MKL’s performance
degrades with increasing density. When the output is not sorted,
increased density generally translates into increased performance.
The performance of all codes except MKL increases significantly as
the ER matrices get denser, but such performance gains are not so
pronounced for G500 matrices. For G500 matrices, we see
significant performance difference between KNL and Haswell
results. While Hash shows superior performance on KNL,
HashVector achieves much higher performance on Haswell. Also
for G500 matrices, we see that MKL (both sorted and unsorted)
and MKL-inspector achieves a peak in performance at edgecount 8,
with performance degrading as the matrices get denser or sparser
than that sweet spot. Hash and HashVector might not have peaked
at these density ranges we experimented since they get faster as
the matrices get denser.
5.4.2 Scaling with Input Size. Evaluation is running on ER and
G500 matrices with scaling the size from 7 to 20 or 17 respectively.
The edge factor is fixed as 16. Figure 12 shows the results on KNL
(top) and Haswell (bottom). On KNL, MKL family with unsorted
output shows good performance for ER matrices with small scale.
However, for large scale matrices, MKL goes down, and Heap and
Hash overcome. Especially, Hash and HashVector keep high
performance even for large scale matrices. This performance trend
becomes more clear on Haswell. When the scale is about 13, the
performance gap between sorted and unsorted is large, and it
becomes smaller when the scale is getting large. This is because
the cost of computation with hash table or heap becomes larger,
and the advantage of removing sorting phase becomes relatively
smaller. For G500 matrices, whose non-zero elements of each row
are skewed, the performance of MKL is terrible even if its output is
unsorted. Since there is no issue about load-balance in Heap and
Hash kernels, they show stable performance as ER matrices.
5.4.3 Scaling with Thread Count. Figure 13 shows the
scalability analysis of KNL on ER and G500 matrices with scale=16
8
20 21 22 23 24 25 26 27 28
Number of Threads
23
24
25
26
27
28
29
210
211
M
FL
O
P
S
KNL (Cache) / ER
Heap Hash HashVec MKL (unsorted) MKL-inspector (unsorted) Kokkos (unsorted) Hash (unsorted) HashVec (unsorted)
20 21 22 23 24 25 26 27 28
Number of Threads
23
24
25
26
27
28
29
210
211
KNL (Cache) / G500
Figure 13: Strong scaling with thread count on KNL with Cache mode with ER (left) and G500 inputs (right). Data used is of
scale 16 with edge factor 16
20 21 22 23 24 25
Compression Ratio
28
29
210
211
212
213
M
FL
O
P
S
Sorted
MKL Heap Hash HashVec
20 21 22 23 24 25
Compression Ratio
28
29
210
211
212
213
214
M
FL
O
P
S
Unsorted
MKL
MKL-inspector
Kokkos
Hash
HashVec
Figure 14: Scaling with compression ratio of SuiteSparse matrices on KNL with Cache mode. The algorithms that operate on
sorted matrices (both input & output) are on the left and those that operate on unsorted matrices are on the right.
and edge factor=16. Each kernel is executed with 1, 2, 4, 8, 16, 32,
64, 68, 128, 136, 192, 204, 256 or 272 threads. We do not show the
result of MKL with sorted output since it takes much longer
execution time compared to other kernels. All kernels show good
scalability until around 64 threads, but MKL with unsorted output
has no improvement over 68 threads. On the other hand, Heap and
Hash/HashVector get further improvement over 64 threads.
5.4.4 Sensitivity to Compression Ratio on Real Matrices. We
evaluate SpGEMM performance on 26 real matrices listed in
Table 2 on KNL. Figure 14 shows the result with sorted output and
unsorted output respectively in ascending order of compression
ratio (= flop / number of non-zero elements of output). Lines in the
graph are linear fitting for each kernel. First we discuss the result
with sorted matrices. The performance of Heap is stable regardless
of compression ratio while MKL gets better performance with
higher compression ratio. The matrices about graph processing
with low compression ratio cause load imbalance and performance
degradation on MKL. In contrast, Hash outperforms MKL on most
of matrices, and shows high performance independent from
compression ratio. For unsorted matrices, we add KokkosKernels
to the evaluation, but it underperforms other kernels in this test.
The performance of Hash SpGEMM is best for the matrices with
low compression ratio and MKL-inspector shows significant
improvement especially for the matrices with high compression
ratio.
Comparing sorted and unsorted versions of algorithm that
provide the flexibility, we see consistent performance boost of
keeping the output sorted. In particular, the harmonic mean of the
speedups achieved operating on unsorted data over all real
matrices we have studied from the SuiteSparse collection on KNL
is 1.58× for MKL, 1.63× for Hash, and 1.68× for HashVector.
5.4.5 Profile of Relative Performance of SpGEMM Algorithms.
We compare the relative performance of different SpGEMM
algorithms with performance profile plots [15]. To profile the
relative performance of algorithms, the best performing algorithm
for each problem is identified and assigned a relative score of 1.
Other algorithms are scored relative to the best performing
algorithm, with a higher value denoting inferior performance for
that particular problem. For example, if algorithm A and B solve
the same problem in 1 and 3 seconds, their relative performance
scores will be 1 and 3, respectively. Figure 15 shows the profiles of
9
1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Performance Relative to the Best Algorithm
Fr
ac
tio
n 
of
 P
ro
bl
em
s
Sorted
 
 
Hash
HashVector
MKL
Heap
1 1.5 2 2.5 3 3.5 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Performance Relative to the Best Algorithm
Fr
ac
tio
n 
of
 P
ro
bl
em
s
Unsorted
 
 
Hash
HashVector
MKL
MKL−inspector
Kokkos
Figure 15: Performance profiles of SuiteSparse matrices on KNL with Cache mode using sorted (left) and unsorted (right)
algorithms.
10 12 14 16
Scale of Short Side
0
250
500
750
1000
1250
1500
M
FL
O
P
S
Scale of Long Side = 18
Heap Hash HashVec MKL (unsorted) MKL-inspector (unsorted) Kokkos (unsorted) Hash (unsorted) HashVec (unsorted)
10 12 14 16
Scale of Short Side
0
200
400
600
800
1000
1200
Scale of Long Side = 19
10 12 14 16
Scale of Short Side
0
200
400
600
800
1000 Scale of Long Side = 20
Figure 16: SpGEMM between square and tall-skinny matrices on KNL with Cache mode (scales 18, 19, and 20)
relative performance of different SpGEMM algorithms for all 26
matrices from Table 2. Hash is clearly the best performer for sorted
matrices as it outperforms all other algorithms for 70% matrices
and its runtime is always within 1.6× of the best algorithm. Hash
is followed by HashVector, MKL and Heap algorithms in
decreasing order of overall performance. For unsorted matrices,
Hash, HashVector and MKL-inspector all perform equally well for
most matrices (each of them performs the best for about 40%
matrices). They are followed by MKL and KokkosKernels, with the
latter being the worst performer for unsorted matrices.
5.5 Square x Tall-skinny matrix
Many graph processing algorithms perform multiple breadth-first
searches (BFSs) in parallel, an example being Betweenness
Centrality on unweighted graphs. In linear algebraic terms, this
corresponds to multiplying a square sparse matrix with a
tall-skinny one. The left-hand-side matrix represents the graph
and the right-hand-side matrix represent the stack of frontiers,
each column representing one BFS frontier. In the
memory-efficient implementation of the Markov clustering
algorithm [5], a matrix is multiplied with a subset of its column,
representing another use case of multiplying a square matrix with
a tall-skinny matrix. In our evaluations, we generate the
tall-skinny matrix by randomly selecting columns from the graph
itself. Figure 16 shows the result of SpGEMM between square and
tall-skinny matrices. We set scale as 18, 19 or 20 for square matrix,
and as 10, 12, 14 or 16 for short size of tall-skinny matrix. The
non-zero pattern of generated matrix is G500 with edge factor=16.
The result of square x tall-skinny follows that of A2 (upper right in
Figure 12). Both for sorted and unsorted cases, Hash or HashVec is
the best performer.
5.6 Triangle counting
We also evaluate the performance of SpGEMM used in triangle
counting [4]. The original input is the adjacency matrix of an
undirected graph. For optimal performance in triangle counting,
we reorder rows with increasing number of nonzeros. The
algorithm then splits the reordered matrix A as A = L +U , where L
is a lower triangular matrix and U is an upper triangular matrix.
We evaluate the SpGEMM performance of the next step, where
L ·U is computed to generate all wedges. After preprocessing the
input matrix, we compute SpGEMM between L and U . Figure 17
shows the result with sorted output in ascending order of
compression ratio on KNL. Lines in the graph are linear fitting for
each kernel. Basically, the result shows similar performance trend
to that of A2. Hash and HashVector generally overwhelm MKL for
10
20 21 22 23 24
Compression Ratio
26
27
28
29
210
211
212
M
FL
O
P
S
MKL
Heap
Hash
HashVec
Figure 17: The performance of SpGEMM between L and U
triangular matrices when used to count triangles on KNL
with Cache mode
Table 4: Summary of best SpGEMM algorithms on KNL
(a) Real data specified by compression ratio (CR)
High CR (> 2) Low CR (≤ 2)
A x A Sorted Hash Hash
Unsorted MKL-inspector Hash
L x U Sorted Hash Heap
(b) Synthetic data specified by sparsity (edge factor, EF) and non-zero pattern
Sparse (EF ≤ 8) Dense (EF > 8)
Uniform Skewed Uniform Skewed
A x A Sorted Heap Heap Heap Hash
Unsorted HashVec HashVec HashVec Hash
TallSkinny Sorted - Hash - HashVec
Unsorted - Hash - Hash
any compression ratio. One big difference from A2 is that Heap
performs the best for inputs with low compression ratios.
5.7 Empirical Recipe for SpGEMM on KNL
From our evaluation results, best algorithm is different by target use
case and inputs. We summarize the recipe to choose best algorithm
for SpGEMM on KNL in Table 4. The recipe for real data is based
on compression ratio, which effects the dominant code. For low
compression ratios, especially LxU where output is sparser, Heap
shines. In the other cases, Hash and MKL-inspector dominates the
high-compression ratio scenarios. On synthetic data, our hash-table-
based implementations dominate others for almost cases, and Heap
works well for sparser matrices or uniform data. We note thatA2 for
uniform inputmatrices shows low compression ratio. This empirical
recipe was already predicted by our analysis in Section 4.2.4.
6 CONCLUSIONS
We studied the performance of computing the multiplication of
two sparse matrices on multicore and Intel KNL architectures. This
primitive, known as SpGEMM, has recently gained attention in
the GPU community, but there has been relatively less work on
CPUs and other accelerators. We have tried to fill that gap by
evaluating publicly accessible implementations, including those in
proprietary libraries. From architecture point of view, we develop
the optimized Heap and Hash SpGEMM algorithms for multicore
and Intel KNL architectures. Performance evaluation shows that
our optimized SpGEMM algorithms largely overcome Intel MKL
and Kokkos-kernel.
Our work provides multiple recipes. One is for the
implementers of new algorithms on highly-threaded x86
architectures. We have found that the impact of memory allocation
and deallocation to be significant enough to warrant optimization
as without them SpGEMM performance does not scale well with
increasing number of threads. We have also uncovered the impact
of MCDRAM for the SpGEMM primitives. When the matrices are
sparser than a threshold (≈ 4 nonzeros on average per row), the
impact of MCDRAM is minimal because in that regime the
computation becomes close to latency bound. On the other than,
MCDRAM shines as matrices get denser because then SpGEMM
becomes primarily bandwidth bound and can take advantage of
the higher bandwidth available on MCDRAM. The second recipe is
for the users. Our results show that different codes dominate on
different inputs, and we clarify which SpGEMM algorithm works
well from both theoretical and empirical points of view. For
example, MKL is a perfectly reasonable option for small matrices
with uniform nonzero distributions. However, our heap and
hash-table-based implementations dominate others for larger
matrices. Similarly, compression ratio also effects the dominant
code. Our results also highlight the benefits of leaving matrices
(both inputs and output) unsorted whenever possible as the
performance savings are significant for all codes that allow both
options. Finally, this optimization strategy for acquiring these two
recipes is beneficial for optimization of SpGEMM on future
architectures.
REFERENCES
[1] Sandeep R Agrawal, Christopher M Dee, and Alvin R Lebeck. 2016. Exploiting
accelerators for efficient high dimensional similarity search. In PPoPP. ACM.
[2] Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2016. Balanced Hashing
and Efficient GPU Sparse General Matrix-Matrix Multiplication. In ICS. ACM,
New York, NY, USA, Article 36.
[3] Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Oded
Schwartz, Sivan Toledo, and Samuel Williams. 2016. Exploiting multiple levels
of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific
Computing 38, 6 (2016), C624–C651.
[4] Ariful Azad, Aydın Buluç, and John Gilbert. 2015. Parallel triangle counting and
enumeration using matrix algebra. In IPDPSW.
[5] Ariful Azad, Georgios A Pavlopoulos, Christos A Ouzounis, Nikos C Kyrpides,
and Aydin Buluç. 2018. HipMCL: a high-performance parallel implementation of
the Markov clustering algorithm for large-scale networks. Nucleic acids research
(2018).
[6] Grey Ballard, Christopher Siefert, and Jonathan Hu. 2016. Reducing
communication costs for sparse matrix multiplication within algebraic multigrid.
SIAM Journal on Scientific Computing 38, 3 (2016), C203–C231.
[7] Nicolas Bock, Matt Challacombe, and Laxmikant V Kalé. 2016. Solvers for O(N)
Electronic Structure in the Strong Scaling Limit. SIAM Journal on Scientific
Computing 38, 1 (2016), C1–C21.
[8] Aydın Buluç and John R. Gilbert. 2011. The Combinatorial BLAS: Design,
Implementation, and Applications. IJHPCA 25, 4 (2011), 496–509.
[9] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A
recursive model for graph mining. In Proceedings of the 2004 SIAM International
Conference on Data Mining. SIAM, 442–446.
[10] Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing sparse matrix—
matrix multiplication for the gpu. ACM Transactions on Mathematical Software
(TOMS) 41, 4 (2015), 25.
[11] Timothy A Davis. [n. d.]. private communication. ([n. d.]).
[12] Timothy A Davis. 2006. Direct methods for sparse linear systems. SIAM.
11
[13] Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix
collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1.
[14] Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. 2017.
Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core
Architectures. In IPDPSW. IEEE, 693–702.
[15] Elizabeth D Dolan and Jorge J Moré. 2002. Benchmarking optimization software
with performance profiles. Mathematical programming 91, 2 (2002), 201–213.
[16] John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in
MATLAB: Design and implementation. SIAM J. Matrix Anal. Appl. 13, 1 (1992),
333–356.
[17] John R. Gilbert, Steve Reinhardt, and Viral B. Shah. 2007. High performance
graph algorithms from parallel sparse matrices. In PARA. 260–269.
[18] Felix Gremse, Andreas Hofter, Lars Ole Schwen, Fabian Kiessling, and Uwe
Naumann. 2015. GPU-Accelerated Sparse Matrix-Matrix Multiplication by
Iterative Row Merging. SIAM Journal on Scientific Computing 37, 1 (2015), C54–
C71.
[19] Fred G Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication
and permuted transposition. ACM TOMS 4, 3 (1978), 250–269.
[20] Guoming He, Haijun Feng, Cuiping Li, and Hong Chen. 2010. Parallel SimRank
computation on large graphs with iterative aggregation. In SIGKDD. ACM.
[21] Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-
matrix multiplication for irregular data. In IPDPS. IEEE, 370–381.
[22] Kiran Matam, Siva Rama Krishna Bharadwaj Indarapu, and Kishore Kothapalli.
2012. Sparse Matrix-Matrix Multiplication on Modern Architectures. In HiPC.
IEEE.
[23] John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High
Performance Computers. Technical Report. University of Virginia.
[24] Johannes Sebastian Mueller-Roemer, Christian Altenhofen, and André Stork.
2017. Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision
and Processing on GPUs. In Computer Graphics Forum, Vol. 36.
[25] Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance
and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA
Pascal GPU. In ICPP. IEEE, 101–110.
[26] Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram,
Jongsoo Park, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das,
Sergey G Pudov, Vadim O Pirogov, and Pradeep Dubey. 2015. Parallel efficient
sparse matrix-matrix multiplication on multicore platforms. In ISC. Springer,
48–57.
[27] Usha Nandini Raghavan, Réka Albert, and Soundar Kumara. 2007. Near linear
time algorithm to detect community structures in large-scale networks. Physical
review E 76, 3 (2007), 036106.
[28] Kenneth A Ross. 2007. Efficient Hash Probes on Modern Processors. In ICDE.
IEEE, 1297–1301.
[29] Karl Rupp, Philippe Tillet, Florian Rudolf, Josef Weinbub, Andreas Morhammer,
Tibor Grasser, Ansgar JuÌĹngel, and Siegfried Selberherr. 2016. ViennaCL—
Linear Algebra Library for Multi-and Many-Core Architectures. SIAM Journal
on Scientific Computing 38, 5 (2016), S412–S439.
[30] Viral B. Shah. 2007. An Interactive System for Combinatorial Scientific Computing
with an Emphasis on Programmer Productivity. Ph.D. Dissertation. University of
California, Santa Barbara.
[31] Peter D Sulatycke and Kanad Ghose. 1998. Caching-efficient multithreaded fast
multiplication of sparse matrices. In IPPS/SPDP. IEEE.
12
