SpArch: Efficient Architecture for Sparse Matrix Multiplication by Zhang, Zhekai et al.
The 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA 2020)
SpArch: Efficient Architecture for Sparse Matrix Multiplication
Zhekai Zhang∗, Hanrui Wang∗, Song Han
EECS
Massachusetts Institute of Technology
Cambridge, MA, US
{zhangzk, hanrui, songhan}@mit.edu
William J. Dally
Electrical Engineering
Stanford University / NVIDIA
Stanford, CA, US
dally@stanford.edu
Abstract—Generalized Sparse Matrix-Matrix Multiplication
(SpGEMM) is a ubiquitous task in various engineering and
scientific applications. However, inner product based SpGEMM
introduces redundant input fetches for mismatched nonzero
operands, while outer product based approach [1] suffers from
poor output locality due to numerous partial product matrices.
Inefficiency in the reuse of either inputs or outputs data leads
to extensive and expensive DRAM access.
To address this problem, this paper proposes an efficient
sparse matrix multiplication accelerator architecture, SpArch,
which jointly optimizes the data locality for both input and
output matrices. We first design a highly parallelized streaming-
based merger to pipeline the multiply and merge stage of partial
matrices so that partial matrices are merged on chip imme-
diately after produced. We then propose a condensed matrix
representation that reduces the number of partial matrices by
three orders of magnitude and thus reduces DRAM access by
5.4×. We further develop a Huffman tree scheduler to improve
the scalability of the merger for larger sparse matrices, which
reduces the DRAM access by another 1.8×. We also resolve the
increased input matrix read induced by the new representation
using a row prefetcher with near-optimal buffer replacement
policy, further reducing the DRAM access by 1.5×. Evaluated
on 20 benchmarks, SpArch reduces the total DRAM access
by 2.8× over previous state-of-the-art. On average, SpArch
achieves 4×, 19×, 18×, 17×, 1285× speedup and 6×, 164×,
435×, 307×, 62× energy savings over OuterSPACE, MKL,
cuSPARSE, CUSP, and ARM Armadillo, respectively.
Keywords-Sparse Matrix Multiplication; Domain-Specific
Architecture; Specialized Accelerators; Huffman-Tree; Data
Reuse
I. INTRODUCTION
Generalized sparse matrix-matrix multiplication (SpG-
EMM) is the key computing kernel for many algorithms
such as compressed deep neural networks [2], [3], [4],
[5], triangle counting [6], Markov clustering [7], searching
algorithms [8], [9], and matching algorithms [10]. It is also
ubiquitous in scientific and engineering applications such as
grammar parsing [11], chemical molecule dynamics [12],
color intersection search [13], linear solver [14] and many
other applications [15], [16], [17], [18], [19], [20].
However, the performance of SpGEMM is memory
bounded on the traditional general-purpose computing plat-
forms such as CPUs and GPUs, because of the irregular
∗Equal Contributions.
Vanilla 
Inner 
Product 
Based =
Poor Input Reuse Good Output ReuseA B
C
Vanilla 
Outer 
Product 
Based
=
Good Input Reuse Poor Output Reuse
{  merge many partial matricesD E
H
G
F
Proposed 
Outer 
Product 
Based
=
Good Input Reuse Good Output Reuse
❸ Huffman Tree Scheduler
❹ Row Prefetcher
❶ Pipelined Multiply and Merge
❷ Matrix Condensing Less Partial Matrices
{less  merge(1)(2)
J
K
ED’
Figure 1. Proposed outer product based SpGEMM architecture jointly
optimizes input and output data reuse. It achieves good output reuse with
pipelined multiply and merge, matrix condensing, Huffman tree scheduler,
good input reuse with row prefetcher.
memory access pattern and poor locality caused by low-
density matrices [21], [22], [23]. For instance, the density
of Twitter’s [24] adjacency matrix is as low as 0.000214%.
As Moore’s Law [25] is slowing down, domain-specific
architecture [26] becomes promising to improve perfor-
mance and energy efficiency. OuterSPACE [1] proposed
outer product based SpGEMM, which has perfect input
reuse compared to inner product based method. However,
outer product has poor output reuse because it produces
a considerable amount of partial matrices (F,G,H in Fig-
ure 1). Those partial matrices need to be stored to DRAM
before merging, which incurs extensive DRAM access and
cancels the benefits of good input reuse. As a result, the
performance of OuterSPACE is only 10.4% of the theoretical
peak.
ar
X
iv
:2
00
2.
08
94
7v
1 
 [c
s.A
R]
  2
0 F
eb
 20
20
Pipelined Multiply 
and Merge Matrix Condensing
Huffman Tree 
Scheduler Row Prefetcher
5.7x slowdown 
5.7x more DRAM access
8.8x speedup 
5.4x less DRAM access
Overall 
4.2x speedup 
2.8x less DRAM access
1.5x speedup 
1.8x less DRAM access
1.8x speedup 
1.7x less DRAM access
Better Output Reuse Better Input Reuse
Figure 2. Four innovations in SpArch. The first slowdown is because of the excessive memory access of partially merged results when the number of
partial matrices is much larger than merge tree parallelism. But it enables the next three optimizations. We save DRAM bandwidth, achieve higher memory
utilization and speedup.
In this paper, we propose SpArch1, a domain-specific
accelerator to jointly optimize input and output data reuse.
We obtain input reuse by outer product, and output reuse by
on-chip partial matrix merging.
To achieve this, we design a highly parallelized merger
to pipeline the two computing stages, multiply and merge.
The multiply stage produces partial matrices, and the merge
stage merges partial matrices into final results. For large
matrices, however, the number of partial matrices exceeds
the merger’s parallelism. Merging only a part of partial
matrices at a time with multiple rounds leads to an increased
amount of memory access for the partially merged results,
which neutralizes the performance gain of pipelined multiple
and merge, making DRAM access even larger (see the first
step of Figure 2). However, it is a prerequisite for the
next three optimizations. Therefore, we propose a condensed
matrix representation for the first input matrix, where all
non-zero elements are pushed to the left, forming much
denser columns. As shown in Figure 1, we condense the
first input matrix D to condensed D′. The condensed matrix
D′ is loaded by condensed column. Implementation-wise,
this is equivalent to storing the matrix in CSR format and
fetching the elements with the same index for all rows. The
second input matrix E is stored in CSR format. The original
left matrix D has three columns and produces three partial
matrices. After condensing, D′ only has two columns and
only two partial matrices. In real-world benchmarks, we save
the number of partial matrices by three orders of magnitude.
Unfortunately, the condensed representation can still pro-
duce a larger amount of partial matrices than the merger’s
parallelism. The merge order impacts the amount of DRAM
access: partial matrices merged early have to be load/stored
to DRAM for every future merges. Therefore, we should
merge matrices with less non-zeros first. We design a
Huffman tree scheduler to decide the near-optimal merge
order. We represent each column of a sparse matrix as a
leaf node. The weight of the node is the number of non-
zeros in that column. For very sparse matrices, the number
of non-zeros after merge is approximately the sum of the
number of non-zeros before merge. Therefore, the weight of
1SpArch is the homophone of word spark, meaning striking together two
hard surfaces such as stone or metal, which we consider as analogous to
matrix-condensing, a critical technique in our architecture.
a parent node is the sum of the children’s weights, exactly
following the convention of a Huffman tree. A Huffman
tree minimizes the sum of all nodes’ weights. We apply
it to minimize the total DRAM traffic. Scheduled by
a Huffman tree, we merge matrices with fewer non-zeros
first and larger ones later (example in Figure 8). The three
techniques discussed above save output DRAM access on
20 benchmarks by 1/5.7 × 5.4 × 1.8 = 1.7× compared to
OuterSPACE (Figure 2).
However, in matrix condensing step, we ruin the perfect
reuse of the second matrix (E in Figure 1) because now one
condensed column of D′ need all the three (green, black and
pink) rows of E. In contrast, one column of D only needs one
row of E. Therefore, we propose a row prefetcher to prefetch
rows of the second matrix E and store to a row buffer. The
buffer replacement policy is near-optimal because we look
at the sequence of the rows we need ahead of time while
streaming in the first matrix, and thus we can replace the line
with farthest next use. The row buffer can achieve a 62%
hit rate, thus reducing DRAM access of the second matrix
by 2.6×, largely recovering the input reuse. With the four
techniques together, we reduce DRAM access by 2.8× over
OuterSPACE.
In summary, SpArch jointly optimizes input and output
reuse of SpGEMM and makes five contributions:
• Pipeline the multiply stage and the merge stage with a
comparator array-based highly parallelized merger.
• Use matrix-condensing to condense the first input ma-
trix, reducing the number of partial matrices by three
orders of magnitude.
• A Huffman tree scheduler that provides a near-optimal
merge order of partial matrices and reduces the DRAM
traffic.
• A row prefetcher that achieves near-optimal buffer
replacement policy for the second input matrix and
resolves the increased DRAM read induced by matrix-
condensing.
• We evaluate SpArch on real-world datasets from
SuiteSparse [27], SNAP [28] and rMAT [29], achieving
4×, 19×, 18×, 17×, 1285× speedup, and 6×, 164×,
435×, 307×, 62× energy saving over OuterSPACE,
MKL, cuSPARSE, CUSP and ARM Armadillo.
0.6
1.3
2.2
1.1
Partial Matrix BPartial Matrix A
0.1 0.5
0.2
1.2
+
0.1 1.1
0.2 1.3
2.2
1.1 1.2
=
Merged
(1, 0.1)(3, 1.1)(4, 0.2) (13, 0.1)(5, 1.3)(10, 2.2)(12, 1.1)
Sorted
(1, 0.1)(3, 0.5) (4, 0.2) (13, 0.1)(3, 0.6) (5, 1.3)(10, 2.2)(12, 1.1)
0 1 2 3 4 5 6 7
Comparator Array
(1,  
0.1)
(3,  
0.5)
(4,  
0.2)
(13, 
1.2)
(3, 0.6)
(5, 1.3)
(10, 2.2)
(12, 1.1)
≥ ≥ <
≥ ≥ ≥
≥ ≥ ≥
≥ ≥ ≥
0
1
2
3
1
2
3
4
2
3
4
5
3
4
5
6,7
<
<
<
<
<
<
<
<
≥ ≥ ≥ ≥
4
5
6
7
4 5 6 7
Figure 3. Comparator Array Based Merger. Each diagonal group outputs
the data on the boundary of ”≥” and ”<” to form a sorted array. The adder
and zero eliminator will further merge the duplicated elements.
II. PROPOSED ARCHITECTURE
A. Comparator Array based Merger
Previous state-of-the-art SpGEMM accelerator [1] pro-
cesses multiply and merge stages separately, which need
to store all partial matrices in DRAM. A better way is
to pipeline the multiply and merge stages and perform an
on-chip merge. The partial matrix is represented in COO
format with [row index, column index, value]. It is sorted
by row index then column index. The merger combines
multiple sorted arrays into a new sorted array. A simple
merger puts each array into a FIFO and selects the smallest
elements from the top of all FIFOs. This method suffers
from low parallelism and thus cannot fully utilize the DRAM
bandwidth. Therefore, we design a parallel merge unit.
1) Parallel Merge Unit: A naı¨ve merger walks two
pointers over two arrays and compares the corresponding
elements. The pointer with the smaller element will move
forward. This only outputs one element per cycle. In order
to increase the parallelism, we replace the pointer by a
sliding window of size N , all elements in the window A
will be compared to all elements in window B by an array
of comparators. After the comparison, we move one of the
window forward by N , improving the throughput by N
times.
A comparator array is the core of our merge unit. As
in Figure 3, the merger contains 4×4 comparators. The two
input matrices are stored in the format of [coordinate, value].
We use the comparator array to compare coordinates of all
A0
B0
< < <
≥ ≥ <
≥ ≥ ≥
0
1
2
1
2
3
2
3
4
A0 
(13, 
1.2)
A1 
(37, 
9.2)
A2 
(58, 
-10.8)
B0  
(12, 1.1)
B1 
(40, 9.2)
B2 
(61, 9.9) D
ec
id
e 
ch
un
k 
P
ai
rs
 fo
r l
ow
 le
ve
l 
co
m
pa
ra
to
r a
rr
ay
s
B1
B1
A2
B1
A2
B2
0
1
2
3
4
Top level 
comparator array 
compares last element  
in each chunk
Low level 
comparator arrays 
compare chunk pairs 
in parallel
A0
A1
A0
B0
Zoom In
(1,  
0.1)
(3,  
0.5)
(4,  
0.2)
(13, 
1.2)
(3, 0.6)
(5, 1.3)
(10, 2.2)
(12, 1.1)
≥ ≥ <
≥ ≥ ≥
≥ ≥ ≥
≥ ≥ ≥
0
1
2
3
1
2
3
4
2
3
4
5
3
4
5
6,7
<
<
<
<
<
<
<
<
≥ ≥ ≥ ≥
4
5
6
7
4 5 6 7
(1,0.1)(3,0.5)(4,0.2)
Chunk A0
(13,1.2) (19,5.1)(22,-3.1)(35,1.2)(37,9.2)
Chunk A1
(42,1.1)(47,9.9)(48,0.3)(58,-10.8)
Chunk A2
Partial Matrix A
(3,0.6)(5,1.3)(10,2.2)(12,1.1)
Chunk B0
(15,4.4)(29,3.9)(35,-1.1)(40,9.2)
Chunk B1
(44,7.1)(52,-1.0)(55,0.2)(61,9.9)
Chunk B2
Partial Matrix B
Figure 4. Hierarchical Comparator Array. We break the input array into
multiple chunks and use a top level comparator array to decide which pairs
of chunks will be fed to low level comparator arrays.
non-zero elements in partial matrix A and those of partial
matrix B. If A<B, then the entry is ’<’, otherwise it is
’≥’. We pad one dummy column of ’<’ to the right and
one dummy row of ’≥’ to the bottom. Then we detect a
boundary between the ’≥’ and ’<’ tiles. The boundary is
defined as below: 1. The left-top corner tile is a boundary.
2. The ’≥’ tiles in the first row are boundaries. 3. If a tile is
’≥’ and its top neighbor is ’<’, it is a boundary. 4. If a tile
is ’<’ and its left neighbor is ’≥’, it is a boundary. With the
rules, we mark the boundary tiles with red in Figure 3. We
further divide all the tiles diagonally into eight groups. The
tiles in the same group have the same group index marked at
the top-left of each tile. Each group will have one and only
one output. The output of one boundary tile is the smaller
coordinate and corresponding value in its two inputs. For
example, for the left-top tile, 1 < 3, therefore (1, 0.1) is the
output of that tile, also the output of group 0.
The output of the whole comparator array is the sorted
results of two input arrays. Let the left input array as a
and top array as b. The correctness is based on the fact
that a boundary tile (i, j) in k-th group always has ’<’
above and ’≥’ on the left. If it is ’≥’, it will output the
top corresponding bj . The ’<’ above indicates a0 to ai−1
from left array a are smaller than bj . Plus b0 to bj−1, there
are exactly i+ j = k items smaller than it. Thus the output
of that tile will be the k-th item in the merged result. The
case of ’<’ is similar to ’≥’.
All the results are generated in one clock cycle because
there is no data dependency between inputs. The 4×4
comparator array can process arbitrary length of inputs. In
each clock cycle, it merges eight inputs, four each from two
matrices. Then we shift-in following inputs and replace the
previous ones, so new merged results can be produced in
Merger (Comparator Array)
26 31 28 42 13 15 14 16
22 24 11 12
Merger (Comparator Array)
DRAM
A B C D
Multiplier Array
Merger (Comparator Array)
24 26 22 28 11 13 12 14
Merger (Comparator Array)
DRAM
A B C D
26 31 28 42
Merger (Comparator Array)
22 24
15 21 16 17
13 14
11 12
Merger (Comparator Array)
DRAM
A B C D
Merger (Comparator Array)
26 31 28 42 21 23 17 19
22 24 15 16
13 14
Merger (Comparator Array)
DRAM
A B C D
11, 12
Multiplier Array
Multiplier ArrayMultiplier Array
A: (24)(26)(31)(52)(54)(56)(57)(58)(73)(75) 
B: (22)(28)(42)(44)(46)(47)(48) 
C: (11)(13)(15)(21)(23)(25)(41)(43)(45) 
D: (12)(14)(16)(17)(18)(32)(34)(36)(37)(38)(72)
Figure 5. A merge tree which merges four partial product matrices. We
only show coordinates of elements here.
each cycle.
2) Hierarchical Parallel Merge Unit: To increase the
merger parallelism, we can increase the size of the com-
parator array, such as from 4×4 to 12×12. Nevertheless, the
number of comparators is O(n2), where n is the side length
of the comparator array, thus consuming a large number
of hardware resources. Therefore, we further propose a
hierarchical merger in Figure 4. We use two levels of
comparator arrays to reduce the total number of comparators.
The hierarchical merge unit contains high-level and low-
level comparator arrays. We divide the input into chunks of
4, as in Figure 4 top. The length of each chunk equals to
the input length of the low level comparator array (4 in the
example). The top level array is used to decide which chunks
need to be compared by low level arrays. The intuition
is that if the largest element of chunk A is still smaller
than chunk B, then we can skip the comparison of the two
chunks. We use the top level comparator array to compare
the last element in each chunk and get the boundary tiles
(marked in red). Since each chunk is already sorted, the last
element is the largest. The boundary tiles indicate the chunk
pairs for the low level comparator arrays. For example, in
Figure 4, the right-bottom tile is a boundary tile, so the A2
and B2 chunk is a chunk pair. We also divide the top level
comparator array into diagonal groups, and each group will
generate one and only one chunk pair. Then we use five
low level comparator arrays to process the chunk pairs in
parallel.
Output of each low level comparator array is limited by
a min-max bound to avoid element duplication. Results not
3zero_count 0 0 1 2 2 2 3
iter1: 
if zero_count[0] 
== 1 
shift left by 1
1 0 0 2 3 0 4 0
1 0 0 2 3 4 0 0
3zero_count 0 0 1 2 2 2 2
iter2: 
if zero_count[1] 
== 1 
shift left by 2
1 0 0 2 3 4 0 0
1 2 3 4 0 0 0 0
Figure 6. Zero Eliminator. Bit 0 of zero count is checked to determine
whether to shift by 1 in the first layer. Bit 1 of zero count is checked to
determine whether to shift by 2 in the second layer. We need logN -cycle-
latency to process input of length N.
in the bounds are set to 0. If the top level boundary tile has
one left/top boundary tile neighbor, then the min bound is
the first element of the top/left input chunk. The min bound
of the first low level array is the smallest element. The max
bound of each low level array is the min bound of the next
one. The upper bound of the last low level array is +∞.
Then the outputs of each low level array are concatenated
together to get the overall output. In this way, we save the
low level comparators in non-boundary tiles in the high level
array. Mathematically, the number of comparators is reduced
to O(n 43 ). If we choose a n 23 × n 23 top level comparator
array and n
1
3 ×n 13 low level comparators, we can process n
elements at a time and uses only (2n
2
3−1)∗(n 13 )2+(n 23 )2 =
O(n 43 ) comparators.
3) Merge Tree: Using the hierarchical merger, we get a
highly parallelized binary streaming merger that can merge
up to 16 elements in each cycle. It merges two arrays into
one array. In order to merge more arrays into one array, we
stack multiple binary mergers and form a merge tree. As
shown in Figure 5, the merge tree is a full binary tree that
each node represents a FIFO on the hardware. Input arrays
are fed to the leaf nodes, and the output array is collected
from the root node. A binary merger merges the elements
from the child FIFO and stores the results to the parent
FIFO. The throughput of the whole tree is bounded by the
root node, which has only one merger. Therefore, each layer
shares one merger to balance the throughput.
Figure 5 illustrates a merge tree of 2 layers, and each layer
is equipped with a 1×1 merger. We only show coordinates
of elements here. Four arrays A, B, C, D are to be merged.
They are loaded into the leaf FIFOs. In the first 4 cycles, the
lower merger merges data from the leaf FIFOs and store the
results to the middle FIFOs. Then the upper merger notices
there is enough data in the middle FIFOs, so it merges them
and stores to the topmost FIFO. When the root FIFO is
full, data is written to DRAM. The merge tree continues to
working streamingly until all data fed to the lowest FIFO
reach the root, and four arrays are merged fully on-chip.
4) Adders and Zero Eliminator: The merger stated above
only merges the elements and leaves alone same-location
elements, i.e., elements with the same row and column index.
However, in SpGEMM, we need to perform add operation
on those elements. Therefore, we connect a slice of adders
right after the merger, and it will add adjacent same-location
elements and set one of the elements to zero. Then we use
M
atrix 
C
ondensing
Round 0Round 1Round 2
Load S
equence
Produces 16 partial matrices Produces 12 partial matrices
Matrix Condensing
Stored in 
CSR; 
Loaded by  
condensed 
column
1st
4th
2nd
3rd
Figure 7. Matrix Condensing. Condense the sparse matrix to the left,
reducing the number of columns, thus reducing the number of partial
matrices. It can be stored naturally using the CSR format.
a Zero Eliminator to compress these zeroes and output the
dense results. The Zero Eliminator consists of two parts. The
first part is a prefix sum module that computes the number of
zeroes (zero count) before each element. The second part
is a modified logN layer shifter. Each layer contains N
MUXs that shift the input array and zero count by 1,2,4,...
positions. Unlike a traditional shifter in which MUXs share
a common control signal, the MUXs in Zero Eliminator
are controlled by the zero count signal of each element.
Figure 6 shows an example of the Zero Eliminator.
B. Matrix Condensing
Due to the limitation of hardware resources, we can only
afford to merge 64 arrays on-chip. However, the intermediate
results generated by the multiplier can be as large as 10,000
to 1,000,000, which requires multiple rounds of 64-way
merging and still consumes a considerable amount of DRAM
bandwidth. Therefore, we propose matrix condensing to
merge multiple columns into a single column, which reduces
the number of partial matrices to 100 to 1,000 in our
benchmarks.
The motivation of matrix condensing is that if we have
two columns a1, a2 from the left matrix that have no
elements sharing the same row index, we can merge them
(i.e., combine them into a new array sorted by the row index)
while keeping the original column index. When we process
the merged column in multiply phase, we multiply each
element by its corresponding row (the same as its original
column index) in the second input matrix B. The result is
the same as the merge result of a1 × B and a2 × B. We
use a cheap merge of the left matrix to replace an expensive
merge of the much longer multiplied results and reduce the
number of partial matrices.
Furthermore, since exchanging elements between different
columns does not affect the final result, we condense all
elements in a row to the leftmost column. In this way, the
number of columns of the condensed left matrix is far less
than the original one. The number of partial product matrices
is equivalent to the number of condensed columns, which is
reduced by three orders of magnitude.
Figure 7 shows a matrix in condensed format. The number
of columns is reduced from 16 to 12, which equals the
length of the longest row in the original matrix. In real-world
datasets, we can reduce it from 100,000 to 100˜1,000, which
is much closer to the size of the merge tree.
We store the left matrix in CSR format. The elements
in CSR directly map to those in condensed format: the ith
element in a CSR row is just in the ith column of condensed
format. CSR format and our condensed format are two
different views of the same data. In the view of DRAM, the
matrix is loaded by the rows, but in the view of a port of
the merge tree, the matrix is loaded by condensed columns.
This is achieved by the data loader, which dispatches data
to different ports according to its condensed column. As in
Figure 7, if the merger has parallelism of 4, we load four
condensed columns together. The dash frames show the load
sequence. After loading an element from the left matrix, we
can use its original column index to fetch a row from the
right matrix and fed them to the multipliers. The multiplied
results will be fed to the ports whose index equals to the
condensed column index (not the original column index).
C. Huffman Tree Scheduler
The merge unit can merge 64 matrices on-chip. After
matrix condensing, the number of partial matrices can still
exceed 64, which needs to be written to DRAM and merged
later. The order of the merge matters: the earlier a matrix
is merged, the more rounds of DRAM read and write it
needs. Intuitively, we should merge sparser partial products
matrices first, since they have less number of non-zeros; even
in-and-out from DRAM for many times, the access number
is smaller. We should leave denser matrices merged later. To
wisely choose the order, we use a Huffman tree scheduler
to minimize memory access during the whole task.
Our scheduler abstracts the entire merge process as a tree.
In Figure 8, we show the 2-way and 4-way Huffman tree
schedulers when the merger can merge 2, and 4 matrices at
the same time. We also show a 2-way sequential scheduler
without the Huffman tree scheduler for comparison.
The leaf nodes in the tree represent the initial multiplied
results of a column in the left matrix with the right matrix.
The internal nodes represent the partially merged results.
The weights of the nodes represent the size of partially
merged results or the size of the initial multiplied results.
For internal nodes, we estimate its weight by adding up the
weights of its children because the matrix is very sparse,
and the additions during the merge stage are relatively rare.
The arrow points from nodes to their parent node represent
a round of multiply-and-merge operation.
The memory access amount of all partially merged results
equals to the sum of all internal node weights. Since the root
node and leaf node weights do not rely on the tree shape,
the optimization goal is to minimize the total weights of
R
8
A 
15
2-Way with Sequential Scheduler
Total weight of all nodes: 365
A 
15
B 
15
C 
13
D 
12
E 
9
F 
7
G 
3
H 
2
I 
2
J 
2
K 
2
L 
2
weight∝memory access
(a)
2-Way with Huffman Tree Scheduler
C
28
B
15 13
A
15
E
9
17
K
2
L
2
4
I
2
J
2
4
8
32
F
75
12
G
3
H
2
12
D
24
52
84
R: Round Total weight of all nodes reduces to: 354
R
0
R
2
R
6
R
1
R
4
R
5
R
7
B 
15
C 
13
D 
12
E 
9
F 
7
G 
3
H 
2
I 
2
J 
2
K 
2
L 
2
(b)
4-Way with Huffman Tree Scheduler
J
2
K
2
L
2
6
C
13
G
3
H
2
I
2
A
15
B
15
D
12
E
9
F
7 13
41
84
R0R1R2R3
A 
15
B 
15
C 
13 
D 
12
E 
9
F 
7
G 
3
H 
2
I 
2
J 
2
K 
2
L 
2
Total weight of all nodes reduces to: 228
(c)
A
4
D
L K
2 7
C
2
I
9
4 13
16 17
33
5
J
2
B
2
G
15 16 15
H
2
F
3
20 31
51
84
E
12
Figure 8. Huffman tree scheduler gives near optimal order of partial matrices merging. The total memory access is minimized. The weights of nodes
represent the number of non-zeros in the partial matrix. The total DRAM access is proportional to the sum of nodes’ weights.
all nodes, which also equals the sum of the weights of leaf
nodes multiplied by their depths in the tree.
A k-ary Huffman Tree is proven to be the optimal solution
to minimize the sum of all nodes. It puts larger leaf nodes
nearer to the root and smaller nodes farther, so the accu-
mulative products of weights and depths can be minimized.
This is exactly what we need to minimize the total DRAM
traffic. In each round of the Huffman Tree construction, we
select k un-merged nodes with minimal weights and merge
them into an internal node except the first round. In the
first round, the number of nodes that need to choose can
be calculated with Formula 1. This guarantees that the last
round always merges k nodes, so the root node of the tree
is always full.
kinit = (numcondensed col−2)mod(nummerger way−1)+2
(1)
In the example (Figure 8), a 2-way Huffman scheduler
reduced the total weights from 365 to 354. A 4-way Huff-
man scheduler further reduces it to 228. The weight is
proportional to the DRAM traffic. Therefore total weight
is reduced when the parallelism of the merger increases.
However, high parallelism usually comes at the cost of high
power consumption and large area. We choose a 64-way
Huffman tree scheduler, and will discuss the trade-offs in
Section III.
In our real implementation, the Huffman tree is built on
the fly with a priority queue. The implementation reflects
the algorithm of building a Huffman tree. We firstly add
the weights of leaf nodes to the queue and sort them. For a
m−way merger, in each iteration, the first m partial matrices
are merged, and the weight of the merged matrix is added
to the queue, and then we sort the queue again.
D. Row Prefetcher
Matrix condensing reduces the number of partial matrices
but ruins the input reuse of the second operand matrix since
one column after condensing requires the elements from
different rows of the second matrix. We solve the problem
by a row prefetcher with near-optimal buffer replacement.
The prefetcher is motivated by the fact that we can predict
the access order of the right matrix ahead of time. As
mentioned in section II-B, we access the left matrix from
top to down. Using the column index of the left matrix, we
can deduce the order of the right matrix, which can be used
by the prefetcher to load data before they are consumed by
the multipliers.
The prefetcher has two functions: first, fetching the data
used by multipliers ahead of time to hide the latency of
DRAM; second, caching the fetched data in a buffer for
future reuse to reduce the amount of DRAM reading.
The first function is achieved by multiple data fetchers.
We use a data fetcher for each DRAM channel. Accesses to
different DRAM channels and banks are overlapped; thus,
the DRAM latency can be hidden.
The second function is achieved by an on-chip buffer,
which stores the rows of the right matrix. When new data
is fetched, using the predicted access order, we can replace
the line with the furthest next use, which can achieve near-
optimal reuse if the prediction is accurate.
In Figure 9, we show a simplified example. The left
column is one condensed column from the condensed first
matrix. Each element is marked with its original column.
The first three are original column 1, 0, and 2. On the top,
we show row 0 to row 4 of the second matrix and divide
them according to the length of the buffer line. The central
part of the figure shows the buffer at different time steps.
The red frames highlight the new contents after each time
step.
In time step 0, we load Row 1. In time step 1, we load
Row 0. In time step 2, we firstly store R2-0. Then we need
to spill some buffer lines to continue to load R2. We spill
R0 first because R0 will be used in 7 time steps later, and
R1 will be used in 3 time steps later. However, after spilling
all R0 lines, R2 still has remaining parts. So we have to spill
R1. Spilling a row line by line instead of as a whole can
R0-0R0 R0-1 R0-2 R0-3 R0-4 R0-5 R0-6
R1-0R1 R1-1 R1-2 R1-3
R2-0R2 R2-1 R2-2 R2-3 R2-4 R2-5 R2-6 R2-7 R2-8 R2-9
R3-0R3 R3-1 R3-2
R4-0R4 R4-1 R4-2 R4-3 R4-4 R4-5
Replace R0 
then 
Replace R1
R1-0
R1-1
R1-2
R1-3
C1
R1-0
R1-1
R1-2
R1-3
R0-0
R0-1
R0-2
R0-3
R0-4
R0-5
R0-6
C0
R2-8
R2-9
R1-2
R1-3
R2-1
R2-2
R2-3
R2-4
R2-5
R2-6
R2-7
R2-0
C2
R2-8
R2-9
R3-0
R3-1
R2-1
R2-2
R2-3
R2-4
R2-5
R2-6
R2-7
R3-2
C3
Replace R1 
then 
Replace R2
R2-8
R2-9
R2-0
R3-1
R2-1
R2-2
R2-3
R2-4
R2-5
R2-6
R2-7
R3-2
C2
Replace 
R3
R2-8
R2-9
R1-0
R3-1
R1-1
R1-2
R1-3
R2-4
R2-5
R2-6
R2-7
R3-2
C1
Replace 
R2
R4-4
R4-5
R1-0
R3-1
R1-1
R1-2
R1-3
R4-0
R4-1
R4-2
R4-3
R3-2
C4
Replace 
R2
R4-4
R4-5
R1-0
R3-1
R1-1
R1-2
R1-3
R4-0
R4-1
R4-2
R4-3
R3-2
C1
Replace 
R4
R4-4
R4-5
R1-0
R3-1
R1-1
R1-2
R1-3
R3-0
R4-1
R4-2
R4-3
R3-2
C3
Replace R3 
then 
Replace R4
R0-6
R4-5
R1-0
R0-1
R1-1
R1-2
R1-3
R0-0
R0-3
R0-4
R0-5
R0-2
C0
R0-6
R4-5
R1-0
R0-1
R1-1
R1-2
R1-3
R0-0
R0-3
R0-4
R0-5
R0-2
C1
R0-6
R4-5
R1-0
R0-1
R1-1
R1-2
R1-3
R0-0
R0-3
R0-4
R0-5
R0-2
C0
C1
C0
C2
C3
C2
C1
C4
C1
C3
C0
C1
C0
C4
C4
C1
C0
D
istance: 7 
D
istance: 3 Step0 Step1 Step2 Step3 Step4 Step5 Step6 Step7 Step8 Step9 Step10 Step11
Figure 9. The row prefetcher and the buffer replacement policy. The prefetcher fetches the rows of the second matrix in advance to hide the DRAM
latency. The buffer lines are replaced according to the pre-calculated sequence of required rows thus can achieve near Be´la´dy optimal.
H
B
M
MatB Row 
Prefetcher
M
ultiplier 
A
rray
MatA Column 
Fetcher
Distance List 
Builder
Look-Ahead 
FIFO
Required 
Row
Row 
Distance
MatA

Element
MatB 
Row
Merger
Merger
New 
PMat
Merger 
FIFO
Partial Mat 
Fetcher
Partial Mat Addr
Partial Mat 
Writer
Old 
PMat
Figure 10. System architecture of proposed SpArch.
bring benefits. For example, from the time step 7 to time
step 8, we only need to load R3-0 and no need for R3-1
and R3-2 because they are already in the buffer.
E. System Architecture
The system architecture of SpArch is shown in Figure 10.
The left matrix A is stored in CSR and fetched by condensed
column, and the right matrix B is stored in CSR format in
HBM. The MatA Column Fetcher receives control instruc-
tions from the software scheduler, calculates the addresses of
data in the selected columns, and fetches the elements from
the left matrix. Then the fetched elements will be sent to a
look-ahead FIFO. The Distance List Builder will process the
look-ahead FIFO and calculates the next use time of each
row. The row index and next use time are provided to MatB
Row Prefetcher to prefetch rows and decide which buffer
line to spill, as illustrated in Section II-D. To perform the
associative search, we use a hash table to map row indexes to
positions in the buffer and a reduction tree of next use time
to decide which line to spill. As the width of the hash table
is much lower than the buffer itself, and the next use time
changes very few (at most one line per cycle), the power
consumption of the logic is minimal.
After prefetching rows, the multiplier array will conduct
the outer product and generate partial matrices in the COO
format and send them to the merge tree. There is a MUX
before each lowest level FIFO of the merge tree and can be
configured to accept data either from the Multiplier Array
or the Partial Matrix Fetcher. If the MUXs before each
lowest level FIFOs are configured to Partial Matrix Fetcher,
the scheduler will send the address of corresponding
partially merged results. It will fetch the requested matrix
once the FIFO is near empty. The output of the merge tree
will be connected to the Partial Matrix Writer. It buffers
partial matrices before they are written to DRAM and also
converts the internal COO format to the CSR format if it is
the final result.
III. EVALUATION
A. Methodology
To evaluate the performance of our design, we built a cycle-
accurate simulator in C++ to model the exact behavior of the
hardware. Each module is abstracted as a class with a clock
update method updating the internal state of this module
in each cycle, and a clock apply method, which simulates
Array Merger 16×16 hierarchical merger (4×4 top level + 4×4 low level) with 64-bit index (32 bits for row and 32 bits for column)and 64-bit value, running under 1GHz clock frequency.
Merge Tree 6 layers of array merger provides the ability of merging at most 64 arrays simultaneously.
Multiplier 2 groups, each consists of 8 double precision floating point multipliers.
MatA Column Fetcher Looks ahead buffer of 8192 elements. 64 fetchers support 64 columns of left matrix.
MatB Row Prefetcher A prefetch buffer of 1024 lines×48 elements per line×12 bytes per element. 16 fetchers that load data from 16 DRAMchannels. Each can prefetch up to 48 rows before used.
Partial Matrix Fetcher Support fetching up to 64 partial results.
Partial Matrix Writer A FIFO of 1024 elements before they are stored into DRAM.
Main Memory 16×64-bit HBM channels, each channel provides 8GB/s bandwidth.
Table I
ARCHITECTURAL SETUP OF SPARCH.
SpArch OuterSPACE
Technology 40nm 32nm
Area 28.49 mm2 87 mm2
Power 9.26W 12.39W
DRAM HBM@128GB/s HBM@128GB/s
Bandwidth 68.6% 48.3%Utilization
Table II
COMPARISON WITH PRIOR STATE-OF-THE-ART OUTERSPACE ON
AREA, POWER, AND MEMORY BANDWIDTH UTILIZATION.
the flip-flops in the circuit to make sure signals are updated
correctly. The parameters during the simulation are listed in
Table I.
To measure the area and power consumption, we model all
the logic on the data path, which consists of the array merg-
ers, arithmetic units, FIFOs, row prefetcher, and DRAM.
We implement the array merger in Verilog and synthesize it
using Synopsys Design Compiler (DC) under TSMC 40nm
library. We generate the annotated toggle rate using the xsim
RTL simulator with the real input data extracted from the
simulator and dump it into Switching Activity Interchange
Format (SAIF) to estimate the power consumption with
Design Compiler. We then use the area and power data from
[30] with the number of multiplications and additions from
the simulator to model the arithmetic units (floating-point
multipliers and adders). We also dumped the size, width, and
the amount of reading/writing of each SRAM/FIFO from the
simulator and use CACTI to estimate the energy and area of
FIFOs and prefetch buffers in the circuit. From the simulator,
we also got the exact amount of DRAM read and write.
We use them to estimate our DRAM power consumption
according to [31], [32].
We use multiple platforms as our baselines, including Desk-
top CPU, GPU, mobile CPU, and the state-of-art ASIC
accelerator OuterSPACE. For desktop CPU, we use Intel
MKL [33] on a 6-core Intel Core-i7 5930K CPU and mea-
sure its power with pcm-power tool. MKL provides math
routines parallelized by OpenMP for various computational
Energy (nJ/FLOP) Area (mm2)
Outer- SpArch Outer- SpArchSPACE SPACE
Computation 3.19 0.26 49.1 4.1
SRAM 0.35 0.34 37.5 24.4
DRAM 1.20 0.29 N/A N/A
Crossbar 0.21 N/A 0.1 N/A
Overall 4.95 0.89 86.7 28.5
Table III
ENERGY AND POWER BREAKDOWN. SAVINGS OF SPARCH ARE FROM
LESS DRAM ACCESS AND MORE EFFICIENT LOGIC.
workloads. For GPU, we use both cuSPARSE [34] and
CUSP [35], [36] and run them on an NVIDIA TITAN
Xp GPU. cuSPARSE parallelizes the computation between
matrix rows and then merges the partial results of each
row with a hash table. CUSP also computes matrix rows
in parallel and then sorts and merges different rows. Power
consumption is measured with nvidia-smi. For mobile
CPU, we use Armadillo library [37], [38] on a 4-core
ARM A53 CPU and measure its dynamic power using a
powermeter.
In order to provide fair comparison baselines, the data
type of all baselines and our architecture are double.
For MKL, cuSPARSE, CUSP and Armadillo, we dis-
card memory allocation and transportation time and only
measure the core execution time of SpGEMM func-
tions, i.e., mkl_sparse_spmm, cusparseDcsrgemm,
cusp::gen eralized_spgemm and overloaded ”*” re-
spectively. For all power measurement, we first measure
the idle power of the system and then repeatedly run the
workloads and measure the total power. Then the dynamic
power is total power minus idle power.
B. Experimental Results
We obtain the area and power consumption under TSMC
40nm technology and compare it with state-of-the-art
SpGEMM accelerator OuterSPACE (Table II and Table III).
For a fair comparison, we use the same DRAM power
estimation as OuterSPACE, which is 42.6 GB/s/W.
Table 2
Over OuterSPACE Over MKL Over cuSPARSE Over CUSP Over Armadillo
2cubes_sphere 4.63723471200183 23.7090932716815 11.6037079232328 13.6559455612264 1414.71684180317
amazon0312 4.69028938243438 24.6857069068942 36.4501027011047 28.5646988009485 1705.79952528691
ca-CondMat 3.27543190819418 14.438318413541 33.0722032141698 17.2771054977574 1393.70108257684
cage12 5.03262488313806 28.785126352939 11.1842625120057 14.3175093605454 1251.97237487494
cit-Patents 3.92871650073752 105.990977227265 277.722353710464 62.4823090625836 3760.12940718149
cop20k_A 4.68452255052138 22.6621302623938 8.88079194138266 25.273407536053 1223.43917925459
email-Enron 3.04134901392929 9.4132465516484 15.1160484520773 24.4207665634565 1127.48006520061
facebook 3.92113263170544 25.5497382542342 7.67126601052896 12.8528906478115 771.275846038707
filter3D 4.18759373901766 9.72254309299399 8.96356967524707 9.14610515204381 1112.21944191207
m133-b3 5.52025414665423 10.8500130987032 8.25828924856117 12.2497727579963 904.288379889281
mario002 3.29427301074803 14.593447810248 5.97791338876889 6.53630328402332 1127.3919271693
offshore 4.57206428685174 19.5529152239133 19.4987302026345 11.1972005843762 1583.81390063974
p2p-
Gnutella31 3.71778234501883 10.8416346615855 21.132261156069 14.9948391709607 1044.98679798341
patents_main 4.96201680134316 52.4597422670546 28.5313399011772 12.661369631862 2570.1183471693
poisson3Da 4.41815264458411 12.1530376373988 13.6920873552052 14.5200616954377 1154.18673921998
roadNet-CA 3.77022741739571 10.2015152710377 3.08117687073174 3.63136050743148 711.679595542241
scircuit 5.37754734203454 13.8398186253061 7.15226282794476 9.32166243719404 1046.17953963557
web-Google 3.88360587837463 40.4724378554192 171.183954133139 123.360493673665 2499.88198224918
webbase-1M 3.35674758713642 15.6488736127697 19.9363356983682 18.6282753205719 828.017521433532
wiki-Vote 3.95656617463903 10.1606271441995 32.3200783959356 28.2769622727105 1039.82204787504
Geo Mean 4.15351619751816 18.6730047656866 17.5572696873829 16.5504801170404 1284.82668728664
Sp
ee
du
p
1
10
100
1000
2cubes_sphere
amazon0312
ca-CondMat
cage12
cit-Patents
cop20k_A
email-Enron
facebook
filter3D m133-b3
mario002
offshore
p2p-Gnutella31
patents_main
poisson3Da
roadNet-CA
scircuit
web-Google
webbase-1M
wiki-Vote
Geo Mean
1285104082825001046712
115425701045158411279041112771
1127122337601252139417061415
17
28
19
123
9
4
15131511
7
12
9
13
2425
62
1417
29
14
18
32
20
171
7
3
14
29
2119
6
898
15
9
278
11
3336
12
19
10
16
40
14
1012
52
11
20
15
1110
26
9
23
106
29
14
2524
4434
5
44
5
4
5
3
6
44
3
54
5
3
55
Over OuterSPACE Over MKL Over cuSPARSE Over CUSP Over Armadillo
Figure 11. Speedup of SpArch over OuterSPACE, MKL, cuSPARSE, CUSP and ARM Armadillo on real-world matrices. We get a consistent speedup
over state-of-the-art accelerator OuterSPACE.
Table 2-1
Over OuterSPACE Over MKL Over cuSPARSE Over CUSP Over Armadillo
2cubes_sphere 5.22598013701056 164.479706024014 248.98984201057 224.736280976096 48.8794103654672
amazon0312 6.88669612107833 207.712592126548 1036.29986243966 395.904969663462 96.9939551732474
ca-CondMat 4.55256711712377 127.275952124852 789.941034417262 296.424679535877 65.6403473966779
cage12 6.1337790198124 215.034425495027 258.44786024133 275.944202252062 51.7060977534429
cit-Patents 7.25734669627795 842.584079050631 6043.03394872051 1517.87502822201 240.969783980313
cop20k_A 5.3322464562582 164.579763511822 191.658727100428 408.493838243293 48.3125411831766
email-Enron 4.07628608796988 81.4823089737375 398.832608227801 422.514708264772 51.2059843560592
facebook 5.57950145066446 196.374485582292 139.146457204678 360.029976868761 32.7611756758377
filter3D 5.07278113707992 77.7377585494996 207.622821921975 178.777117462946 47.8287360113564
m133-b3 9.97750448012376 116.179849862068 275.773539080564 200.065901400538 59.3398554608102
mario002 5.71104930029322 148.454025747139 204.013979452471 127.077153403376 69.3822020739428
offshore 5.1165667685493 139.589816061172 427.80226775715 161.755821474672 61.4898006191512
p2p-
Gnutella31 7.48531293020062 143.21295310736 693.405848594896 292.289503746795 69.5960869719778
patents_main 8.78317485672016 581.285331905204 912.699149338374 216.97984436607 139.474296041095
poisson3Da 5.38058986883362 94.6915803293822 326.910800757299 251.092918105476 45.3617087903062
roadNet-CA 6.98590537909619 116.300290082148 105.480316582459 78.171990650603 42.5562957823006
scircuit 8.04931003153742 126.655359994534 165.837700849291 174.487228387844 48.0096337541645
web-Google 6.09481991954019 338.368987619223 3619.67363745205 2499.1134461239 123.44514190491
webbase-1M 5.63841986757046 144.973300299178 570.073275010066 403.920150291568 43.7629936304188
wiki-Vote 5.18479288948222 88.4092060143102 471.550958822391 490.290796345931 40.6759669931953
Geo Mean 6.07169317810153 163.893593882208 435.268243623545 306.705501508714 62.2038980004567
En
er
gy
 
Ef
fic
ie
nc
y
1
10
100
1000
2cubes_sphere
amazon0312
ca-CondMat
cage12
cit-Patents
cop20k_A
email-Enron
facebook
filter3D m133-b3
mario002
offshore
p2p-Gnutella31
patents_main
poisson3Da
roadNet-CA
scircuit
web-Google
webbase-1M
wiki-Vote
Geo Mean
62
4144
123
484345
139
70616959
48
33
5148
241
52
66
97
49
307
490
404
2499
174
78
251217
292
162
127
200179
360423408
1518
276296
396
225
435472
570
3620
166
105
327
913
693
428
204
276
208
139
399
192
6043
258
790
1036
249
164
88
145
338
12711695
581
143140148
116
78
196
81
165
843
215
127
208
164
6566
87
5
97
56
10
564
5
76
5
7
5
Over OuterSPACE Over MKL Over cuSPARSE Over CUSP Over Armadillo
Figure 12. Energy saving of SpArch over OuterSPACE, MKL, cuSPARSE, CUSP and ARM Armadillo.
Table -2
Column Fetcher Row Prefetcher Multiplier Array Merge Tree Partial Mat Writer
area 2.64 5.8 0.45 17.27 2.34
Partial Mat Writer
8.2%
Merge Tree
60.6%
Multiplier Array
1.6%
Row Prefetcher
20.4%
Column Fetcher
9.3%
Table 1-1-2
Column Fetcher Row Prefetcher Multiplier Array Merge Tree Partial Mat Writer HBM
power 101.39 1155.72 73.10 4738.47 243.04 2240.4
HBM
26.2%
Partial Mat Writer
2.8%
Merge Tree
55.4%
Multiplier Array
0.9%
Row Prefetcher
13.5%
Column Fetcher
1.2%
(a) (b)
Figure 13. Area (a) and Power (b) Breakdown. The merge tree as the
core of SpArch takes the most of area and power.
We also analyze the power and area per module. Figure 13
shows area and power breakdown of SpArch. Most of the
energy and area is consumed by the merge tree, which is
the core part of SpArch. We evaluate the performance of
SpArch on the SuiteSparse dataset [27] and SNAP dataset
[28], which are the same as [1].
Figure 11 compares the SpArch and baselines. On av-
erage, SpArch is 4×, 19×, 18×, 17× and 1285× faster
than OuterSPACE, MKL, cuSPARSE, CUSP and ARM Ar-
madillo. SpArch outperforms the baselines on every matrix.
This mainly comes from the reduction of redundant memory
access of partial matrices, with 2.8× less DRAM access
compared to OuterSPACE. We also achieve higher memory
bandwidth utilization compared to OuterSPACE, by virtue
of hiding the DRAM latency with row prefetcher and the
regular write pattern of streaming merge tree. Figure 12
shows the relative energy savings. SpArch achieves 6×,
164×, 435×, 307× and 62× energy saving comparing
to OuterSPACE, MKL, cuSPARSE, CUSP and ARM Ar-
madillo. This is because our specialized architecture de-
signed for SpGEMM tasks removes unnecessary compo-
nents like cache, instruction decoder in general processors,
and uses a carefully-designed computational unit, which is
more efficient on specific tasks.
Figure 14 shows the performance comparison on syn-
Table 1-1-1
MKL Ours
rmat-5k-x32 1358366704 13871334064.5
rmat-5k-x16 980693508.9 12460358251.2
rmat-10k-x32 1182123213 10094618599.5
rmat-5k-x8 760608128.4 9394076179.7
rmat-10k-x16 763053019 9539259802.5
rmat-20k-x32 939993571.8 7865802681.3
rmat-5k-x4 352499704.2 6576208342.5
rmat-10k-x8 684995374 8592527459.5
rmat-20k-x16 736914843.1 7371529812.6
rmat-40k-x32 779826545.9 6442724494.4
rmat-10k-x4 290633276.5 6716478022.4
rmat-20k-x8 600813361 7462114702.3
rmat-40k-x16 625954887.6 6083553 91.8
rmat-20k-x4 290874130.6 6845946204.74
rmat-40k-x8 517816495 637778 424.74
rmat-80k-x16 518906373.8 5217667204.83
rmat-40k-x4 229579497 6372357066.38
rmat-80k-x8 381004740.5 5551040046.90
rmat-80k-x4 243843142.5 5840754841.78
Geo Mean 568293928.976508 7544938524.49646
1E+08
1E+09
1E+10
rmat-5k-x32
rmat-5k-x16
rmat-10k-x32
rmat-5k-x8
rmat-10k-x16
rmat-20k-x32
rmat-5k-x4
rmat-10k-x8
rmat-20k-x16
rmat-40k-x32
rmat-10k-x4
rmat-20k-x8
rmat-40k-x16
rmat-20k-x4
rmat-40k-x8
rmat-80k-x16
rmat-40k-x4
rmat-80k-x8
rmat-80k-x4
Geo Mean
MKL Ours
Degradation: 2.7x
Degradation: 5.9x
Density: 6E-3 5E-5
FLOPS
Figure 14. Performance on rMAT benchmarks. Our performance is not
only higher than MKL, but also more stable when the matrices get sparser.
thesized rMAT [29] data with Intel MKL. The density of
matrices ranges from 6×10−3 to 5×10−5. SpArch achieves
over 10× speedup while keeping a relatively stable perfor-
mance: the performance only drops by 2.7× when matrices
are sparser, while MKL suffers from larger performance
degradation. The lower impact on matrix density partially
derives from the outer product algorithm, which is not
sensitive to the density. It also relates to our architecture
design which uses a single but large merge tree rather than
many small processing elements. The latter one will be
more likely to suffer from the problem of load balancing,
especially when matrices become sparser.
To understand how far our design is from the theoretically
optimal solution, we use the roofline model to analyze
our performance. Figure 15 shows the result of roofline
analysis. We calculate the theoretical operational intensity
of the outer product on our dataset, which equals the
number of operations divided by the size of two input
matrices plus the size of merged final results, calculated
to be 0.19Flops/Byte. The theoretical computation roof is
32GFlops in our design. We use 16 multipliers running in
1GHz. The peak multiplication performance is 16GFlops/s,
and the overall peak performance (multiplication+addition)
is 32GFlops/s. The roof in our condition is 23.9GFlops, 2.3×
Performance 
(GFLOPS/s)
0.05 0.10 0.2 0.4 Operation Intensity  
(Ops/Byte)
6.4
12.8
25.6
51.2
computation roof 32.0
0.19
Ours
23.9
10.4
2.5
OuterSPACE
ban
dw
idth
 ro
of
Figure 15. Roofline Model for SpArch and OuterSPACE. The theoretical
operational intensity is 0.19Flops/Byte. The theoretical computation roof
is 32GFlops/s. SpArch achieves 10.4GFlops/s while OuterSPACE acheives
2.5GFlops/s. Our performance is 4× higher than OuterSPACE by virtue of
less redundant DRAM traffic for partial matrices.
over our performance, and 9.6× over OuterSPACE. We are
much nearer to the roof comparing to OuterSPACE.
C. Interpreting the Performance Gain
To analyze where our speed up comes from and demon-
strate the effectiveness of several proposed ideas separately,
we conduct breakdown experiments. Figure 16 shows the
performance breakdown for each of our contributions. In
the breakdown experiments, we first remove our prefetcher,
scheduler, and change back to CSC/CSR matrix format but
preserve the pipeline of multiply and merge phase and use a
random order to select initial columns and partially merged
results to merge. The performance slows down by 5.7×
comparing to OuterSPACE.
The DRAM read-write in OuterSPACE is mainly the
intermediate results generated by the multiplier. Assuming
the dimension of the matrix is N × N , and there are M
multiplications, thus M intermediate results. We also have
an average of 0.5M final results, according to the statistics
of the datasets, so the memory access is roughly 2.5M . In
our multiply-merge pipeline, ideally, the partial matrices are
not written to DRAM. However, as N is much larger than
the size of the merge tree (64), we still need to store lots of
partially merged results to DRAM. Assume that the merging
does not change the length of arrays, we can estimate the
read times of one multiplied result α as follows:
E =
t−1∑
k=0
Pr[α is used in round k] (2)
=
t−1∑
k=0
w
N − k ∗ (w − 1) (3)
=
w
w − 1 ∗
t−1∑
k=0
1
t+ 1w−1 − k
(4)
=
w
w − 1 ∗
t∑
i=1
1
1
w−1 + i
(5)
where t = N−1w−1 is the number of rounds and w = 64 is the
Table 2-2
0 1 2 3 4
2.5 0.4352 3.8 5.9 10.4
0
1
2
3
4
GFLOPS
0.0 3.0 6.0 9.0 12.0
10.4
5.9
3.8
0.4
2.5
5.7x slowdown with Pipelined Multiply and Merge
8.8x speedup with Matrix Condensing
1.5x speedup with Huffman Tree Scheduler
1.8x speedup with Row Prefetcher
OuterSPACE baseline
Figure 16. Dissecting the performance gain. The performance first
drops due to the increased memory access of partially merged results,
but it enables next three optimizations that reduce memory access more
effectively: matrix condensing and Huffman tree scheduler for the output
matrix, together with row prefetcher for the input matrix. Overall it achieves
4.2× speed up over OuterSPACE.
size of the merge tree. As 1w−1 is very small compared to
i, we can ignore it:
E ≈ w
w − 1 ∗
t∑
i=1
1
i
(6)
≈ w
w − 1 ∗ ln t (7)
On average N ≈ 140, 000 in the datasets, we need
to read and write the partially merged results for roughly
ln 14000063 − 1 ≈ 6.7 times. (Minus 1 because we do
not need to load/store the multiplied results in the first
round). The DRAM access of partial results is roughly
6.7∗2M +0.5M = 13.9M . Comparing to the 2.5M access
of OuterSPACE, the 5.7× slow down is reasonable. We
then apply matrix condensing to the left matrix, which can
reduce the number of columns from 140,000 to 100, both on
average. That means we can finish the merging in 2 rounds
on average using a 64-way merger. However, the right matrix
will be read not only once. Instead, the amount of right
matrix access equals to the number of multiplications M .
Plus the partially merged results and final results, which
is roughly ((1 + 1/2) − 1) ∗ 2M + 0.5M = 1.5M (using
Formula 6), our memory access is 2.5M , 5.5× less than the
one without condensing. The speedup improvement this step
is 8.8×, higher than 5.5× because with matrix condensing,
each round takes much more clock cycles than before, thus
the startup overhead is reduced, leading to higher memory
utilization and higher performance.
Furthermore, we add back the Huffman Tree scheduler,
which makes the memory access of partially merged results
negligible. This is because the average number of non-zeroes
per row is less than 64. The long columns can be scheduled
near the root node in the Huffman Tree, so they will not
generate partially merged results. This reduces the memory
access by roughly 1.7× (from 2.5M to 1.5M ), near to the
actual performance we get.
Table 1-3-1
Prefetched Buffer 
Size
1024x24 1024x36 1024x48 1024x60 1024x72 1024x84 1024x96
GFLOPS 10.21 10.30 10.45 10.47 10.53 10.55 10.57
DRAM Access 216.5 211.5 208.2 206.0 204.6 203.8 203.4
D
R
A
M
 A
cc
es
s 
(M
B
)
200
205
210
215
220
G
FL
O
P
S
10.1
10.2
10.3
10.4
10.5
10.6
(a)Prefetcher Buffer Total Size
1024x24
1024x36
1024x60
1024x72
1024x84
1024x96
GFLOPS DRAM Access
203.4203.8
204.6
206
208.2
211.5
216.5
10.57
10.55
10.53
10.47
10.45
10.3
10.21
Table 1-3-1-2
Merger FIFO Size 2048x24 1024x48 512x96 256x192
GFLOPS 10.41 10.45 9.91 9.26
DRAM Access 202.1 208.2 226.6 245.7
D
R
A
M
 A
cc
es
s 
(M
B
)
190
202
214
226
238
250
G
FL
O
P
S
9
9.5
10
10.5
11
2048x24 512x96 256x192
GFLOPS DRAM Access
245.7
226.6
208.2
202.1
9.26
9.91
10.4510.41
Table 1-3-1-1-1
Lookahead FIFO 
Size
1024 2048 4096 8192 16384
GFLOPS 10.04 10.25 10.43 10.45 10.39
DRAM Access 228.0 220.1 212.3 208.2 206.5
D
R
A
M
 A
cc
es
s 
(M
B
)
200
206
212
218
224
230
G
FL
O
P
S
9.9
10.1
10.3
10.5
(d)Lookahead FIFO Size
1024 2048 4096 16384
GFLOPS DRAM Access
206.5
208.2
212.3
220.1
228.0
10.39
10.4510.43
10.25
10.04
Table 1-3-1-1-1-1
Comparator Array 
Size
1x1 2x2 4x4 8x8 16x16
GFLOPS 1.28 2.53 4.94 8.18 10.45
DRAM Access 216.1 216.1 215.8 212.3 208.2
D
R
A
M
 A
cc
es
s 
(M
B
)
160
179
198
217
G
FL
O
P
S
0
2.2
4.4
6.6
8.8
11
(c)Comparator Array Size
1x1 2x2 4x4 8x8
GFLOPS DRAM Access
208.2
212.3
215.8216.1216.1 10.45
8.18
4.94
2.53
1.28
1024x48
1024x48
16x16 8192
(b)Prefetcher Buffer Line Size
Figure 17. Design space exploration on various buffer sizes and array
sizes.
We finally add back the row prefetcher. It buffers the
data from the right matrix and achieves a hit rate of 62%.
Therefore, we reduce the total memory access from 1.5M
to 0.38M + 0.5M = 0.88M , reducing DRAM access by
1.7×, which matches the experiments.
D. Design Space Exploration
The high performance of SpArch comes from two factors:
higher memory bandwidth utilization and lower memory
access. To achieve higher bandwidth utilization, we need
to make sure we are not bounded by the performance of
computational units - mainly the comparator arrays in the
merge tree. To achieve lower memory access, we need to
make sure high reuse of the right matrix, which is mainly
affected by the prefetch buffer and the look-ahead FIFO;
and low partial merged results access, which is mainly
determined by the size of the merge tree. However, higher
performance usually comes with the cost of a larger area and
power. We need to balance the performance and power/area
carefully. We conduct design space exploration to find the
optimal parameters for our design.
Figure 17 (a)(b) shows the relationship between perfor-
mance and the size of prefetch buffer (m×n means there
are m lines and each line contains n elements). When
the number of lines is fixed as, the longer line results in
better performance, but the area will increase linearly. The
difference between 1024×60 and 1024×48 is much smaller
than the one between 1024×48 and 1024×36, so we choose
48 as our line size. As shown in Figure 17 (b), When the
overall buffer size is fixed as 1024 × 48 = 49152, more
lines leads to less memory access. However, when there are
more than 1024 lines, such as 2048 lines, the benefit from
less DRAM access is minor, but the increased latency of
replacement logic has a greater impact on the performance,
leading to smaller overall Flops/s. Thus we choose 1024 as
6
Table 1-3-1-1-2-1-1
Number of Merger 
Tree Layers
2 3 4 5 6 7
GFLOPS 4.13 6.39 8.80 10.00 10.45 10.45
DRAM Access 645.3 403.3 274.7 223.4 208.2 204.2
D
R
A
M
 A
cc
es
s 
(M
B
)
0
175
350
525
700
G
FL
O
P
S
0
2.2
4.4
6.6
8.8
11
Number of Merge Tree Layers
2 3 4 5 7
GFLOPS DRAM Access
204.2208.2223.4
274.7
403.3
645.3
10.4510.45
10
8.8
6.39
4.13
Figure 18. Design space exploration on the merge tree size. 6 layers
are enough to achieve good performance, above which gives diminishing
improvement.
the number of lines.
Figure 17 (c) shows the relationship between the size
of comparator arrays in our merge tree and the overall
performance. Wh n the array size is less than 8×8, the
performance increas linearly. That means we are com-
putationally bounded. When the size increases to 16×16,
the performance gain is lower, and we are bounded by
memory bandwidth. So we choose 16×16 as our comparator
size, which consumes the memory bandwidth as much as
possible.
Figure 17 (d) explores the size of look-ahead FIFO.
Larger FIFO makes the replacement of prefetch buffer more
accurate and improves the data reuse. However, we need
more time to fill the larger FIFO at the start of each round.
Excessive size can lead to performance degradation. We
choose 8192, which results in the highest performance, as
our configuration of the look-ahead FIFO.
Figure 18 helps us determine the size of the merge tree.
Larger merge tree can process more columns at the same
time but requires more FIFOs and comparator arrays. A
merge tree of 6 layers and 64 ports is good enough, and
the larger one does not contribute to the speedup. Therefore,
we choose 6 layers to achieve the best performance without
wasting area and power.
IV. RELATED WORK
Sparse Matrix-Matrix Multiplication Acceleration. Be-
sides the ASIC Sparse Matrix-Matrix Multiplication accel-
erator OuterSPACE [1] that we mainly discussed in this
paper, there are some previous explorations on SpMM on
FPGA, such as [39]. Contrary to our fully specialized merge
tree, both of them utilized general-purpose PEs and not fully
specialized for the sparse data.
There are also many works focusing on SpMM on
general-purpose platforms like CPUs and GPUs. The al-
gorithms are mostly based on the multiply-and-insert-like
Gustavson’s algorithm [40], which has many variants that
are different in the method of inserting an element to the
output matrix. We list several examples as below. cuS-
PARSE [34] utilizes a hash table. However, on-chip hash-
table limits the scalability, and off-chip hash-table suffers
from low memory utilization. CUSP [35] uses a sorting
algorithm which suffers from higher complexity (sorting
network) and excessive DRAM access if on-chip resources
are limited. HeapSpGEMM [41] uses a heap to insert. Since
the heap is hard to parallelize, the parallelism only comes
from processing multiple rows simultaneously, which would
suffer from the load-balance problem. BHSPARSE [42] and
proposed SpArch choose merge as the insertion method.
It is less complex than sorting and remains parallelizable.
It can also process the whole partial matrix rather than a
single row, eliminating the load-balance issue. Compared to
BHSPARSE, our design further reduces the DRAM access
thanks to the specialized multiply-and-merge pipeline, which
can hardly be implemented on general-purpose platforms.
Hardware Accelerator for Merge and Intersection. [43]
designed an efficient parallel GPU-based method to solve
the problem of intersecting two unsorted sets. [44] used
a special data structure called bloom filter to parallelize
the intersections on GPU. [45] implemented parallel lists
intersection and index compression on GPU to speed up
web search. [46] used SIMD techniques to enable a high
level of parallelism in the set intersection. Extensor [47]
also proposes to accelerate different sparse linear algebra by
hierarchically eliminating computations to the intersection
operation.
Sparse Linear Algebra. A number of designs are pre-
sented to accelerate sparse linear algebra on FPGA plat-
forms. [48] introduced a tree of binary operators to increase
the energy efficiency of SpMV on FPGAs. [49] proved
that separating indices comparison and computing operations
could increase the throughput. The index comparison and
computing in our SpArch system design are also separated.
[50] proposed a new sparse matrix storage method called
”BVCSR” to compress the indices of non-zero elements,
thus increasing the valid bandwidth of FPGA. [51] and [52]
proposed an architecture for large-scale SpMV in the FEM
problem. [51] co-designed an FPGA SpMV architecture with
a matrix stripping and partitioning algorithms that enable
the architecture to process arbitrarily large matrices without
changing the PE quantities.
For sparse operations in deep learning algorithms, [53]
proposed a specialized architecture that can leverage not
only the static sparse weights but also the dynamic sparse
activations. [54] further extended the efficient SpMV archi-
tecture to LSTM[55], a kind of recurrent neural networks.
Besides SpMV, SCNN [56] proposed an architecture for
sparse convolution operations, and SparTen [57] further
solved the overhead and load-balance issue in SCNN. In-
stead of designing a pure hardware architecture, [58] focused
on the algorithm and hardware co-design. Different from
those works, SpArch is based on the outer product algorithm,
trying to maximize both input and output data reuse.
Comparison between Different Hardware. The com-
parison of performance and energy efficiency of different
platforms has been investigated by many researchers. [59],
[22] and [60] conducted comparisons by running the same
benchmarks on different platforms and measuring the per-
formance. We use similar methodology in the evaluation of
SpArch.
Systolic Array. Systolic Array [61] is a homogeneous
architecture of coupled processing units. Although the pro-
posed comparator array shares a similar spatial architecture
with systolic arrays, there exists a fundamental difference
between them: the comparator array does not have long data
dependencies. Each comparator block only depends on the
comparison results from the top and left neighbor block,
which is computed based on the input array and not on
other comparator blocks. Therefore, all the outputs of the
comparator matrix can be calculated in one cycle.
V. CONCLUSION
In this paper, we propose SpArch optimizing the data
reuse of both the input matrix and the output matrix to
accelerate sparse matrix-matrix multiplication (SpGEMM).
We design a streaming-based merger to pipeline the multiply
and merge phase of the outer product on-chip. We then pro-
pose a condensed matrix representation to reduce the partial
matrices generated by the merge phase. We further develop a
Huffman tree scheduler to make our merger scalable to larger
or power-law matrices, also to reduce the memory access of
partially merged results. Finally, we use a row prefetcher
that achieves near-optimal buffer replacement to reduce the
read of the input right matrix. These improvements are
demonstrated to be valid and fruitful. When evaluated on 20
real-world benchmarks, the memory access reduces by 2.8×,
the performance increases by 4×, and the energy efficiency
increases by 6× compared to the previous state-of-the-art.
We also provide a detailed breakdown to interpret the saving
of each step, offering insights to the specialized accelerator
designs.
ACKNOWLEDGEMENT
Part of this work was supported under an NSF HDR
Award #1934700 and DARPA SDH program. We thank
MIT Data Science and AI Lab (DSAIL) for supporting
this research. We thank Joel Emer, Saman Amarasinghe,
Julian Shun, Stephen Keckler, and Christopher Fletcher for
inspiring discussions. This research was, in part, funded by
the U.S. Government. The views and conclusions contained
in this document are those of the authors and should not
be interpreted as representing the official policies, either
expressed or implied, of the U.S. Government.
REFERENCES
[1] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng,
C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, and
R. Dreslinski, “Outerspace: An outer product based sparse
matrix multiplication accelerator,” in 2018 IEEE International
Symposium on High Performance Computer Architecture
(HPCA), pp. 724–736, IEEE, 2018.
[2] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com-
pressing deep neural networks with pruning, trained quantiza-
tion and huffman coding,” arXiv preprint arXiv:1510.00149,
2015.
[3] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights
and connections for efficient neural network,” in Advances in
neural information processing systems, pp. 1135–1143, 2015.
[4] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning
structured sparsity in deep neural networks,” in Advances in
neural information processing systems, pp. 2074–2082, 2016.
[5] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc:
Automl for model compression and acceleration on mobile
devices,” in ECCV, pp. 784–800, 2018.
[6] A. Azad, A. Buluc¸, and J. Gilbert, “Parallel triangle counting
and enumeration using matrix algebra,” in 2015 IEEE In-
ternational Parallel and Distributed Processing Symposium
Workshop (IPDPSW), pp. 804–811, IEEE, 2015.
[7] S. M. Van Dongen, Graph clustering by flow simulation. PhD
thesis, 2000.
[8] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “A unified frame-
work for numerical and combinatorial computing,” Comput-
ing in Science & Engineering, vol. 10, no. 2, pp. 20–25, 2008.
[9] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “High-
performance graph algorithms from parallel sparse matrices,”
in International Workshop on Applied Parallel Computing,
pp. 260–269, Springer, 2006.
[10] M. O. Rabin and V. V. Vazirani, “Maximum matchings in gen-
eral graphs through randomization,” Journal of Algorithms,
vol. 10, no. 4, pp. 557–567, 1989.
[11] G. Penn, “Efficient transitive closure of sparse matrices over
closed semirings,” Theoretical Computer Science, vol. 354,
no. 1, pp. 72–81, 2006.
[12] S. Itoh, P. Ordejo´n, and R. M. Martin, “Order-n tight-
binding molecular dynamics on parallel computers,” Com-
puter physics communications, pp. 173–185, 1995.
[13] T. M. Chan, “More algorithms for all-pairs shortest paths in
weighted graphs,” SIAM Journal on Computing, vol. 39, no. 5,
pp. 2075–2089, 2010.
[14] I. Yamazaki and X. S. Li, “On techniques to improve ro-
bustness and scalability of a parallel hybrid linear solver,” in
International Conference on High Performance Computing
for Computational Science, pp. 421–434, Springer, 2010.
[15] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grained
parallelism in algebraic multigrid methods,” SIAM Journal on
Scientific Computing, vol. 34, no. 4, pp. C123–C152, 2012.
[16] G. Karypis, A. Gupta, and V. Kumar, “A parallel formulation
of interior point algorithms,” in Supercomputing’94: Proceed-
ings of the 1994 ACM/IEEE Conference on Supercomputing,
pp. 204–213, IEEE, 1994.
[17] S. Badia, A. F. Martı´n, and J. Principe, “A highly scalable par-
allel implementation of balancing domain decomposition by
constraints,” SIAM Journal on Scientific Computing, vol. 36,
no. 2, pp. C190–C218, 2014.
[18] H. Wang, J. Yang, H.-S. Lee, and S. Han, “Learning to
design circuits,” NeurIPS 2018 Machine Learning for Systems
Workshop, 2018.
[19] H. Wang, K. Wang, J. Yang, N. Sun, H.-S. Lee, and S. Han,
“Tts: Transferable transistor sizing with graph neural net-
works and reinforcement learning,” DAC 57, 2020.
[20] H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang,
R. Marcus, r. addanki, M. Khani Shirkoohi, S. He, V. Nathan,
F. Cangialosi, S. Venkatakrishnan, W.-H. Weng, S. Han,
T. Kraska, and D. Alizadeh, “Park: An open platform for
learning-augmented computer systems,” in Advances in Neu-
ral Information Processing Systems 32, pp. 2490–2502, Cur-
ran Associates, Inc., 2019.
[21] G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis,
and N. Koziris, “Understanding the performance of sparse
matrix-vector multiplication,” in 16th Euromicro Conference
on Parallel, Distributed and Network-Based Processing (PDP
2008), pp. 283–292, IEEE, 2008.
[22] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang,
“Understanding performance differences of fpgas and gpus,”
in 2018 IEEE 26th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM),
pp. 93–96, IEEE, 2018.
[23] K. Matam, S. R. K. B. Indarapu, and K. Kothapalli, “Sparse
matrix-matrix multiplication on modern architectures,” in
2012 19th International Conference on High Performance
Computing, pp. 1–10, IEEE, 2012.
[24] Twitter, “Number of monthly active twitter users worldwide
from 1st quarter 2010 to 1st quarter 2019 (in millions),” 2019.
[25] G. E. Moore, “Cramming more components onto integrated
circuits.”
[26] J. L. Hennessy and D. A. Patterson, “A new golden age for
computer architecture: Domain-specific hardware/software
co-design, enhanced security, open instruction sets, and agile
chip development,” Turing Lecture, 2018.
[27] T. A. Davis and Y. Hu, “The university of florida sparse matrix
collection,” ACM Transactions on Mathematical Software
(TOMS), vol. 38, no. 1, p. 1, 2011.
[28] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford
large network dataset collection.” http://snap.stanford.edu/
data, June 2014.
[29] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A.
Ang, “Introducing the graph 500,” Cray Users Group (CUG),
vol. 19, pp. 45–74, 2010.
[30] S. Galal and M. Horowitz, “Energy-efficient floating-point
unit design,” IEEE Transactions on computers, vol. 60, no. 7,
pp. 913–922, 2010.
[31] A. Shilov, “Jedec publishes hbm2 specification as samsung
begins mass production of chips,” 2016.
[32] M. Alfano, B. Black, J. Rearick, J. Siegel, M. Su, and J. Din,
“Unleashing fury: A new paradigm for 3-d design and test,”
IEEE Design & Test, vol. 34, no. 1, pp. 8–15, 2016.
[33] Intel, “Intel math kernel library.”
[34] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi,
“Cusparse library,”
[35] S. Dalton, N. Bell, L. Olson, and M. Garland, “Cusp: Generic
parallel algorithms for sparse matrix and graph computa-
tions,” 2014. Version 0.5.0.
[36] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grained
parallelism in algebraic multigrid methods,” SIAM Journal on
Scientific Computing, vol. 34, no. 4, pp. C123–C152, 2012.
[37] C. Sanderson and R. Curtin, “Armadillo: a template-based
c++ library for linear algebra,” Journal of Open Source
Software, vol. 1, no. 2, p. 26, 2016.
[38] C. Sanderson and R. Curtin, “Practical sparse matrices in c++
with hybrid storage and template-based expression optimisa-
tion,” Mathematical and Computational Applications, vol. 24,
no. 3, 2019.
[39] C. Y. Lin, N. Wong, and H. K.-H. So, “Design space
exploration for sparse matrix-matrix multiplication on fpgas,”
International Journal of Circuit Theory and Applications,
vol. 41, no. 2, pp. 205–219, 2013.
[40] F. G. Gustavson, “Two fast algorithms for sparse matrices:
Multiplication and permuted transposition,” ACM Transac-
tions on Mathematical Software (TOMS), vol. 4, no. 3,
pp. 250–269, 1978.
[41] A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori,
O. Schwartz, S. Toledo, and S. Williams, “Exploiting multiple
levels of parallelism in sparse matrix-matrix multiplication,”
SIAM Journal on Scientific Computing, vol. 38, no. 6,
pp. C624–C651, 2016.
[42] W. Liu and B. Vinter, “An efficient gpu general sparse matrix-
matrix multiplication for irregular data,” in 2014 IEEE 28th
International Parallel and Distributed Processing Symposium,
pp. 370–381, IEEE, 2014.
[43] M. Fort, J. A. Sellare`s, and N. Valladares, “Intersecting
two families of sets on the gpu,” Journal of Parallel and
Distributed Computing, vol. 104, pp. 167–178, 2017.
[44] F. Zhang, D. Wu, N. Ao, G. Wang, X. Liu, and J. Liu, “Fast
lists intersection with bloom filter using graphics processing
units,” in Proceedings of the 2011 ACM Symposium on
Applied Computing, pp. 825–826, ACM, 2011.
[45] N. Ao, F. Zhang, D. Wu, D. S. Stones, G. Wang, X. Liu,
J. Liu, and S. Lin, “Efficient parallel lists intersection and in-
dex compression algorithms using graphics processing units,”
Proceedings of the VLDB Endowment, vol. 4, no. 8, pp. 470–
481, 2011.
[46] S. Han, L. Zou, and J. X. Yu, “Speeding up set intersections in
graph algorithms using simd instructions,” in Proceedings of
the 2018 International Conference on Management of Data,
SIGMOD ’18, (New York, NY, USA), pp. 1587–1602, ACM,
2018.
[47] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago,
A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “Ex-
tensor: An accelerator for sparse tensor algebra,” in MICRO
52, p. 319333, ACM, 2019.
[48] L. Zhuo and V. K. Prasanna, “Sparse matrix-vector multipli-
cation on fpgas,” in Proceedings of the 2005 ACM/SIGDA
13th International Symposium on Field-programmable Gate
Arrays, FPGA ’05, (New York, NY, USA), pp. 63–74, ACM,
2005.
[49] E. Jamro, T. Pabis´, P. Russek, and K. Wiatr, “The algorithms
for fpga implementation of sparse matrices multiplication,”
Computing and Informatics, vol. 33, no. 3, pp. 667–684,
2015.
[50] D. Zou, Y. Dou, S. Guo, and S. Ni, “High performance sparse
matrix-vector multiplication on fpga,” IEICE Electronics Ex-
press, vol. 10, no. 17, pp. 20130529–20130529, 2013.
[51] Y. Elkurdi, D. Ferna´ndez, E. Souleimanov, D. Giannacopou-
los, and W. J. Gross, “Fpga architecture and implementation
of sparse matrix–vector multiplication for the finite element
method,” Computer Physics Communications, vol. 178, no. 8,
pp. 558–570, 2008.
[52] P. Grigoras¸, P. Burovskiy, W. Luk, and S. Sherwin, “Op-
timising sparse matrix vector multiplication for large scale
fem problems on fpga,” in Field Programmable Logic and
Applications (FPL), 2016 26th International Conference on,
pp. 1–9, IEEE, 2016.
[53] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,
and W. J. Dally, “Eie: efficient inference engine on com-
pressed deep neural network,” in ISCA, pp. 243–254, 2016.
[54] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie,
H. Luo, S. Yao, Y. Wang, et al., “Ese: Efficient speech
recognition engine with sparse lstm on fpga,” in Proceedings
of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 75–84, ACM, 2017.
[55] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[56] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkate-
san, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally,
“Scnn: An accelerator for compressed-sparse convolutional
neural networks,” in ISCA, pp. 27–40, June 2017.
[57] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vi-
jaykumar, “Sparten: A sparse tensor accelerator for convolu-
tional neural networks,” MICRO ’52, pp. 151–165, 2019.
[58] M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core:
Algorithm and hardware co-design for vector-wise sparse
neural networks on modern gpus,” MICRO ’52, pp. 359–371,
ACM, 2019.
[59] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance
comparison of fpga, gpu and cpu in image processing,” in
Field programmable logic and applications, 2009. fpl 2009.
international conference on, pp. 126–131, IEEE, 2009.
[60] C. Cullinan, C. Wyant, T. Frattesi, and X. Huang, “Computing
performance benchmarks among cpu, gpu, and fpga,”
[61] H.-T. Kung, “Why systolic architectures?,”
