Expressing Sparse Matrix Computations for Productive Performance on
  Spatial Architectures by Rong, Hongbo
Expressing Sparse Matrix Computations for
Productive Performance on Spatial Architectures
Hongbo Rong
Parallel Computing Lab (PCL), Intel
hongbo.rong@intel.com
Abstract
This paper addresses spatial programming of sparse matrix
computations for productive performance. The challenge is
how to express an irregular computation and its optimiza-
tions in a regular way.
A sparse matrix has (non-zero) values and a structure. In
this paper, we propose to classify the implementations of
a computation on a sparse matrix into two categories: (1)
structure-driven, or top-down, approach, which traverses the
structure with given row and column indices and locates the
corresponding values, and (2) values-driven, or bottom-up,
approach, which loads and processes the values in paral-
lel streams, and decodes the structure for the values’ cor-
responding row and column indices.
On a spatial architecture like FPGAs, the values-driven
approach is the norm. We show how to express a sparse ma-
trix computation and its optimizations for a values-driven
implementation. A compiler automatically synthesizes a
code to decode the structure. In this way, programmers fo-
cus on optimizing the processing of the values, using famil-
iar optimizations for dense matrices, while leaving the com-
plex, irregular structure traversal to an automatic compiler.
We also attempt to regularize the optimizations of the reduc-
tion for a dynamic number of values, which is common in a
sparse matrix computation.
1. Introduction
This paper addresses high-performance high-productivity
programming of sparse matrix computations on spatial ar-
chitectures like FPGAs. Such computations are usually con-
sidered irregular in terms of memory access patterns and
parallelism, etc. For productive performance, however, we
should hide the irregularity and express the computations
and their optimizations in a regular way. How to do so re-
mains an open problem. There have been some efforts on
CPUs [2, 11], but no work has ever been done on spatial
architectures, as far as we know.
A sparse matrix is a 2-dimensional storage with a lot of
zeros in it. This storage is a unique dense representation,
where every datum, zero or not, is stored, and is located by
a unique row and column index.
However, a sparse matrix can have many sparse represen-
tations. A sparse representation has values and a structure,
where the values include all the non-zeros, and the struc-
ture is a mapping from the row and column indices to the
values. Popular sparse representations include CSR (Com-
pressed Sparse Row), CSC (Compressed Sparse Column),
and their blocked versions, etc [1, 2].
Fig. 1 illustrates a few representations of a sparse matrix.
Fig. 1(a) shows its dense representation. Fig. 1(b) is a sparse
representation in a tree, where every dimension corresponds
to a tree level, and the values are located at the leaf nodes.
This is the way proposed in The Tensor Algebra Compiler
(TACO) [11] 1. This tree representation is equivalent to a
CSR representation(Fig. 1(c)).
In this paper, we propose to classify the implementations
of a computation on a sparse matrix into two categories:
• Structure-driven, or top-down, approach, which tra-
verses the structure with given row and column indices,
and locates the corresponding values.
Take sparse-matrix dense-vector multiply (SpMV), an
important kernel in sparse linear algebra, for example:
y(i) =
∑
∀j,A(i, j) 6=0
A(i, j) · x(j), (1)
Here y and x are dense column vectors, A is a sparse
matrix, and A(i, j) refers to the value of matrix A with
row index i and column index j. In order to realize the
computation, the structure of matrix A is traversed, and
the corresponding values are located and loaded.
For the tree representation in Fig. 1(b), the tree is tra-
versed top-down until a value is reached at the bottom.
This structure-driven approach is commonly seen in CPU
implementations [11] and sometimes on GPUs [6].
• Values-driven, or bottom-up, approach, which loads
and processes the values in parallel streams, aiming to
1 In TACO, a tree is used to conceptually illustrate how a tensor can be
represented. The physical representation is in dense vectors. A tensor is a
generalization of a matrix to higher dimensions.
1
ar
X
iv
:1
81
0.
07
51
7v
1 
 [c
s.M
S]
  1
2 O
ct 
20
18
A   B    
    C  
      DR
o
w
s 
(i
)
Columns (j)
0  1  2   3
0   
1   
2
(a) Dense represen-
tation. Zeros not
shown.
Matrix
0     1      2  
0    2   2      3
A   B  C     DValues
Rows (i)
Columns (j)
Structure
(b) A sparse representation in a tree.
A B   
C 
D
Values
0 2   
2 
3
Structure
Row 
lengths
Column 
indices
2   
1 
1
(c) The equivalent CSR
representation.
Figure 1: A sparse matrix’s representations.
A B   
C 
D
Values
Schedule rows to 
2 machines ASAP
A B   
C D
Dense 
representation Structure
Columns (j')
0  1
R
o
w
s 
(i
')
0   
1
Row 
lengths
Column 
indices
A   B   
    C  
      DR
o
w
s 
(i
)
Columns (j)
0  1  2   3
0   
1   
2
Pack 
each row
0 2   
2 
3
Values
0 2   
2 3
2   
1 1
for machine 0
for machine 1
Structure
Row 
lengths
Column 
indices
2   
1 
1
Figure 2: Illustration: Evolving the representation of a sparse matrix.
maximize memory bandwidth, data locality and paral-
lelism. The structure is decoded to locate the indices of
the values [7, 19, 9, 3, 4].
The decoding of the structure is bottom-up: if we take the
tree representation in Fig. 1(b) again, for a value, we need
traverse the tree bottom-up so as to determine the indices
of a value.
This values-driven approach particularly suits a spatial
architecture: the key to get high performance there is
to stream values in and consume the values in massive
parallelism. It would be cumbersome if a structure has to
be decoded first in order to load values.
For our purpose of high-performance high-productivity
spatial programming, we need productively express values-
driven implementations. As we said, a sparse matrix compu-
tation like that in Equation 1 is suitable for a structure-driven
implementation. How could we re-express the computation
to make it amenable to a values-driven implementation, in-
stead?
We observe that a sparse matrix can be represented by a
series of invertible transformations, including but not limited
to, packing, blocking, and job scheduling. Packing puts the
non-zeros of the matrix together along the rows or columns
of the matrix. Blocking divides the non-zeros into blocks.
Job scheduling schedules the non-zeros, in rows, columns,
or blocks, as jobs to some “machines”, which are hardware
units with identical logic and will process the non-zeros in
parallel.
These transformations must be invertible: their resulting
structure must encode the original row and column indices
in some way so that the one-to-one correspondence between
a value and its indices is still maintained.
Fig. 2 illustrates the observation with the same sparse ma-
trix in Fig. 1. Starting from the dense representation, first
pack the non-zeros in each row together, and we get the CSR
sparse representation. Then we schedule the rows to 2 “ma-
chines” in ASAP (As Soon As Possible) policy, resulting
another representation called CISR (Condensed Interleaved
Sparse Representation). Because the job scheduling sched-
uled the jobs to 2 machines, in this final representation, there
are 2 rows of values. The first (second) row of the values,
along with related structure information, will be fed to the
first (second) machine. Each machine accepts its inputs as a
stream and processes the stream independently. This is es-
sentially the SpMV design proposed by Fowers, et al. [7],
where the machines processes the streams by multiply-add.
This is the first time that job scheduling is proposed as
a primitive transformation for representing a sparse matrix:
since we target a spatial architecture that allow many hard-
ware units (i.e. machines) with the exactly the same circuit
to perform computation in parallel, scheduling values to the
machines is a natural abstraction.
The transformations can be applied more than once, but
the last transformation should always be a job scheduling,
for the same reason stated above. This last job scheduling
gives a value two new indices: a machine identifer i′, and
the position j′ of a value within the stream scheduled to that
machine.
From now on, we will use i and j to represent the row
and column index of a value in the dense representation,
respectively, and i′ and j′ to represent the new indices in
the final sparse representation, respectively. The two pairs
of indices are 1-1 corresponding to each other. We call the
mapping from (i, j) to (i′, j′) the forward mapping, and the
mapping from (i′, j′) to (i, j) the inverse mapping.
2
With the inverse mapping, we can re-express a sparse
matrix computation, e.g. SpMV in Equation 1, as follows:
y((i′, j′) −→ i) =
∑
∀i′,j′
A.values(i′, j′)·x((i′, j′) −→ j).
(2)
Here A.values(i′, j′) represents the j′’th value to be pro-
cessed by machine i′, and (i′, j′) −→ i and (i′, j′) −→ j
are the inverse mapping. For each machine i′, we may incre-
ment j′ so that effectively, each machine is loading a stream
of data, without accessing the structure of the matrix. This is
exactly what a high-performance implementation on a spa-
tial hardware would behave. Therefore, the above equation
enables a computation to be driven by values.
Remember that A.values, x and y are all densely stored.
Thus one may apply the familiar dense-matrix/vector opti-
mizations, such as tiling, unrolling, etc., in optimizing the
computation in Equation 2.
The inverse mapping, (i′, j′) −→ i and (i′, j′) −→ j,
may be automatically generated by a compiler by inverting
the high-level transformations used for a sparse representa-
tion like packing, etc. Usually, a traversal of a structure for
either the forward mapping or the inverse mapping would
require code to be written in an imperative language with
complex array indirections and control flow, which is espe-
cially complicated when processing many values in parallel.
The diverse representations a sparse matrix might have fur-
ther increase the complexity. Now that the complex, irregu-
lar structure traversal is automatically handled by a compiler,
programmers can be more productive and focus on optimiz-
ing the processing of values instead.
In this paper, we further propose how to regularize and
optimize a reduction with dynamic number of values, which
is common in a sparse matrix computation. We show how
to partition a reduction into two levels of reductions, and
specify useful properties that help a compiler to generate
efficient reduction circuits.
Fig. 3 summarizes our approach how to, in general, ef-
ficiently implement a sparse matrix computation on a spa-
tial hardware. Starting from the dense representation of the
sparse matrix, a programmer specifies a sparse representa-
tion as a forward mapping, using a series of high-level trans-
formations, including packing, blocking, and job scheduling,
etc. A compiler automatically constructs an inverse map-
ping. The programmer specifies the optimizations of the ma-
trix so that the values and the inverse mapping are fed into
the machines that the last job scheduling targeted. Between
the machines, there might be interactions, for example, re-
duction. Finally, the program specifies the results of the com-
putation to be stored back to the external memory of the spa-
tial architecture.
Our language is named T2S (Temporal To Spatial), which
is an extension of the Halide language [14] to spatial archi-
tectures. Previously, we have proposed T2S in the context of
dense matrices [15], where a sparse matrix cannot be directly
represented unless it is pre-processed into a dense matrix on
the host, and for an inverse-mapping, an external function
written in an imperative language has to be introduced.
In this paper, we show how to further extend T2S to
express and process sparse matrices directly. To the best of
our knowledge, this is the first time a sparse computation can
be specified in a productivity programming language so that
its optimizations can also be specified in a way much like for
a dense matrix computation.
Below we introduce the language extension of T2S for
sparse matrices, and illustrate the approach with two high-
performance SpMV designs on FPGAs. We skip a detailed
description of the T2S language but introduce it briefly when
explaining the examples. Then we discuss related work,
point out future directions, and conclude the paper.
2. Language
We extend T2S in two aspects for sparse matrices. One
aspect is how to specify forward and inverse mapping, the
other is how to specify optimizations of reductions. The
other optimizations like tiling, unrolling, buffering, etc. are
the same as for dense matrices.
2.1 Forward and Inverse Mapping
We explicitly express a sparse matrix as the result of a series
of invertible transformations. For example, the following
specification expresses an input matrix A in CSR sparse
representation:
1 Representation CSR;
2 Parameter A(float, 2, CSR);
3 Var col_idx;
4 CSR = A.dense()
5 .pack(Columns, Index(col_idx));
Here Line 1 declares a representation named CSR, but
exactly what it is is not defined yet. Line 2 says that input
parameter A is a 2-dimensional floating-point matrix in CSR
representation. Line 3 declares a variable col idx, which will
be used to refer to the column indices in the CSR repre-
sentation. Then we define CSR: starting from the matrix’s
dense representation (Line 4), pack the non-zeros along the
column dimension (Line 5). The packed dimension is rep-
resented sparsely based on indices, referred to by the pre-
declared variable col idx, which gives us a handle to access
the indices later.
As illustrated in Fig. 1, the sparse representation of a
sparse matrix can continuously evolve. Continue with the
above example. There can be a subsequent statement
6 Representation CISR = A.schedule(Rows, 4,
ASAP, Zero);
This statement further schedules the rows of matrix A to
4 machines in the ASAP policy, and pads the machines with
zeros at the end if needed, so that every machine has the
same number of values to process. At this moment, all we
3
Dense 
representation
Values
Structure
Forward mapping
(Packing, blocking, 
Job scheduling)
  resultreduction
Inverse mapping
(generated by a compiler)
Row index
Column index
Machine
Machine
Figure 3: Our approach for high-performance sparse matrix computation.
need is only the total number of machines. Exactly how a
machine behaves can be defined later.
Below we define the language elements:
• Representation r
Declare a representation named as r.
• Parameter matrix(type, total dimensions, r)
Declare an input matrix with the given type, number of
dimensions, and representation r.
• matrix.dense()
The dense representation of the sparse matrix.
• matrix.pack(d, r)
Pack the matrix along dimension d, which is either Rows
or Columns. The resulting representation r can be index-
based, i.e. Index(idx), or with a length as well, i.e.
Index(idx), Length(len), where idx and len are
variables recording the indices and length of the values
along the packed dimension.
• matrix.block(d, f, padding policy, r)
Block dimension d with factor f, with the given padding
policy. The resulting representation r is Length(b),
where b is a variable for the number of blocks along the
dimension.
• matrix.schedule(d, m, scheduling policy,
padding policy)
Schedule dimension d to m number of machines, with
a given scheduling policy. At the end, balance the
machines’ workloads in a given padding policy.
• matrix.values(i′, j′)
After the above job scheduling, access one value of the
sparse matrix by the machine identifier i′ and the posi-
tion j′ of the value within the stream scheduled to ma-
chine i′.
• matrix.inverse map(i′, j′)
After the above job scheduling, for a value scheduled to
machine i′ at position j′, return an expression that rep-
resent the row and column index in the corresponding
dense representation of the sparse matrix. Let that ex-
pression be e, we can extract the row and column index
by e[0] and e[1], respectively.
• Func.load structure(var1, var2, ...)
Load the structure information specifically referred to by
var1, var2, ... These variables are those used in pack
and block above.
2.2 Reduction
Reduction is a very common computation for sparse matri-
ces. Unlike a reduction for a dense matrix or vector, one re-
duction for a sparse matrix usually has dynamic number of
input values.
In this section, we will use the following conceptual ex-
ample:
y((i′, j′) −→ i) = initial value;
y((i′, j′) −→ i) += f(A.values(i′, j′));
where f is an arbitrary function. The above two definitions
of y, respectively, are an initial definition, directly referred
to as y in a specification, and an update definition, referred
to as y.update() in a specification.
Usually, it is unknown to the compiler how many (i′,
j′)’s are mapped to the same i, and whether these (i′,
j′)’s appear next to each other to a machine. To help com-
piler generate efficient reduction circuit, we can specify if
the (i′, j′)’s that are mapped to the same i appear continu-
ously to a machine, and the maximum number of (i′, j′)’s
that are mapped to the same i to a machine. Further, if the
number of (i′, j′)’s that are mapped to the same i must be
a multiple of some integer number, we can block these (i′,
j′)’s: reduce the corresponding values in each block to a
local sum, then reduce the local sums of the blocks. This
two-level reduction provides more optimization opportuni-
ties: the two reduction circuits can be optimized differently.
If several machines with different i′ reduce to the same
y(i), we may combine their reduction results and write into
y(i) only once.
Below we describe the language features.
• f.is continuous reduction(l)
For a specific iteration of loop l in function f, the func-
tion reduces to the same result location continuously be-
fore the function reduces to another result location.
• f.is distinct reduction(l)
For two different iterations of loop l in function f, the
function reduces to different result locations.
• f.isolate reduction(F)
Isolate out of function f the reduction of values into a
separate function F. This creates a two-level reduction: F
reduces the values, and f reduces the results of F.
For the above example, we can isolate out a new reduc-
tion Y in the following specification:
Func Y;
4
y.update().isolate_reduction(Y);
This produces the following:
Y((i′, j′) −→ i) = 0;
Y((i′, j′) −→ i) += f(A.values(i′, j′));
y(i) = initial value;
y(i) += Y(i);
Basically, Y(i) is a local sum. For the same i, Y might
reduce several values and send the local sum to y, and
then do the same for the next several values, and so on.
Every time Y needs to send y a local sum, it also sends
the current i to tell exactly which y(i) to reduce this local
sum for. The sending of i can be added automatically by
a compiler.
• f.combine reduction(l, target info)
Combine all the reductions of all the iterations
of loop l into a single reduction with additional
target information, which can be Same Target or
Maybe different Targets.
For the example at the beginning of Section 2.2, we can
combine the reductions of the iterations of loop i′ into a
single reduction:
y.update().combine reduction(i′,
Same Targets);
This transforms the original loop nest of the update defi-
nition of function y:
for j′
for i′
y((i′, j′) −→ i) += f(A.values(i′, j′));
into the following:
for j′
sum = 0
for i′
sum += f(A.values(i′, j′));
y((0, j′) −→ i) = sum;
If, however, we specify
y.update().combine reduction(i′,
Maybe different Targets);
the original loop nest will be transformed into the follow-
ing instead:
for j′
sum = 0
(0, j′) −→ current i
for i′
(i′, j′) −→ i
if (current i == i)
sum += f(A.values(i′, j′));
else
y(current i) = sum;
sum = f(A.values(i′, j′));
current i = i
y(current i) = sum;
• f.reduction circuit(circuit structure, l)
The reduction function f is implemented in a circuit with
a given structure, including Tree and Linear Array.
Usually, the reduction structure reduces the inputs level
by level. The maximum number of levels in the circuit
structure is l.
3. Case Studies
In this section, we illustrate our approach with two high-
performance designs of SpMV on FPGAs.
3.1 SpMV on an FPGA with High Memory Bandwidth
In an implementation of SpMV on an FPGA [7] for the
above Equation 2, j′ is incremented by 1 every time such
that the values for the same i′ are loaded from memory
into an FPGA sequentially, forming a stream. And several
streams for different i′’s are formed in parallel. The final re-
sult, vector y, is stored back to the memory also as a stream.
This implementation is focusing on the values, and can be
very efficient for spatial architectures with high memory
bandwidth.
Fig. 4 illustrates the implementation. Starting from the
dense representation of a sparse matrix , pack together all
non-zeros in each row, we get the CSR sparse representa-
tion. Let us say the target FPGA has 4 memory channels. We
would like the matrix to be read from the 4 memory channels
in parallel for the maximal memory bandwidth. If we treat
the 4 memory channels as 4 “machines”, and the compressed
sparse rows as “jobs”, we can schedule the jobs one by one
to the machines: a job is scheduled to the earliest available
machine; the last job for a machine may be padded with ze-
ros so that all the machines have the same number of values
to process. The resulting sparse presentation is CISR [7]. A
similar but trivial example has been described in Section 1
before.
Now with the CISR sparse representation, every mem-
ory channel of the FPGA can read the rows scheduled to
it as a sequential stream, and all the memory channels read
their streams in parallel. More specifically, this is done by a
hardware unit, i.e. 4 A loaders, in Fig. 4. The loaded con-
tents include row lengths, column index j and values. A de-
coder figures out the row index i from the row lengths to
each A loader. This decoder is realizing an inverse map-
ping (i′, j′) −→ i based on the row lengths. Vector x
and y are buffered on the FPGA. Vector x is loaded into
its buffer with x loader in the figure. Using the row in-
dex i and column index j, x(j) and y(i) can be easily re-
trieved. Now all the necessary information are available for
SpMV defined in Equation 2. The computation is straight-
forward: 4 multipliers work in parallel to compute 4 differ-
ent A.values(i′, j′) * x(j). The products are reduced
5
A     B  
      C 
        D E 
  F     G   H
  I       J K
      L       M
  N       O  
            P 
Pack each row
Schedule the rows to 4 
machines ASAP
multiplier
irow len
A_loader
FPGA
buffer 
for x
*
*
*
*
col_idx
+
+
x
x
_
lo
a
d
e
r
y
_
u
n
lo
a
d
e
r
y
x(j)
A.values(i',j')
b
u
ffer
 fo
r
 y
y(i)
Fused 
accumulator
decoder
A B L M   
C I J K
D E N O
F G H P
Dense representation
Structure
Columns (j')
0   1  2   3
R
o
w
s 
(i
')
Row 
lengths
Column 
indices
R
o
w
s 
(i
)
Columns (j)
0  1  2   3   4  5   6  7
0   
1   
2
3
4
5
6
7
Values
0 3 3 7
3 1 5 6
4 5 1 5
1 4 6 6
2 2
1 3
2 2
3 1
for machine 0
for machine 1
Values
A B  
C 
D E 
F G H
I J K
L M
N O  
P 
0 3  
3 
4 5 
1 4 6
1 5 6
3 7
1 5  
6 
for machine 2
for machine 3
0   
1
2
3
Structure
Row 
lengths
Column 
indices
2  
1 
2 
3
3
2
2  
1 
x_feeder
Figure 4: Illustration: an SpMV design on an FPGA [7]. The example sparse matrix is also from [7].
by, say 2, adders. Each adder processes two streams alterna-
tively. Finally, the resulting vector y is stored back to mem-
ory by a hardware unit, y unloader, in the figure.
Note that the entire hardware in the figure is a regular
dataflow graph. All the hardware units are purely functional.
The only exception is the decoder, which is stateful and the
output of it depends on its previous output. Because it is
not really functional, usually, one might have to write it in
an imperative language, not a functional language. In our
approach, however, a compiler should automatically gener-
ate such a decoder, i.e. generate the hardware for the in-
verse mapping (i′, j′) −→ i, automatically connecting
the hardware with other hardware units in the figure based
on the producer-consumer relationship.
Fig. 5 shows the T2S specification. There are 3 parts:
sparse representation, temporal definition, and spatial map-
ping.
The sparse representation (Line 1-6) defines the sparse
representation of the input matrix A as a series of trans-
formations. Line 1 declares a sparse representation called
CISR. Line 2 says that A is a 2-dimensional floating-point
matrix with CISR sparse representation, and x is a floating-
point vector. Line 3 declares two variables to be used next.
Line 4-6 actually defines the sparse representation: starting
6
1   Representation CISR;
2   Parameter         A(float, 2, CISR), x(float, 1);
3   Var                   col_idx, row_len;
4   CISR = A.dense()
5                   .pack(Columns, Length(row_len), Index(col_idx))
6                   .schedule(Rows, 4, ASAP, Zero);
7   Var      i';
8   RDom j' (0, A.values.extent(1));
9   Expr    original_ij = A.inverse_map(i', j');
10 Expr    i = original_ij[0];
11 Expr    j = original_ij[1];
12 Func    y;
13 y(i)   = 0;
14 y(i) += A.values(i', j') * x(j);
15 y.bound(i, 0, x.extent(0)); 
16 y.update().bound(i', 0, 4).reorder(i', j');
                    
17 Func  multiplier, y_unloader;
18 y.update().isolate_producer(A.values(i', j') * x(j), multiplier)
19                 .isolate_consumer(y, y_unloader);
20 Func  A_loader, x_loader, x_feeder;
21 multiplier.isolate_producer(A.values, A_loader)
22                 .isolate_producer_chain(x, x_loader, x_feeder);
23 A_loader.load_structure(row_len, col_idx)
24                .unroll(i');
25 x_feeder.buffer(x);
26 multiplier.unroll(i'); 
27 Var i'';
28 y.update().is_continuous_reduction(i').is_distinct_reduction(i')
29                 .tile(i', i'', 2).unroll(i'');
30 y_unloader.buffer(y)
31                   .bound(i, 0, x.extent(0));
Temporal 
definition
Build a basic 
spatial layout
Specialize 
individual 
computations
Spatial 
mapping
Sparse 
representation
Figure 5: Specification for the SpMV design in Fig. 4.
from the dense representation of the sparse matrix, first, pack
the nonzeros along the column dimension for each row, and
record the total number and the column indices of the non-
zeros in each row (row len and col idx). Then schedule
the rows to 4 machines in ASAP policy with zero-padding at
the end if needed.
The temporal definition (Line 7-16) defines what to com-
pute. Line 10-11 defines the row and column index as two
expressions based on an inverse mapping (Line 9). The ma-
chine identifier i′ is declared as normal variable (Line 7),
while the position j′ of a value scheduled to a machine is a
reduction variable whose range is from 0 to the last column
of A.values.
Line 14 defines the SpMV computation, according to
Equation 2. The result vector y is initialized as 0 beforehand
in Line 13. We know that vector y is equally long as the input
vector x. Thus instead of calculating every i from an inverse
mapping, we can simply iterate i over the same range as
vector x to initialize each y(i) directly (Line 15).
Line 16 says that the machine identifier i′ is from 0 to
4 (not included), as there are 4 machines. Further, the loops
are ordered such that loop i′ is the inner loop, and loop j′ is
the outer loop.
Line 17-22 isolate the computation in Line 14 into multi-
ple pieces. Line 18-19 first isolate the expression in the right-
hand side of Line 14 into a function named multiplier,
and then isolate the saving of the result vector y into an-
other function y unloader. Further, Line 21 isolates from
the multiplier the loading of the values of matrix A to a func-
tion A loader. Then Line 22 isolates from the multiplier the
loading of vector x into a function x loader, which trans-
fers the loaded values to another function x feeder, which
transfers the values to function multiplier.
Line 23-31 specialize each function. In addition to load-
ing the values of A, A loader is also specified to load
7
the structure information row len and col idx (Line 23).
Further, we unroll loop i′ in A loader, which creates 4
instances of A loader (Line 24). Line 25 lets x feeder
buffers the x values in an on-chip scratchpad memory, whose
size will be automatically calculated by the compiler.
Line 26 fully unrolls loop i′ in multiplier to create 4
instances of it. Line 28 says that the reduction for y is contin-
uous for each iteration of loop i′, and is distinct for different
iterations of loop i′. Then Line 29 splits loop i′ by a fac-
tor of 2, and fully unrolls the newly created loop i′′ to cre-
ates two reduction circuits (i.e. the two fused accumulators
in Fig. 4). Line 30 buffers the entire result vector y before
sending any data of it out. Line 31 says that in sending out
the data of the buffered results of y, we can directly iterate
i with the same range as the input vector x, instead of com-
puting i indirectly from the inverse mapping (i′, j′) −→
i.
The inverse mapping is specified as A.inverse map()
in Line 9. The mapping should be realized by a compiler
automatically to generate the decoder in Fig. 4.
3.2 Illustration: Another SpMV design on FPGAs [19]
This design is very different from the previous one in how
the data are represented and processed. Fig. 6 illustrates this
design. After packing the columns along each row of the
dense representation, we get the CSR sparse representation.
Then with a constant k (e.g. 4), block each row by a factor
of k/2 with zero-padding for the last block if needed. After
blocking, a row is divided into blocks, and each block has
k/2 values. Then schedule the blocks as jobs to 2 machines
in the ASAP policy.
Each machine computes the sum of a block. For this pur-
pose, it uses k/2 multipliers, each for one of the values in
the block. Each multiplier gets the values and their column
indices from another hardware unit, A loader. With a col-
umn index, the multiplier reads a value of x from a buffer,
which was filled beforehand by a hardware unit, x loader.
Now with a value from the matrix A and the corresponding
value from the vector, the multiplier can compute their prod-
uct. For all the multipliers in the same machine, their results
are added together in a tree of adders (For the particular case
shown in Fig. 6, there is only 1 adder in the tree).
For the two machines, their corresponding blocks might
be from the same row, or from two adjacent rows. A
row blocks loader loads the structure information of the
total blocks in each row, and sends the information to a
decoder, which decides if the two blocks currently processed
in the two machines belong to the same row or not. If yes,
the decoder controls an adder to add together the two sums
of the two blocks. Otherwise, the first sum is added with 0.
The adder’s result and the row index is sent to a reduction
circuit. The second sum and the corresponding row index is
also sent to the reduction circuit but with some delay. The
reduction circuit is accumulating the sums for all the blocks
in the same row. It is composed of a linear arrays of small
buffers and adders. With the row indices from the input, the
reduction circuit knows when a row is completely reduced,
and sends the results to another hardware unit, y unloader,
to save to memory.
Fig. 7 shows the T2S specification for the design. Line
1-8 describe the sparse representation of the input matrix A.
Line 2 says that A is a 2-dimensional floating-point sparse
matrix in a representation called rep. How rep is defined?
Starting from the dense representation of the matrix (Line 5),
pack the non-zeros along the column dimension of each row
and record the column indices of the non-zeros (col idx)
(Line 6). Then block the columns of each row into blocks
with a length of k/2 (Line 7), where k is a symbolic constant
(Line 3), and pad the last block with zeros if needed. The
number of the blocks of a row is referred to by row blocks.
Finally, schedules the blocks to 2 machines in ASAP policy,
padding the last machine with a zeroed block if needed (Line
8).
Line 9-18 defines the temporal computation. Not surpris-
ingly, it is the same as the temporal definition in the previ-
ous example in Fig. 5, except the number of machines are
changed from 4 to 2, the inner loop is j′ and the out loop is
i′ (Line 18).
Line 19-27 builds a basic spatial layout. Line 20 isolates
a y unloader to store the final results of vector y to the ex-
ternal memory. Line 21 isolates the reduction of y so that the
reduction is done in two levels: there is an additional reduc-
tion circuit Y to compute local sums at the first level, which
are further reduced at the second level. Line 23 isolates out
the loading of vector x. Line 24 then unrolls loop i′ to cre-
ate 2 machines. Line 25-26 split loop j′ by a factor of k/2
and unroll the newly created loop j′′ so that each machine
has k/2 multipliers. Finally, Line 27 isolates the loading of
matrix A.
Line 28-35 specialize the individual computations. Line
28 buffers vector x for Y. Then we combine the reductions
in the same machine(Line 29). We also combine the results
of the 2 machines, but they may be reduce to two different
result locations (Line 30).
Line 31 says that the local sums sent from function Y
to function y must be continuous for a row. The reduction
circuit for function y is chosen to be implemented in a linear
array with 4 levels.
In addition to loading of A values, A loader also loads
the structure information, col idx (Line 33).
Line 34 declares a special function, row block loader,
which is neither defined nor isolated. This is because it is to
fetch a part of the sparse structure information, the number
of blocks per row. We simply specify that functionality in
Line 35.
The inverse mapping is specified as A.inverse map()
in Line 11. The mapping should be realized by a compiler
automatically to generate the decoder in Fig. 6.
8
4. Related Work
We first discuss related work from the perspectives of im-
plementing a sparse matrix computation, and then from the
perspectives of languages.
4.1 Structure-Driven vs. Values-Driven Approach
In this paper, we classify algorithms for sparse matrix (or
more generally, tensor) computations into two categories:
the first is structure-driven, or top-down; the second is
values-driven, or bottom-up.
Structure-driven algorithms are commonly seen for
latency-oriented CPUs and sometimes for throughput-
oriented GPUs. TACO [11] provides a unified and general
way to represent all possible sparse tensors, including sparse
matrices. Conceptually, every sparse tensor can be viewed as
stored in a tree, with every tree level being a dense or sparse
dimension of the tensor, and the leaves being the values of
the tensor. The compiler will also automatically optimize the
traversing of the structures of several related tensors. This
traversal of the sparse structures is an efficient way of for-
ward mapping, not the inverse mapping we propose in this
paper. Thus the code generated by TACO is top-down, and
is driven by structures.
In TACO, programmers control the representations of a
sparse tensor, but have no control of subsequent optimiza-
tions so far. In contrast, we would like programmers have
explicit control not only in the representations of sparse ma-
trices, but also how to optimize them. Also we target FPGAs,
while TACO targets CPUs so far.
Gustavson [10] proposed an algorithm for permuted
transposition, and an algorithm for the multiplication of
two sparse matrices (SpGEMM). Both algorithms follow the
structure of a sparse matrix to get the values. Thus the algo-
rithms are structure-driven. Based on Gustavson’s algorithm
for SpGEMM, Deveci et al. [6] developed different parallel
algorithms for portable performance accross different CPUs
and GPUs. They consider the trade-offs on parallelism (how
to partition and parallelize a reduction accross different com-
pute units), how to determine the size of the result matrix,
and efficient data structures of a reduction.
Values-driven approach seems to be the norm for sparse
matrix computation on FPGAs. This is not surprising: due
to the massive hardware parallelism of FPGAs, their slower
frequencies relative to CPUs/GPUs, and their lack of hard-
ware caches, a high-performance implementation on FPGAs
would have to stream values in (maybe in several parallel
streams) to avoid any random memory access, and process
them in pipelines. Meanwhile, since FPGAs have abundant,
programmable hardware resources, the implementation can
use additional hardware resources to decode the structure of
a sparse matrix for the indices of the values simultaneously.
Values-driven approach also appears in GPU implemen-
tations. Greathouse and Daga [9] map SpMV efficiently to a
GPU by streaming the values of a sparse matrix into the local
scratchpad memory of the GPU, and dynamically assigning
rows to each parallel GPU compute unit. Bell et al. [3] show
an SpGEMM algorithm based on COO (coordinate) repre-
sentation. Bell and Garland [4] compare several SpMV im-
plementations based on different representations including
DIA (diagonal), ELL, CSR and COO. While there are no
detailed algorithm descriptions, we believe their algorithms
are driven by values, because they target to distribute fine-
grain parallelism among thousands or tens of thousands of
threads, which might be difficult for a structure-driven ap-
proach.
4.2 Languages
Our language, T2S, is an extension of Halide [14], a domain-
specific language for image processing on CPUs/GPUs, to
spatial architectures. Halide-HLS [13] is another spatial ex-
tension of Halide, where a dataflow graph of functions is
specified to offload to an FPGA, with line buffers between
functions to optimize their communication. As far as we
know, sparse matrices are not expressible in the current
Halide and Halide-HLS.
Spatial[12] is another domain-specific language for spa-
tial architectures. Its syntax directly supports streaming,
channels (i.e. pipes), reduction, finite state machine, loop
parallelization, etc. There is no detail specific to sparse ma-
trices, but it was evaluated with PageRank, a sparse matrix
or graph algorithm. From its syntax, we believe that sparse
matrices are not directly supported; instead, a values-driven
approach might be implemented: the values of a sparse ma-
trix may be streamed into a spatial hardware, and the struc-
ture of the matrix may be accessed with loops to build the
inverse mapping. However, in this way, a programmer has to
write in detail how to build the inverse mapping, while in our
approach, we may have a compiler to automatically build it.
The Little Language (LL) [2] has also observed that a
sparse matrix can be represented by a series of transforma-
tions. However, LL does not point out that the transforma-
tions are invertible, a simple yet critical fact that makes it
possible for a compiler to automatically construct an inverse
mapping. The usage of the LL language was for verification
of sparse matrix codes, which is structure-driven. In addi-
tion, the transformations in LL have much lower abstraction
level. For example, blocking and job-scheduling here would
have to be expressed in other lower-level primitives in LL.
In our approach, we propose the transformations at a
higher abstraction level in order to make it easy for a com-
piler to invert the transformations and automatically gener-
ate the inverse mapping, which is the key to enable a values-
driven implementation.
5. Future Work
This paper is a first step to expressing and optimizing sparse
matrix computations in a values-driven style on spatial archi-
tectures. There are many interesting questions left for future:
9
• An algorithm for building an inverse mapping. We
reasonably assume that a compiler can automatically
build an inverse mapping for a sparse matrix, as long as
the matrix is represented in a series of invertible, high-
level, transformations. In the next step, we will develop
such a compiler algorithm.
• Extension to sparse tensors. While there are many
(more than 50 in 2010 [2]) sparse representations for a
sparse matrix, the possible representations for a sparse
tensor can be overwhelming, since a sparse tensor can
have many dimensions and any dimension can be dense
or sparse. If a sparse tensor could be expressed in a sim-
ilar way as a sparse matrix and for values-driven com-
putations, programmers would be able to productively
explore the design space, including not only the possi-
ble sparse tensor representations, but also their optimiza-
tions, for high performance.
• More transformations. While we have pointed out three
high-level transformations, including packing, blocking,
and job scheduling, it is possible to include more trans-
formations to succinctly and precisely express the repre-
sentations and optimizations of sparse matrices for vari-
ous computations. There are known data parallel primi-
tives that emerge in many computations [3, 17], includ-
ing reduction, scan (parallel prefix-sum), mapping, gath-
ering, scattering, stream compaction, segmented reduc-
tion, and sorting [3]. They might be used as additional
high-level transformations.
In this paper, the result of job scheduling is statically
decidable with a sparse matrix. However, for GPUs and
FPGAs, dynamic job scheduling is also common. It is
useful if we could standardize dynamic job scheduling as
a high-level transformation.
• Optimizing across computations. So far, we have fo-
cused on a single sparse matrix computation. In an itera-
tive solver, there are multiple computations involved. Re-
ordering [5, 8] may improve data locality. Inspection/ex-
ecution is a common strategy that examines the struc-
tures of sparse matrices, and transforms the computa-
tions at run-time for better performance. In our previ-
ous work [16], we have built a compiler and library to
transparently optimize across computations with reorder-
ing and inspection/execution. Now that we can explicitly
express a sparse representation, in future, we may lever-
age our previous work but further give programmers ex-
plicit control of reordering and inspection/execution.
• More workloads and designs for FPGAs. In this paper,
we have studied two SpMV implementations on FPGAs,
and from them extracted a general approach. We will
study more workloads and their designs on FPGAs to
validate and enrich this approach.
• Workloads and designs for GPUs and CGRAs. Similar
to FPGAs, GPUs have massive resources and parallelism.
CGRA (Coarse-Grain Reconfigurable Architecture) [18]
is structurally similar to FPGAs except its resources have
coarse granularity. It is interesting to see if our approach
is applicable to workloads and designs on these architec-
tures.
6. Conclusion
We have proposed a novel idea to productively expressing
a sparse matrix for high-performance computing on a spa-
tial architecture. We express the sparse representation of the
matrix as a series of invertible transformations, and express
the computation centering around values, aided by an inverse
mapping automatically generated by a compiler. Then a pro-
grammer can specify optimizations of the computation in a
way much like for a dense matrix computation.
This is a first step in addressing spatial programming
of sparse matrix computations for productive performance.
There are many interesting researches left for future.
Acknowledgments
The keen interest of Nitish Kumar Srivastava in sparse ten-
sors has motivated me to think about the problem in this pa-
per (productive, performant, spatial programming of sparse
matrix computations). Zhiru Zhang, Nitish Kumar Srivas-
tava, and Chris Hughes have provided valuable feedback on
the solution.
References
[1] Sparse matrix. https://en.wikipedia.org/wiki/sparse matrix,
2018.
[2] G. Arnold, J. Ho¨lzl, A. S. Ko¨ksal, R. Bodı´k, and M. Sagiv.
Specifying and verifying sparse matrix codes. In Proceedings
of the 15th ACM SIGPLAN International Conference on
Functional Programming, ICFP ’10, pages 249–260, New
York, NY, USA, 2010. ACM.
[3] N. Bell, S. Dalton, and L. Olson. Exposing fine-grained
parallelism in algebraic multigrid methods. SIAM Journal of
Scientific Computing, 34(4), 2012.
[4] N. Bell and M. Garland. Implementing sparse matrix-
vector multiplication on throughput-oriented processors.
In Proceedings of the Conference on High Performance
Computing Networking, Storage and Analysis, SC ’09, pages
18:1–18:11, New York, NY, USA, 2009. ACM.
[5] E. Cuthill and J. McKee. Reducing the bandwidth of
sparse symmetric matrices. In Proceedings of the 1969 24th
National Conference, ACM ’69, pages 157–172, New York,
NY, USA, 1969. ACM.
[6] M. Deveci, C. Trott, and S. Rajamanickam. Multi-threaded
sparse matrix-matrix multiplication for many-core and GPU
architectures. CoRR, abs/1801.03065, 2018.
[7] J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and
G. Stitt. A high memory bandwidth fpga accelerator for
10
sparse matrix-vector multiplication. In 2014 IEEE 22nd
Annual International Symposium on Field-Programmable
Custom Computing Machines, pages 36–43, May 2014.
[8] A. George and J. W. Liu. Computer Solution of Large
Sparse Positive Definite. Prentice Hall Professional Technical
Reference, 1981.
[9] J. L. Greathouse and M. Daga. Efficient sparse matrix-vector
multiplication on gpus using the csr storage format. In SC
’14: Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis,
pages 769–780, Nov 2014.
[10] F. G. Gustavson. Two fast algorithms for sparse matrices:
Multiplication and permuted transposition. ACM Trans.
Math. Softw., 4(3):250–269, Sept. 1978.
[11] F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amaras-
inghe. The tensor algebra compiler. Proc. ACM Program.
Lang., 1(OOPSLA):77:1–77:29, Oct. 2017.
[12] D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Had-
jis, R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis,
and K. Olukotun. Spatial: A language and compiler for
application accelerators. In Proceedings of the 39th ACM
SIGPLAN Conference on Programming Language Design
and Implementation, PLDI 2018, pages 296–311, New York,
NY, USA, 2018. ACM.
[13] J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-
Kelley, and M. Horowitz. Programming heterogeneous
systems from an image processing dsl. ACM Trans. Archit.
Code Optim., 14(3):26:1–26:25, Aug. 2017.
[14] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand,
and S. Amarasinghe. Halide: A language and compiler
for optimizing parallelism, locality, and recomputation in
image processing pipelines. In Proceedings of the 34th ACM
SIGPLAN Conference on Programming Language Design
and Implementation, PLDI ’13, pages 519–530, New York,
NY, USA, 2013. ACM.
[15] H. Rong. Programmatic control of a compiler for generating
high-performance spatial hardware. CoRR, abs/1711.07606,
2017.
[16] H. Rong, J. Park, L. Xiang, T. A. Anderson, and M. Smelyan-
skiy. Sparso: Context-driven optimizations of sparse linear
algebra. In Proceedings of the 2016 International Conference
on Parallel Architectures and Compilation, PACT ’16, pages
247–259, New York, NY, USA, 2016. ACM.
[17] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan
primitives for gpu computing. In Proceedings of the
22Nd ACM SIGGRAPH/EUROGRAPHICS Symposium on
Graphics Hardware, GH ’07, pages 97–106, Aire-la-Ville,
Switzerland, Switzerland, 2007. Eurographics Association.
[18] M. Wijtvliet, L. Waeijen, and H. Corporaal. Coarse grained
reconfigurable architectures in the past 25 years: Overview
and classification. In 2016 International Conference on
Embedded Computer Systems: Architectures, Modeling and
Simulation (SAMOS), pages 235–244, July 2016.
[19] L. Zhuo and V. K. Prasanna. Sparse matrix-vector multipli-
cation on fpgas. In Proceedings of the 2005 ACM/SIGDA
13th International Symposium on Field-programmable Gate
Arrays, FPGA ’05, pages 63–74, New York, NY, USA, 2005.
ACM.
11
Schedule the blocks to 
2 machines ASAP.
FPGA
+
y
_
u
n
lo
a
d
e
r
y(i)
buffer 
for x
*
x(j)
buffer 
for x
*
buffer 
for x
*
buffer 
for x
*
+
                                              +
b
u
ffer
+
b
u
ffer
+
d
e
la
y
y
Combine reductions in 
the same machine
decoder
i
A.values(i', j')
col_idx
Machine 0
Machine 1
x
_
lo
a
d
e
r
x
A
_
lo
a
d
e
r
A
_
lo
a
d
e
r
A
_
lo
a
d
e
r
A
_
lo
a
d
e
r
row_blocks_loader
A     B  
      C 
        D E 
  F     G   H
  I       J K
      L       M
  N       O  
            P 
Pack each 
row
Dense representation
R
o
w
s 
(i
)
Columns (j)
0  1  2   3   4  5   6  7
0   
1   
2
3
4
5
6
7
Values
A B  
C 
D E 
F G H
I J K
L M
N O  
P 
0 3  
3 
4 5 
1 4 6
1 5 6
3 7
1 5  
6 
Structure
Row 
lengths
Column 
indices
2  
1 
2 
3
3
2
2  
1 
Block each row 
by factor of 2
Values
A B  
C 0
D E 
F G H 0
I J K 0
L M
N O  
P 0
0 3  
3 *
4 5 
1 4 6 *
1 5 6 *
3 7
1 5  
6 *
Structure
Column 
indices
Row 
blocks
1  
1 
1 
2
2
1
1  
1 
Values
0 3 4 5 6 * 6 * 1 5 
3 * 1 4 1 5 3 7 6 *
Structure
Column 
indices
Row 
blocks
1  
1 
1 
2
2
1
1  
1 
A B D E H 0 K 0 N O  
C 0 F G I J L M P 0
Columns (j')
0  1  2   3  4  5   6  7   8  9
R
o
w
s 
(i
')
0   
1
A D H K N
B E 0 0 O
0 4 6 6 1
3 5 * * 5
C F I L P
0 G J M 0
3 1 1 3 6
* 4 5 7 *
Conditionally combine 
reductions from two machines
Local sum 
and i
Figure 6: Another SpMV design on FPGAs [19].
12
1   Representation  rep;
2   Parameter          A(float, 2, rep), x(float, 1);
3   Assumption       symbolic_constants(k);
4   Var                    col_idx, row_blocks;
5   rep = A.dense()
6               .pack(Columns, Index(col_idx))
7               .block(Columns, k/2, Zero, Length(row_blocks))
8               .schedule(row_blocks, 2, ASAP, Zero); 
9    Var      i';
10  RDom j'(0, A.values.extent(1)); 
11  Expr    original_ij = A.inverse_map(i', j');
12  Expr    i = original_ij[0];
13  Expr    j = original_ij[1];
14  Func    y;
15  y(i)   = 0;
16  y(i) += A.values(i', j') * x(j); 
17  y.bound(i, 0, x.extent(0)); 
18  y.update().bound(i', 0, 2).reorder(j', i');
19  Func  y_unloader, Y, x_loader, A_loader;
20  y.update().isolate_consumer(y, y_unloader)
21                  .isolate_reduction(Y);
22  RDom j''(0, k/2);
23  Y.update().isolate_producer(x, x_loader)
24                   .unroll(i')
25                   .tile(j', j'', k/2)
26                   .unroll(j'')             
27                   .isolate_producer(A, A_loader);
28  Y.update().buffer(x)
29                   .combine_reduction(j'', Same_Target);
30                   .combine_reduction(i', Maybe_Different_Target);
31  y.update().is_continous_reduction(i)
32                  .reduction_circuit(Linear_Array, 4);
33  A_loader.load_structure(col_idx);
34  Func row_blocks_loader;
35  row_blocks_loader.load_structure(row_blocks);
Temporal 
definition
Build a basic 
spatial layout
Specialize 
individual 
computations
Spatial 
mapping
Sparse 
representation
Figure 7: Specification for the SpMV design in Fig. 6.
13
