Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms by Weng, Jian et al.
Exploiting Fine-Grain Ordered Parallelism
in Dense Matrix Algorithms
Jian Weng, Vidushi Dadu, Tony Nowatzki
University of California, Los Angeles
{jian.weng, vidushi.dadu, tjn}@cs.ucla.edu
ABSTRACT
Dense linear algebra kernels are critical for wireless applica-
tions, and the oncoming proliferation of 5G only amplifies
their importance. Many such matrix algorithms are inductive,
and exhibit ample amounts of fine-grain ordered parallelism
– when multiple computation flows with fine-grain producer/-
consumer dependences, and where the iteration domain is not
easily tileable. Synchronization overheads make multi-core
parallelism ineffective, and the non-tileable iterations make
the vector-VLIW approach less effective, especially for the
typically modest-sized matrices.
Because CPUs and DSPs lose order-of-magnitude per-
formance/hardware utilization, costly and inflexible ASICs
are often employed in signal processing pipelines. A pro-
grammable accelerator with similar performance/power/area
would be highly desirable. We find that fine-grain ordered
parallelism can be exploited by supporting: 1. fine-grain
stream-based communication/synchronization; 2. inductive
data-reuse and memory access patterns; 3. implicit vector-
masking for partial vectors; 4. hardware specialization of
dataflow criticality.
In this work, we propose, REVEL, as a next-generation
DSP architecture. It supports the above features in its ISA
and microarchitecture, and further uses a novel vector-stream
control paradigm to reduce control overheads. Across a
suite of linear algebra kernels, REVEL outperforms equally-
provisioned DSPs by 4.6×-37× in latency, and achieves a
performance per mm2 of 8.3×. It is only 2.2× higher power
to achieve the same performance as ideal ASICs, at about
55% of the combined area.
1. INTRODUCTION
Dense linear algebra kernels, like matrix factorization, mul-
tiplication, decomposition and FFT, have for decades been
the computational workhorses of signal processing across
standards, specifications, and device settings. The oncom-
ing proliferation of 5G wireless is only further pushing the
computational demands, both in performance and energy effi-
ciency. Driven by needs of higher capacity and applications
like augmented and virtual reality [1], new standards will
require signal processing at more than an order-of-magnitude
higher throughput and lower latency.
Despite their ubiquity, many important dense matrix op-
erations are far from trivial to parallelize and compute at
high hardware efficiency. As evidence, Figure 1 shows the
hardware utilization (based on max. vector issue width), of
12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 37 73 14
7
19
9 12 24 48 64 12
8
25
6
51
2
10
24
0
20
40
60
80
100
FU
 U
til
iza
tio
n 
(%
)
svd qr cholesky solver fir gemm fft
dsp cpu
Figure 1: Percent peak performance of CPU (Intel Xeon 4116) and DSP
(TI C6678) on DSP kernels
a modern CPU and DSP running common DSP algorithms
from native application suites (eg. MKL, and TI DSPLIB).
For algorithms without fine-grain dependences (GEMM, FIR,
and FFT), a reasonable utilization is achieved, usually be-
tween 30-80%. However, for factorization/decomposition
(SVD, QR, Cholesky, Solver), the utilization is exception-
ally poor, generally between 5%-20%. Even this measure is
generous as we only consider the maximum throughput of a
single core, yet there is enough raw parallelism to multithread.
Empirically, however, MKL and TI libraries do not even in-
voke multiple threads at the commonly-small matrix sizes
required, due to synchronization overheads. CPUs and DSPs
leave untapped factors of performance/hardware utilization.
The challenge and opportunity comes from the dom-
inant form of parallelism in these workloads, which we
call fine-grain ordered parallelism (FGOP). FGOP con-
sists of fine-grain producer/consumer relationships between
otherwise parallel computations, where the rate of production-
to-consumption, the rate of data reuse, and the memory
access relation is an affine function of the induction variables.
This results from the iterative and inductive nature of these
algorithms, as they operate on relatively small matrix sizes.
To substantiate, consider the triangular solver in Fig-
ure 2(a). Its iteration space diagram, 2(b), reveals the many
for (j=0; j<n; ++j)
  b[j] = b[j]/a[j,j]
  for (i=j+1; i<n; ++i)
    b[i] -= b[j]*a[j,i]
i (for inner loop)j
Vectorized
Iters.
Not Profitably
Vectorizable
Time
(d) Ideal
Schedule
… Efficient Vectorization 
speeds-up inner-loop
(c) Typical 
CPU Schedule
iter j
iter j+1
…
(a) Solver Code (b) Iteration Space and Dependences
Exploiting dependences enables region overlap
Many 
fine-grain 
dependences
restrict traditional 
multithreading
Vectorization 
challenging due to 
small  & changing 
iteration length
Forward Dep. Loop Carried
iter j iter j+1
Figure 2: FGOP Example: Triangular Linear Solver
1
ar
X
iv
:1
90
5.
06
23
8v
1 
 [c
s.D
C]
  9
 M
ay
 20
19
fine-grain dependences that make profitable multithreading
between regions impossible. Furthermore, the inner-loop
trip count changes inductively, leading to many iterations
that are difficult to vectorize. Nevertheless, an architecture
can be designed to exploit FGOP; the potential is shown
in Figure 2(c,d). If dependences between regions can be
enforced at a fine-grain with low overhead, then overlap
between regions becomes possible, increasing the parallelism.
If the inductive memory access pattern (and its relationship to
computation) can be expressed efficiently, then vectorization
can reduce the total time of the inner-loop region.
ASICs can of-course be designed to exploit FGOP – hence
why they are so commonly employed for these tasks. Unfor-
tunately, they have significant drawbacks: design time and
verification effort, extra on-chip area, lack of flexibility, and
lengthened time-to-market; these are especially relevant for
example domain of wireless, where standards are continu-
ally changing and infrastructure costs are high. A general
and programmable architecture exploiting FGOP could prove
to be a worthy, if not essential, replacement of traditional
vector-VLIW DSP architectures.
Goals: Our goals are twofold: 1. developing abstractions and
execution semantics to enable efficient expression of FGOP;
and 2. applying these abstractions to create an efficient pro-
grammable accelerator instance for DSP algorithms, capable
of accelerating both FGOP and non-FGOP workloads in this
domain (eg. GEMM, filters).
Approach: Through an in-depth workload analysis, we find
four essential architecture abstractions to express FGOP ef-
ficiently to hardware: 1. parallel dataflows with ordered
communication channels. 2. to reduce control overhead,
induction-variable dependent communication, memory ac-
cess, and data-reuse. 3. for efficient vectorization, the implicit
masking of non-vector-width-divisible iterations. 4. for high
hardware utilization, the specialization of compute hardware
for critical versus non-critical dataflows.
While in principle the above abstractions can be added to a
conventional ISA, we choose a stream-dataflow ISA [2], as its
dataflow-based computation and communication abstractions
are simple to modify, and the resulting accelerator can be
performance/power competitive with DSPs. For the hardware
implementation, we start with a simple design for one lane: a
scratchpad connected to a coarse grain reconfigurable fabric
(eg. similar to some previous designs [3, 4, 5, 6]). We use
multiple such “lanes” to scale up the design.
Our accelerator, REVEL: the Reconfigurable Vector Lane
architecture (Figure 3), is constructed by adding support for
each of the FGOP-exploiting abstractions: 1. We allow multi-
ple parallel dataflows (similar to threads) which can commu-
nicate within/across lanes through FIFOs to support synchro-
nization on fine-grain dependences. To simplify the ordering
of commands, we centralize control into one control-core
which coordinates all lanes. 2. We provide the ability to
express inductive memory access, data-reuse and communi-
cation patterns by adding suitable state machines to FIFO
communication structures. 3. We implement implicit vector
masking by exploiting the relationship between computation-
vector width and communication-stream length. 4. For high
computation utilization, we develop a novel heterogeneous
  Reconfig.
  Fabric
 X
FE
R
     SPAD   
  Reconfig.
  Fabric
 X
FE
R
    Reconfig.    
    Fabric
 X
FE
R
...
...
     Shared SPAD 
Vector Lane 1 Vector Lane 2 Vector Lane N
Lane Control Bus
Shared SPAD Bus Control 
Core
Inductive Stream Control
Configurable Vector Port
Control path             
Wide-datapath
     SPAD        SPAD   
Legend
Dedicated 
Region
Temporal
Region
  
Heterog.
Fabric:
Figure 3: REVEL Architecture Model
compute fabric, where different regions are specialized for
critical and non-critical dataflows.
Our contributions are:
• Identification and characterization of fine-grain ordered
parallelism (FGOP) as the main challenge for accelerating
many dense linear algebra kernels.
• Architecture and execution model for expressing FGOP
naturally to hardware.
• Novel architecture features (vector-stream control, induc-
tive access/reuse, implicit vector-masking, and heterog.
fabric), enabling ASIC-like power/area/performance.
Results: A single 1.25GHz REVEL unit can outperform
a 2.1GHz OOO core running highly-optimized MKL code
on DSP workloads by mean 9.6×, with an area normalized
speedup of 1308×. Compared to a DSP, REVEL achieves
between 4.6×-37× lower latency, with an area normalized
speedup of 8.3×. Compared to a set of ideal ASICs with
equivalent performance, it is about 2.2× higher power and
.55× the area.
Paper Organization: We briefly motivate the kernels in Sec-
tion 2, and analyze their challenges/potential in Section 3.
The FGOP abstractions and ISA instance (REVEL) are in
Sections 4 and 5. Section 6 and 8 describe the microarchitec-
ture and compiler. Methodology and results are in Sections 9
and 10. We finally cover related work and conclude.
2. WHY THESE WORKLOADS?
We examine these DSP workloads as they represent a co-
herent and important set, and because exploiting FGOP is
critical for their performance. To elaborate, Figure 4 shows
the stages of a typical 4G/5G transmitter/receiver.
Kernels we do not target: Channel coding and modulation
involve mostly bit-level arithmetic. RE mapping is a short
resource allocation phase which is not computation intense.
Kernels we target: The Beamforming stage involves mostly
GEMM, coming from spatial signal filtering [7]. Filters and
FFT of several varieties are also common [8, 9, 10].
FFT
SRS 
channel Est.
DMRS 
channel Est.
Uplink
Downlink
Beam-Forming
Weight Update
MIMO Eq. &
Noise Var. Est.
Beam-
Forming
RE
Mapping
IFFT
FFT/IFFT size depends on cell 
bandwidth, sub-carrier spacing
General Matrix Mult: Matrix size 
depends on # of antennas&beams
Demodulation &
Chan. Decoding
Channel Coding &
Modulation
Bit-level Ops- 
not targetted
Resource Element Mapping: 
Control dominated, not targeted
Matrix Factorization/Inversion:
QR Decomp/Cholseky/SVD,
And Filtering
Figure 4: Typical 4G/5G Transmitter/Receiver Pipeline
2
Challenging FGOP workloads are mostly within MIMO
equalization and channel estimation. These include Singular
Value Decomp., used for noise reduction [11], QR Decom-
position, used for signal detection for (MIMO) [12], and
Cholesky decomposition used in channel estimation and
equalization [13, 14]. Solver is an instructive example.
Why are matrices small?: The parameters often depend on
the number of antennas and beams (in the range of 12-32
would be common) [15].
3. FINE-GRAIN ORDERED PARALLELISM
Here we first define FGOP properties with an example ker-
nel, and explain why each is important either as a challenge
or opportunity. Then we characterize their prevalence in our
workloads and beyond. Finally, we perform a case study to
answer why task-parallelism plus vectorization is insufficient.
3.1 Characteristic Properties of FGOP
We use Cholesky decomposition as a running example in
Figure 5. Cholesky contains a point, a vector, and a matrix
computation region. In general, regions correspond to com-
putations either across subsequent loops, or from within the
same imperfect loop but at different nesting levels.
Property 1: Parallel Flows with Fine-grain Dependences:
A central characteristic of FGOP is the presence of fine-grain
dependences between regions. In Cholesky, the vector and
matrix region are dependent on the point region (forward
dep.), and the point region is dependent on the first element in
the matrix region (loop carried dep.). For a small matrix, the
granularity of these dependences is a few hundred instructions
or less, and even lower as the algorithm progresses.
Why is this important: the presence of these fine-grain de-
pendences is the key limiter to performance of multithreading
the regions, due to high synchronization overhead.
Property 2: Ordered Dependences: Dependences are of-
ten strictly-ordered from the perspective of their producing
and consuming instructions. Figure 5(b) shows Cholesky’s it-
eration space and dependences. In Cholesky, across multiple
iterations of the outer k loop, the point region is producing
values (inva and sqrt) that are consumed by the matrix re-
gion. Similarly, the matrix region produces values consumed
by subsequent iterations of the point region. An example of a
non-ordered dependence is when an array is consumed in the
backwards order of how it was produced.
Why is this important: the structure of ordered de-
pendences makes fine-grain synchronization of these
dependences natural, and therefore creates an opportunity to
exploit efficiently in hardware.
Property 3: Inductive Access/Data-Reuse: An inductive
algorithm iteratively builds on previous computations. In
array codes, this often manifests as induction-variable depen-
dent trip-counts. This is the case for both of Cholesky’s loops
(but would not, for example, be true of a matrix multiply).
This has implications for dependences, in that their reuse
pattern (the rate of production to consumption) would be
induction-variable dependent. For example, how many times
inva is consumed in the matrix region varies with k and j.
Another example is that the matrix region only produces a
…
i
j
i
jj j
k k+1
… …Vector Matrix Vector Matrix
f
i
r
s
t
 
e
l
e
m
 
o
f
 
i
n
n
e
r
 
l
o
o
p
 
for (k=0; k<n; ++k)
  ia = 1.0/a[k,k]
  is = 1.0/sqrt(a[k,k])
  for (j=k; j<n; ++j)
    l[j,i] = a[i,j]*is
  for (j=k+1; j<n; ++j)
    for (i=j; i<n; ++i)
      a[j,i] -= a[k,i]*
             a[k,j]*ia
Forward 
Dep.
Loop-Carried
Critical
Region
Sub-critical
Region
(c) Ideal 
Schedule:
Time
…
k
k+1
Vectorized
Comp.
Not Profitably
Vectorizable
V
ec
to
r
M
at
ri
x
P
o
in
t
Point Point
(a) Cholesky Code
(b) Iteration Space and Dependences
(some 
dependences 
omitted for 
visual clarity)
Inner loops are “inductive”: 
their trip counts depend on 
an induction variable.
This creates the triangular 
pattern below.
Figure 5: FGOP Example: Cholesky
value for the next point region on the first iteration of the
inner loop (so depends on k).
Also, inductive loops cause patterns of computation/mem-
ory that are not easy to tile with vectorized loop iterations.
Figure 5(b) also shows a vector-tiling pattern for Cholesky,
with many leftover iterations. In a traditional vector architec-
ture, these would require scalar iterations or masking/merging
overhead.
Why is this important: inductive patterns cause overheads
for coordination of vector computation, as well as the wide
interface to memory.
Property 4: Region Imbalance: Finally, regions often ex-
press imbalanced amounts of work. In Cholesky, the matrix
region performs much more computation than the others,
making it critical for performance. In the Figure 5(a), we
highlight the critical region in red, and the sub-critical re-
gions in purple. In DSP workloads, sub-critical regions often
contain high-latency operations like sqrt and div.
Why is this important: for a high hardware utilization,
execution resources should be allocated appropriately to
regions. Furthermore, we will show how it is possible to
specialize the computation substrate for criticality.
Similar properties in QR, SVD: Figure 6 shows that both
have fine-grain ordered deps. between scalar/matrix regions
(eg. tau) and between inner loops (eg. w[j]). Inner loops are
inductive and imbalanced compared to householder region.
Prevalence of FGOP: We characterize the prevalence of
each FGOP property in our 7 DSP workloads (Cholesky,
QR, SVD, Solver, FFT, GEMM, Filter), as well as a more
general dense matrix benchmark suite, PolyBench [16]. We
use LLVM [17] to instrument programs to track dynamic
memory dependences. Figure 7 shows a cumulative density
3
function (CDF) for each property across three different matrix
sizes (16,32,128)1. In all plots, lines closer to the upper left
indicates more FGOP.
• Fine Granularity: Figure 7(a) characterizes the distance
between inter-region dependences in terms of arithmetic
instructions. Most dependences (where the steepest part
of the CDF curve is) are between about 75 to 1000 in-
structions, where larger data sizes are on the higher end.
Intuitively, this is a range where an out-of-order (OOO)
core’s instruction window begins to be insufficient, but
still where shared-memory based synchronization also
significantly hurts performance (especially considering
pipeline serialization during synchronization).
• Ordered: Figure 7(b) shows the prevalence of ordered
dependences as a fraction of total dependences. All work-
loads contain at least 50% ordered dependences, and more
than 80% of DSP workloads are completely ordered; this
is quite high and promising for later exploitation.
• Inductive: DSP workloads have significant amounts of
inductive access. 4/7 of the DSP workloads have more
than 85% inductive accesses. PolyBench in general has
much less inductive access, but still about 1/5 of those
workloads are 60% inductive. Nevertheless, the inductive
property is critical for our DSP workloads.
• Imbalanced: 4/7 DSP workloads have imbalanced re-
gions, while 50% of PolyBench have imbalanced regions.
Overall, FGOP properties are common across dense ma-
trix workloads, especially for the relevant DSP workloads.
3.2 Why not task parallelism + vectorization?
We know from the data in Figure 1 (page 1) that exploiting
FGOP is non-trivial for vector and VLIW cores. But why is
this so, given that DSPs are designed for these workloads?
The traditional method of parallelizing workloads with
FGOP is through task parallelism on a block of computations
(eg. a set of iterations). Each dependence, or set of depen-
dences, is simply a condition under which to start a new task
or synchronize existing tasks. Intuitively, this works well
when dependences are coarse grain (less overhead to start or
1FFT/Filter does not process matrices, so we pick data size com-
pared to the matrices.
W
h
e
n 
j
=
k
+
1
for (k=0; k<n; ++k)
  for (j=k,norm=0; j<n; ++j)
    v[j] = a[j,k]
    norm += v[j]*v[j]
  norm = sqrt(norm)
  s = -phase(v[0])
  a[k,k] = s * norm
  u1 = 1.0 / (v[0] - phase)
  tau = conj(s) / u1 / norm
  for (j=k; j<n; ++j)
    v[j] *= u1
  for (j=k; j<n; ++j)
    w[j] = 0
    for (i=k; i<n; ++i)
      w[j] += a[i,j]*v[i]
    w[j] *= tau
  for (j=k; j<n; ++j)
    for (i=k; i<n; ++i)
      a[i,j] -= w[j]*v[i]
W
h
e
n 
j
=
k
+
1
for (k=0; k<n; ++k)
  v,tau=householder(a[k:n,k])
  for (j=k; j<n; ++j)
    w[j] = 0
    for (i=k; i<n; ++i)
      w[j] += a[i,j]*v[i]
    w[j] *= tau
  for (j=k; j<n; ++j)
    for (i=k; i<n; ++i)
      a[i,j] -= w[j]*v[i]
  v,tau=householder(a[k,k+1:n])
  for (j=k; j<n; ++j)
    w[j] = 0
    for (i=k; i<n; ++i)
      w[j] += a[j,i]*v[i]
    w[j] *= tau
  for (j=k; j<n; ++j)
    for (i=k; i<n; ++i)
      a[j,i] -= w[j]*v[i]
Time
…
k
k+1
Time
iter
k
k+1…
QR Factorization (R only) SVD (Bidiagonalize only)
iter
Ideal Schedules
house 
holder
region 
Figure 6: FGOP Examples: QR and SVD
1 101 102 103 104 105 106
#Arithmetic Inst.
Between Depend.
0
20
40
60
80
100
Pe
rc
en
ta
ge
 o
f W
or
kl
oa
ds
wi
th
 C
or
re
s. 
Pr
op
er
ty
(%
) (a) Fine Grain Depend.
100 75 50 25 0
Percentage of Depend.
Consumed in Order
(b) Ordered Depend.
DSP (16)
DSP (32)
DSP (128)
Poly (16)
Poly (32)
Poly (128)
100 75 50 25 0
Percentage of Induc.
 Mem. Streams
(c) Induc. Data Stream
Figure 7: Prevalence of FGOP Properties.
16 32 64 12
8
25
6
51
2
10
24
20
48
0
1
2
3
4
5
Re
la
tiv
e 
Ex
ec
. T
im
e mkl 1
mkl 2
mkl 4
mkl 8
16 32 64 12
8
25
6
51
2
10
24
20
48
mkl 1
task par. 1
task par. 2
task par. 4
task par. 8
Figure 8: Case Study: Speedup of task-parallel Cholesky and MKL.
synchronize), and when blocks are perfectly tile the iteration
space. As we discussed earlier, this is not true for the DSP
workloads we study.
Case Study: Task-parallel Cholesky: For practical analy-
sis, we analyzed an established Cholesky kernel [18] which
uses blocked task-parallel execution. Figure 8 shows the task-
parallel speedup over the single-threaded industry-standard
MKL for different matrix sizes (see Sec 9 for CPU and
methodology). First, notice the its performance is similar
to MKL for large matrices, which is possible because it
calls underlying BLAS routines (dpotf2,dtrsm,dsyrk) at a
block-level. We suspect MKL’s implementation uses a simi-
lar approach, but does not use task parallelism at small matrix
sizes for performance reasons.
Considering the task-parallel code, the results indicate that
exploiting FGOP is only profitable at all for larger matrices.
Using more threads simply results in higher overhead of syn-
chronization, far outweighing the benefits of parallelization.
For the task-parallel version, speedups higher than 2× are
only possible with matrices of 1024k size or larger, higher
than we can make use of in our domain. Therefore, our goal
is to create an architecture which can exploit FGOP better at
all matrix sizes and enable speedup at small sizes.
4. ABSTRACTIONS TO EXPRESS FGOP
In this section we propose a set of architecture features
which can express fine-grain ordered parallelism efficiently
to hardware. The description here is architecture-neutral, and
we later develop an architecture instance (an ISA).
Feature 1: Concurrent Dataflows with Ordered Depen-
dences: The essential abstraction is that of concurrent
dataflows (threads) with the ability to express ordered
dependences between regions. Ordered dependences are
distinguished from typical instruction dependences in that
they have a non-uniform rate of production to consumption.
A consumption rate higher than one indicates reuse of a
4
value along a dependence. This may occur because data is
reused multiple times within a subsequent inner loop. A
production rate higher than one means that several iterations
occur before producing a value. For example, this could
be because data is being reduced (accumulated) for several
iterations. Figure 9 shows the solver kernel’s dataflows,
annotated with the memory access it performs within each
iteration of the outer j loop. Edges are labeled with their
production:consumption rate, unless they are uniform (1:1).
Importance of Control Overhead: One important consider-
ation is the control-to-computation ratio. Short-vector SIMD
is one way to reduce control overhead; one SIMD instruction
expresses multiple operations over a fixed number of data
items. A generalization (used in a variety of prior architec-
tures [19, 20, 2, 21, 22, 23, 24]) is the concept of streams,
where a single control command describes an entire pattern
of operations. The following features (2-4) are related to the
use of streams to reduce control overhead.
Feature 2: Inductive Production:Consumption Rate:
Data-reuse patterns may depend on induction variables, as
seen in Figure 9. Here, the output of division is used multiple
times within the inner loop, but the number of times is
reduced by one each iteration. In general we find that the
pattern changes only by small constant numbers. We specify
these as two “stretch” parameters: sp and sc, the rate of
change of production and consumption. Figure 11 contains
an example encoding as a stream. Including these parameters
is not necessary for correct enforcement of dependences,
because multiple lower-dimension streams can be generated.
However, the number of instructions increases by an order of
magnitude (as shown in Figure 11).
Feature 3: Inductive Memory Streams: Similarly, all prior
architectures that we are aware of use rectangular memory
access streams: ie. their iteration domains (without loss
of generality) begin at ~0 and end at a constant n in each
dimension (ie. a trip count), and their address functions are
linear functions of~I. If we let ci, c j etc. be the multipliers
of~I in the address function, rectangular streams can then be
depicted intuitively as a loop nest – see Figure 10(a).
Inductive streams are more general; their iteration domains
may be bounded polyhedra instead of strictly rectangular.
Trip counts become a linear function of lexicographically pre-
vious iterators. We encode using stretch multipliers s ji, rep-
resenting the multipliers of iterator j in the trip count for di-
mension i. Figure 10(b) shows a 2D inductive stream pattern
as a loop nest. Figure 11 shows how to specify the accesses in
solver using either rectangular or inductive streams. Again,
inductive access streams require O(n) fewer control insts.
Later evaluation uses a simple notation to describe capa-
bilities: Letter “R” denotes a rectangular dimension, and “I”
denotes inductive dimension, so “RI” would be a 2D capabil-
for (j=0; j<n; ++j)
  b[j] /= a[j,j]
  for (i=j+1; i<n; ++i)
    b[i] -= b[j]*a[j,i]
a[j,j+1:n]a[j,j]
1:(n-j-1)
b[j]
b[0]
(n-j-1):1
-
×/
b[i]
multi-reuse
multi-
discard
b[i]
(a) Solver Code (b) Iteration Space and Dependences
Figure 9: Solver’s Dataflows and Ordered Dependences
f o r j =0 t o n j
f o r i =0 t o ni
a r r a y [ j ∗ c j + i∗ ci ]
(a) 2D Rectangular (RR)
f o r j =0 t o n j
f o r i =0 t o ni + j ∗ s ji
a r r a y [ j ∗ c j + i∗ ci ]
(b) 2D Inductive (RI)
Figure 10: Memory Address Stream Type Comparison
a(ci=1)a(ci=1)
b(ci=1)b(ci=1)
a(ci=n+1)
nc=n-j-1
b(ci=1)
b
np=
n-j-1
-
×/
b(ci=1,ni=n-j)
a(cj=n+1,ci=1,sij=-1)a(ci=n+1)
nc=n-1,
sc=-1
b(ci=1)
b
sp=-1,
np=n-1
-
×/
b(cj=1,sij=-1)
(a) Using 2D Rectangular Streams (b) Using 2D Inductive Streams
Total Control Instructions = 3 + 5n Total Control Instructions = 8
n instances of these “instructions” required
… 
… 
… 
… 
a(ci=1,ni=n-j)
Figure 11: Stream Specification using Different Types
a[j,j+1:n]a[j,j]
1:(n-j-1)/4
b[j]
b[0]
(n-j-1)/4:1
/
b[i]
-
×
-
×
-
×
-
×
a[j,j+1:n]
b[i]
-
×
-
×
-
×
-
×
(a) Vectorized Critical-Region (b) Masked Lanes when i+4>ni
… 
Figure 12: Implicit Vector Masking
ity with induction in second dimension.
Feature 4: Stream-based Implicit Vector Masking: There
are two issues with vectorization of FGOP. The first is that
the reuse rate may become fractional, as it may be divided by
the vector width (see example in Figure 12(a)). Therefore, we
need si j to be able to represent fractional numbers. Second
is the problem of non-vector-width divisible iterations. To
address this, we make it implicit that the datapath for the
remaining iterations becomes masked or predicated. This
can be enforced by dynamically checking the stream iterator
for the case when the inner-loop iterator i is greater than the
current length ni (see Figure 12(b)).
Feature 5: Specification of Dataflow Criticality: Certain
regions may be more or less computationally critical than
others, as they perform more or less work. In our example
in Figure 9, the “divide” dataflow happens n/2 fewer times.
In practice, non-critical dataflows should be allocated shared
resources, while critical dataflows should be vectorized. We
will later demonstrate the effectiveness of hardware special-
ization for criticality.
5. REVEL: AN FGOP-ENABLED ISA
Using the principles from the previous section, we con-
struct an efficient and scalable ISA and execution model
(REVEL ISA) to exploit FGOP and traditional parallelism in
dense matrix algorithms. REVEL is an instance of a stream-
dataflow ISA [2], which we chose because it is straightfor-
ward to enhance for FGOP, and it enables an efficient pro-
grammable accelerator. Section 7 discusses enhancing other
architectures (OOO cores and Plasticine [4]). In this section,
we discuss the control model, how the architecture incorpo-
rates FGOP features, and then its specific ISA instantiation.
Background on stream-dataflow: Stream-dataflow ISAs
5
a[k,k]
×a[k,k+1:n]
a[k+1:n,k+1:n]
-
a[k,k+1:n]
×
×
a[k+1:n,k+1:n]
(a) Original Source Code
/ √ /
Config(chol_dataflow,255)
Context(1)
//Feed a[0,0] to in=1.❶
//Stream a to 1.❽ to init
Local_St_Bar()
Local_ld(cj=n+1,nj=n-1,ni=n-1,
         s=-1,local=0,in=1.❽)
for (int i=0, lane=0, m=n; i<n-1; ++i,m=n-i) {
  XFER(n=1,out=lane.❷,in=lane.❷,nc=m-1)
  Local_St(ni=1,out=lane.⓿,addr=&L[0,0])
  Local_Ld(ni=m-1,in=lane.❹,addr=&a[i,i+1:n])
  Local_St(ni=m-1,out=lane.❺,Local=&L[i+1:n,0])
  XFER(np=1,out=1.❸,in=1.❸,nc=m*m/2)  
  Local_ld(ni=m-1,ci=1,sji=-1,in=lane.❻)
  Local_Ld(ni=m-1,addr=&a[i:n],
           in=lane.❼,nc=m-1,sc=-1)
  XFER(np=1,out=acc.❾,in=(lane+1&7).❶)
  Local_St(ni=m-1,out=lane.❾,local=&a[i,i+1:n])
  XFER(np=(m-1)²/2,out=lane.❾,in=(lane+1&7).❽)
  //put a[i,i+1:n] on next lanes SPAD
  acc=acc+1&7 
} Wait_all_done()
×
× -
(Unrolled by 2)
4
6 7
8
5
9Legend:
(d) Vector-Stream Code Snippet
1
a[0,1:n]
C
o
m
p
u
te
 F
ab
ri
c
Sh
ar
ed
 
SP
A
D
.
a[n:n] L[n:n] … 
Lo
ca
l 
SP
A
D
.
In
P
o
rt
s
O
u
t
P
o
rt
s
(c) Compute Fabric and SPAD mapping
Dep. mapped to 
compute fabric
Data Slicea[k,k+1:n]
Comm. stream 
mapped to ctrl core
1
Operation mapped to 
dedi./temp. region
2
REVEL Lane 1
If not specifically 
specified, nc=ci=cj=1, 
sc=sp=sji=0
For clarity,  I/O port ids correspond to the stream # in (b)
a[k,k]
0
×√ ×
×/
/
×
- -
×
1 2 4 6 7 8
5 92
3
3 0
×√ ×
×/
/
×
- -
×
1 2 4 6 7 8
5 92
3
3 0
REVEL Lane 2
L[0:n,0] a[1,2:n] L[1:n,1]
(b) Comp. Slice & Streams
for (k=0; k<n; ++k)
  ia = 1.0/a[k,k]
  is = 1/sqrt(a[k,k])
  for (j=k; j<n; ++j)
    l[j,i] = a[i,j]*is
  for (j=k+1; j<n; ++j)
    for (i=j; i<n; ++i)
      a[j,i] -= 
           a[k,i]*a[k,j]*ia
V
ec
to
r
M
at
ri
x
P
o
in
t
V
ec
to
r
M
at
ri
x
P
o
in
t
3
/×
Ctrl.
Core
Dataflow compiler 
lowers dataflows to 
fabric configuration
 bits (see Section 7)
L[k+1:n,k+1]
Compiler (hypothetical) Generates IR
Communication streams will be converted to stream intrin. for control core 
Compiler Steps: Optimizations+Backend Code Gen.
This
runs 
on
REVEL Program: Dataflow Config + Vector-Stream Code
Figure 13: Explaining REVEL abstractions using Cholesky as an example.
express computation as a dataflow graph, where its inputs
and outputs are named ports. Communication is performed
using streams, where the endpoints of streams are either the
dataflow-graph ports or memory. A VonNeumann program
embeds all stream commands, and streams with the same
port number are guaranteed to be executed in program order.
Memory requests can be ordered by explicit barriers.
ISA Enhancements to support FGOP:
• Ordered Dependences between Dataflows: Computa-
tion is expressed as multiple independently-triggered
dataflow graphs, where streams describe their communi-
cation and re-use pattern.
• Inductive Dependence/Access: Stretch parameters
(sp,sc) added to relevant streams.
• Vector Masking: Non-divisible iteration lengths causes
predication of the corresponding dataflow.
• Execution Rate: Implementation is closely related to
hardware, so we describe separately (Section 6.3).
Example: Cholesky: Figure 13 demonstrates REVEL’s ab-
stractions by showing the transformation from source (a) to
the abstract dataflow IR (b), and finally to dataflow configu-
ration and stream-code running on the control core (c,d).
Enabling Scalability with Lanes: To enable scalability at
low overhead, we chose to add multiple lanes of execution.
Each lane is independent, in that it can concurrently execute
multiple dataflows, each potentially communicating using
inductive streams or through a global memory. Also, since
each lane can be programmed separately, the architecture is
flexible in terms of what computations are being parallelized.
Vector-stream control: There are two challenges with using
multiple lanes: 1. Each lane needs coordination (control
overhead), and 2. The dataflow-dependence streams between
lanes must somehow be ordered.
Our solution is the vector-stream control model. Here, a
single VonNeumann control program coordinates the execu-
tion of all lanes. Control commands are sent to all relevant
lanes, specified by a bitmask (eg. load array from address
0 of local memory to dataflow 1). In addition, a lane’s in-
dex can be used to offset the address of a command, so a
single command can allow each lane to read a different por-
tion of an array. This is unique and more powerful than
the control amortization offered by either vectorization or
streaming alone, as it amortizes both in “space” across lanes,
and in time through streaming commands. It is inspired by
Pattern Params Source Params Dest. Params
Shared_ld ci, c j, n j, ni
shared_addr local_addr
L
an
e
B
itm
as
k
(A
L
L
)Shared_st local_addr shared_addr
Local_st ci, c j n j, ni, s ji
out_port local_addr
Local_ld local_addr
Const n j, ni, s ji val1, val2
in_port,
nc, sc
XFER np, sp out_port
Configure local_addr
Barrier Ld/St & Wait
Table 1: REVEL’s Vector-Stream Control Commands
Vector-threading [25, 26, 27] but with a stream-based ISA.
In the example, Figure 13, we map all three dataflows
(scalar, vector, matrix) to one lane to share its resources, and
parallelize the outer k loop across lanes.
REVEL Commands: Table 1 contains the set of commands
within the VonNeumann control program for stream coor-
dination, including their pattern parameters, source, and
destination. Shared_Ld/St are for transfers between lo-
cal and shared memory. Local_Ld/St transfer between
the local memory and the dataflow graph. XFER specifies
inter-dataflow communication streams to support fine-grain
dependences. Const can stream a pattern of val1,val2, eg.
(0,0,0,1,0,0,1,0,1), which is useful for inductive control-flow
within the dataflow graph. The Barrier_Ld/St command
prevents concurrent scratchpad memory access, and Wait
delays until a lane is no longer active. These are used for
flexible double buffering. All commands take a lane bitmask
as a parameter, to implement vector-stream control.
6. REVEL MICROARCHITECTURE
We describe REVEL’s microarchitecture by first giving
a broad overview, then explaining the key innovations that
enable efficient exploitation of FGOP. We discuss the het-
erogeneous compute fabric in detail, as it is a key novel
component of the design, enabling low overhead execution
of unbalanced FGOP regions.
6.1 Hardware Overview
The REVEL processor (Figure 14) is composed of a num-
ber of lanes, a low power VonNeumann control core, and a
shared scratchpad. The control core can issue vector-stream
commands to each lane. Each lane manages its stream and
memory dependences, data access requests and computation
6
Vector Lane 1
+
× 
× 
× 
× 
× × × × 
+ + + +
+ + +
+ + +
+ S S
× 
+
S
cf
g
cf
g
cf
g
cf
g
cf
g Input
ports
Output
ports
+
× 
× 
× 
× 
× × × × 
+ + + +
+ + +
+ + +
+ S S
× 
+
S
Input
ports
Output
ports
Config bus input
Config bus xfer
Vector Lane 2
XF
ER
Shared SPAD
XF
ER
...
Ve
ct
or
 L
an
es
 3
-8
...
Config bus input
Config bus xferCmd Queue
SPAD
Data Bus
Lane Ctrl
cf
g
cf
g
cf
g
cf
g
cf
g
Co
m
m
an
d 
Q
ue
ueVonNeumann
Control Core
Cmd QueueCmd Sync C
Compute 
Fabric
Compute 
Fabric
Private SPAD Private SPADStreamControl
St
re
am
Co
nt
ro
l
Stream
Control
St
re
am
Co
nt
ro
l
Figure 14: REVEL Microarchitecture
firing. Dataflows on the same or separate lanes can communi-
cate data through the XFER unit or shared scratchpad.
High-level Operation: REVEL’s high-level execution flow
is as follows: First, the core will issue a config command,
and configuration data is broadcast to each relevant lanes’
compute fabric and its ports. Asynchronously, the control
core will begin to compute the parameters of any stream com-
mands. When a command is ready, it will be broadcast rele-
vant lanes’ command queues. Commands are then issued to
either the private or shared scratchpad, provided the resources
they depend on are free (eg. input or output compute-fabric
ports). Streams execute locally until completion, until which
point they notify the command queue that they are free. Inde-
pendent dataflows may be mapped onto the compute fabric,
where they are executed in pipelined fashion once enough
data has arrived for an instance of the computation. Once the
control core has completed issuing vector-stream commands,
it will issue a Wait command. This blocks the control pro-
gram until all relevant lanes are no longer active, which is
determined by the completion of all streams.
The responsibilities of each component are:
• Command Queue is the lane’s resource manager, and is
responsible for maintaining data ordering. It maintains
a queue of commands from the control core, and issues
them to the scratchpad or XFER unit if no barrier com-
mands or port dependences prevent that. A scoreboard
tracks ports in-use.
• Stream Control maintains the set of concurrent streams,
where each stream tracks the state of its iterators (i, j) and
length ni, (to support inductive access). It can generate
addresses for one stream per cycle, along with a mask
for any unused words of the scratchpad line. Streams are
prioritized by minimum “cycles-to-stall,” which is the
number of cycles before the corresponding port will run
out of data (data-in-fifo / port-width).
• Input/Output Ports contain a set of FIFOs for holding
intermediate results while waiting for (or produced from)
the compute fabric. Input ports can receive data either
from the scratchpad bus, or from the XFER unit if receiv-
ing data from a neighboring dataflow. Each port attaches
to a unique location within the grid, so it is the compiler’s
responsibility to choose optimal ports.
• Compute Fabric monitors the data ready in each input
port FIFO to determine which dataflows can begin. Mul-
tiple can be fired in a single cycle. This heterogeneous
fabric is divided into regions which specialize for either
critical or non-critical computations (Section 6.3).
• XFER Unit is responsible for arbitrating the bus from
output ports back to the local or remote input ports, which
enables fine-grain dependences between dataflows, both
within and across lanes.
6.2 Supporting FGOP Abstractions
Here we describe the essential hardware mechanisms for
supporting FGOP within REVEL. Details on the heteroge-
neous compute are discussed subsequently.
Concurrent Dataflows with Ordered Dependences: To
support multiple dataflows with different firing conditions,
the data present in each dataflow’s ports are tracked sep-
arately by the data-firing logic, which can manage up to
four independently-firing dataflows. The association between
ports and dataflows is determined at configuration time.
One other challenge is maintaining data-ordering when
there are fine-grain dependences between lanes – ie. a source
lane should not transmit until all prior data items (in pro-
gram order) for the destination lane’s port have arrived. This
is accomplished by sending the destination lane a place-
holder stream. The destination’s command queue informs the
source’s when the placeholder is issued for the destination
port, and the source’s command queue informs the destina-
tion’s when the placeholder can be removed.
Inductive Memory Access: To support inductive re-use
streams, the scratchpad controller maintains the current itera-
tor values and the current stream length. When ni addresses
are complete, the length is adjusted by the stretch si j. Note
that si j is a fixed-point number to support vectorization with
induction patterns.
Inductive Data Reuse: While a stream without reuse would
perform the usual destructive FIFO read on every cycle, a
stream with reuse will only pop the data from the port at a
longer interval. When the stream is issued from the command
queue to the stream control unit, the reuse configuration (nr
and sr) is sent to the port (maintained similarly to params for
memory access). Besides enabling fine-grain dependences
with inductive changes in re-use length, another benefit of
the reuse within the configurable port is a large reduction in
scratchpad bandwidth.
Implicit Vector Masking: As a stream is executing, the
stream control unit compares the remaining iterations with
the vector length of the destination port. If the iterations left
is non-zero and less than the vector length, the stream control
unit sends the data padded with zeroes for the unused lanes,
along with additional meta-information which indicates that
the those lanes should be predicated off. This information is
buffered in a predication FIFO which tracks the data FIFO.
6.3 Heterogeneous Compute Fabric
Attaining high utilization in FGOP workloads requires bal-
ancing execution resources between critical and non-critical
dataflows. This is especially challenging given that they pre-
fer quite-different fabric microarchitectures.
There are two key types of fabrics with different trade-
7
         
RegFile
FuncUnit
From Neighbor Switches
To Neighbor Switches
                  
From Neighbor Switches
Accum
FuncUnit
Conf. 
reg
To Lower Right Switch select
Temporal Tile
Dedicated Tile
+ × +
S SS
...
...
Temporal 
Region
Temporal Switch Alloc.
Dedicated
 Region
Inst.
Sched
         Inst
Buffer
Critical Dataflows 
Mapped Here
Non-Critical Dataflows 
Mapped Here
Figure 15: Heterogeneous Computation Fabric and Tiles
offs. Dedicated fabrics are those that restrict each execu-
tion resource (tile) to execute only a single instruction, but
pipelined at full throughput (eg. FPCA [24], Q100 [21],
Tartan [28], PipeRench [29], and DySER [6]). In contrast,
temporal fabrics are those that may time-multiplex mul-
tiple different static instructions over the same resources
(eg. TRIPS [30], WaveScalar [31], Triggered Insts [32] and
RAW [33]).2 While dedicated fabrics only need to wait for
the arrival of data for dataflow execution, temporal fabrics
require token matching, meaning more complex structures
(and power/area overhead).
Because critical dataflows are often easily vectorizable by
the compiler, they can be scaled to the size of the fabric, and
be executed more efficiently on the more power/area efficient
dedicated fabric. However, non-critical dataflows often have
many instructions, and running them on a dedicated fabric
would be a waste of resources (eg. FP units) as they would at-
tain poor utilization. Therefore, non-critical dataflows would
be best run on a temporal fabric. It would also be inefficient
to run critical dataflows on a (smaller) temporal fabric, as
contention for resources would degrade the throughput. An
over-provisioned temporal fabric can alleviate this, at the cost
of significant power/area overhead.
Criticality-Specialization: Given the above, our approach
is to make the fabric heterogeneous: provision most of the
fabric’s resources to be a dedicated fabric to enable fast execu-
tion of the critical dataflows, and allocate a smaller portion to
be a temporal fabric, which can execute non-critical regions
efficiently. Figure 15 shows the lower corner of REVEL’s
heterogeneous compute fabric, which embeds the temporal
2A VonNeumann core is also temporal in this context, but does not
yield enough performance for this use in REVEL.
fabric’s network and compute into the dedicated fabric.
The physical network for both fabrics is a circuit-switched
mesh with no flow control. The dedicated tiles simply select
inputs, performs computations, and forward outputs in fully
pipelined fashion according to the dataflow configured by the
mesh. The dataflow compiler must equalize delays for each
operand to ensure correct execution.
The temporal fabric is embedded within the circuit-
switched mesh, using a pattern shown by blue arrows
in Figure 15. This allows temporal units to communi-
cate without interfering with the dedicated region (ie. no
horizontal/vertical links consumed). The temporal tile
microarchitecture is based on Triggered Instructions [32],
which performs operations based on the arrival of data to a
queue at the input or output. A register file holds live-state of
waiting instructions.
Note that dataflows communicate through ports (exiting
and re-entering the fabric). The benefit of integration into the
same network is that when there are no non-critical dataflows,
the temporal region may be reclaimed for use by critical
instructions. Section 8 details compilation concerns.
7. GENERALITY OF OUR APPROACH
Finally, we argue our approach is applicable to other archi-
tectures. Table 2 explains how to add FGOP capabilities to
out-of-order (OOO) cores and Plasticine [4], a reconfigurable
dataflow fabric, programmed using parallel patterns. This
table describes changes necessary in the software, ISA and
hardware.
8. SOFTWARE STACK
A program is decomposed into two components: 1.
C+intrinsics specifying the Von Neumann control program,
and 2. Dataflow specification which is mapped onto the
compute fabric. A dataflow compiler (described in the
next paragraph) is responsible for producing the hardware
configuration bits for the temporal and dedicated portion of
the fabric. These are finally compiled together to create the
final RISCV binary.
Dataflow Compiler: We implemented a spatial architecture
compiler (eg. [34, 35, 36, 37, 38]) which maps computation
and communication of all dataflows together on the com-
pute fabric. For the dedicated dataflows, all operand timing
paths must be equalized, and there is no sharing of resources.
For the temporal dataflows, the goal is to minimize resource
contention. Usually instructions from a temporal/dedicated
Ordered Dep. Inductive Dep. Inductive Mem. Implicit Vector Mask Crit. Specialization
OOO
Core
S/W Thread-communicat-
ion-aware OS sched
(see below) Streaming memory
command interface
Add FIFO interf. b/t streams &
vector instrs.
Not applicable, no
explicit-dataflow
substrate.ISA Stream-based produc-er/consumer instrs.
Add induction parameters to stream instrs. Implicit mask register indicat-
ing predicated lanes.
H/W Commun.-FIFOs b/t
neighbor cores
Add inductive
control to FIFOs
Add streaming mem-
ory request engine
Vector store instruction ignores
masked lanes.
Plast-
icine
[4]
S/W
Already Supported
Add inductive param for map&fold patterns None None
ISA Update stream-control and addr. gen. interf. Update stream instr. semantics Temporal fabric
ISA
H/W Add induction to
stream controller
Add induction to addr.
gen.
Implement predication within
SIMD Lanes
Make some PCU’s
temporally shared
Table 2: Adding FGOP Abstractions to Existing Architectures S/W: Software, H/W: Hardware
8
R
ev
el
L
an
e
(×
8)
CGRA
PEs 14 add, and 3 sqrt/div, 9 mult
Div/Sqrter Lat.: 12 Cyc., Thr.: 5 Cyc.
SubwrdSIMD 4-way Fixed-point, 2-way FP
Data Firing 4 Independent Dataflows
Temporal PE 2x1 (32 Insts/FU)
Vector
Ports
Width 2×512, 2×256, 1× 128, 1× 64 bit
Capability 4-entry FIFO+Config. Reuse
Stream
Con-
trol
SPAD Ctrl. Induct. Addr. Gen, 8-Ent. Table
XFER Ctrl. 8-Entry Stream-table
Cmd Queue 8-Entry Cmd Queue
SPAD Structure 8Kb, Single-bankBandwidth 512 Bits (1R/1W Port)
Net.
SPAD-Ports 512 Bit Dedicated Bus
XFER-Ports 512 Bit Dedicated Bus
Ports-CGRA Point-to-Point 64-bit links
C
trl
C
or
e RISCV ISA [41], 5-stage, single-issue, 16kb d$, insts.
added for stream-commands
Sh
r.
SP
D Structure: 128Kb, Single-bank
Bandwidth: 512 Bits (1R/1W Port)
N
et
. Inter-lane: 512 Bit Data Bus (8-bit Cmd Sync)
Shared scratchpad Bus: 512 Bit Shared Bus
Table 3: REVEL Params (bold features for FGOP)
dataflow map to the corresponding region of the compute fab-
ric. However, temporal instructions may map to the dedicated
fabric to minimize utilization, and dedicated instructions may
be mapped to the temporal fabric to minimize latency or net-
work congestion, provided that there are enough resources
either way. To balance these objectives, we take a simulated
annealing approach similar to the Pathfinder algorithm [39]
and prior stochastic schedulers [40], which allows resource
over-provisioning to determine and then constrain heavily
needed network and execution resources.
9. EVALUATION METHODOLOGY
REVEL Modeling: REVEL hardware parameters are in
Table 3. For performance, all blocks are modeled at a cycle
level within a modified gem5 [42][43]. We synthesized a
single lane of REVEL (heterogeneous fabric, stream control,
ports, command queue, XFER unit) using Synopsys DC,
28nm tech library. The design meets timing at 1.25GHz. An
open source triggered instructions implementation was our
reference for the temporal fabric [44]. Results from synthesis,
with Cacti 7.0 [45] for SRAMs, are used to create an event-
based power model and area model.
ASIC Analytical Models: These highly-optimistic models
(Table 4) are based on the optimized algorithms, and are
only limited by the algorithmic critical path and throughput
constraints, with equivalent FUs to REVEL. ASIC area and
power models only count FU and scratchpad power.
SVD QR MM
48m+2QR(n)+ d n34 e 40n+n2 +
n
∑
i=1
(i+ i∗n) d nmp8 e
Solver FFT Cholesky Centro-FIR
2
n−1
∑
0
max(d i4e,14) n8 logn
n−1
∑
i=1
max(d i24 e,24) d n−m+14 e
Table 4: Ideal ASIC Models (assumes FU lat. from Table 3)
Comparison Methodology: For fairness we compare de-
signs with similar max. per-cycle throughput:
Workload Data Size Lane Acc Dep Reuse Het Vec
SVD 12,16,24,32 1 RI Y Y Y Y
QR 12,16,24,32 8 RI Y Y Y Y
Cholesky 12,16,24,32 8 RI Y Y Y Y
Solver 12,16,24,32 1 RI Y Y Y Y
FFT 64,128...1024 1 RR N Y N N
GEMM 12,24,48x16x64 8 RR N Y N N
FIR 12,16,24,32 8 I N Y N N
Table 5: Workload Params. and FGOP Features; small and large sizes
bolded; Lane: #Lanes in latency ver., Acc: Access pat. Dep: Fine-grain
deps, Reuse: stream-reuse, Het: Heterog. fabric, Vec: implicit vect. masking
• TI 6678 DSP (@1.25GHz) 8-core DSP, each core has
16-FP adders/multipliers, using DSPLIB_C66x_3.4.0.0.
• OOO Core: Intel Xeon 4116 (@2.1GHz) Conventional
OOO processor using highly-optimized Intel MKL li-
brary. (8 cores used)
• REVEL-No-FGOP: REVEL without FGOP support (8
Lanes). To evaluate, we therefore also implement highly-
optimized non-FGOP workload versions.
Workload Versions: We make comparisons in two different
settings, high-throughput and low-latency. The throughput
setting assumes there exist multiple data items to parallelize
over, while the latency setting assumes only one. We imple-
ment both throughput and latency optimized REVEL work-
loads, where latency-optimized spreads work across multiple
lanes. Throughput versions use each lane in data-parallel
fashion. Note that we could not profitably parallelize any
FGOP kernel across multiple DSP/OOO cores, even using
native libraries, so their latency-optimized versions only use a
single core. Table 5 describes data-sizes, and also how FGOP
features were used by each workload.
10. EVALUATION
We broadly answer the question of whether fine-grain or-
dered parallelism is exploitable in DSP workloads, and if
REVEL’s execution model, architecture, and microarchitec-
ture is effective. What we find overall is that REVEL’s ability
to exploit FGOP leads to order-of-magnitude speedup and
area-normalized performance over traditional DSPs.
We first discuss the applicability of FGOP features and
overall latency and throughput potential. We then explain
how performance improvements were achieved by analyz-
ing cycle-level bottleneck breakdowns, and incremental per-
formance improvements. We also answer the question of
sensitivity to temporal region size and address-generation
capability. Finally, we analyze the area and power break-
downs, comparison of normalized performance, and compare
to optimistic ASIC models.
Q1. Can workloads use REVEL’s FGOP features?: Ta-
ble 5 shows the applicability of FGOP features. Matrix fac-
torization/decomposition workloads (QR, SVD, Cholesky,
Solver) use all FGOP features. Even non-FGOP workloads
took advantage of streaming-reuse to reduce scratchpad band-
width, and FIR had a short inductive access phase.
Q2. How much speedup do REVEL’s execution model
and FGOP features provide?: The speedups over DSP for
latency optimized codes are shown in Figure 16 for both small
and large matrices. The DSP and CPU have similar mean per-
formance. REVEL attains up to 37× speedup, with geomean
9
svd qr cho sol fir mm fft GM
100
101
102
Sp
ee
du
p
Small
svd qr cho sol fir mm fft GM
Large
TI C6678 @1.25GHz Intel OoO @2.10GHz REVEL (no fgop) REVEL
Figure 16: Latency-optimized kernel performance
svd qr cho sol fir mm fft GM
100
101
Re
la
tiv
e 
Th
ro
ug
hp
ut
Small
svd qr cho sol fir mm fft GM
Large
TI C6678 @1.25GHz Intel OoO @2.10GHz REVEL (no fgop) REVEL
Figure 17: Throughput-optimized kernel performance
of 10× and 17× for small and large data sizes. Considering
just workloads which exhibit FGOP, the speedup from FGOP
specialization is 6.1× (large size). The benefit of REVEL’s
dataflow/vector-stream model without FGOP provides 2.8×
speedup over DSP. The DSP is only competitive on the small
FFT, as REVEL here requires multiple-configurations.
Performance for throughput-optimized kernels (data par-
allelism across lanes), is shown in Figure 17. For small and
large sizes, REVEL gets a speedup of 6.3× and 8.1× over
the DSP and CPU. Again, considering just workloads which
exhibit FGOP, the speedup from FGOP specialization is 4.4×
(large size). REVEL’s dataflow/vector-stream model provides
the other 2.6× speedup over the DSP. The performance trade-
offs here are similar, except the advantage of parallelizing
across lanes is diminished due to data-parallel execution.
The vector-stream control and FGOP-exploitation enables
combined order-of-magnitude speedups.
Q3. Why does exploiting FGOP help REVEL?: Figure 18
overviews REVEL’s cycle-level behavior, normalized to non-
FGOP hardware. Latency-optimized workloads are labeled
as “multi”. To explain the categories, issue and multi-issue
means that one or multiple dedicated dataflow fired, and
temporal means only a temporal dataflow fired during that
cycle. All other categories represent overhead, including the
drain of the dedicated fabric, scr-b/w and scr-barrier for
bandwidth and synchronization, stream-dpd for waiting on
dependences, and ctrl-ovhd for waiting on the control core.
The clearest trend is that exploiting FGOP reduces the con-
trol overhead for both small and large matrix sizes. Also, ex-
ploiting FGOP enables parallelism between dataflows, which
can be seen in the multi-issued category, especially prevalent
for the larger matrix sizes of FGOP kernels.
Exploiting FGOP increases parallelism and reduces con-
trol overhead, enabling higher hardware utilization.
Q4. What is the impact of each mechanism?: Figure 19
shows the incremental speedup of each hardware/software
feature (so 5 versions of each kernel). Latency-optimized
versions have “lat” as a suffix.
no
fg
p
re
ve
l
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l0.00
0.25
0.50
0.75
1.00
Re
la
tiv
e 
Ex
ec
. T
im
e
sv
d qr ch
o
so
l
fir
m
m ff
t n
of
gp
re
ve
l
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l
re
ve
l-m
ul
ti
no
fg
p
re
ve
l
sv
d qr ch
o
so
l
fir
m
m ff
t
issued
multi-issued
temporal
stream-dpd
scr-b/w
scr-bar
ctrl-ovhd
drain
config
Figure 18: REVEL’s Cycle-level bottlenecks
12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 64 12
8
51
2
10
24
1.0
2.0
4.0
8.0
10.0
Re
la
tiv
e 
Sp
ee
du
p
svd qr qr-lat. cho. cho.-lat. sol. fft
vec. mask
hetro. fab.
fine-grain dep.
induc. acc./reuse
no fgop
Figure 19: Performance Impact of Each Mechanism.
The inductive benefit is small alone, as it reduces control
but does not increase parallelism. Only FFT benefits greatly
by using inductive reuse to reduce SPAD bandwidth. Most
workloads were accelerated dramatically from exploiting fine-
grain dependences. However, QR and SVD have complex
sub-critical regions, so they only see the benefit after adding
the heterogeneous fabric. Throughput-optimized QR suffers
from local memory access because of the shrinking matrix
sizes, but latency-optimized QR converts these to inter-lane
data streams. Solver is also accelerated by the heterogeneous
fabric because it is latency sensitive, and collapsing sub-
critical instructions can reduce latency. Cholesky’s triangular
access implies large gains from implicit vector masking.
REVEL’s mechanisms together enable high performance.
Q5. What is the biggest remaining overhead for
REVEL?: As shown in Figure 18, this is the drain time on
smaller workloads, often caused by reconfiguration. This
is more of an issue for the smaller matrices and especially
FFT, where the datapath should be reconfigured for each
algorithm phase. REVEL’s reliance on deep pipelines causes
config/serialization penalty on extremely short phases.
Q6. What are the sources of area and power?: Table 6
shows the breakdown; the largest source (especially power)
comes mostly from the floating point units. At 28nm, REVEL
is 1.79mm2. Note that the control core is now only about one
50th of the overall area.
Q7. Does REVEL have better perf/mm2?: REVEL’s
high speedup with only small area overhead (over the DSP)
for the computation fabric’s networks results in a large
performance/mm2 advantage: 1308× over the OOO core and
8.3× over the DSP.
Q8. How sensitive is REVEL’s performance to the size
of the temporal region?: Because temporal tiles cost
more than 5× the area than dedicated tiles (dedicated tile:
2265µm2, temporal tile: 12062µm2), it is important to
10
area(mm2) power(mw)
Compute
Fabric
Dedi. Net. (23) 0.05 71.40
Temp. Net. (2) 0.01 14.81
Func. Units 0.07 74.04
Total Fabric 0.13 160.25
Control (ports/XFER/str. ctrl) 0.03 62.92
SPAD-8KB 0.06 4.64
1 Vector Lane 0.22 207.90
Control Core 0.04 19.91
REVEL 1.79 1663.3
Table 6: Area and Power Breakdown (28nm)
1x1 1x2 2x2 3x2 3x30.9
1.0
1.1
1.2
1.3
Re
la
tiv
e 
Pe
rfo
rm
an
ce
0.9
1.0
1.1
1.2
1.3
Re
la
tiv
e 
Ar
ea
svd
qr
cholesky
solver
Area Overhead
Figure 20: Temporal region sensitivity (width×height)
choose the correct temporal region size. Figure 20 shows
REVEL’s performance sensitivity to this size, as well as the
area tradeoff. SVD and QR have the largest regions, so are
affected the most, but a 1×1 temporal region only has 13%
overhead. We choose this size to minimize area penalty.
Q9. Do we require a heterogeneous fabric?: To create a
purely dedicated fabric, we would have had to support 52
additional dedicated tiles for our largest temporal region in
SVD, costing about 2.75× fabric area. Similarly, for the
entire design to be temporal, it would have cost around 2.5×
fabric area. A heterogeneous fabric provides the best perfor-
mance/area ratio.
Q10. Would even more complex inductive streams have
reduced control overhead?: Our analysis so far has shown
that using 2D inductive streams can reduce control overhead
and improve performance significantly. An interesting ques-
tion is whether supporting higher dimension stream-access
could have helped further.
To analyze stream capabilities analytically, we implement
a static compiler analysis in LLVM [17], using scalar evo-
lution analysis [46] for the closed-form representation of
address patterns and loop termination with respect to induc-
tion variables. This analysis can determine the length of a
given stream for each pattern. Figure 21 shows the average
length of a stream: the number of loop iterations the pattern
describes. We also calculate the number of effective mem-
ory instructions per inner-loop iteration, “Mem. Insts/Iter”,
which is a measure of the control overhead (Figure 22). We
consider vector (V), 1D streams (R), 2D streams (RR and
inductive RI) and 3D streams (RRR and inductive RII).
Regular workloads like GEMM require only a low dimen-
sion rectangular access pattern for a long length. However,
FGOP workloads show much higher lengths only with in-
ductive access capability (RI or RII capability). This benefit
translates to fewer memory instructions per-iteration. A value
of less than 1 in Figure 22 means that fewer than one control
instructions would need to be issued per cycle. This helps
svd qr chol solver fir gemm fft
100
101
102
103
104
A
v
g
. 
S
tr
e
a
m
 L
e
n
g
th
V8
R
RR
RI
RRR
RII
Figure 21: Stream-type Access Length Comparison
svd qr chol solver fir gemm fft
10-2
10-1
100
101
M
e
m
. 
In
st
s.
/I
te
r.
V8
R
RR
RI
RRR
RII
Figure 22: Control overhead of various Capabilities measured in con-
trol instructions per iteration. The stacked bar indicates additional
control overhead if stream-reuse technique is disabled.
to explain why vector instructions alone are insufficient for
parallelism – because they require too much specification of
work. Fortunately, the RI capability always either achieves a
control overhead below 1 inst/iter or matches the least over-
head capability.
The ability to reuse stream values as part of the stream
definition can also reduce control overhead. The control
overhead if this feature is disabled is shown by the stacked
bar in Figure 22. This benefit is modest; the more important
reason for stream-reuse is to reduce memory bandwidth.
2D Induction streams (RI) are necessary to reduce control
overhead, and RII may provide only a small energy advantage,
but is also more complex.
Q11. How competitive is REVEL with custom ASICs?:
Table 6 shows the performance-normalized power and area
overhead, as compared to ASIC analytical models. REVEL
is mean 2.2× power. This is mostly due to the control logic
(vector ports, bus, and etc.) and reconfigurable networks. It
is 0.55× the area of the combined ASIC. Note this is highly
optimistic for ASICs, as the performance model assumes
perfect pipelining, and the power model assumes no control.
Workloads SVD QR Cho. Sol. FIR MM FFT Mean
Power Ovhd. 3.5 2.1 2.2 2.0 2.0 1.9 1.9 2.2
Area Ovhd. 3.8 2.4 2.5 2.7 2.2 2.1 2.6 2.6/0.55
Table 6: Power/Area overheads to ideal ASIC (iso-perf)
REVEL is competitive with ASICs, and could replace fixed-
function accelerators or conventional DSPs in some designs.
11. RELATED WORK
DSP Accelerators: Many application/domain-specific recon-
figurable designs have targeted DSP algorithms. Fasthuber
et. al [47] outline the basic approaches. One representative
example includes LAC [48], targeted at matrix factorization.
Ordered Parallelism and Synchronization: A conceptu-
ally similar work to ours from the GPGPU space is dMT-
CGRA [49], which proposes inter-thread communication
between SIMT threads [50, 51]. Prabhakar et al [52] devel-
ops “nested-parallelism,” which enables coupling of datap-
11
aths with different nesting of parallel patterns. Swarm [53]
also targets a form of ordered parallelism by building ab-
stractions on top of a task-parallel model, targeting irregular
data-dependent parallelism [54].
Task-parallel+Acceleration: An alternative model to ours
is task-based parallelism plus some form of acceleration to
reduce the synchronization overhead (eg. TAPAS [55]). Task
parallelism has the benefit of dynamic load balancing, but
this does not appear to be necessary our DSP workloads.
Flexible Vectorization: Our vector-stream control paradigm
is inspired by prior techniques which marshal independent
execution lanes to create a vector-like execution when use-
ful. This includes Vector-Threading [26, 25], Vector-Lane
Threading [27], and Loop-Task Accelerators [56]. REVEL
also marshals lanes to reduce control and increase parallelism,
but its lanes are autonomous once programmed with streams.
Some techniques apply vectorization with reconfigurability,
eg. Libra [57] and DySER [6], which can create/reconfigure
vector lanes. REVEL also amortizes control through time.
Address Gen Capability
Name Type
Imagine [19] R
Q100 [21] R
Accel DMA [58] R
Softbrain [2] RR
RSVP [20] RR
CoRAM++ [22] RR
APMC [23] RR
REVEL RI
FPCA [24] RRR
Stream-based ISAs:
Several prior archi-
tectures have stream
primitives. We list their
address capability as
compared to REVEL
on the right. We believe
REVEL is the only one
to support inductive
patterns.
12. DISCUSSION AND CONCLUSION
This paper identified fine-grain ordered parallelism as a
common property across a variety of linear-algebra and DSP
algorithms. It is extremely difficult to take advantage of using
existing VonNeumann Vector and multi-threading architec-
tures. This work identified a set of abstractions and developed
an execution model and hardware implementation (REVEL)
which could exploit this form of parallelism.
Our REVEL implementation was more than an order of
magnitude lower latency (10×-17×), and its performance per
mm2 was 6.7× that of a DSP (up to 16×). Overall, REVEL’s
design offers large advantages over existing architectures for
important signal processing workloads, and is a promising
alternative to existing DSPs and beyond.
13. REFERENCES
[1] Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young, “Mobile
edge computing—a key technology towards 5g,” ETSI white paper,
vol. 11, no. 11, pp. 1–16, 2015.
[2] T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam,
“Stream-dataflow acceleration,” in Proceedings of the 44th Annual
International Symposium on Computer Architecture, ISCA ’17, (New
York, NY, USA), pp. 416–429, ACM, 2017.
[3] T. Nowatzki, V. Gangadhar, K. Sankaralingam, and G. Wright,
“Pushing the limits of accelerator efficiency while retaining
programmability,” in 2016 IEEE International Symposium on High
Performance Computer Architecture (HPCA), pp. 27–39, March 2016.
[4] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao,
S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A
reconfigurable architecture for parallel paterns,” in Proceedings of the
44th Annual International Symposium on Computer Architecture,
ISCA ’17, (New York, NY, USA), pp. 389–402, ACM, 2017.
[5] H. Singh, M.-H. Lee, G. Lu, N. Bagherzadeh, F. J. Kurdahi, and
E. M. C. Filho, “Morphosys: An integrated reconfigurable system for
data-parallel and computation-intensive applications,” IEEE Trans.
Comput., vol. 49, pp. 465–481, May 2000.
[6] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish,
K. Sankaralingam, and C. Kim, “Dyser: Unifying functionality and
parallelism specialization for energy-efficient computing,” IEEE
Micro, vol. 32, pp. 38–51, Sept. 2012.
[7] R. Mudumbai, G. Barriac, and U. Madhow, “On the feasibility of
distributed beamforming in wireless networks,” IEEE Transactions on
Wireless communications, vol. 6, no. 5, 2007.
[8] H. Johansson et al., “Polyphase decomposition of digital
fractional-delay filters,” IEEE signal processing letters, vol. 22, no. 8,
pp. 1021–1025, 2015.
[9] R. Zhao, “Wls design of centro-symmetric 2-d fir filters using matrix
iterative algorithm,” in 2015 IEEE International Conference on Digital
Signal Processing (DSP), pp. 34–38, July 2015.
[10] F. Mintzer, “On half-band, third-band, and nth-band fir filters and their
design,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 30, pp. 734–738, Oct 1982.
[11] M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech
enhancement from noise: A regenerative approach,” Speech
Communication, vol. 10, no. 1, pp. 45–57, 1991.
[12] D. Patel, M. Shabany, and P. G. Gulak, “A low-complexity high-speed
qr decomposition implementation for mimo receivers,” in 2009 IEEE
International Symposium on Circuits and Systems, pp. 33–36, May
2009.
[13] P. Salmela, A. Happonen, T. Jarvinen, A. Burian, and J. Takala, “Dsp
implementation of cholesky decomposition,” in Joint IST Workshop on
Mobile Future, 2006 and the Symposium on Trends in
Communications. SympoTIC ’06., pp. 6–9, June 2006.
[14] P. Darwood, P. Alexander, and I. Oppermann, “Lmmse chip
equalisation for 3gpp wcdma downlink receivers with channel coding,”
in ICC 2001. IEEE International Conference on Communications.
Conference Record (Cat. No.01CH37240), vol. 5, pp. 1421–1425
vol.5, 2001.
[15] D. Tse and P. Viswanath in Fundamentals of Wireless Communication,
New York, NY, USA: Cambridge University Press, 2005.
[16] L.-N. Pouchet, “Polybench: The polyhedral benchmark suite,” URL:
http://www. cs. ucla. edu/pouchet/software/polybench, 2012.
[17] C. Lattner and V. Adve, “LLVM: A compilation framework for
lifelong program analysis & transformation,” in CGO ’04, pp. 75–88.
[18] A. Buttari, “Multicore and multicore programming with openmp,”
URL: http://buttari. perso. enseeiht. fr/stuff/crgc_mcore. pdf, 2012.
[19] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. López-Lagunas,
P. R. Mattson, and J. D. Owens, “A bandwidth-efficient architecture
for media processing,” in Proceedings of the 31st Annual ACM/IEEE
International Symposium on Microarchitecture, MICRO 31, (Los
Alamitos, CA, USA), pp. 3–13, IEEE Computer Society Press, 1998.
[20] S. Ciricescu, R. Essick, B. Lucas, P. May, K. Moat, J. Norris,
M. Schuette, and A. Saidi, “The reconfigurable streaming vector
processor (rsvp),” in Proceedings of the 36th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO 36,
(Washington, DC, USA), pp. 141–, IEEE Computer Society, 2003.
[21] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, “Q100:
The architecture and design of a database processing unit,” in
Proceedings of the 19th International Conference on Architectural
Support for Programming Languages and Operating Systems,
ASPLOS ’14, (New York, NY, USA), pp. 255–268, ACM, 2014.
[22] G. Weisz and J. C. Hoe, “Coram++: Supporting data-structure-specific
memory interfaces for fpga computing,” in 25th International
Conference on Field Programmable Logic and Applications (FPL),
pp. 1–8, Sept 2015.
[23] T. Hussain, O. Palomar, O. Unsal, A. Cristal, E. AyguadÃl’, and
M. Valero, “Advanced pattern based memory controller for fpga based
hpc applications,” in 2014 International Conference on High
Performance Computing Simulation (HPCS), pp. 287–294, July 2014.
[24] J. Cong, H. Huang, C. Ma, B. Xiao, and P. Zhou, “A fully pipelined
and dynamically composable architecture of cgra,” in
Field-Programmable Custom Computing Machines (FCCM), 2014
12
IEEE 22nd Annual International Symposium on, pp. 9–16, IEEE,
2014.
[25] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris,
J. Casper, and K. Asanovic, “The vector-thread architecture,” in
Proceedings of the 31st Annual International Symposium on Computer
Architecture, ISCA ’04, (Washington, DC, USA), pp. 52–, IEEE
Computer Society, 2004.
[26] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and
K. Asanovic´, “Exploring the tradeoffs between programmability and
efficiency in data-parallel accelerators,” in Proceedings of the 38th
Annual International Symposium on Computer Architecture, ISCA ’11,
(New York, NY, USA), pp. 129–140, ACM, 2011.
[27] S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, “Vector lane
threading,” in 2006 International Conference on Parallel Processing
(ICPP’06), pp. 55–64, Aug 2006.
[28] M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C.
Goldstein, and M. Budiu, “Tartan: Evaluating spatial computation for
whole program execution,” in Proceedings of the 12th International
Conference on Architectural Support for Programming Languages and
Operating Systems, ASPLOS XII, (New York, NY, USA),
pp. 163–174, ACM, 2006.
[29] S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. Taylor,
and R. Laufer, “Piperench: a coprocessor for streaming multimedia
acceleration,” in Computer Architecture, 1999. Proceedings of the 26th
International Symposium on, 1999.
[30] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John,
C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, W. Yoder, and t. T.
Team, “Scaling to the end of silicon with edge architectures,”
Computer, vol. 37, pp. 44–55, July 2004.
[31] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, “Wavescalar,”
in Proceedings of the 36th Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO 36, (Washington, DC,
USA), pp. 291–, IEEE Computer Society, 2003.
[32] A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig,
V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess,
S. Maresh, and J. Emer, “Triggered instructions: A control paradigm
for spatially-programmed architectures,” in Proceedings of the 40th
Annual International Symposium on Computer Architecture, ISCA ’13,
(New York, NY, USA), pp. 142–153, ACM, 2013.
[33] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat,
B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma,
A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank,
S. Amarasinghe, and A. Agarwal, “The raw microprocessor: A
computational fabric for software circuits and general-purpose
programs,” IEEE Micro, vol. 22, pp. 25–35, Mar. 2002.
[34] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin,
M. Oskin, and S. J. Eggers, “Instruction scheduling for a tiled dataflow
architecture,” in Proceedings of the 12th international conference on
Architectural support for programming languages and operating
systems, ASPLOS XII, pp. 141–150, 2006.
[35] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and
S. Amarasinghe, “Space-time scheduling of instruction-level
parallelism on a raw machine,” in Proceedings of the Eighth
International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS VIII, (New York, NY,
USA), pp. 46–57, ACM, 1998.
[36] H. Park, K. Fan, S. A. Mahlke, T. Oh, H. Kim, and H.-s. Kim,
“Edge-centric modulo scheduling for coarse-grained reconfigurable
architectures,” in Proceedings of the 17th international conference on
Parallel architectures and compilation techniques, PACT ’08,
pp. 166–176, 2008.
[37] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins,
“Exploiting loop-level parallelism on coarse-grained reconfigurable
architectures using modulo scheduling,” IEE Proceedings - Computers
and Digital Techniques, vol. 150, pp. 255–61–, Sept 2003.
[38] T. Nowatzki, M. Sartin-Tarm, L. De Carli, K. Sankaralingam, C. Estan,
and B. Robatmili, “A general constraint-centric scheduling framework
for spatial architectures,” in Proceedings of the 34th ACM SIGPLAN
Conference on Programming Language Design and Implementation,
PLDI ’13, (New York, NY, USA), pp. 495–506, ACM, 2013.
[39] L. McMurchie and C. Ebeling, “Pathfinder: A negotiation-based
performance-driven router for fpgas,” in Third International ACM
Symposium on Field-Programmable Gate Arrays, pp. 111–117, Feb
1995.
[40] T. Nowatzki, N. Ardalani, K. Sankaralingam, and J. Weng, “Hybrid
optimization/heuristic instruction scheduling for programmable
accelerator codesign,” in Proceedings of the 27th International
Conference on Parallel Architectures and Compilation Techniques,
PACT ’18, (New York, NY, USA), pp. 36:1–36:15, ACM, 2018.
[41] K. Asanovic´ and D. A. Patterson, “Instruction sets should be free: The
case for risc-v,” EECS Department, University of California, Berkeley,
Tech. Rep. UCB/EECS-2014-146, 2014.
[42] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,
A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen,
K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The
gem5 simulator,” SIGARCH Comput. Archit. News, 2011.
[43] A. Roelke and M. R. Stan, “RISC5: Implementing the RISC-V ISA in
gem5,” in Workshop on Computer Architecture Research with RISC-V
(CARRV), 2017.
[44] T. J. Repetti, J. a. P. Cerqueira, M. A. Kim, and M. Seok, “Pipelining a
triggered processing element,” in Proceedings of the 50th Annual
IEEE/ACM International Symposium on Microarchitecture,
MICRO-50 ’17, (New York, NY, USA), pp. 96–108, ACM, 2017.
[45] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and
V. Srinivas, “Cacti 7: New tools for interconnect exploration in
innovative off-chip memories,” ACM Transactions on Architecture and
Code Optimization (TACO), vol. 14, no. 2, p. 14, 2017.
[46] R. A. Van Engelen, “Efficient symbolic analysis for optimizing
compilers,” in International Conference on Compiler Construction,
pp. 118–132, Springer, 2001.
[47] R. Fasthuber, F. Catthoor, P. Raghavan, and F. Naessens,
Energy-Efficient Communication Processors: Design and
Implementation for Emerging Wireless Systems. Springer Publishing
Company, Incorporated, 2013.
[48] A. Pedram, A. Gerstlauer, and R. van de Geijn, “Algorithm,
architecture, and floating-point unit codesign of a matrix factorization
accelerator,” IEEE Transactions on Computers, no. 1, pp. 1–1, 2014.
[49] D. Voitsechov and Y. Etsion, “Inter-thread communication in
multithreaded, reconfigurable coarse-grain arrays,” arXiv preprint
arXiv:1801.05178, 2018.
[50] D. Voitsechov and Y. Etsion, “Single-graph multiple flows: Energy
efficient design alternative for gpgpus,” in Proceeding of the 41st
Annual International Symposium on Computer Architecuture, ISCA
’14, (Piscataway, NJ, USA), pp. 205–216, IEEE Press, 2014.
[51] D. Voitsechov and Y. Etsion, “Control flow coalescing on a hybrid
dataflow/von neumann gpgpu,” in 2015 48th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO),
pp. 216–227, Dec 2015.
[52] R. Prabhakar, D. Koeplinger, K. J. Brown, H. Lee, C. De Sa,
C. Kozyrakis, and K. Olukotun, “Generating configurable hardware
from parallel patterns,” in Proceedings of the Twenty-First
International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’16, (New York, NY,
USA), pp. 651–665, ACM, 2016.
[53] M. C. Jeffrey, S. Subramanian, C. Yan, J. Emer, and D. Sanchez, “A
scalable architecture for ordered parallelism,” in 2015 48th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO),
pp. 228–241, Dec 2015.
[54] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan,
R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo,
D. Prountzos, and X. Sui, “The tao of parallelism in algorithms,” in
Proceedings of the 32Nd ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI ’11, (New York, NY,
USA), pp. 12–25, ACM, 2011.
[55] S. Margerm, A. Sharifian, A. Guha, A. Shriraman, and G. Pokam,
“Tapas: Generating parallel accelerators from parallel programs,”
[56] J. Kim, S. Jiang, C. Torng, M. Wang, S. Srinath, B. Ilbeyi,
K. Al-Hawaj, and C. Batten, “Using intra-core loop-task accelerators
to improve the productivity and performance of task-based parallel
programs,” in Proceedings of the 50th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-50 ’17, (New
York, NY, USA), pp. 759–773, ACM, 2017.
[57] Y. Park, J. J. K. Park, H. Park, and S. Mahlke, “Libra: Tailoring simd
execution using heterogeneous hardware and dynamic configurability,”
in Proceedings of the 2012 45th Annual IEEE/ACM International
13
Symposium on Microarchitecture, MICRO-45, (Washington, DC,
USA), pp. 84–95, IEEE Computer Society, 2012.
[58] Y. S. Shao, S. L. Xi, V. Srinivasan, G. Y. Wei, and D. Brooks,
“Co-designing accelerators and soc interfaces using gem5-aladdin,” in
2016 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), pp. 1–12, Oct 2016.
14
