Predictable Accelerator Design with Time-Sensitive Affine Types by Nigam, Rachit et al.
Predictable Accelerator Design
with Time-Sensitive Affine Types
Rachit Nigam Sachille Atapattu Samuel Thomas Zhijing Li
Theodore Bauer Yuwei Ye Apurva Koti Adrian Sampson Zhiru Zhang
Cornell University
USA
Abstract
Field-programmable gate arrays (FPGAs) provide an oppor-
tunity to co-design applications with hardware accelerators,
yet they remain difficult to program. High-level synthesis
(HLS) tools promise to raise the level of abstraction by com-
piling C or C++ to accelerator designs. Repurposing legacy
software languages, however, requires complex heuristics
to map imperative code onto hardware structures. We find
that the black-box heuristics in HLS can be unpredictable:
changing parameters in the program that should improve
performance can counterintuitively yield slower and larger
designs. This paper proposes a type system that restricts
HLS to programs that can predictably compile to hardware
accelerators. The key idea is to model consumable hardware
resources with a time-sensitive affine type system that pre-
vents simultaneous uses of the same hardware structure. We
implement the type system in Dahlia, a language that com-
piles to HLS C++, and show that it can reduce the size of HLS
parameter spaces while accepting Pareto-optimal designs.
CCS Concepts: • Software and its engineering → Con-
straints.
Keywords: Affine Type Systems, High-Level Synthesis
ACM Reference Format:
Rachit Nigam, Sachille Atapattu, Samuel Thomas, Zhijing Li, Theodore
Bauer, Yuwei Ye, Apurva Koti, Adrian Sampson, and Zhiru Zhang.
2020. Predictable Accelerator Design with Time-Sensitive Affine
Types. In Proceedings of the 41st ACM SIGPLAN International Con-
ference on Programming Language Design and Implementation (PLDI
’20), June 15–20, 2020, London, UK. ACM, New York, NY, USA,
23 pages. https://doi.org/10.1145/3385412.3385974
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than the author(s) must
be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from permissions@acm.org.
PLDI ’20, June 15–20, 2020, London, UK
© 2020 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
ACM ISBN 978-1-4503-7613-6/20/06. . . $15.00
https://doi.org/10.1145/3385412.3385974
1 Introduction
While Moore’s lawmay not be dead yet, its stalled returns for
traditional CPUs have sparked renewed interest in special-
ized hardware accelerators [28], for domains from machine
learning [31] to genomics [56]. Reconfigurable hardware—
namely, field-programmable gate arrays (FPGAs)—offer some
of the benefits of specialization without the cost of cus-
tom silicon. FPGAs can accelerate code in domains from
databases [11] to networking [2] and have driven vast effi-
ciency improvements in Microsoft’s data centers [46, 19].
However, FPGAs are hard to program. The gold-standard
programmingmodel for FPGAs is register transfer level (RTL)
design in hardware description languages such as Verilog,
VHDL, Bluespec, and Chisel [40, 5]. RTL requires digital
design expertise: akin to assembly languages for CPUs, RTL
is irreplaceable for manual performance tuning, but it is too
explicit and verbose for rapid iteration [53].
FPGA vendors offer high-level synthesis (HLS) or “C-to-
gates” tools [58, 16, 42, 10] that translate annotated subsets of
C and C++ to RTL. Repurposing a legacy software languages,
however, has drawbacks: the resulting language subset is
small and difficult to specify, and minor code edits can cause
large swings in hardware efficiency. We find empirically
that smoothly changing source-level hints can cause wild
variations in accelerator performance. Semantically, there is
no HLS programming language: there is only the subset of C++
that a particular version of a particular compiler supports.
This paper describes a type system that restricts HLS to
programs whose hardware implementation is clear. The goal
is predictable architecture generation: the hardware impli-
cations are observable in the source code, and costly imple-
mentation decisions require explicit permission from the
programmer. Instead of silently generating bad hardware for
difficult input programs, the type system yields errors that
help guide the programmer toward a better design. The result
is a language that can express a subset of the architectures
that HLS can—but it does so predictably.
The central insight is that an affine type system [54] can
model the restrictions of hardware implementation. Com-
ponents in a hardware design are finite and expendable: a
subcircuit or a memory can only do one thing at a time, so a
program needs to avoid conflicting uses. Previous research
has shown how to apply substructural type systems to model
classic computational resources such as memory allocations
ar
X
iv
:2
00
4.
04
85
2v
2 
 [c
s.P
L]
  3
0 A
pr
 20
20
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
C/C++
Frontend with 
#pragmas
Transformation 
Heuristics
RTL Genera‐
tion Backend
Dahlia
Type Checking
Plain C/C++ 
Toolchain
#pragma In‐
sertion
Erasure
Verilog
Executable
Type Error
Transformation Failure
Traditional HLS Toolchain
This Paper
Figure 1. Overview of a traditional high-level synthesis
toolchain and how Dahlia layers type safety on top.
and file handles [24, 7, 36, 54] and to enforce exclusion for
safe shared-memory parallelism [23, 6, 13]. Unlike those
classic resources, however, the availability of hardware com-
ponents changes with time. We extend affine types with
time sensitivity to express that repeated uses of the same
hardware is safe as long as they are temporally separated.
We describe Dahlia, a programming language for pre-
dictable accelerator design. Dahlia differs from traditional
HLS in twoways: (1) Dahliamakes the hardware implementa-
tion for each language construct manifest in the source code
instead of leaving this decision up to theHLSmiddle-end, and
(2) Dahlia uses its time-sensitive affine types to reason about
the hardware constraints and reject programs that would re-
quire complex transformation to implement in hardware. We
implement a compiler for Dahlia that emits annotated C++
for a commercial HLS toolchain. We show that predictability
pitfalls exist in both industrial and recent academic tools and
that Dahlia’s reasoning can help alleviate these issues.
The contributions of this paper are:
• We identify predictability pitfalls in HLS and measure
their effects in an industrial tool in Section 2.
• We design Dahlia (Section 3), a language that restricts
HLS to predictable design spaces by modeling hard-
ware constraints using time-sensitive affine types.
• We formalize a time-sensitive affine type system and
prove syntactic type soundness in Section 4.
• We empirically demonstrate Dahlia’s effectiveness in
rejecting unpredictable design points and its ability to
make area–performance trade-offs in common accel-
erator designs in Section 5.
2 Predictability Pitfalls in Traditional HLS
Figure 1 depicts the design of a traditional high-level syn-
thesis (HLS) compiler. A typical HLS tool adopts an existing
open-source C/C++ frontend and adds a set of transforma-
tion heuristics that attempt to map software constructs onto
hardware elements along with a backend that generates RTL
code [15, 10]. The transformation step typically relies on
a constraint solver, such as an LP or SAT solver, to satisfy
resource, layout, and timing requirements [25, 17]. Program-
mers can add #pragma hints to guide the transformation—for
example, to duplicate loop bodies or to share functional units.
1 int m1[512][512], m2[512][512], prod[512][512];
2 int sum;
3 for (int i = 0; i < 512; i++) {
4 for (int j = 0; j < 512; j++) {
5 sum = 0;
6 for (int k = 0; k < 512; k++) {
7 sum += m1[i][k] * m2[k][j];
8 }
9 prod[i][j] = sum; } }
Figure 2. Dense matrix multiplication in HLS-friendly C.
HLS tools are best-effort compilers: they make a heuris-
tic effort to translate any valid C/C++ program to RTL, re-
gardless of the consequences for the generated accelerator
architecture. Sometimes, the mapping constraints are unsat-
isfiable, so the compiler selectively ignores some #pragma
hints or issues an error. The generated accelerator’s effi-
ciency depends on the interaction between the code, the
hints, and the transformation heuristics that use them.
The standard approach prioritizes automation over pre-
dictability. Small code changes can yield large shifts in the
generated architecture. When performance is poor, the com-
piler provides little guidance about how to improve it. Prun-
ing such unpredictable points from the design space would let
programmers explore smaller, smoother parameter spaces.
2.1 An Example in HLS
Programming with HLS centers on arrays and loops, which
correspond to memory banks and logic blocks. Figure 2
shows the C code for a matrix multiplication kernel. This
section imagines the journey of a programmer attempting to
use HLS to generate a fast FPGA-based accelerator from this
code. We use Xilinx’s SDAccel [57] compiler (v2018.3.op) and
target an UltraScale+ VU9P FGPA on an AWS F1 instance [1]
to perform the experiments in this section.
Initial accelerator. Our imaginary programmer might first
try compiling the code verbatim. The HLS tool maps the
arrays m1, m2, and prod onto on-chip memories. FPGAs have
SRAM arrays, called block RAMs (BRAMs), that the compiler
allocates for this purpose. The loop body becomes combi-
national logic consisting of a multiplier, an adder, and an
accumulator register. Figure 3a depicts this configuration.
This design, while functional, does not harness any par-
allelism that an FPGA can offer. The two key metrics for
evaluating an accelerator design are performance and area,
i.e., the amount of physical chip resources that the accelera-
tor occupies. This initial configuration computes the matrix
product in 841.1 ms and occupies 2,355 of the device’s lookup
tables (LUTs). However, the target FPGA device has over 1
million LUTs, so the programmer’s next job is to expend
more of the FPGA area to improve performance.
Loop unrolling. The standard tool that HLS offers for ex-
pressing parallelism is an UNROLL annotation, which dupli-
cates the logic for a loop body. A programmer might attempt
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
m1
prod
*
Block RAMs
Combinational Logic
m2
+
sum
Register
(a) The original code.
ꔇ* * **
prod +sum
m1 m2
(b)With unrolling.
ƒ ƒ
ꔇ* * **
m1[0]
m1[1]
m1[2]
m1[7]
prod
m2[0]
m2[1]
m2[2]
m2[7]
+sum
(c)With unrolling and banking.
Figure 3. Three accelerator implementations of the matrix multiplication in Figure 2.
2 4 6 8 10
Unrolling factor (no partitioning)
2,300
2,400
2,500
2,600
2,700
L
U
T
s
us
ed
2 4 6 8 10
Unrolling factor (no partitioning)
750
800
850
900
950
1,000
R
un
ti
m
e
(m
s)
(a) Unrolling without partitioning.
2 4 6 8 10 12 14 16
2,000
2,500
3,000
3,500
4,000
4,500
5,000
L
U
T
s
us
ed
Unpredictable points
Predictable points
Incorrect hardware
2 4 6 8 10 12 14 16
Unrolling factor (partitioning = 8)
100
200
300
400
500
600
700
800
R
un
ti
m
e
(m
s)
(b) Unrolling with 8-way partitioning.
2 4 6 8 10 12 14 16
2,250
2,500
2,750
3,000
3,250
3,500
3,750
4,000
L
U
T
s
us
ed
2 4 6 8 10 12 14 16
Partitioning and Unrolling factor
100
200
300
400
500
600
700
800
R
un
ti
m
e
(m
s)
(c) Unrolling and banking in lockstep.
Figure 4. Look-up table count (top) and execution latency (bottom) for the kernel in Figure 2 with varying parameters.
to obtain a better accelerator design by adding this annota-
tion to the innermost loop on lines 6–8 in Figure 2:
#pragma HLS UNROLL FACTOR=8
This unrolling directive instructs the HLS tool to create 8
copies of the multiplier and adder, called processing elements
(PEs), and attempt to run them in parallel. Loop unrolling
represents an area–performance trade-off: programmers can
reasonably expect greater unrolling factors to consume more
of the FPGA chip but yield lower-latency execution.
The UNROLL directive alone, however, fails to achieve this
objective. Figure 4a shows the effect of various unrolling
factors on this code in area (LUT count) and performance
(latency). There is no clear trend: greater unrolling yields
unpredictably better and worse designs. The problem is that
the accelerator’s memories now bottleneck the parallelism
provided by the PEs. The BRAMs in an FPGA have a fixed,
small number of ports, so they can only service one or two
reads or writes at a time. So while the HLS tool obeys the
programmer’s UNROLL request to duplicate PEs, its schedul-
ing must serialize their execution. Figure 3b shows how the
HLS tool must insert additional multiplexing hardware to
connect the multipliers to the single-ported memories. The
additional hardware and the lack of parallelism yields the
unpredictable performance and area for different PE counts.
Memory banking to match parallelism. To achieve ex-
pected speedups from parallelism, accelerators need to use
multiple memories. HLS tools provide annotations to parti-
tion arrays, allocating multiple BRAMs and increasing the
access throughput. The programmer can insert these parti-
tioning annotations to allocate 8 BRAMs per input memory:
#pragma HLS ARRAY_PARTITION VARIABLE=m1 FACTOR=8
#pragma HLS ARRAY_PARTITION VARIABLE=m2 FACTOR=8
Banking uses several physical memories, each of which
stores a subset of the array’s data. The compiler partitions
the array using a “round-robin” policy to enable parallel ac-
cess. In this example, elements 0 and 8 go in bank 0, elements
1 and 9 go in bank 1, etc.:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(Each shade represents a different memory bank.) Figure 3c
shows the resulting architecture, which requires no multi-
plexing and allows memory parallel access.
Combining banking and unrolling, however, unearths an-
other source of unpredictable performance. While the HLS
tool produces a good result when both the banking factors
and the loop unrolling factor are 8, other design choices
perform worse. Figure 4b shows the effect of varying the un-
rolling factor while keeping the arrays partitioned with fac-
tor 8. Again, the area and performance varies unpredictably
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
with the unrolling factor. Reducing the unrolling factor from
9 to 8 can counter-intuitively improve both performance and
area. In our experiments, some unrolling factors yield hard-
ware that produces incorrect results. (We show the area but
omit the running time for these configurations.)
The problem is that some partitioning/unrolling combina-
tions yield much simpler hardware than others. When both
the unrolling and the banking factors are 8, each parallel PE
need only access a single bank, as in Figure 3c. The first PE
needs to access elements 0, 8, 16, and so on—and because the
array elements are “striped” across the banks, all of these
values live in the first bank. With unrolling factor 9, however,
the first PE needs to access values from every bank, which
requires complicated memory indirection hardware. With
unrolling factor 4, the indirection cost is smaller—the first
PE needs to access only bank 0 and bank 4.
From the programmer’s perspective, the HLS compiler
silently enforces an unwritten rule:When the unrolling factor
divides the banking factor, the area is good and parallelism
predictably improves performance. Otherwise, all bets are off.
Figure 4b labels the points where the unrolling factor divides
the banking factor as predictable points. The HLS compiler
emits no errors or warnings for any parameter setting.
Banking vs. array size. Even if we imagine that a program-
mer carefully ensures that banking factors exactly match
unrolling factors, another pitfall awaits them when choos-
ing the amount of parallelism. Figure 4c shows the effects
of varying the banking and unrolling factor in our kernel
together. The LUT count again varies wildly.
The problem is that, when the banking and unrolling fac-
tors do not evenly divide the sizes of the arrays involved, the
accelerator needs extra hardware to cope with the “leftover”
elements. The memory banks are unevenly sized, and the
PEs need extra hardware to selectively disable themselves on
the final iteration to avoid out-of-bounds memory accesses.
Again, there is a predictable subset of design points when
the programmer obeys the unwritten rule: An array’s bank-
ing factor should divide the array size. Figure 4c highlights the
predictable points that follow this rule. Among this subset,
the performance reliably improves with increasing paral-
lelism and the area cost scales proportionally.
2.2 Enforcing the Unwritten Rules
The underlying problem in each of these sources of unpre-
dictability is that the traditional HLS tool prioritizes automa-
tion over programmer control. While automation can seem
convenient, mapping heuristics give rise to implicit rules
that, when violated, silently produce bad hardware instead
of reporting a useful error.
This paper instead prioritizes the predictability of hard-
ware generation and making architectural decisions obvious
in the source code. HLS tools already contain such a pre-
dictable subset hidden within their unrestricted input lan-
guage. By modeling resource constraints, we can separate
out this well-behaved fragment. Figure 1 shows how our
checker augments a traditional HLS toolchain by lifting hid-
den compiler reasoning into the source code and rejecting
potentially unpredictable programs.
The challenge, however, is that the “unwritten rules” of
HLS are never explicitly encoded anywhere—they arise im-
plicitly from non-local interactions between program struc-
ture, hints, and heuristics. A naïve syntactic enforcement
strategy would be too conservative—it would struggle to
allow flexible, fine-grained sharing of hardware resources.
We design a type system that models the constraints of
hardware implementation to enforce these constraints in a
composable, formal way. Our type system addresses target-
independent issues—it prevents problems that would occur
even on an arbitrarily large FPGA. We do not attempt to
rule out resource exhaustion problems because they would
tie programs to specific target devices. We see that kind of
quantitative resource reasoning as important future work.
3 The Dahlia Language
Dahlia’s type system enforces a safety property: that the
number of simultaneous reads and writes to a given memory
bank may not exceed the number of ports. While traditional
HLS tools enforce this requirement with scheduling heuris-
tics, Dahlia enforces it at the source level using types.
The key ideas in Dahlia are (1) using substructural typing
to reason about consumable hardware resources and (2) ex-
pressing time ordering in the language to reason about when
resources are available. This section describes these two core
features (Sections 3.1 and 3.2) and then shows how Dahlia
builds on them to yield a language that is flexible enough to
express real programs (Sections 3.3–3.6).
3.1 Affine Memory Types
The foundation of Dahlia’s type system is its reasoning about
memories. The problem in Section 2.1’s example is conflict-
ing simultaneous accesses to the design’s memories. The
number of reads and writes supported by a memory per cy-
cle is limited by the number of ports in the memory. HLS
tools automatically detect potential read/write conflicts and
schedule accesses across clock cycles to avoid errors. Dahlia
instead makes this reasoning about conflicts explicit by en-
forcing an affine restriction on memories.
Memories are defined by giving their type and size:
let A: float[10];
The type of A is mem float[10], denoting a single-ported
memory that holds 10 floating-point values. Each Dahlia
memory corresponds to an on-chip BRAM in the FPGA.
Memories resemble C or Java arrays: programs read and
mutate the contents via subscripting, as in A[5] := 4.2.
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
Because they represent static physical resources in the gen-
erated hardware, memory types differ from plain value types
like float by preventing duplication and aliasing:
let x = A[0]; // OK: x is a float.
let B = A; // Error: cannot copy memories.
The affine restriction onmemories disallows reads andwrites
to a memory that might occur at the same time:
let x = A[0]; // OK
A[1] := 1; // Error: Previous read consumed A.
While type-checking A, the Dahlia compiler removes A from
the typing context. Subsequent uses of A are errors, with one
exception: identical reads to the same memory location are
allowed. This program is valid, for example:
let x = A[0];
let y = A[0]; // OK: Reading the same address.
The type system uses access capabilities to check reads and
writes [18, 22]. A read expression such as A[0] acquires a
non-affine read capability for index 0 in the current scope,
which permits unlimited reads to the same location but pre-
vents the acquisition of other capabilities for A. The gener-
ated hardware reads once from A and distributes the result
to both variables x and y, as in this equivalent code:
let tmp = A[0]; let x = tmp; let y = tmp;
However, memory writes use affine write capabilities, which
are use-once resources: multiple simultaneous writes to the
same memory location remain illegal.
3.2 Ordered and Unordered Composition
A key HLS optimization is parallelizing execution of inde-
pendent code. This optimization lets HLS compilers paral-
lelize and reorder dependency-free statements connected by
; when the hardware constraints allow it—critically, when
they do not need to access the same memory banks.
Dahlia makes these parallelism opportunities explicit by
distinguishing between ordered and unordered composition.
The C-style ; connector is unordered: the compiler is free to
reorder and parallelize the statements on either side while
respecting their data dependencies. A second connector, ---,
is ordered: in A --- B, statement A must execute before B.
Dahlia prevents resource conflicts in unordered composi-
tion but allows two statements in ordered composition to use
the same resources. For example, Dahlia accepts this program
that would be illegal when joined by the ; connector:
let x = A[0]
---
A[1] := 1
In the type checker, ordered composition restores the affine
resources that were consumed in the first command before
checking the second command. The capabilities for all mem-
ories are discarded, and the program can acquire fresh capa-
bilities to read and write any memory.
Together, ordered and unordered composition can express
complex concurrent designs:
let A: float[10]; let B: float[10];
{
let x = A[0] + 1
---
B[1] := A[1] + x // OK
};
let y = B[0]; // Error: B already consumed.
The statements composed with --- are ordered with each
other but unordered with the last line. The read therefore
must not conflict with either of the first two statements.
Logical time. From the programmer’s perspective, a chain
of ordered computations executes over a series of logical
time steps. Logical time in Dahlia does not directly reflect
physical time (i.e., clock cycles). Instead, the HLS backend is
responsible for allocating cycles to logical time steps in a way
that preserves the ordering of memory accesses. For example,
a long logical time step containing an integer division might
require multiple clock cycles to complete, and the compiler
may optimize away unneeded time steps that do not separate
memory accesses. Regardless of optimizations, however, a
well-typed Dahlia program requires at least enough ordered
composition to ensure that memory accesses do not conflict.
Local variables as wires & registers. Local variables, de-
fined using the let construct, do not share the affine restric-
tions of memories. Programs can freely read and write to
local variables without restriction, and unordered composi-
tion respects the dependencies induced by local variables:
let x = 0; x := x + 1; let y = x; // All OK
In hardware, local variables manifest as wires or registers.
The choice depends on the allocation of physical clock cy-
cles: values that persist across clock cycles require registers.
Consider this example consisting of two logical time steps:
let x = A[0] + 1 --- B[0] := A[1] + x
The compiler must implement the two logical time steps in
different clock cycles, so it must use a register to hold x. In
the absence of optimizations, registers appear whenever a
variable’s live range crosses a logical time step boundary.
Therefore, programmers can minimize the use of registers
by reducing the live ranges of variables or by reducing the
amount of sequential composition.
3.3 Memory Banking
As Section 2.1 details, HLS tools can bank memories into
disjoint components to allow parallel access. Dahlia memory
declarations support bank annotations:
let A: float[8 bank 4];
In a memory type mem t[n bank m], the banking factorm
must evenly divide the size n to yield equally-sized banks.
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
HLS tools, in contrast, allow uneven banking and silently
insert additional hardware to account for it (see Section 2.1).
Affine restrictions for banks. Dahlia tracks an affine re-
source for each memory bank. To physically address a bank,
the syntaxM{b}[i] denotes the ith element ofM ’s bth bank.
This program is legal, for example:
let A: float[10 bank 2];
A{0}[0] := 1;
A{1}[0] := 2; // OK: Accessing a different bank.
Dahlia also supports logical indexing into banked arrays
using the syntax M[n] for literals n. For example, A[1] is
equivalent to A{1}[0] above. Because the index is static, the
type checker can automatically deduce the bank and offset.
Multi-ported memories. Dahlia also supports reasoning
about multi-ported memories. This syntax declares a mem-
ory where each bank has two read/write ports:
let A: float{2}[10];
A memory provides k affine resources per bank where k is
the number of ports in a memory. This rule lets multi-ported
memories provide multiple read/write capabilities in each
logical time step. For example, Dahlia accepts this program:
let A: float{2}[10];
let x = A[0];
A[1] := x + 1;
Dahlia does not guarantee data-race freedom in the presence
of multi-ported memories. Programs are free to write to and
read from the same memory location in the same logical
time step and should expect the semantics of the underlying
memory technology. Extensions to rule out data races would
resemble race detection for parallel software [44, 38].
Multi-dimensional banking. Banking generalizes tomulti-
dimensional arrays. Every dimension can have an indepen-
dent banking factor. This two-dimensional memory has two
banks in each dimension, a total of 2 × 2 = 4 banks:
let M: float[4 bank 2][4 bank 2];
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
The physical and logical memory access syntax similarly
generalizes to multiple dimensions. For example, M{3}[0]
represents the element logically located at M[1][1].
3.4 Loops and Unrolling
Fine-grained parallelism is an essential optimization in hard-
ware accelerator design. Accelerator designers duplicate a
block of logic to trade off area for performance: n copies of
the same logic consume n times as much area while offering
a theoretical n-way speedup. Dahlia syntactically separates
parallelizable doall for loops, whichmust not have any cross-
iteration dependencies, from sequential while loops, which
may have dependencies but are not parallelizable. Program-
mers can mark for loops with an unroll factor to duplicate
the loop body logic and run it in parallel:
for (let i = 0..10) unroll 2 { f(i) }
This loop is equivalent to a sequential one that iterates half as
many times and composes two copies of the body in parallel:
for (let i = 0..5) { f(2*i + 0); f(2*i + 1) }
The doall restriction is important because it allows the com-
piler to run the two copies of the loop body in parallel using
unordered composition. In traditional HLS tools, a loop un-
rolling annotation such as #pragma HLS unroll is always
allowed—even when the loop body makes parallelization
difficult or impossible. The toolchain will replicate the loop
body and rely on complex analysis and resource scheduling
to optimize the unrolled loop body as well as it can.
Resource conflicts in unrolled loops are errors. For exam-
ple, this loop accesses an unbanked array in parallel:
let A: float[10];
for (let i = 0..10) unroll 2 {
A[i] := compute(i) // Error: Insufficient banks.
}
Unrolled memory accesses. Dahlia uses special index types
for loop iterators to type-check memory accesses within
unrolled loops. Index types generalize integers to encode
information about loop unrolling. In this example:
for (let i = 0..8) unroll 4 { A[i] }
The iterator i gets the type idx{0..4}, indicating that ac-
cessing an array at i will consume banks 0, 1, 2, and 3. Type-
checking a memory access with i consumes all banks indi-
cated by its index type.
Unrolling and ordered composition. Loop unrolling has a
subtle interaction with ordered composition. In a loop body
containing ---, like this:
let A: float[10 bank 2];
for (let i = 0..10) unroll 2 {
let x = A[i]
---
f(x, A[0]) }
A naive interpretation would use parallel composition to join
the loop bodies at the top level:
for (let i = 0..5) {
{ let x0 = A[2*i] --- f(x0, A[0]) };
{ let x1 = A[2*i + 1] --- f(x1, A[0]) } }
However, this interpretation is too restrictive. It requires all
time steps in each loop body to avoid conflicts with all other
time steps. This example would be illegal because the access
to A[i] in the first time step may conflict with the access to
A[0] in the second time step. Instead, Dahlia reasons about
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
unrolled loops in lockstep by parallelizing within each logical
time step. The loop above is equivalent to:
for (let i = 0..5) {
{ let x0 = A[2*i]; let x1 = A[2*i + 1] }
---
{ f(x0, A[0]); f(x1, A[0]) } }
The lockstep semantics permits this unrolling because con-
flicts need only be avoided between unrolled copies of the
same logical time step. HLS tools must enforce a similar
restriction but leave the choice to black-box heuristics.
Nested unrolling. In nested loops, unrolled iterators can
separately access dimensions of a multi-dimensional array.
Nested loops also interact with Dahlia’s read and write ca-
pabilities. In this program:
let A: float[8 bank 4][10 bank 5];
for (let i = 0..8) {
for (let j = 0..10) unroll 5 {
let x = A[i][0]
---
A[i][0] := j; // Error: Insufficient write
} } // capabilities.
The read to array A[i][0] can be proved to be safe because
after desugaring, the reads turn into:
let x0 = A[i][0]; let x1 = A[i][0] ...
The access is safe because the first access acquires a read
capability for indices i and 0, so the subsequent copies are
safe. Architecturally, the code entails a single read fanned
out to each parallel PE. However, the write desugars to:
A[i][0] := j; A[i][0] := j + 1 ...
which causes a write conflict in the hardware.
3.5 Combine Blocks for Reduction
In traditional HLS, loops can freely include dependent oper-
ations, as in this dot product:
for (let i = 0..10) unroll 2 { dot += A[i] * B[i] }
However, the += update silently introduces a dependency
between every iteration which is disallowed by Dahlia’s
doall for-loops. HLS tools heuristically analyze loops to
extract and serialize dependent portions. In Dahlia, program-
mers explicitly distinguish the non-parallelizable reduction
components of for loops. Each for can have an optional
combine block that contains sequential code to run after
each unrolled iteration group of the main loop body. For
example, this loop is legal:
for (let i = 0..10)
unroll 2 {
let v = A[i] * B[i];
} combine {
dot += v;
}
PE 0
combine
A{0} B{0} A{1} B{1}
* PE 1 *
dot+
There are two copies of the loop body that run in parallel
and feed into a single reduction tree for the combine block.
The type checker gives special treatment to variables like
v that are defined in for bodies and used in combine blocks.
In the context of the combine block, v is a combine register,
which is a tuple containing all values produced for v in the
unrolled loop bodies. Dahlia defines a class of functions
called reducers that take a combine register and return a
single value (similar to a functional fold). Dahlia defines +=,
-=, *=, /= as built-in reducers with infix syntax.
3.6 Memory Views for Flexible Iteration
In order to predictably generate hardware for parallel ac-
cesses, Dahlia statically calculates banks accessed by each
PE and guarantees that they are distinct. Figure 5a shows
the kind of hardware generated by this restriction—each PE
is directly connected to a bank.
To enforce this hardware generation, Dahlia only allows
simple indexing expressions like A[i] and A[4] and rejects
arbitrary index calculations like A[2*i]. General indexing
expressions can require complex indirection hardware to
allow any PE to access any memory bank. An access like
A[i*i], for example, makes it difficult to deduce which bank
it would read on which iteration. For simple expressions
like A[j+8], however, the bank stride pattern is clear. Tradi-
tional HLS tools make a best-effort attempt to deduce access
patterns, but subtle changes in the code can unpredictable
prevent the analysis and generate bad hardware.
Dahlia uses memory views to define access patterns that
HLS compilers can compile efficiently and to convince the
Dahlia type checker that a parallel access will be predictable.
The key idea is to offer different logical arrangements of the
same underlying physicalmemory. By logically re-organizing
the memory, views can simply reuse Dahlia’s type-checking
to ensure that complex access patterns are predictable. Fur-
thermore, this allows views to capture the hardware cost
of an access pattern in the source code instead of relying
on black-box analysis in HLS tools. For Dahlia’s HLS C++
backend, views are compiled to direct memory accesses.
The rest of this section describes Dahlia’s memory views
and their cost in terms of hardware required to transform
bank and index values to support the iteration pattern.
Shrink. To directly connect PEs to memory banks, Dahlia
requires the unrolling factor to match the banking factor. To
allow lower unrolling factors, Dahlia provides shrink views,
which reduce the banking factors of an underlying memory
by an integer factor. For example:
let A: float[8 bank 4];
view sh = shrink A[by 2]; // sh: float[8 bank 2]
for (let i = 0..8) unroll 2
sh[i]; // OK: sh has 2 banks. Compiled to: A[i].
The example first defines a view sh with the underlying
memory A and divides its banking factor by 2. Dahlia allows
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
PE
0
PE
1
PE
2
PE
3
Bank 
0
Bank 
1
Bank 
2
Bank 
3
(a) No view.
PE
0
PE
1
Bank 
0
Bank 
1
Bank 
2
Bank 
3
(b) Shrink view.
PE
0
PE
1
PE
2
PE
3
Bank 
0
Bank 
1
Bank 
2
Bank 
3
+
+
+
+
(c) Suffix view.
PE
0
PE
1
PE
2
PE
3
Bank 
0
Bank 
1
Bank 
2
Bank 
3
+
+
+
+
(d) Shift view.
PE
0, 0
PE
0, 1
PE
1, 0
PE
1, 1
Bank 
0
Bank 
1
Bank 
2
Bank 
3
+
+
+
+
(e) Split view.
Figure 5. Hardware schematics for each kind of memory view. Highlighted outlines indicate added hardware cost.
sh[i] here because each PEwill access a distinct set of banks.
The first PE accesses banks 0 and 2; the second accesses banks
1 and 3. The hardware cost of a shrink view, as Figure 5b
illustrates, consists of multiplexing to select the right bank
on every iteration. The access sh[i] compiles to A[i].
Suffix. A second kind of view lets programs create small
slices of a larger memory. Dahlia distinguishes between suf-
fixes that it can implement efficiently and costlier ones. An
efficient aligned suffix view uses this syntax:
view v = suffix M[by k * e];
where view v starts at element k × e of the memory M. Crit-
ically, k must be the banking factor of M. This restriction
allows Dahlia to prove that each logical bank in the view
maps to the same physical bank while the indices are offset
by the indexing expression. The hardware cost of a suffix
view is the address adapter for each bank. A view access
v{b}[i] is compiled to M{b}[e + i].
For example, generating suffixes in a loop results in this
pattern, where the digits in each cell are the indices, the
shades represent the banks, and the highlighted outline indi-
cates the view:
let A: float[8 bank 2];
for (let i = 0..4) {
view s = suffix A[by 2*i];
s[1]; // reads A[2*i + 1]
}
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
A suffix view defined using view v = suffix M[by k*e]
and accessed using v[i] is compiled to M[k*e + i].
Shift. Shifted suffixes are like standard suffixes but allow
unrestricted offset expressions:
view v = shift M[by e];
Since e is unrestricted, Dahlia assumes that both the bank
and the indices need to be adapted and that each PE accesses
every bank. Figure 5d shows the hardware cost of a shift view:
each PE is connected to every bank and the index expression
is transformed using an address adapter. The distinction
between suffix and shift views allows Dahlia to capture the
cost of different accessing schemes.
Even in this worst-case scenario, Dahlia can reason about
the disjointness of bank accesses. This loop is legal:
let A: float[12 bank 4];
for (let i = 0..3) {
view r = shift A[by i*i]; // r: float[12 bank 4]
for (let j = 0..4) unroll 4
let x = r[j]; // accesses A[i*i + j]
}
The view r has a memory type, so Dahlia can guarantee that
the inner access r[j] uses disjoint banks and is therefore
safe to parallelize. An access r[i] to a view declared with
shift M[by e] compiles to M[e + i].
Split. Some nested iteration patterns can be parallelized at
two levels: globally, over an entire array, and locally, over a
smaller window. This pattern arises in blocked computations,
such as this dot product loop in C++:
float A[12], B[12], sum = 0.0;
for (int i = 0; i < 6; i++)
for (int j = 0; j < 2; j++)
sum += A[2*i + j] * B[2*i + j];
Both the inner loop and the outer loop represent opportuni-
ties for parallelization. However, Dahlia cannot prove this
parallelization to be safe:
let A, B: float[12 bank 4];
view shA, shB = shrink A[by 2], B[by 2];
for (let i = 0..6) unroll 2 {
view vA, vB = suffix shA[by 2*i], shB[by 2*i];
for (let j = 0..2) unroll 2 {
let v = vA[j] + vB[j];
} combine {
sum += v; }}
While Dahlia can prove that the inner accesses into the views
can be predictably parallelized, it cannot establish the dis-
jointness of the parallel copies of the views va and vb created
by the outer unrolled loop.
Split views allow for this reasoning. The key idea is to
create logically more dimensions than the physical mem-
ory and reusing Dahlia’s reasoning for multidimensional
memories to prove safety for such parallel accesses. A split
view transforms a one-dimensional memory (left) into a two-
dimensional memory (right):
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
x ∈ variables a ∈ memories n ∈ numbers
b ::= true | false v ::= n | b
e ::= v | bop e1 e2 | x | a[e]
c ::= e | let x = e | c1 c2 | c1 ; c2 | if x c1 c2 |
while x c | x := e | a[e1] := e2 | skip
τ ::= bit⟨n⟩ | float | bool | mem τ [n1]
Figure 6. Abstract syntax for the Filament core language.
0 1 2 3 4 5 6 7 8 9 10 11
1 5 6 8 90
3 6 7 10 112
Using these split-view declarations:
view split_A = split A[by 2];
view split_B = split B[by 2];
Each view has type mem float[2 bank 2][6 bank 2]. A
row in the logical view represents a “window” for compu-
tation. The above example can now unroll both loops, by
changing the inner access to:
let v = split_A[j][i] * split_B[j][i];
As Figure 5e illustrates, split views have similar cost to
aligned suffix views: they require no bank indirection hard-
ware because the bank index is always known statically. They
require an address adapter to compute the address within the
bank from the separate coordinates. A split view declared
view sp = split M[by k] on a memory M with k banks
translates the access sp[i][j] to M{bank}[idx] where:
bank = i ∗ k + (j mod b) idx =
⌊
j
b
⌋
4 Formalism
This section formalizes the time-sensitive affine type system
that underlies Dahlia in a core language, Filament. We give
both a large-step semantics, which is more intelligible, and
a small-step semantics, which enables a soundness proof.
4.1 Syntax
Figure 6 lists the grammar for Filament. Filament statements
c resemble a typical imperative language: there are expres-
sions, variable declarations, conditions, and simple sequen-
tial iteration via while. Filament has ordered composition
c1 c2 and unordered composition c1 ; c2. It separates mem-
ories a and variables x into separate syntactic categories.
Filament programs can only declare the latter: a program
runs with a fixed set of available memories.
4.2 Large-Step Semantics
Filament’s large-step operational semantics is a checked se-
mantics that enforces Dahlia’s safety condition by explicitly
tracking and getting stuck when it would otherwise require
two conflicting accesses. Our type system (Section 4.3) aims
to rule out these conflicts.
The semantics uses an environment σ mapping variable
and memory names to values, which may be primitive values
ormemories, which in turnmap indices to primitive values. A
second context, ρ, is the set of the memories that the program
has accessed. ρ starts empty and accumulates memories as
the program reads and writes them.
The operational semantics consists of an expression judg-
ment σ1, ρ1, e ⇓ σ2, ρ2,v and a command judgment σ1, ρ1, c ⇓
σ2, ρ2. We describe some relevant rules here, and the supple-
mentary material lists the full semantics and proof [39].
Memory accesses. Memories in Filament are mutable stores
of values. Banked memories in Dahlia can be built up using
these simpler memories. The rule for a memory read expres-
sion a[n] requires that a not already be present in ρ, which
would indicate that the memory was previously consumed:
a < ρ1 σ1, ρ1, e ⇓ σ2, ρ2, n σ2(a)(n) = v
σ1, ρ1, a[e] ⇓ σ2, ρ2 ∪ {a}, v
Composition. Unordered composition accumulates the re-
source demands of two commands by threading ρ through:
σ1, ρ1, c1 ⇓ σ2, ρ2 σ2, ρ2, c2 ⇓ σ3, ρ3
σ1, ρ1, c1 ; c2 ⇓ σ3, ρ3
If both commands read or write the same memory, they will
conflict in ρ. Ordered composition runs each command in
the same initial ρ environment and merges the resulting ρ:
σ1, ρ1, c1 ⇓ σ2, ρ2 σ2, ρ1, c2 ⇓ σ3, ρ3
σ1, ρ1, c1 c2 ⇓ σ3, ρ2 ∪ ρ3
4.3 Type System
The typing judgments have the form Γ1,∆1 ⊢ c ⊣ Γ2,∆2 and
Γ,∆1 ⊢ e : τ ⊣ ∆2. Γ is a standard typing context for variables
and ∆ is the affine context for memories.
Affine memory accesses. Memories are affine resources.
The rules for reads and writes check the type of the index in
Γ and remove the memory from ∆:
Γ,∆1 ⊢ e1 : bit⟨n⟩ ⊣ ∆2 ∆2 = ∆3 ∪ {a 7→ memτ [n1]}
Γ,∆1 ⊢ a[e] : τ ⊣ ∆3
Composition. The unordered composition rule checks the
first statement in the initial contexts and uses the resulting
contexts to check the second statement:
Γ1,∆1 ⊢ c1 ⊣ Γ2,∆2 Γ2,∆2 ⊢ c2 ⊣ Γ3,∆3
Γ1,∆1 ⊢ c1 ; c2 ⊣ Γ3,∆3
Ordered composition checks both commands under the same
resource set, ∆1, but threads the non-affine context through:
Γ1,∆1 ⊢ c1 ⊣ Γ2,∆2 Γ2,∆1 ⊢ c2 ⊣ Γ3,∆3
Γ1,∆1 ⊢ c1 c2 ⊣ Γ3,∆2 ∩ ∆3
The rulemerges the resulting∆ contexts with set intersection
to yield the resources not consumed by either statement.
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
4.4 Small-Step Semantics
We also define a small-step operational semantics for Fila-
ment upon which we build a proof of soundness. We claim
that the small-step semantics, when iterated to a value, is
equivalent to the big-step semantics. The semantics consists
of judgments σ1, ρ1, e → σ2, ρ2, e′ and σ1, ρ1, c → σ2, ρ2, c′
where σ and ρ are the environment and the memory context
respectively. The main challenge is sequential composition,
which uses an intermediate command form c1
ρ∼ c2 to thread
ρ to c1 and c2. The supplementary material has full details.
4.5 Desugaring Surface Constructs
Filament desugars surface language features present inDahlia.
Memory banking. A banked memory declaration like this:
let A: float[m bank n];
desugars into several unbanked memories:
let A_0: float[mn ]; let A_1: float[
m
n ]; ...
Desugaring transforms reads and writes of banked memories
to conditional statements that use the indexing expression
to decide which bank to access.
Loop unrolling. Desugaring of for loops uses the tech-
nique described in Section 3.4, translating from:
for (let i = 0 .. m) unroll k { c1 --- c2 ... }
into a while loop that duplicates the body:
let i = 0;
while (i < mk ) {
{ c1[i 7→ k*i+0]; c1[i 7→ k*i+1] ... }
---
{ c2[i 7→ k*i+0]; c2[i 7→ k*i+1] ... }
...
i++; }
where c[x 7→ e] denotes substitution.
Memory views. For views’ operational semantics, a desug-
aring based on the mathematical descriptions in Section 3.6
suffices. To type-check them, however, would require track-
ing the underlying memory for each view (transitively, to
cope with views of views) and type-level reasoning about the
bank requirements of an access pattern. Formal treatment of
these types would require an extension to Filament.
Multi-ported memories. Reasoning about memory ports
requires quantitative resource tracking, as in bounded linear
logic [21]. We leave such an extension of Filament’s affine
type system as future work.
4.6 Soundness Theorem
We state a soundness theorem for Filament’s type system
with respect to its checked small-step operational semantics.
Theorem. If ∅,∆∗ ⊢ c ⊣ Γ2,∆2 and ∅, ∅, c ∗→ σ , ρ, c ′ and
σ , ρ, c ′ ↛, then c ′ = skip.
where ∆∗ is the initial affine context of memories available
to a program. The theorem states that a well-typed program
never gets stuck due to memory conflicts in ρ. We prove this
theorem using progress and preservation lemmas:
Lemma 1 (Progress). If Γ,∆ ⊢ c ⊣ Γ2,∆2 and Γ,∆ ∼ σ , ρ,
then σ , ρ, c → σ ′, ρ ′, c ′ or c = skip.
Lemma 2 (Preservation). If Γ,∆ ⊢ c ⊣ Γ2,∆2 and Γ,∆ ∼ σ , ρ,
and σ , ρ, c → σ ′, ρ ′, c ′, then Γ′,∆′ ⊢ c ′ ⊣ Γ′2 ,∆′2 and Γ′,∆′ ∼
σ ′, ρ ′.
In these lemmas, Γ,∆ ∼ σ , ρ is a well-formedness judgment
stating that all variables in Γ are in σ and all memories in ∆
are not in ρ. Using an extension of the syntax in Figure 6,
we prove the lemmas by induction on the small-step rela-
tion [39].
5 Evaluation
Our evaluation measures whether Dahlia’s restrictions can
improve predictability without sacrificing too much sheer
performance. We conduct two experiments: (1) We perform
an exhaustive design space exploration for one kernel to
determine how well the restricted design points compare to
the much larger unrestricted parameter space. (2) We port
the MachSuite benchmarks [49] and, where Dahlia yields a
meaningful design space, perform a parameter sweep.
5.1 Implementation and Experimental Setup
We implemented a Dahlia compiler in 5200 LoC of Scala. The
compiler checks Dahlia programs and generates C++ code us-
ing Xilinx Vivado HLS’s #pragma directives [58]. We execute
benchmarks on AWS F1 instances [1] with 8 vCPUs, 122 GB
of main memory, and a Xilinx UltraScale+ VU9P. We use the
SDAccel development environment [57] and synthesize the
benchmarks with a target clock period of 250 MHz.
5.2 Case Study: Unrestricted DSE vs. Dahlia
In this section, we conduct an exhaustive design-space explo-
ration (DSE) of a single benchmark as a case study. Without
Dahlia, the HLS design space is extremely large—we study
how the smaller Dahlia-restricted design space compares.We
select a blocked matrix multiplication kernel (gemm-blocked
from MachSuite) for its large but tractable design space. The
kernel has 3 two-dimensional arrays (two operands and the
output product) and 5 nested loops, of which the inner 3
are parallelizable. We define parameters for the 6 banking
factors (two dimensions for each memory) and 3 unrolling
factors. (A full code listing appears in the supplementary ma-
terial [39].) We explore a design space with banking factors
of 1–4 and unrolling factors of 1, 2, 4, 6, and 8. This design
space consists of 32,000 distinct configurations.
We exhaustively evaluated the entire design space using
Vivado HLS’s estimation mode, which required a total of
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
(a) Pareto-optimal points. (b) Points accepted by Dahlia. (c) Cluster of Pareto points.
Figure 7. Results from exhaustive design space exploration for gemm-blocked.
2,666 compute hours. We identify Pareto-optimal configura-
tions according to their estimated cycle latency and number
of lookup tables (LUTs), flip flops (FFs), block RAMs (BRAMs),
and arithmetic units (DSPs).
Dahlia accepts 354 configurations, or about 1.1% of the
unrestricted design space. But the smaller space is only valu-
able if it consists of useful design points—a broad range of
Pareto-optimal configurations. Figures 7a and 7b show the
Pareto-optimal points and the subset of points that Dahlia
accepts, respectively. (Pareto optimality is determined using
all objectives, but the plot shows only two: LUTs and la-
tency.) Figure 7c shows a zoomed-in view of the tight cluster
of Pareto points in the bottom-left of the first two graphs.
Dahlia-accepted points lie primarily on the Pareto frontier
and allow area-latency trade-offs. The optimal points that
Dahlia rejects expend a large number of LUTs to reduce
BRAM consumption which, while Pareto optimal, don’t seem
to be of practical use.
5.3 Dahlia-Directed DSE & Programmability
We port benchmarks from an HLS benchmark suite, Mach-
Suite [49], to study Dahlia’s flexibility. Of the 19 MachSuite
benchmarks, one (backprop) contains a correctness bug and
two fail to synthesize correctly in Vivado, indicating a bug
in the tools. We successfully ported all 16 of the remaining
benchmarks without substantial restructuring.
From these, we select 3 benchmarks that exhibit the kind
of fine-grained, loop-level parallelism that Dahlia targets as
case studies: sencil2d, md-knn, and md-grid. As the previ-
ous section illustrates, an unrestricted DSE is intractable for
even modestly sized benchmarks, so we instead measure the
breadth and performance of the much smaller space of con-
figurations that Dahlia accepts. For each benchmark, we find
all optimization parameters available in the Dahlia port and
define a search space. The type checker rejects some design
points, and we measure the remaining space. We use Vivado
HLS’s estimation mode to measure the resource counts and
estimated latency for each accepted point. Figure 8 depicts
the Pareto-optimal points in each space. In each plot, we also
highlight the effect a single parameter has on the results.
The rest of this section reports quantitatively on each
benchmark’s design space and reports qualitatively on the
programming experience during the port from C to Dahlia.
stencil2d. MachSuite’s stencil2d is a filter operation with
four nested loops. The outer loops scan over the input matrix
and the inner loops apply a 3×3 filter. Our Dahlia port unrolls
the inner two loops and banks both input memories. We use
unrolling factors from 1 to 3 and bank each dimension of
the input array by factors 1 to 6. The resulting design space
has 2,916 points. Dahlia accepts 18 of these points (0.6%), of
which 8 are Pareto-optimal within the set.
Figure 8a shows the Pareto frontier among the Dahlia-
accepted points. The figure uses color to show the unrolling
factor for the innermost loop. This unrolling factor has a
large effect on the design’s performance, while banking fac-
tors and the other loop explain the rest of the variation.
The original C code uses single-dimensional arrays and
uses index arithmetic to treat them as matrices:
for (r=0; r<row_size-2; r++)
for (c=0; c<col_size-2; c++)
for (k1=0; k1<3; k1++)
for (k2=0; k2<3; k2++)
mul = filter[k1*3 + k2] *
orig[(r+k1)*col_size + c+k2];
In the Dahlia port, we must use proper two-dimensional ar-
rays because the compiler rejects arbitrary indexing expres-
sions. Using views, programmers can decouple the storage
format from the iteration pattern. To express the accesses
to the input matrix orig, we create a shifted suffix view
(Section 3.6) for the current window:
for (let row = 0..126) {
for (let col = 0..62) {
view window = shift orig[by row][by col];
for (let k1 = 0..3) unroll 3 {
for (let k2 = 0..3) unroll 3 {
let mul = filter[k1][k2] * window[k1][k2];
The viewmakes the code’s logicmore obviouswhile allowing
the Dahlia type checker to allow unrolling on the inner two
loops. It also clarifies why parallelizing the outer loops would
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
50k 100k 150k 200k 250k 300k
Latency (cycles)
2k
3k
4k
5k
6k
LU
Ts
 u
se
d
1
3
Inner unroll
(a) stencil2d with inner unroll.
16k 17k 18k
Latency (cycles)
0k
100k
200k
300k
400k
500k
128.6k 128.7k 128.8k
Latency (cycles)
8k
9k
10k
11k
12k
1
2
4
8
Outer unroll
(b) md-knn with outer unroll.
7,950 8,000 8,050
Latency (cycles)
10k
15k
20k
25k
30k
35k
40k
45k
50k
1
2
Middle unroll
(c) md-grid with middle unroll.
Figure 8. The design spaces for three MachSuite benchmarks. Each uses a color to highlight one design parameter.
be undesirable: the parallel views would require overlapping
regions of the input array, introducing a bank conflict.
md-knn. The md-knn benchmark implements an n-body
molecular dynamics simulation with a k-nearest neighbors
kernel. The MachSuite implementation uses data-dependent
loads in its main loop, which naïvely seems to prevent paral-
lelization. In our Dahlia port, however, we hoist this serial
section into a separate loop that runs before the main, par-
allelizable computation. Dahlia’s type system helped guide
the programmer toward a version of the benchmark where
the benefits from parallelization are clear.
For each of the program’s four memories, we used banking
factors from 1 to 4. We unrolled each of the two nested loops
with factors from 1 to 8. The full space has 16,384 points,
of which Dahlia accepts 525 (3%). 37 of the Dahlia-accepted
points are Pareto-optimal.
Figure 8b shows two Pareto frontiers that Dahlia accepts
at different scales. The color shows the unrolling factor of the
outer loop. The frontier on the right uses an order of magni-
tude fewer resources but is an order of magnitude slower. In
this kernel, the dominant effect is the memory banking (not
shown in the figure), which determines which frontier the
designs fall into. The outer unrolling factor (shown in color)
affects the two regimes differently: on the right, it allows
area–latency trade-offs within the frontier; on the left, it acts
as a second-order effect that expends LUTs to achieve a small
increase in performance.
md-grid. Another algorithm for the same molecular dynam-
ics problem, md-grid, uses a different strategy based on a
3D grid implemented with several 4-dimensional arrays. It
calculates forces between neighboring grid cells. Of its 6
nested loops, the outer three are parallelizable. We use bank-
ing factors of 1 to 4 for each dimension of each array, and
we try unrolling factors from 1 to 8 for both loops. The full
space has 21,952 points, of which Dahlia accepts 81 (0.4%).
13 of the Dahlia-accepted points are Pareto-optimal.
Figure 8c again shows the Pareto-optimal design points.
The innermost loop unrolling factor (not shown in the figure)
determines which of three coarse regimes the design falls
into. The color shows the second loop unrolling factor, which
determines a second-order area–latency trade-off within
each regime. Unrolling enables latency-area trade-offs in
both the cases.
6 Future Work
Dahlia represents a first step toward high-level semantics for
accelerator design languages. It leaves several avenues for
future work on scaling up from kernels to full applications
and expressing more hardware implementation techniques.
Modularity. Dahlia’s type system relies on a closed-world
assumption. A compositional type system would enable
reuse of abstract hardware modules without “inlining” them,
like functions in a software language. The primary challenge
in modular accelerator design is the balance between abstrac-
tion and efficiency: a more general module is likely to be
less efficient. An abstraction mechanism must also cope with
the timing of inter-module interactions: some interfaces are
latency-insensitive while others rely on cycle-level timing.
Polymorphism. Dahlia’s memory types are monomorphic.
Polymorphism would enable abstraction over memories’
banking strategies and sizes. A polymorphic Dahlia-like
language could rule out invalid combinations of abstract
implementation parameters before the designer picks con-
crete values, which would help constrain the search space
for design space exploration.
Pipelining. Pipelined logic is a critical implementation tech-
nique for high-level synthesis. Dahlia does not reason about
the timing of pipeline stages or their resource conflicts. Ex-
tensions to its type system will need to reason about the
cycle-level latency of these stages and track the fine-grained
sharing of logic resources.
Direct RTL generation. The current Dahlia compiler relies
on a commercial C++-based HLS compiler as its backend. It
generates directives that instruct the HLS tool to generate
hardware according to the program’s Dahlia types, but the
unpredictability of traditional HLS means that results can
still vary. Future compilers for Dahlia-like languages might
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
2 4 6 8 10 12 14 16
Unrolling Factor
0.5
1.0
1.5
2.0
2.5
3.0
3.5
N
or
m
al
iz
ed
R
es
ou
rc
e
U
sa
ge
s DSP used
BRAM used
LUT used
Figure 9. Resource utilization for gemm-ncubed in Spatial
normalized to the design without unrolling.
generate RTL directly and rely on the simpler input language
avoid the complexity of unrestricted HLS.
7 Related Work
Dahlia builds on a long history of work on safe systems pro-
gramming. Substructural type systems are known to be a
good fit for controlling system resources [7, 24, 54, 13, 36].
Dahlia’s enforcement of exclusive memory access resembles
work on race-free parallel programming using type and ef-
fect systems [8] or concurrent separation logic [41]. Safe
parallelism on CPUs focuses on data races where concurrent
reads and writes to a memory are unsynchronized. Conflicts
in Dahlia are different: any simultaneous pair of accesses to
the same bank is illegal. The distinction influences Dahlia’s
capability system and its memory views, which cope with
the arrangement of arrays into parallel memory banks.
Dahlia takes inspiration from other approaches to improv-
ing the accelerator design process, including HDLs, HLS,
DSLs, and other recent accelerator design languages.
Spatial. Spatial [32] is a language for designing accelera-
tors that builds on parallel patterns [43], which are flexible
hardware templates. Spatial adds some automation beyond
traditional HLS: it infers a banking strategy given some par-
allel accesses. Like HLS, Spatial designs can be unpredictable.
Figure 13 shows resource usage for the matrix multiplication
kernel from Section 2 written in Spatial. (A full experimen-
tal setup appears in the supplementary material [39].) For
unrolling factors that do not evenly divide the memory size,
Spatial will sometimes infer a banking factor that is not
equal to the unrolling factor. In these cases, the resource
usage abruptly increases. A type system like Dahlia could
help address these predictability pitfalls in Spatial.
BetterHDLs. Modern hardware description languages [5, 35,
14, 4, 55, 30, 40] aim to address the shortcomings of Verilog
and VHDL. These languages target register transfer level
(RTL) design. Dahlia targets a different level of abstraction
and a different use case: it uses an imperative programming
model and focuses exclusively on computational accelerators.
Dahlia is not a good language for implementing a CPU, for
example. Its focus on acceleration requires the language and
semantics to more closely resemble software languages.
Traditional HLS. Existing commercial [58, 29, 37, 9] and
academic [47, 10, 42, 59] high-level synthesis (HLS) tools com-
pile subsets of C, C++, OpenCL, or SystemC to RTL. While
their powerful heuristics can be effective, when they fail, pro-
grammers have little insight into what went wrong or how
to fix it [34]. Dahlia represents an alternative approach that
prioritizes programmer control over black-box optimization.
Targeting hardware from DSLs. Compilers to FPGAs and
ASICs exist for DSLs for image processing [26, 27, 45, 51] and
machine learning [20, 52]. Dahlia is not a DSL: it is a general
language for implementing accelerators. While DSLs offer
advantages in productivity and compilation for individual
application domains, they do not obviate the need for general
languages to fill in the gaps between popular domains, to
offer greater programmer control when appropriate, and to
serve as a compilation target for multiple DSLs.
Accelerator design languages. Some recent languages also
focus on general accelerator design. HeteroCL [33] uses a
Halide-like [48] scheduling language to describe how to
map algorithms onto HLS-like hardware optimizations, and
T2S [50] similarly lets programs describe how generate a spa-
tial implementation. Lime [3] extends Java to express target-
independent streaming accelerators. CoRAM [12] is not a
just a language; it extends FPGAswith a programmable mem-
ory interface that adapts memory accesses, akin to Dahlia’s
memory views. Dahlia’s focus on predictability and type-
driven design makes it unique, as far as we are aware.
8 Conclusion
Dahlia exposes predictability as a new design goal for HLS
tools. Predictability comes at a cost—it can rule out design
points that perform surprisingly well because of a subtle con-
vergence of heuristics. We see these outliers as a worthy sac-
rifice in exchange for an intelligible programming model and
robust reasoning tools. We hope to extend Dahlia’s philoso-
phy to bring predictability to the rest of the reconfigurable
hardware system stack, from the language to the LUTs.
Acknowledgments
We thank Drew Zagieboylo and Yi-Hsiang Lai for insight-
ful discussions and Alexa VanHattum, Dietrich Geisler, and
Pedro Henrique Azevedo de Amorim for invaluable com-
ments on early drafts and solidarity in the final hours. Many
thanks to the anonymous PLDI reviewers and our shepherd,
Zachary Tatlock, for suggesting important additions.
This work was supported in part by the Center for Ap-
plications Driving Architectures (ADA), one of six centers
of JUMP, a Semiconductor Research Corporation program
co-sponsored by DARPA. It was also supported by the Intel
and NSF joint research center for Computer Assisted Pro-
gramming for Heterogeneous Architectures (CAPA). Support
included NSF awards #1723715 and #1845952.
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
References
[1] Amazon Web Services. [n.d.]. Amazon EC2 F1 Instances. https://aws.
amazon.com/ec2/instance-types/f1/.
[2] Mina Tahmasbi Arashloo, Alexey Lavrov, Manya Ghobadi, Jennifer
Rexford, David Walker, and David Wentzlaff. 2020. Enabling Pro-
grammable Transport Protocols in High-Speed NICs. In USENIX Sym-
posium on Networked System Design and Implementation (NSDI).
[3] Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah.
2010. Lime: A Java-compatible and Synthesizable Language for Hetero-
geneous Architectures. InACM SIGPLAN Conference on Object Oriented
Programming, Systems, Languages and Applications (OOPSLA).
[4] C. Baaij, M. Kooijman, J. Kuper, A. Boeijink, and M. Gerards. 2010.
CλaSH: Structural Descriptions of Synchronous Hardware Using
Haskell. In Euromicro Conference on Digital System Design: Architec-
tures, Methods and Tools.
[5] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew
Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović.
2012. Chisel: constructing hardware in a Scala embedded language. In
Design Automation Conference (DAC).
[6] Henry G. Baker. 1995. “Use-once” Variables and Linear Objects: Storage
Management, Reflection and Multi-threading. SIGPLAN Notices 30, 1
(Jan. 1995), 45–52.
[7] J Bernardy, Mathieu Boespflug, Ryan Newton, Simon L. Peyton Jones,
and Arnaud Spiwack. 2017. Linear Haskell: practical linearity in a
higher-order polymorphic language. In ACM SIGPLAN-SIGACT Sym-
posium on Principles of Programming Languages (POPL).
[8] Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve,
Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Sim-
mons, Hyojin Sung, and Mohsen Vakilian. 2009. A Type and Effect
System for Deterministic Parallel Java. In ACM SIGPLAN Conference
on Object Oriented Programming, Systems, Languages and Applications
(OOPSLA).
[9] Cadence. [n.d.]. Stratus High-Level Synthesis. https://www.cadence.
com/content/cadence-www/global/en_US/home/tools/digital-
design-and-signoff/synthesis/stratus-high-level-synthesis.html.
[10] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed
Kammoona, Jason H Anderson, Stephen Brown, and Tomasz Cza-
jkowski. 2011. LegUp: high-level synthesis for FPGA-based pro-
cessor/accelerator systems. In International Symposium on Field-
Programmable Gate Arrays (FPGA).
[11] Eric S. Chung, John D. Davis, and Jaewon Lee. 2013. LINQits: big data
on little clients. In International Symposium on Computer Architecture
(ISCA).
[12] Eric S Chung, James C Hoe, and Ken Mai. 2011. CoRAM: an in-
fabric memory architecture for FPGA-based computing. In Field pro-
grammable gate arrays (FPGA).
[13] Sylvan Clebsch, Sophia Drossopoulou, Sebastian Blessing, and Andy
McNeil. 2015. Deny Capabilities for Safe, Fast Actors. In International
Workshop on Programming Based on Actors, Agents, and Decentralized
Control (AGERE!).
[14] J. Clow, G. Tzimpragos, D. Dangwal, S. Guo, J. McMahan, and T. Sher-
wood. 2017. A Pythonic approach for rapid hardware prototyping and
instrumentation. In International Conference on Field-Programmable
Logic and Applications (FPL).
[15] J. Cong, Y. Fan, G. Han, W. Jiang, and Z. Zhang. 2006. Platform-
Based Behavior-Level and System-Level Synthesis. In International
SoC Conference.
[16] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang.
2011. High-Level Synthesis for FPGAs: From Prototyping to Deploy-
ment. IEEE Transactions on Computer-Aided Design of Integrated Cir-
cuits and Systems (TCAD) 30, 4 (April 2011), 473–491.
[17] J. Cong and Zhiru Zhang. 2006. An efficient and versatile scheduling
algorithm based on SDC formulation. In Design Automation Conference
(DAC).
[18] Matthew Fluet, Greg Morrisett, and Amal Ahmed. 2006. Linear regions
are all you need. In European Symposium on Programming (ESOP).
[19] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Mas-
sengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman,
Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam
Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt,
Adrian M. Caulfield, Eric S. Chung, and Doug Burger. 2018. A Config-
urable Cloud-scale DNN Processor for Real-time AI. In International
Symposium on Computer Architecture (ISCA).
[20] N. George, H. Lee, D. Novo, T. Rompf, K. J. Brown, A. K. Sujeeth, M.
Odersky, K. Olukotun, and P. Ienne. 2014. Hardware system synthesis
from Domain-Specific Languages. In International Conference on Field-
Programmable Logic and Applications (FPL).
[21] Jean-Yves Girard, Andre Scedrov, and Philip J. Scott. 1992. Bounded
Linear Logic: A Modular Approach to Polynomial-Time Computability.
Theoretical Computer Science 97, 1 (April 1992), 1–66.
[22] Colin S Gordon, Michael D Ernst, and Dan Grossman. 2013. Rely-
guarantee references for refinement types over aliased mutable data.
In ACM SIGPLAN Conference on Programming Language Design and
Implementation (PLDI).
[23] Colin S. Gordon, Matthew J. Parkinson, Jared Parsons, Aleks Brom-
field, and Joe Duffy. 2012. Uniqueness and Reference Immutability
for Safe Parallelism. In ACM SIGPLAN Conference on Object Oriented
Programming, Systems, Languages and Applications (OOPSLA).
[24] Dan Grossman, Greg Morrisett, Trevor Jim, Michael Hicks, Yanling
Wang, and James Cheney. 2002. Region-based Memory Management
in Cyclone. In ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI).
[25] S Gupta, Renu Gupta, Nikil Dutt, and Alex Nicolau. 2004. SPARK: A
Parallelizing Approach to the High-Level Synthesis of Digital Circuits.
Springer.
[26] James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-
Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and
Pat Hanrahan. 2014. Darkroom: compiling high-level image processing
code into hardware pipelines. ACM Transactions on Graphics 33, 4
(2014).
[27] James Hegarty, Ross Daly, Zachary DeVito, Jonathan Ragan-Kelley,
Mark Horowitz, and Pat Hanrahan. 2016. Rigel: Flexible multi-rate
image processing hardware. ACM Transactions on Graphics 35, 4
(2016).
[28] John L. Hennessy and David A. Patterson. 2019. A New Golden Age
for Computer Architecture. Communications of the ACM (CACM) 62,
2 (Jan. 2019), 48–60.
[29] Intel. [n.d.]. Intel High Level Synthesis Compiler. https:
//www.altera.com/products/design-software/high-level-
design/intel-hls-compiler/overview.html
[30] Jane Street. [n.d.]. HardCaml: Register Transfer Level HardwareDesign
in OCaml. https://github.com/janestreet/hardcaml.
[31] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau-
rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Bo-
den, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris
Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb,
Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland,
Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert
Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexan-
der Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy,
James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu,
Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire
Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray
Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda,
Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani,
Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan
Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Ho-
ria Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang,
Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
Analysis of a Tensor Processing Unit. In International Symposium on
Computer Architecture (ISCA).
[32] David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang,
Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram,
Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: a language and
compiler for application accelerators. In ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI).
[33] Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, JieWang, CodyHao Yu, Yuan Zhou,
Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A Multi-Paradigm Pro-
gramming Infrastructure for Software-Defined Reconfigurable Com-
puting. In International Symposium on Field-Programmable Gate Arrays
(FPGA).
[34] Yun Liang, Kyle Rupnow, Yinan Li, Dongbo Min, Minh N Do, and Dem-
ing Chen. 2012. High-level synthesis: productivity, performance, and
software constraints. Journal of Electrical and Computer Engineering
(2012).
[35] Derek Lockhart, Gary Zibrat, and Christopher Batten. 2014. PyMTL:
A Unified Framework for Vertically Integrated Computer Architecture
Research. In IEEE/ACM International Symposium on Microarchitecture
(MICRO).
[36] Nicholas D. Matsakis and Felix S. Klock, II. 2014. The Rust Language.
In High Integrity Language Technology (HILT).
[37] Mentor Graphics. [n.d.]. Catapult High-Level Synthesis. https://www.
mentor.com/hls-lp/catapult-high-level-synthesis/.
[38] Mayur Naik, Alexander Aiken, and John Whaley. 2006. Effective static
race detection for Java. In ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI).
[39] Rachit Nigam, Sachille Atapattu, Samuel Thomas, Zhijing Li, Theodore
Bauer, Yuwei Ye, Apurva Koti, Adrian Sampson, and Zhiru Zhang.
[n.d.]. Predictable Accelerator Design with Time-Sensitive Affine
Types: Supplemental Material. 2020.
[40] Rishiyur Nikhil. 2004. Bluespec System Verilog: efficient, correct RTL
from high level specifications. In Conference on Formal Methods and
Models for Co-Design (MEMOCODE).
[41] Peter W. O’Hearn. 2007. Resources, Concurrency, and Local Reasoning.
Theoretical Computer Science 375 (April 2007), 271–307.
[42] Christian Pilato and Fabrizio Ferrandi. 2013. Bambu: A modular frame-
work for the high level synthesis of memory-intensive applications. In
International Conference on Field-Programmable Logic and Applications
(FPL).
[43] Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee,
Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. Gen-
erating configurable hardware from parallel patterns. ACM SIGARCH
Computer Architecture News 44, 2 (2016), 651–665.
[44] Polyvios Pratikakis, Jeffrey S. Foster, and Michael W. Hicks. 2006.
LOCKSMITH: context-sensitive correlation analysis for race detection.
In ACM SIGPLAN Conference on Programming Language Design and
Implementation (PLDI).
[45] Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson,
Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming het-
erogeneous systems from an image processing DSL. ACM Transactions
on Architecture and Code Optimization (TACO) (2017).
[46] Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou,
Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fow-
ers, Gopi Prashanth, Gopal Jan, Gray Michael, Haselman Scott Hauck,
Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James
Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Y.
Xiao, and Doug Burger. 2014. A Reconfigurable Fabric for Accelerat-
ing Large-scale Datacenter Services. In International Symposium on
Computer Architecture (ISCA).
[47] Andrew R Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and
Prasanna Sundararajan. 2008. CHiMPS: A high-level compilation flow
for hybrid CPU-FPGA architectures. In International Symposium on
Field-Programmable Gate Arrays (FPGA).
[48] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain
Paris, Frédo Durand, and Saman P. Amarasinghe. 2013. Halide: a
language and compiler for optimizing parallelism, locality, and recom-
putation in image processing pipelines. In ACM SIGPLAN Conference
on Programming Language Design and Implementation (PLDI).
[49] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and
David Brooks. 2014. MachSuite: Benchmarks for Accelerator Design
and Customized Architectures. In IEEE International Symposium on
Workload Characterization (IISWC).
[50] Hongbo Rong. 2017. Programmatic Control of a Compiler for Gener-
ating High-performance Spatial Hardware. arXiv preprint 1711.07606.
https://arxiv.org/abs/1711.07606.
[51] Jeff Setter. [n.d.]. Halide-to-Hardware. https://github.com/jeffsetter/
Halide-to-Hardware.
[52] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro,
Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh.
2016. From high-level deep neural models to FPGAs. In IEEE/ACM
International Symposium on Microarchitecture (MICRO).
[53] Stuart Sutherland, Don Mills, and Chris Spear. 2007. Gotcha Again:
More Subtleties in the Verilog and SystemVerilog Standards That Ev-
ery Engineer Should Know. In Synopsys Users Group (SNUG) San
Jose. https://lcdm-eng.com/papers/snug07_Verilog%20Gotchas%
20Part2.pdf
[54] Jesse A. Tov and Riccardo Pucella. 2011. Practical Affine Types. InACM
SIGPLAN-SIGACT Symposium on Principles of Programming Languages
(POPL).
[55] Lenny Truong and Pat Hanrahan. 2019. A Golden Age of Hardware De-
scription Languages: Applying Programming Language Techniques to
Improve Design Productivity. In Summit oN Advances in Programming
Languages (SNAPL).
[56] Yatish Turakhia, Gill Bejerano, and William J. Dally. 2018. Darwin:
A Genomics Co-processor Provides Up to 15,000X Acceleration on
Long Read Assembly. In ACM International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS).
[57] Xilinx Inc. [n.d.]. SDAccel: Enabling Hardware-Accelerated Soft-
ware. https://www.xilinx.com/products/design-tools/software-
zone/sdaccel.html.
[58] Xilinx Inc. [n.d.]. Vivado Design Suite User Guide: High-Level
Synthesis. UG902 (v2017.2) June 7, 2017. https://www.xilinx.com/
support/documentation/sw_manuals/xilinx2017_2/ug902-vivado-
high-level-synthesis.pdf.
[59] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and
Jason Cong. 2008. AutoPilot: A platform-based ESL synthesis system.
In High-Level Synthesis: From Algorithm to Digital Circuit, Philippe
Coussy and Adam Morawiec (Eds.). 99–112.
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
A Semantics
The following lists the grammar for Filament, the core lan-
guage of Dahlia.
x ∈ variables a ∈ memories n ∈ numbers
b ::= true | false v ::= n | b | v1 bop v2
e ::= v | bop e1 e2 | x | a[e]
c ::= e | let x = e | c1 c2 | c1 ρ∼ c2 | c1 ; c2 | if x c1 c2 |
while x c | x := e | a[e1] := e2 | skip
τ ::= bit⟨n⟩ | float | bool | mem τ [n1]
The large-step operational semantics listed below capture
the complete evaluation of an expression or command. The
environment σ maps variables and memory names to values,
and the context ρ is the set of memories the program has
accessed.
σ1, ρ1, e ⇓ σ2, ρ2, v ()
σ1, ρ1, e1 ⇓ σ2, ρ2, v1 σ2, ρ2, e2 ⇓ σ3, ρ3, v2 v3 = v1 bop v2
σ1, ρ1, bop e1 e2 ⇓ σ3, ρ3, v3
σ (x) = v
σ , ρ, x ⇓ σ , ρ, v
a < ρ1 σ1, ρ1, e ⇓ σ2, ρ2, n σ2(a)(n) = v
σ1, ρ1, a[e] ⇓ σ2, ρ2 ∪ {a}, v
σ1, ρ1, c ⇓ σ2, ρ2 ()
σ1, ρ1, e ⇓ σ2, ρ2, v
σ1, ρ1, let x = e ⇓ σ2[x 7→ v], ρ2
σ1, ρ1, c1 ⇓ σ2, ρ2 σ2, ρ1, c2 ⇓ σ3, ρ3
σ1, ρ1, c1 c2 ⇓ σ3, ρ2 ∪ ρ3
σ1, ρ1, c1 ⇓ σ2, ρ2 σ2, ρ, c2 ⇓ σ3, ρ3
σ1, ρ1, c1
ρ∼ c2 ⇓ σ3, ρ2 ∪ ρ3
σ1, ρ1, c1 ⇓ σ2, ρ2 σ2, ρ2, c2 ⇓ σ3, ρ3
σ1, ρ1, c1 ; c2 ⇓ σ3, ρ3
σ1, ρ1, e1 ⇓ σ2, ρ2, true σ2, ρ2, c1 ⇓ σ3, ρ3
σ1, ρ1, if x c1 c2 ⇓ σ3, ρ3
σ1, ρ1, e1 ⇓ σ2, ρ2, false σ2, ρ2, c2 ⇓ σ3, ρ3
σ1, ρ1, if x c1 c2 ⇓ σ3, ρ3
σ1, ρ1, e1 ⇓ σ2, ρ2, true
σ2, ρ2, c while x c ⇓ σ3, ρ3
σ1, ρ1,while x c ⇓ σ3, ρ3
σ1, ρ1, e1 ⇓ σ2, ρ2, false
σ1, ρ1,while x c ⇓ σ2, ρ2
σ1, ρ1, e ⇓ σ2, ρ2, v
σ1, ρ1, x := e ⇓ σ2[x 7→ v], ρ2
σ1, ρ1, e1 ⇓ σ2, ρ2, n σ2, ρ2, e2 ⇓ σ3, ρ3, v a < ρ3
σ1, ρ1, a[e1] := e2 ⇓ σ3[a[n] 7→ v], ρ3 ∪ {a}
The small-step operational semantics capture incremental
evaluation of an expression or command and form the basis
of the proof of soundness in Appendix B.
σ , ρ, e → σ ′, ρ ′, e′ ()
σ , ρ, e → σ ′, ρ ′, e′
σ , ρ, a[e] → σ ′, ρ ′, a[e′]
a < ρ
σ , ρ, a[n] → σ , ρ ∪ {a}, v
σ , ρ, e1 → σ ′, ρ ′, e′1
σ , ρ, bop e1 e2 → σ ′, ρ ′, bop e′1 e2
σ , ρ, e2 → σ ′, ρ ′, e′2
σ , ρ, bop v1 e2 → σ ′, ρ ′, bop v1 e′2
v3 = v1 bop v2
σ , ρ, bop v1 v2 → σ , ρ, v3
σ (x) = v
σ , ρ, x → σ , ρ, v
σ1, ρ1, c → σ ′, ρ ′, c′ ()
σ , ρ, e1 → σ ′, ρ ′, e′1
σ , ρ, a[e1] := e2 → σ ′, ρ ′, a[e′1] := e2
σ , ρ, e → σ ′, ρ ′, e′
σ , ρ, a[n] := e → σ ′, ρ ′, a[n] := e′
a < ρ
σ , ρ, a[n] := v → σ [a[n] 7→ v], ρ ∪ {a}, skip
σ , ρ, e → σ ′, ρ ′, e′
σ , ρ, let x = e → σ ′, ρ ′, let x = e′
σ , ρ, let x = v → σ [x 7→ v], ρ, skip
σ , ρ, c1 → σ ′, ρ ′, c′1
σ , ρ, c1 ; c2 → σ ′, ρ ′, c′1 ; c2 σ , ρ, skip ; c2 → σ , ρ, c2
σ , ρ, c1 c2 → σ , ρ, c1 ρ∼ c2
σ , ρ, c1 → σ ′, ρ ′, c′1
σ , ρ, c1
ρ′′∼ c2 → σ ′, ρ ′, c′1
ρ′′∼ c2
σ , ρ ′′, c2 → σ ′, ρ ′′′, c′2
σ , ρ, skip ρ
′′
∼ c2 → σ ′, ρ, skip ρ
′′′
∼ c′2
σ , ρ, skip ρ
′′
∼ skip→ σ , ρ ∪ ρ ′′, skip
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
σ (x) = true
σ , ρ, if x c1 c2 → σ , ρ, c1
σ1(x) = false
σ , ρ, if x c1 c2 → σ , ρ, c2
σ , ρ,while x c → σ , ρ, if x (c while x c) skip
To enforce Dahlia’s safety condition, the typing judgments
use the typing context Γ for variables and the affine context
∆ for memories.
Γ,∆1 ⊢ e : τ ⊣ ∆2 ()
Γ,∆ ⊢ v : τ ⊣ ∆
Γ,∆1 ⊢ e1 : τ ⊣ ∆2 Γ,∆2 ⊢ e2 : τ ⊣ ∆3 bop : τ → τ → τ
Γ,∆1 ⊢ bop e1 e2 : τ ⊣ ∆3
Γ(x) = τ
Γ,∆1 ⊢ x : τ ⊣ ∆1
Γ,∆1 ⊢ e1 : bit⟨n⟩ ⊣ ∆2 ∆2 = ∆3 ∪ {a 7→ memτ [n1]}
Γ,∆1 ⊢ a[e] : τ ⊣ ∆3
Γ1,∆1 ⊢ c ⊣ Γ2,∆2 ()
Γ,∆ ⊢ skip ⊣ Γ,∆
Γ,∆1 ⊢ e1 : bit⟨n⟩ ⊣ ∆2 Γ,∆2 ⊢ e2 : τ ⊣ ∆3
∆3 = ∆4 ∪ {a 7→ memτ [n1]}
Γ,∆1 ⊢ a[e1] := e2 ⊣ Γ,∆4
Γ,∆1 ⊢ e : τ ⊣ ∆2 (x → τ ) < Γ
Γ,∆1 ⊢ let x = e ⊣ Γ[x 7→ τ ],∆2
Γ1,∆1 ⊢ c1 ⊣ Γ2,∆2 Γ2,∆2 ⊢ c2 ⊣ Γ3,∆3
Γ1,∆1 ⊢ c1 ; c2 ⊣ Γ3,∆3
Γ1,∆1 ⊢ c1 ⊣ Γ2,∆2 Γ2,∆1 ⊢ c2 ⊣ Γ3,∆3
Γ1,∆1 ⊢ c1 c2 ⊣ Γ3,∆2 ∩ ∆3
Γ1,∆1 ⊢ c1 ⊣ Γ2,∆2 Γ2, ρ¯ ⊢ c2 ⊣ Γ3,∆3
Γ1,∆1 ⊢ c1 ρ∼ c2 ⊣ Γ3,∆2 ∩ ∆3
Γ,∆1 ⊢ x : bool ⊣ ∆2 Γ,∆2 ⊢ c1 ⊣ Γ2,∆3
Γ,∆2 ⊢ c2 ⊣ Γ3,∆4
Γ,∆1 ⊢ if x c1 c2 ⊣ Γ,∆2 ∩ ∆3 ∩ ∆4
Γ,∆1 ⊢ x : bool ⊣ ∆2 Γ,∆2 ⊢ c1 ⊣ Γ2,∆3
Γ,∆3 ⊢ c2 ⊣ Γ3,∆4
Γ,∆1 ⊢ if x c1 c2 ⊣ Γ,∆4
Γ,∆1 ⊢ e : τ ⊣ ∆2 Γ(x) = τ
Γ,∆1 ⊢ x := e ⊣ Γ,∆2
Γ,∆1 ⊢ x : bool ⊣ ∆2 Γ,∆2 ⊢ c ⊣ Γ3,∆3
Γ,∆1 ⊢ while x c ⊣ Γ,∆3 ∩ ∆2
B Proof of soundness
If there exists a typing context Γ and affine memory con-
text ∆ under which a command c type-checks, and Γ,∆ is
equivalent to an environment σ and context ρ, then either
σ , ρ, c →∗ σ ′, ρ ′, skip or c diverges.
To prove this theorem, we will prove the supporting progress
and preservation lemmas (stated below), which together im-
ply soundness.
Supporting definitions
• Defined:a is defined in∆ if∃τ ,nwith (a 7→ memτ [n]) ∈
∆. x is defined in Γ if ∃ τ with (x 7→ τ ) ∈ Γ.
• Type-check: If Γ,∆ ⊢ e : τ ⊣ ∆′ then e type-checks to
τ under Γ,∆ producing ∆′. If Γ,∆ ⊢ c ⊣ Γ′,∆′ then c
type-checks under Γ,∆ producing Γ′,∆′.
• ∼ (equivalence): Γ,∆ ∼ σ , ρ if
1. ∀ x with (x 7→ τ ) ∈ Γ,∃ v with (x 7→ v) ∈ σ and v
type-checks to τ under Γ,∆
2. ∀ l with (l 7→ τ ) ∈ ∆, l < ρ.
• ρ¯: ρ¯ = {a 7→ mem τ [n] ∈ ∆∗ ∧ a < ρ} where ∆∗ is
the affine context of memories initially available to a
program.
• Construction: Γ,∆ can be constructed from σ , ρ if
1. ∀ (x 7→ v) ∈ σ , (x 7→ τ ) ∈ Γ
2. ∀ l ∈ ρ, (l 7→ mem τ [n]) < ∆
3. v type-checks to τ under Γ,∆
Supporting lemmas
• L1: If Γ,∆ ⊢ c ⊣ Γ′,∆′, then Γ ⊆ Γ′. Proof: The only
typing rule that modifies Γ is check_let. Under this
rule, Γ′ = Γ extended to add a mapping for a variable
x . There is no rule that removes mapping from Γ. So
∀m ∈ Γ,m ∈ Γ′.
• L2: If c type-checks under Γ,∆, then c type-checks
under Γ′,∆ where Γ ⊆ Γ′. Proof: The only typing rule
that reads Γ is check_update, which checks in its
premises that x is defined in Γ. By L1, if x is defined in
Γ it is defined in Γ′. There is also no rule that changes
the type τ of x in Γ, so x will have the same type τ in
Γ′.
• L3: If σ , ρ, e → σ ′, ρ ′, e ′, then σ = σ ′. Proof: There
is no step rule for expressions that extends σ , which
by the grammar is the only modification possible to
memory stores.
• L4: If σ , ρ, e → σ ′, ρ ′, e ′ and ρ ′ , ρ, then e is a read
a[n] and ρ ′ = ρ ∪ {a}. Proof: the only step rule for
expressions that adds elements to ρ is read2, by which
ρ ′ = ρ ∪ {a}. There is no step rule that removes ele-
ments from ρ.
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
B.1 Progress
If ∃ Γ,∆,σ , ρ such that Γ,∆ ∼ σ , ρ and command c type-
checks under Γ,∆, then either c is a value or
1. ∃ σ ′, ρ ′, c ′ with σ , ρ, c → σ ′, ρ ′, c ′ or
2. c = skip ρ∼ c2 with c2 , skip and ∃ c ′2, ρ ′′ with
σ , ρ ′′, skip ρ∼ c2 → σ ′, ρ ′′, skip ρ
′
∼ c ′2.
B.1.1 Proof. Inductive hypothesis: Progress holds for sub-
forms of any inductive form. Assumptions: Γ,∆ ∼ σ , ρ and c
type-checks under Γ,∆.
Case: c is an expression.
• Case: c is a value. Progress holds by assumption.
• Case: c = bop e1 e2. For simplicity, we ignore the cases
in which bop is incompatible with the types of e1 and
e2. We have three possibilities:
1. Neither e1 nor e2 is a value. By assumption c type-
checks under Γ,∆, so by check_bop e1 type-checks
under Γ,∆. By the inductive hypothesis, σ , ρ, e1 →
σ ′, ρ ′, e ′1 so we have σ , ρ, bop e1 e2 → bop e ′1 e2 as
needed.
2. Only e1 is a value v1. By assumption c type-checks
under Γ,∆ and Γ,∆ ⊢ v1 ⊣ ∆, so e2 type-checks
under Γ,∆. By the inductive hypothesis, σ , ρ, e2 →
σ ′, ρ ′, e2 so σ , ρ, bop v1 e2 → σ ′, ρ ′, bop v1 e ′2.
3. Both e1 and e2 are valuesv1 andv2.σ , ρ, bopv1v2 →
σ , ρ,v1 bopv2 by bop3.
• Case: c = x . By assumption x type checks under Γ,∆,
so x is defined in Γ. Then ∃ v with (x 7→ v) ∈ σ , so
σ , ρ,x → σ , ρ,v by var.
• Case: c = a[e]. By assumption a[e] type-checks under
Γ,∆, so by check_read, e type-checks under Γ,∆ pro-
ducing ∆2, and a is defined in ∆2. By the the inductive
hypothesis, progress holds for e , so σ , ρ, e → σ ′, ρ ′, e ′
or e is a valuen. a is defined in ∆2, so it must be defined
in ∆ since there is no type-checking rule for expres-
sions bywhich Γ,∆ ⊢ e ⊣ ∆2 and∃l < ∆, ∈ ∆2. Soa < ρ.
So if e is a value, then σ , ρ,a[e] → σ , ρ ∪ {a},σ (a)(n).
If e is not a value, then σ , ρ,a[e] → σ ′, ρ ′,a[e ′].
Case: c = let x = e . By assumption this form type-checks un-
der Γ,∆. By check_let, e type-checks under Γ,∆. By the in-
ductive hypothesis, e is either a valuev or σ , ρ, e → σ ′, ρ ′, e ′.
In the first case we have σ , ρ, let x = v → σ [x 7→ v], ρ, skip.
In the second case we have σ , ρ, let x = e → σ ′, ρ ′, let x =
e ′.
Case: c = c1 c2. ∀ σ , ρ, σ , ρ, c1 c2 → σ , ρ, c1 ρ∼ c2.
Case: c = c1
ρ′′∼ c2. We have three possibilities:
• c1 , skip. By assumption c type-checks under Γ,∆. By
check_inter_seq_comp, c1 type-checks under Γ,∆.
By the inductive hypothesis, σ , ρ, c1 → σ ′, ρ ′, c ′1 and
so c1
ρ′′∼ c2 → c ′1
ρ′′∼ c2.
• c = skip ρ∼ c2. By assumption c type-checks under
Γ,∆. By check_inter_seq_comp, c2 type-checks un-
der Γ, ρ¯. By the inductive hypothesis (and the defini-
tion of ∼, under which we have ρ), σ , ρ, c2 → σ ′, ρ ′, c ′2,
so we have σ , ρ ′′, skip ρ∼ c2 → σ ′, ρ ′′, skip ρ
′
∼ c ′2.
• If c1 = c2 = skip, ∀ σ , ρ, we have σ , ρ, c1 ρ
′′
∼ c2 →
σ , ρ ∪ ρ ′′, skip.
Case: c = c1; c2. We have two possibilities:
• c1 , skip. By assumption, c type-checks under Γ,∆,
so c1 type-checks under Γ,∆ by check_par_comp. By
the the inductive hypothesis, σ , ρ, c1 → σ ′, ρ ′, c ′1, so
σ , ρ, c1; c2 → σ ′, ρ ′, c ′1; c2.
• c1 = skip. ∀ σ , ρ, we have σ , ρ, skip; c2 → σ , ρ, c2.
Case: c = if x c1 c2. By assumption, c type-checks under Γ,∆.
Then x type-checks to bool, so x is either true or false. ∀σ , ρ,
we haveσ , ρ, if true c1 c2 → σ , ρ, c1 andσ , ρ, if false c1 c2 →
σ , ρ, c2.
Case: c = while x c1. ∀ σ , ρ we have
σ , ρ,while x c1 → σ , ρ, if x (c1 while x c1) skip.
Case: c = x := e . If e is a valuev , then∀σ , ρ wehaveσ , ρ,x :=
e → σ [x 7→ v], ρ, skip. If e is not a value, then by as-
sumption that c type-checks under Γ,∆, e type-checks under
Γ,∆ by check_update. Then by the inductive hypothesis,
σ , ρ, e → σ ′, ρ ′, e ′, so σ , ρ,x := e → σ ′, ρ ′,x := e ′.
Case: c = a[e1] := e2. We have three possibilities:
• e1 and e2 are values n and v . By assumption c type-
checks under Γ,∆, so by check_write, a is defined in
∆. By definition of ∼, a < ρ, so the premise of write3
is satisfied. Then σ , ρ, c → σ [a[n] 7→ v], ρ ∪ {a}, skip.
• e1 is a valuen. By assumption c type-checks under Γ,∆,
so by check_write n type-checks under Γ,∆ produc-
ing ∆ and e2 type-checks under Γ,∆. By the the induc-
tive hypothesis, σ , ρ, e2 → σ ′, ρ ′, e ′2, so σ , ρ,a[n] :=
e2 → σ ′, ρ ′,a[n] := e ′2.
• Neither e1 nor e2 is a value. By assumption c type-
checks under Γ,∆, so as does e1. By the the inductive
hypothesis, σ , ρ, e1 → σ ′, ρ ′, e ′1, so σ , ρ,a[e1] := e2 →
σ ′, ρ ′,a[e ′1] := e2.
B.2 Preservation
If:
1. ∃ Γ,∆ such that command c type-checks under Γ,∆
2. ∃ σ , ρ with Γ,∆ ∼ σ , ρ
3. ∃σ ′ρ ′, c ′withσ , ρ, c → σ ′, ρ ′, c ′ or∃σ ′, ρ ′, ρ ′′, c ′2 with
c = skip ρ∼ c2 and σ , ρ ′′, skip ρ∼ c2 → σ ′, ρ ′′, skip ρ
′
∼
c ′2
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
then Γ′,∆′ can be constructed from σ ′, ρ ′ such that c ′
type-checks under Γ′,∆′.
B.2.1 Proof. Inductive hypothesis: Preservation holds for
sub-forms of any inductive form. Assumptions: 1., 2., 3.
Case: c is an expression.
• Case: c is a value. c does not step, so preservation vac-
uously holds.
• Case: c = bop e1 e2. For simplicity, we ignore the cases
in which bop is incompatible with the types of e1 and
e2. We have three possibilities:
1. e1 is not a value. By assumption, c type-checks un-
der Γ,∆ and σ , ρ, c → σ ′, ρ ′, bop e ′1 e2. So σ , ρ, e1 →
σ ′, ρ ′, e ′1. By check_bop Γ,∆ ⊢ e1 ⊣ ∆2. By the induc-
tive hypothesis, Γ′,∆′ ⊢ e ′1 ⊣ ∆′2. From L3, σ ′ = σ , so
Γ′ = Γ. If ∆′ = ∆, then e2 type-checks under Γ′,∆′2
and we are done. If ∆′ , ∆, then ρ ′ , ρ. So by L4
e1 was a read a[n]. By assumption and check_bop
Γ,∆ ⊢ e1 ⊣ ∆2 and e2 type-checks under Γ,∆2. Since
e1 = a[n],a could not have been defined in ∆2. e ′1
must be a value, so Γ′,∆′ ⊢ v ⊣ ∆′, so ∆′ = ∆2. So e2
must type-check under Γ′,∆′.
2. e1 is a valuev1. Then by assumption,σ , ρ, bopv1 e2 →
σ ′, ρ ′, bop v1 e ′2. By assumption, c type-checked un-
der Γ,∆, so e2 type-checks under Γ,∆ by check_bop.
By the inductive hypothesis, e ′2 type-checks under
Γ′,∆′, and values always type-check, so we are done.
3. Both e1 and e2 are values v1 and v2. By assumption,
σ , ρ, c → σ , ρ,v1 bopv2. Values always type-check,
so we are done.
• Case: c = x . By assumption x type-checks under Γ,∆,
so (x 7→ τ ) ∈ Γ and (x 7→ v) ∈ σ , and by assumption
σ , ρ,x → σ , ρ,v . Values always type-check, so we are
done.
• Case: c = a[e]. The first possibility is that e is not a
value. Then by assumption σ , ρ,a[e] → σ ′, ρ ′,a[e ′],
and so σ , ρ, e → σ ′, ρ ′, e ′. We need to show that a[e ′]
type-checks under Γ′,∆′. By the inductive hypoth-
esis, e ′ type-checks under Γ′,∆′. To satisfy the sec-
ond premise, it should be that a is defined in ∆′. By
assumption that a[e] type-checked, we know from
check_read that a is defined in ∆ and so a < ρ (by
definition of ∼). If a was not defined in ∆′, that would
mean a ∈ ρ ′, but by L4 this would mean that e was
a read a[n], meaning a is not defined in ∆2 where
Γ,∆ ⊢ e ⊣ Delta2 and c could not type-check un-
der Γ,∆ - this is a contradiction. So a must be de-
fined in ∆′ and so a[e ′] must type-check under Γ′,∆′.
The second possibility is that e is a value n. Then
σ , ρ,a[n] → σ , ρ,v and since values always type-
check we are done.
Case: c = let x = e . The first possibility is that e is not a value.
By assumption c type-checks under Γ,∆. By check_let, so
does e . By assumption σ , ρ, let x = e → σ , ρ, let x = e ′,
so σ , ρ, e → σ ′, ρ ′, e ′. By the inductive hypothesis, e ′ type-
checks under Γ′,∆′. Then we have that let x = e ′ type-
checks under Γ′,∆′, so we are done. The second possibility
is that e is a value v . In this case σ , ρ, let x = v → σ [x 7→
v], ρ, skip. skip always type-checks, so we are done.
Case: c = c1 c2. By assumption,σ , ρ, c1 c2 → σ , ρ, c1 ρ∼
c2 and c type-checks under Γ,∆. σ ′, ρ ′ = σ , ρ, so Γ′ = Γ and
∆′ = ∆. By assumption Γ,∆ ⊢ c1 ⊣ Γ2,∆2 and c2 type-checks
under Γ2,∆. We need to show that c2 type-checks under Γ2, ρ¯.
Since c2 type-checks under Γ2,∆, it does not use any mem-
ories in ρ by definition of ∼. So it must type-check under
Γ2, ρ¯.
Case: c = c1
ρ′′∼ c2. We have three possibilities.
• Neither c1 nor c2 is skip. In this case, we haveσ , ρ, c1 ρ
′′
∼
c2 → σ ′, ρ ′, c ′1
ρ′′∼ c2 and σ , ρ, c1 → σ ′, ρ ′, c ′1. From
assumption, c type-checks under Γ,∆, so 1) c2 type-
checks under ρ¯ ′′ and 2) by the inductive hypothesis,
c ′1 type-checks under the constructed Γ′,∆′. We need
to show c2 type-checks under Γ′, ρ¯ ′′. By L2, if c2 type-
checks under Γ, ρ¯ ′′, it will type-check under Γ′, ρ¯ ′′, so
we are done.
• c1 = skip , c2. We have that σ , ρ ′′, skip ρ∼ c2 →
σ ′, ρ ′′, skip ρ
′
∼ c ′2, so σ , ρ, c2 → σ ′, ρ ′, c ′2. By the induc-
tive hypothesis, c ′2 type-checks under the constructed
Γ′,∆′ (so it will type-check under Γ′, ρ¯ ′) and skipalways
type-checks, so we are done.
• c1 = c2 = skip. This form steps to skip, which always
type-checks, so we are done.
Case: c = c1; c2. We have two possibilities:
• c1 , skip. By assumptionσ , ρ, c → σ ′, ρ ′, c ′, soσ , ρ, c1 →
σ ′, ρ ′, c ′1. We have by assumption that c1 type-checks
under Γ,∆ to produce Γ2,∆2, and c2 type-checks un-
der Γ2,∆2. By the inductive hypothesis, c ′1 type-checks
under Γ′,∆′ to produce Γ′2 ,∆′2. We need to show that
c2 type-checks under Γ′2 ,∆′2. We have two possibilities:
ρ ′ = ρ and ρ ′ , ρ. Consider the first possibility. We
would have ∆ = ∆′, so ∆2 = ∆′2. By L2, c2 type-checks
under Γ′,∆′2 as needed. With the second possibility, it
can only be that ρ ⊂ ρ ′. There are then only two cases
to consider:
1. c1 contained a read or write involving a[n] and c ′1
is a value v . Then ρ ′ = ρ ∪ {a} and a is not defined
in ∆′. By check_write and check_read, a could
not have been defined in ∆2. Since Γ′,∆′ ⊢ v ⊣ ∆′,
∆2 = ∆
′. So c2 must type-check under Γ′,∆′.
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
2. c1 = skip
ρ′′∼ skip. By inter_seq3 c ′1 = skip and
σ ′ = σ , so Γ′ = Γ. By assumption c2 type-checks un-
der Γ2,∆2 where Γ,∆ ⊢ c1 ⊣ Γ2,∆2. By check_inter_
seq_comp Γ2 = Γ.
We need to show c2 type-checks under Γ′2 ,∆′2 =
Γ′,∆′ = Γ,∆′ (since Γ,∆ ⊢ skip ⊣ Γ,∆). For this
to be the case, c2 cannot use any memories in ρ or
ρ ′′ (by definition of construction).
1) Because Γ,∆ ∼ σ , ρ and c1 type-checks under
Γ,∆, c1 does not use any memories in ρ. Then by
assumption that c2 type-checks under Γ2,∆2 and by
check_par_comp, c2 also cannot use any memories
in ρ.
2) By assumption c1 type-checks under Γ,∆ to pro-
duce Γ2,∆2. By check_inter_seq_comp and small_seq
∆2 ⊆ ρ¯ ′′, and by assumption c2 type-checks under
Γ2,∆2, so c2 does not use any memories in ρ ′′. So c2
type-checks under Γ′,∆′ as needed.
• c1 = skip. By assumption skip; c2 type-checks under
Γ,∆, so c2 type-checks under Γ,∆. σ , ρ, skip; c2 →
σ , ρ, c2 so Γ′ = Γ and ∆′ = ∆. Then we need to show
c2 type-checks under Γ′,∆′ = Γ,∆, which we have
from assumption, so we are done.
Case: c = if x c1 c2. By assumption c type-checks under Γ,∆,
so c1 and c2 both type-check under Γ,∆, and x is either
true or false by check_if. If true, σ , ρ, c → σ , ρ, c1. If false,
σ , ρ, c → σ , ρ, c2. We need to show that c1 and c2 type-check
under Γ,∆ (σ ′, ρ ′ = σ , ρ, so Γ′,∆′ = Γ,∆). This is given by
assumption so we are done.
Case: c = while x c1. By assumption c type-checks under
Γ,∆, so by check_while x type-checks to bool and c1 type-
checks under Γ,∆ to produce Γ2,∆2. We need to show that
if x (c1 while x c1) skip type-checks under Γ,∆ (σ ′, ρ ′ =
σ , ρ, so Γ′,∆′ = Γ,∆). For this, it should be the case that x
type-checks to bool. This is already given. It should also
be the case that (c1 while x c1) type-checks under Γ,∆.
This requires that c1 type-checks under Γ,∆, which is given
by assumption, and that while x c1 type-checks under Γ2,∆.
By L2 ifwhile x c1 type-checks under Γ,∆ (which it does by
assumption), it type-checks under Γ2,∆, so we are done.
Case: c = x := e . By assumption c type-checks under Γ,∆, so
(x 7→ τ ) ∈ Γ and e type-checks under Γ,∆ producing ∆2. We
have two possibilities:
• e is not a value. By assumption and check_update
σ , ρ,x := e → σ ′, ρ ′,x := e ′ and σ , ρ, e → σ ′, ρ ′, e ′.
By the inductive hypothesis, e ′ type-checks under the
constructed Γ′,∆′. Then by L1 and L2, if Γ,∆ ⊢ e : τ ⊣
∆2 then Γ′,∆′ ⊢ e ′ : τ ; ⊣ ∆′2. So (x 7→ τ ) ∈ Γ′ and c ′
type-checks under Γ′,∆′ as needed.
• e is a valuev . By assumption x := v type-checks under
Γ,∆ so Γ,∆ ⊢ v : τ ⊣ ∆ and (x 7→ τ ) ∈ Γ. σ , ρ,x :=
v → σ [x 7→ v], ρ, skip, and skip always type-checks,
so we are done.
Case: c = a[e1] := e2. By assumption c type-checks under
Γ,∆, so e1 type-checks under Γ,∆ producing ∆2 and e2 type-
checks under Γ,∆2 by check_write. Additionally, a is de-
fined in ∆ and ∆2 so neither e1 nor e2 use memory a. We
then have three possibilities:
• Neither e1 nor e2 is a value. Then σ , ρ,a[e1] := e2 →
σ ′, ρ ′,a[e ′1] := e2 and σ , ρ, e1 → σ ′, ρ ′, e ′1. By the
inductive hypothesis, e ′1 type-checks under the con-
structed Γ′,∆′ to produce ∆′2. Either ρ ′ = ρ or not. If
ρ ′ = ρ, then by L2, e2 will type-check under Γ′,∆′2
since Γ ⊆ Γ′ and ∆′ = ∆. If not, then by L4 e1 is a read
a1[n]. By assumption and check_write Γ,∆ ⊢ e1 ⊣ ∆2
and e2 type-checks under Γ,∆2. Since e1 = a1[n],a1
could not have been defined in ∆2. e ′1 then must be
a value, so Γ′,∆′ ⊢ v ⊣ ∆′, so ∆2 = ∆′. So e2 must
type-check under Γ′,∆′.
• e1 is a value n and e2 is not a value. Then σ , ρ,a[e1] :=
e2 → σ ′, ρ ′,a[e1] := e ′2 and σ , ρ, e2 → σ ′, ρ ′, e ′2. By
the inductive hypothesis, e2 type-checks under Γ′,∆′.
e1 is a value and always type-checks, so we are done.
• e1 is a value n and e2 is a value v . Assuming this type-
checks, we step to skip, which always type-checks, so
we are done.
C gemm-blocked Design Space
Exploration
Figure 10 lists the parameterized code for the gemm-blocked
exhaustive design space exploration discussed in the evalua-
tion. The code is adapted from MachSuite [49].
D MachSuite Ports
We ported 16Machsuite benchmarks to Dahlia and compared
their resource usage against baseline implementations after
a full synthesis flow targeting Xilinx’s UltraScale+ VU9P for
both rewritten and baseline implementations. We present the
comparison in Figures 11a–11f. The benchmarks highlighted
in red represent benchmarks the completed synthesis but
failed their correctness checks because of a miscompilation
from the Vivado toolchain.
The graphs show that most of the benchmarks perform
identically when rewritten in Dahlia. This is because Dahlia
generates C++ which goes through the same synthesis flow
as the baseline implementations.
E Spatial Experiments
We perform a simple design sweep over a general matrix
multiply kernel written in Spatial and vary the unrolling
factor from 1 to 16.
We run the generated designs through a full synthesis
flow by targeting Xilinx’s Zynq-7000 SoC [? ]. We extended
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
ap_int<32> m1[128][128];
#pragma HLS resource variable=m1 core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=m1 cyclic factor=::BANK11:: dim=1
#pragma HLS ARRAY_PARTITION variable=m1 cyclic factor=::BANK12:: dim=2
ap_int<32> m2[128][128];
#pragma HLS resource variable=m2 core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=m2 cyclic factor=::BANK11:: dim=1
#pragma HLS ARRAY_PARTITION variable=m2 cyclic factor=::BANK12:: dim=2
ap_int<32> prod[128][128];
#pragma HLS resource variable=prod core=RAM_1P_BRAM
#pragma HLS ARRAY_PARTITION variable=prod cyclic factor=::BANK21:: dim=1
#pragma HLS ARRAY_PARTITION variable=prod cyclic factor=::BANK22:: dim=2
for (int jj = 0; jj < 16; jj++) {
for (int kk = 0; kk < 16; kk++) {
for (int i = 0; i < 128; i++) {
#pragma HLS UNROLL factor=::UNROLL1:: skip_exit_check
for (int j = 0; j < 8; j++) {
#pragma HLS UNROLL factor=::UNROLL2:: skip_exit_check
for (int k = 0; k < 8; k++) {
#pragma HLS UNROLL factor=::UNROLL3:: skip_exit_check
ap_int<32> mul = m1[i][8 * kk + k] * m2[8 * kk + k][8 * jj + j];
prod[i][8 * jj + j] += mul;
}
}
}
}
}
Figure 10. gemm-blocked kernel used for exhaustive DSE. The highlighted tokens indicate parameters for exploration.
aes
bfs-
bulk
bfs-
que
ue
fft-s
trid
ed
gem
m-b
lock
ed
gem
m-n
cub
edkmp
md-
grid
md-
knn nw
sort
-me
rge
sort
-rad
ix
spm
v-cr
s
spm
v-el
lpac
k
sten
cil-s
tenc
il2d
sten
cil-s
tenc
il3d
0
1
2
3
4
5
6
7
8
BR
AM
s u
se
d
rewrite
baseline
(a) Comparison of BRAMs used.
aes
bfs-
bulk
bfs-
que
ue
fft-s
trid
ed
gem
m-b
lock
ed
gem
m-n
cub
edkmp
md-
grid
md-
knn nw
sort
-me
rge
sort
-rad
ix
spm
v-cr
s
spm
v-el
lpac
k
sten
cil-s
tenc
il2d
sten
cil-s
tenc
il3d
0
5
10
15
20
25
30
35
40
DS
Ps
 u
se
d
rewrite
baseline
(b) Comparison of DSPs used.
aes
bfs-
bulk
bfs-
que
ue
fft-s
trid
ed
gem
m-b
lock
ed
gem
m-n
cub
edkmp
md-
grid
md-
knn nw
sort
-me
rge
sort
-rad
ix
spm
v-cr
s
spm
v-el
lpac
k
sten
cil-s
tenc
il2d
sten
cil-s
tenc
il3d
0
200
400
600
800
1000
1200
LU
T 
m
em
or
ie
s u
se
d
rewrite
baseline
(c) Comparison of LUT Mems used.
aes
bfs-
bulk
bfs-
que
ue
fft-s
trid
ed
gem
m-b
lock
ed
gem
m-n
cub
edkmp
md-
grid
md-
knn nw
sort
-me
rge
sort
-rad
ix
spm
v-cr
s
spm
v-el
lpac
k
sten
cil-s
tenc
il2d
sten
cil-s
tenc
il3d
0
5000
10000
15000
20000
25000
30000
LU
Ts
 u
se
d
rewrite
baseline
(d) Comparison of LUTs used.
aes
bfs-
bulk
bfs-
que
ue
fft-s
trid
ed
gem
m-b
lock
ed
gem
m-n
cub
edkmp
md-
grid
md-
knn nw
sort
-me
rge
sort
-rad
ix
spm
v-cr
s
spm
v-el
lpac
k
sten
cil-s
tenc
il2d
sten
cil-s
tenc
il3d
0
5000
10000
15000
20000
25000
30000
35000
40000
Re
gi
st
er
s u
se
d
rewrite
baseline
(e) Comparison of Registers used.
aes
bfs-
bulk
bfs-
que
ue
fft-s
trid
ed
gem
m-b
lock
ed
gem
m-n
cub
edkmp
md-
grid
md-
knn nw
sort
-me
rge
sort
-rad
ix
spm
v-cr
s
spm
v-el
lpac
k
sten
cil-s
tenc
il2d
sten
cil-s
tenc
il3d
0
50
100
150
200
250
Av
er
ag
e 
Ru
nt
im
e 
(m
s)
rewrite
baseline
(f) Comparison of runtime averages.
Figure 11. Resource usage comparison between baseline MachSuite implementations in Vivado HLS and rewrite implementa-
tions in Dahlia. The kernel names are on the X-axis and the resource usage is on the Y-axis. Excludes backprop (functionally
incorrect), fft-transpose, and viterbi (mis-synthesized by Vivado HLS)
PLDI ’20, June 15–20, 2020, London, UK Nigam, Atapattu, Thomas, Li, Bauer, Ye, Koti, Sampson, and Zhang
@spatial object GEMM_NCubed_16 extends SpatialApp {
type T = FixPt[TRUE,_16,_16]
def main(args: Array[String]): Unit = {
val (a_dram, b_data, c_data) = (DRAM[T](128,128), DRAM[T](128,128), DRAM[T](128,128))
// Generate random data for the input matrices.
val (a_data, b_data) = ((0::dim,0::dim){(i,j) => random[T](5)}, (0::dim,0::dim){(i,j) => random[T](5)})
// Set data in the input matrices.
setMem(a_dram, a_data)
setMem(b_dram, b_data)
// Accelerator kernel.
Accel {
val (a_sram, b_sram, c_sram) = (SRAM[T](dim,dim), SRAM[T](dim,dim), SRAM[T](dim,dim))
// Load data into memories.
a_sram.load(a_dram)
b_sram.load(b_dram)
// Computation loop.
Foreach(dim by 1) { i => Foreach(dim by 1) { j =>
// DSE parameter: Innermost loop parallelism. Try values from [1, 16].
val sum = Reduce(Reg[T](0))(dim by 1 par ::UNROLL::) { k => a_sram(i,k) * b_sram(k,j) }{_+_}
c_sram(i,j) = sum
}}
c_dram store c_sram
}
// Compute the expected value of the computation.
val c_gold = (0::dim,0::dim){(i,j) => Array.tabulate(dim){k => a_data(i,k) * b_data(k,j)}.reduce{_+_}}
// Check that the computed values are within some range of the expected value.
val c_result = getMatrix(c_dram)
val cksum = c_gold.zip(c_result){(a,b) => abs(a-b) < 0.5.to[T]}.reduce{_&&_}
// Fail if the computed value is different.
assert(cksum)
}
}
Figure 12. Kernel used to collect resource utilization number for the Spatial evaluation.
the Spatial quick-start template to generate designs for the
DSE comparison [? ]. We were unable to get Spatial designs
to pass through the AWS F1 based synthesis flow due to
numerous issues [? ? ? ].
Figure 13a shows the banking factors inferred by Spatial
for a given unrolling factor. Figure 13b plots the resources us-
ages normalized against the spatial design with no unrolling.
When Spatial’s inferred banking decisions are not aligned
with the unrolling factor, the resource usages abruptly in-
crease.
We also plot the absolute LUT, DSP, REG and BRAM usage
against unrolling factor in Figures 13c–13f. The designs use
significantly fewer LUTs when the unrolling factor is a factor
of the size of the memory. Furthermore, Spatial designs use
up to 10×more LUTs and 2×more DSPs than the equivalent
designs generated by Dahlia’s Vivado HLS backend.
Spatial uses automated design space exploration tools to
find optimal parameters for the design. Dahlia’s type system
can be used to eliminate unpredictable design points and
drastically reduce the search spaces with such automated
tools.
Predictable Accelerator Design with Time-Sensitive Affine Types PLDI ’20, June 15–20, 2020, London, UK
2 4 6 8 10 12 14 16
Unrolling Factor
2
4
6
8
10
12
14
16
Ba
nk
in
g 
De
cis
io
ns
input matrix a
input matrix b
(a) Banking decision Spatial made given the
unrolling factors.
2 4 6 8 10 12 14 16
Unrolling Factor
0.5
1.0
1.5
2.0
2.5
3.0
3.5
N
or
m
al
iz
ed
R
es
ou
rc
e
U
sa
ge
s DSP used
BRAM used
LUT used
(b) Resource usage normalized against no un-
rolling.
2 4 6 8 10 12 14 16
Unrolling factor
20
40
60
80
100
120
140
Ds
p 
us
ed
(c) Absolute DSP usage. Predictable points
highlighted.
2 4 6 8 10 12 14 16
Unrolling factor
25000
30000
35000
40000
45000
Re
g 
us
ed
(d) Absolute REG usage. Predictable points
highlighted.
2 4 6 8 10 12 14 16
Unrolling factor
25,000
30,000
35,000
40,000
45,000
Lu
t u
se
d
Unpredictable points
Predictable points
(e) Absolute LUT usage. Predictable points
highlighted.
2 4 6 8 10 12 14 16
Unrolling factor
50
55
60
65
70
75
Br
am
 ti
le
 u
se
d
(f) Absolute BRAM usage. Predictable points
highlighted.
Figure 13. Resource utilization for gemm-ncubed design in Spatial on a Zynq-7000. Absolute resource utilization shows
extreme variation between adjacent design points.
