Extending High-Level Synthesis for Task-Parallel Programs by Chi, Yuze et al.
Extending High-Level Synthesis for Task-Parallel Programs
Yuze Chi, Licheng Guo, Young-kyu Choi, Jie Wang, Jason Cong
{chiyuze,lcguo,ykchoi,jiewang,cong}@cs.ucla.edu
University of California, Los Angeles
ABSTRACT
C/C++/OpenCL-based high-level synthesis (HLS) becomes more
and more popular for eld-programmable gate array (FPGA) accel-
erators in many application domains in recent years, thanks to its
competitive quality of result (QoR) and short development cycle
compared with the traditional register-transfer level (RTL) design
approach. Yet, limited by the sequential C semantics, it remains
challenging to adopt the same highly productive high-level pro-
gramming approach in many other application domains, where
coarse-grained tasks run in parallel and communicate with each
other at a ne-grained level. While current HLS tools support task-
parallel programs, the productivity is greatly limited in the code
development, correctness verication, and QoR tuning cycles, due
to the poor programmability, restricted soware simulation, and
slow code generation, respectively. Such limited productivity oen
defeats the purpose of HLS and hinder programmers from adopting
HLS for task-parallel FPGA accelerators.
In this paper, we extend the HLS C++ language and present a
fully automated framework with programmer-friendly interfaces,
universal soware simulation, and fast code generation to overcome
these limitations. Experimental results based on a wide range of
real-world task-parallel programs show that, on average, the lines
of kernel and host code are reduced by 22% and 51%, respectively,
which considerably improves the programmability. e correctness
verication and the iterative QoR tuning cycles are both greatly
accelerated by 3.2× and 6.8×, respectively.
1 INTRODUCTION
C/C++/OpenCL-based high-level synthesis (HLS) [15] has been
adopted rapidly by both the academia and the industry for pro-
gramming eld-programmable gate array (FPGA) accelerators in
many application domains, e.g., machine learning [16, 62], scientic
computing [40, 68], and image processing [10, 52]. Compared with
the traditional register-transfer level (RTL) paradigm (Figure 1)
where programmers oen spend tens of minutes just to verify the
correctness of a code modication, with HLS, programmers can
follow a rapid development cycle (Figure 2). Programmers can write
code in C and leverage fast soware simulation to verify the func-
tional correctness. Such a correctness verication cycle can take as
few as just 1 second, allowing functionalities to be iterated at a fast
pace. Once the HLS code is functionally correct, programmers can
then generate RTL code, evaluate the quality of result (QoR) based
on the generated performance and resource reports, and modify
the HLS code accordingly. Such a QoR tuning cycle typically takes
only a few minutes. anks to the advances in HLS scheduling
algorithms [6, 7, 19, 28, 30] and timing optimizations [5, 27, 35],
HLS can not only shorten the development cycle, but also generate
programs that are oen competitive in cycle count [17], and more
recently in clock frequency as well [27]. Moreover, FPGA vendors
provide host drivers and communication interfaces for kernels de-
signed in HLS [31, 63], further reducing programmers’ burden to
integrate and ooad workload to FPGA accelerators.
RTL code
Correctness verification
and QoR tuning
(via RTL simulation)
(~tens of minutes)
Hardware
bitstream
Logic synthesis and
implementation
(~hours)
On-board
execution
Figure 1: FPGA accelerator development ow without HLS.
Programmers oen spend tens of minutes aer code modi-
cation to evaluate the correctness and quality of result.
High-level
synthesis
(~minutes)
HLS C++ code
(w/ pragmas)
QoR tuning
(based on HLS report) RTL code &
HLS report
Correctness verification
(via software simulation)
(~seconds)
Hardware
bitstream
Logic synthesis and
implementation
(~hours)
On-board
execution
Figure 2: FPGA accelerator development owwithHLS. Pro-
grammers spend seconds aer code modication to verify
the correctness. ality of result can usually be obtained in
less than 10 minutes from the HLS report.
However, not all programs are created equal for HLS. Data-
parallel programs can be easily programmed following the se-
quential C semantics with HLS-specic compiler directives (i.e.,
ar
X
iv
:2
00
9.
11
38
9v
1 
 [c
s.A
R]
  2
3 S
ep
 20
20
“pragma”). e HLS compiler can then leverage the directives to
extract the parallelism automatically via static dependency analysis.
is enables such applications to be quickly designed and iterated
in the fast correctness verication cycle and QoR tuning cycle, as
shown in Figure 2. However, task-parallel programs are not sup-
ported by the native C semantics, and the productivity provided by
current HLS tools are greatly limited for the following reasons:
• Poor programmability. Due to the lack of convenient application
programming interfaces (API), programmers are oen forced
to write more code than they have to. For example, a network
switch needs to forward packets based on their content and the
availability of output ports. Without an API to read packets with-
out consuming them (a.k.a., “peek”) from the ports, programmers
have to manually and carefully create a buer and maintain a
small state machine to keep track of incoming packets. is not
only elongates the development cycle, but also makes the code
error-prone.
• Restricted soware simulation. As the key to fast correctness
verication, soware simulation is not always available to task-
parallel programs. For example, Vivado HLS does not support
debugging Cannon’s algorithm [44] via soware simulation be-
cause of the existence of feedback loops in data paths, while
Intel OpenCL does not support more than 256 concurrent ker-
nels [31] in soware simulation. Lack of fast soware simulation
forces programmers to resort to RTL simulation for correctness
verication, signicantly elongating the development cycle.
• Slow code generation. We found that current HLS compilers view
task-parallel code as a monolithic design and processes each
instance of the same task as if they are dierent. For designs
that instantiate the same task multiple times (e.g., in a systolic
array), this leads to repetitive compilation on each task and
unnecessarily slows down code generation. One may argue
that programmers can manually synthesize tasks separately and
instantiate them in RTL, but doing so requires debugging RTL
code, which is time-consuming and error-prone. We think such
processes should be automated by the compiler.
Limited productivity for task-parallel programs signicantly
elongates the development cycles and undermines the benets
brought by HLS. One may argue that programmers should always
go for data-parallel implementations when designing FPGA ac-
celerators using HLS, but data-parallelism may be inherently lim-
ited, for example, in applications involving graphs. Moreover, re-
searches show that even for data-parallel applications like neural
networks [16] and stencil computation [10], task-parallel imple-
mentations show beer scalability and higher frequency than their
data-parallel counterparts due to the localized communication pat-
tern [18]. In fact, at least 6 papers [24, 34, 46, 55, 59, 64] among
the 28 research papers published in the ACM FPGA 2020 confer-
ence use task-parallel implementation with HLS, and another 3
papers [4, 50, 65] use RTL implementation that would have required
task-parallel implementation if wrien in HLS.
In this paper, we extend the HLS C++ language and present our
framework, TAPA (task-parallel), as a solution to the aforemen-
tioned limitations of HLS productivity. Our contributions include:
• Convenient programming interfaces: We show that, with
peeking and transactions added to the programming interfaces,
TAPA can be used to program task-parallel kernels with 22%
Table 1: Summary of related work.
Related Work
Programmability Soware
Simulation
RTL Code
GenerationPeek-
ing
Trans-
action
Host
Iface.
Fleet [60] No No N/A Sequential N/A
Intel HLS (ihc::pipe) No No N/A Multi-thread Monolithic
Intel HLS (ihc::stream) No Yes N/A Multi-thread Monolithic
Intel OpenCL No No OpenCL Multi-thread Monolithic
LegUp [3] No No N/A Multi-thread Monolithic
Merlin [14] No No C++ Sequential Monolithic
ST-Accel [54] No No VFS Sequential Hierarchical
Vivado HLS (ap fifo) No No OpenCL Sequential Monolithic
Vivado HLS (axis) No Yes OpenCL Multi-thread Manual
Xilinx OpenCL No No OpenCL Multi-thread Monolithic
TAPA Yes Yes C++ Coroutine Hierarchical
reduction in lines of code (LoC) on average. By unifying the
interface used for the kernel and host, TAPA further reduces the
LoC on the host side by 51% on average.
• Universal soware simulation: We demonstrate that our pro-
posed simulator can correctly simulate task-parallel programs
that existing simulators fail to simulate. Moreover, the correct-
ness verication cycle can be accelerated by a factor of 3.2× on
average.
• Hierarchical code generation: We show that by modularizing
a task-parallel program and using a hierarchical approach, RTL
code generation can be accelerated by a factor of 6.8× on our
server with 32 hyper-threads.
Other related HLS tools [3, 14, 32, 63], streaming works [54, 60],
and alternative APIs will be discussed in Section 5.1. Table 1 shows
the summary of related work. To the best of our knowledge, TAPA
is the only work that provides convenient programming interfaces,
universal soware simulation, and hierarchical code generation
for general task-parallel programs on FPGAs using HLS. TAPA is
open-source at https://github.com/ucla-vast/tapa.
2 BACKGROUND
2.1 Task-Level Parallelism
Task-level parallelism is a form of parallelization of computer pro-
grams across multiple processors. In contrast to data parallelism
where the workload is partitioned on data and each processor exe-
cutes the same program (e.g., OpenMP [21]), dierent processors
in a task-parallel program oen behave dierently, while data are
passed between processors. Examples of task-parallel programs in-
clude image processing pipelines [10, 52], graph processing [61, 67],
and network switching [50]. Soware programs usually implement
tasks as threads and/or processes and rely on the operating system
to schedule execution and handle communication. is oen leads
to poor performance caused by inecient inter-task communica-
tion and frequent context switch [2]. Hardware programs, on the
other hand, can be much more ecient due to the massive amount
of inherently parallel logic units. In this paper, we focus on the
problem of statically mapping tasks to hardware. at is, instances
of tasks are synthesized to dierent areas in an FPGA accelerator.
We plan to address dynamic scheduling in our future work.
2
2.2 Task-Parallel Programming Models
Task-parallel programs are oen described as communicating se-
quential processes [29] or using dataow models [36, 43, 51]. Kahn
process network (KPN) [36] is one of the most popular models used.
Under the KPN model, tasks are called processes. Processes com-
municate only through unidirectional channels. Data exchanged
between channels are called tokens. KPN requires that ¬ each pro-
cess is deterministic, i.e., the same input sequence must produce
the same output sequence; ­ channels are unbound, read blocks if
and only if the channel is empty, and write always succeed imme-
diately; ® a process cannot test an input channel for existence of
tokens without consuming them. While KPN and models derived
from KPN (e.g., synchronous dataow [43]) have been successful
in scheduling tasks on parallel processors, we show in the next
section that, when applied to model task-parallel HLS programs,
such models lack good programmability support. In this paper,
we borrow the terms process, channel, and token used in the KPN
formulation, but are not limited to KPN or any dataow model. In
fact, we will describe our programming model as a hierarchical
nite state machine in Section 3.1.1.
2.3 Motivating Example
Graph is an important data structure that is critical in many data
mining and machine-learning algorithms [26, 38, 45, 47, 49]. While
there are many existing FPGA accelerators designed for graph
algorithms [22, 23, 37, 48, 65–67], none of them are programmed
in HLS. HLS’s lack of good productivity for task-parallel programs
is one of the reasons it is not adopted for graph algorithms. In this
section, we use a real-world design to illustrate the productivity
issues for implementing graph accelerators in HLS, which serve as
a motivating example for our work.
Our example accelerator implements PageRank [49] on the Alveo
U280 board and leverages the high-bandwidth memories (HBM).
e input graph is pre-processed and loaded into the HBM on the
FPGA. e accelerator adopts an edge-centric graph programming
model [53] and decouples the computation into two phases, i.e., the
scaer phase and the gather phase [11, 67]. In the scaer phase,
edges are streamed from the HBM to the processing elements (PE)
on FPGA. For each edge, an update message is generated to propa-
gate the weighted ranking of the source vertex to the destination
vertex. e updates are collected and stored o-chip in the HBM.
In the gather phase, the updates are loaded from the HBM and the
rankings are accumulated over each vertex. Our PageRank acceler-
ator instantiates multiple PEs. e PEs are connected to a vertex
handler and a control module. e control module coordinates
accesses to the vertex aributes and iterative execution between
the two phases. Figure 3 shows the block diagram of the example
accelerator.
We measured 4.4 GTEPS1 on-board execution throughput using
the accelerator with 19 HBM channels in use. As a comparison,
multi-thread CPU performance is around 0.7 GTEPS with 4 DDR4
memory channels [11]. Even if we assume a similar memory band-
width as the FPGA accelerator and project the CPU performance to
3.5 GTEPS, it would still be more than 20% slower, due to the lack
of ne-grained control over communication. While developing this
1Giga traversed edges per second.
PageRank
Accelerator
Ctrl
HBM
vertices
…
Processing 
Element 1
Processing 
Element 2
Processing 
Element 8
Vertex 
Handler
Processing 
Element 
(zoomed in)
updates
Compute 
Unit
Update 
Handler
edges
HBM
HBM
updates
vertices
Vertex 
Handler
Ctrl
Figure 3: Example PageRank accelerator design.
accelerator, we found that the following are missing or hard-to-use
in the HLS tools and signicantly impact the productivity.
1) Peeking. Peeking is dened as reading a token from a channel
without consuming it. As mentioned in Section 2.2, KPN explic-
itly prohibits such behavior. Yet, such a paern is common in
many applications. For example, in the PageRank accelerator, the
UpdateHandler module needs to keep track of the number of up-
dates destinated to each vertex partition. Due to the large number
of partitions, block RAMs (BRAM) are used for storing the update
counts. However, incrementing a value in BRAM cannot be done in
a single clock cycle on FPGAs due to the addressing latency, which
prevents the loop from being fully pipelined. A workaround is to
accumulate the update count in a register for updates with the same
partition id (pid) and only write changes to BRAM when the pid
changes. is requires us to detect conicts on the addresses and
stop reading the input channel when conict occurs, as shown in
the green lines marked with “+” in Listing 1. Without a peek API,
one has to write it as the red lines marked with “-” in Listing 1
to manually maintain a buer for the incoming values. is not
only increases the programming burden, but also makes the design
prone to errors in state transitions of the buer.
2) Transactions. A sequence of tokens may constitute a sin-
gle logical communication transaction. Using the same PageRank
accelerator example, in the gather phase when the updates are
read from HBM, the updates transmied from UpdateHandler to
ComputeUnit for each vertex partition can be considered a single
transaction. Since only UpdateHandler knows the number of up-
dates transmied in each transaction, ComputeUnit needs to test
for a special token to detect the end of transaction (green lines
marked with “+” in Listing 2). Without an eot API, one has to man-
ually add a special bit to the data structure representing the tokens
(red lines marked with “-” in Listing 2). Note that the Update struct
is used elsewhere and it is infeasible to add the eot bit directly to
the Update struct. An alternative solution, i.e., sending the length
of transaction to the token consumer beforehand, is not only more
complicated, but also impractical in cases where the tokens are
generated dynamically and the length of transaction cannot be
determined beforehand.
3) System integration. To ooad computation kernel from the
host CPU to FPGA accelerators, programmers need to write host-
side code to interface the accelerator kernel with the host. FPGA
vendors adopt the OpenCL standard to provide such a functionality.
While the standard OpenCL host-kernel interface infrastructure
3
Listing 1: Code snippets with (green lines marked with “+”)
andwithout (red linesmarkedwith “-”) a peekAPI.Without
the peekAPI, the code snippet is 33% longer and error-prone.
Listing 2: Code snippets with (green lines marked with
“+”) and without (red lines marked with “-”) an end-of-
transaction (eot) API. Without the eotAPI, the code snippet
is 2× longer.
relieves programmers from writing their own operating system dri-
vers and low-level libraries, it is still inconvenient and hard-to-use.
Programmers oen have to write and debug tens of lines of code
just to set up the host-kernel interface. Task-parallel accelerators
oen make the situation worse because the parallel tasks are oen
described as distinct OpenCL kernels [31], which signicantly in-
creases the programmers’ burden on managing these kernels in the
host-kernel interface. For our PageRank accelerator, more than 60
lines of host code are created just for the host-kernel integration,
which constitute more than 20 percent of the whole source code.
Yet, what we actually need is just a single function invocation of
the synthesized FPGA bitstream given proper arguments.
4) Soware simulation. C does not have explicit parallel seman-
tics by itself. Vivado HLS uses the dataow model and allow pro-
grammers to instantiate tasks by invoking each of them sequen-
tially [63]. While this is very concise to write (red lines marked with
“-” in Listing 3), it will lead to incorrect simulation results because
the communication between ComputeUnit and UpdateHandler are
bidirectional, yet sequential execution can only send tokens from
ComputeUnit to UpdateHandler because of their invocation order.
is problem was also pointed out in [8]. In order to run so-
ware simulation correctly, the programmer can change the source
code to run tasks in multiple threads for soware simulation, but
doing so requires the same piece of task instantiation code to be
wrien twice for synthesis and simulation, reducing productivity.
While other tools that run tasks in parallel threads do not have the
same correctness problem, we will show in Section 4.4 that such
simulators do not scale well when the number of tasks increase.
Listing 3: Code snippets that instantiate tasks inVivadoHLS
(red lines marked with “-”) and TAPA (green lines marked
with “+”). e instantiation interface in Vivado HLS is not
verbose, but soware simulation does not work correctly.
5) RTL code generation. In our PageRank design, the same process-
ing element is instantiated 8 times. is makes the HLS compiler
synthesize the same PE module 8 times, taking 7 minutes per com-
pilation. We can reduce the code generation time to less than 1
minute by manually synthesizing each module separately and con-
necting the generated RTL code, but doing so forces us to debug
RTL code and spend tens of minutes to verify the correctness for
each code modication, thus defeats the purpose for adopting HLS.
In this paper, we present the TAPA framework that addresses
these challenges by providing convenient programming interfaces,
universal soware simulation, and hierarchical code generation.
4
3 TAPA FRAMEWORK
3.1 Programming Model and Interface
3.1.1 Hierarchical Finite-State Machine Model. Similar to KPN
described in Section 2.2, tasks in TAPA communicate via chan-
nels. Unlike KPN, tasks are modeled as hierarchical nite-state
machines (FSM). Each task is either a leaf that does not instantiate
any channels or tasks, or a collection of tasks and channels with
which the tasks communicate. A task that instantiates a set of tasks
and channels is called the parent task for that set. Each channel
must be connected to exactly two tasks that are instantiated in
the same parent task. One of the tasks must act as a producer and
the other must act as a consumer. e producer streams tokens to
the consumer via the channel in the rst-in-rst-out (FIFO) order.
Each task is an FSM, where the tokens streamed to and from the
task are inputs and outputs to the FSM. In case of a parent task,
the state of all instantiated channels and tasks constitute its state.
e producer of a channel can test the fullness of the channel and
append tokens to the channel (write) if the channel is not full. e
consumer of a channel can test the emptiness of the channel and re-
move tokens from the channel (read), or duplicate the head of token
without removing it (peek), if the channel is not empty. Read, peek,
and write operations can be blocking or non-blocking. A blocking
operation on an input (output) channel keeps the task FSM in its
current state until the channel becomes non-empty (non-full). A
non-blocking operation tries to perform the operation and returns
whether it is successful as one of the inputs to the task FSM. Each
task is implemented as a C++ function, which can communicate
with each other via the communication interface. A parent task
instantiates channels and tasks using the instantiation interface.
One of the tasks is designated as the top-level task, which denes
the communication interfaces external to the FPGA accelerator.
3.1.2 Communication Interface. Tasks communicate with each
other through the communication interface. TAPA provides sepa-
rated communication APIs for the producer side and the consumer
side. e producer and consumer tasks of a channel use ostream
and istream as the interfaces, respectively. e interfaces are tem-
plated and can be used for any copyable class. On the consumer side,
istream provides peek that allows the programmer to read a token
without removing it from the channel, i.e., the state of the channel is
not changed. A special token denoting end-of-transaction (EoT) is
available to all channels. A process can “close” a channel by writing
an EoT to it, and a process can “open” a channel by reading an EoT
from it. An EoT token does not contain any useful data. is is
designed deliberately to make it possible to break from a pipelined
loop when an EoT is present (Listing 2). Table 2 summarizes the
communication interfaces provided by TAPA. Listing 4 shows an
example of how the communication interfaces are used in TAPA.
3.1.3 Instantiation Interface. A parent task can instantiate chan-
nels and tasks using the instantiation interface. Channels are in-
stantiated using tapa::channel<type, capacity>. For example,
tapa::channel<VertexReq, 2> instantiates a channel with ca-
pacity 2, meaning up to 2 tokens can be wrien to this channel
without reading them out or blocking the producer. Data tokens
transmied using this channel have type VertexReq. Tasks are
instantiated using tapa::task::invoke. By default, a parent task
Table 2: TAPA communication interface.
tapa::ostream<T>& API Producer-side functionality
bool full(); fullness test
void write(T); blocking write a data token
bool try write(T); non-blocking write a data token
void close(); blocking write an EoT token
bool try close(); non-blocking write an EoT token
tapa::istream<T>& API Consumer-side functionality
bool empty(); emptiness test
T peek(); blocking peek a data token
bool try peek(T&); non-blocking peek a data token
T read(); blocking read a data token
bool try read(T&); non-blocking read a data token
bool eot(); return if next token is EoT
bool try eot(bool&); return if next token exists and if it is EoT
void open(); blocking read an EoT token
bool try open(); non-blocking read an EoT token
1 void VertexHandler(tapa::istream<VertexReq>& req_s, ...) {
2 for (;;) {
3 VertexReq req;
4 if (req_s.try_read(req)) {
5 ... // handle requests
6 }
7 }
8 }
9
10 void Ctrl(tapa::ostream<VertexReq>& vertex_req, ...) {
11 ... // initial setup
12 while (...) { // iterative execution
13 VertexReq req(...); // request vertices
14 vertex_req.write(req);
15 ... // finish scatter & do gather
16 }
17 }
Listing 4: TAPA communication interface example.
1 void PageRank(...) {
2 tapa::channel<VertexReq, 2> vertex_req;
3 ...
4 tapa::task()
5 .invoke<tapa::detach>(VertexHandler, vertex_req, ...)
6 .invoke(Ctrl, vertex_req, ...)
7 ...
8 ;
9 }
Listing 5: TAPA instantiation interface example.
does not nish until all its children tasks nish. A child task can
optionally be invoked with tapa::detach, meaning the child task
is launched and detached immediately, and the parent does not wait
for it to nish. e tapa::detach invocation type is particularly
useful when a task never terminates, e.g., VertexHandler with
an innite loop (Listing 4). Listing 5 shows an example of how
channels and tasks are instantiated in TAPA.
5
3.1.4 System Integration Interface. To ooad a kernel to an
FPGA accelerator, programmers will need to integrate the FPGA
into the host CPU system. anks to the vendor-provided system
drivers and the standard OpenCL accelerator APIs, most program-
mers only need to follow the OpenCL host-kernel communication
specication and invoke proper APIs. However, those OpenCL
APIs are still verbose and take a long time to learn and develop. For
example, programmers need to learn the concepts of “platform”,
“context”, “queue”, and “kernel” in OpenCL and manage them for
each accelerator, yet the only thing necessary is usually just nd a
proper FPGA accelerator or simulation environment and use it to
run the program. is overhead for programmers is exacerbated by
task-parallel accelerators, where parallel tasks are oen synthesized
as concurrent OpenCL kernels that need to be managed separately
by the host.
TAPA uses a unied system integration interface to further re-
duce programmers’ burden. To ooad a kernel to an FPGA ac-
celerator, programmers only need to call the top-level task as a
C++ function in the host code. Since TAPA can extract metadata
information, e.g., argument type, from the kernel code, TAPA will
automatically synthesize proper OpenCL host API calls and emit
an implementation of the top-level task C++ function that can set
up the runtime environment properly. As a user of TAPA, the pro-
grammer can use a single function invocation in the same source
code to run soware simulation, hardware simulation, and on-board
execution, with the only dierence of specifying proper bitstreams.
3.2 Soware Simulation
State-of-the-Art Approach. ere are mainly two state-of-the-art
approaches that run fast soware simulation for task-parallel ap-
plications: the sequential approach and the multi-thread approach.
A sequential simulator invokes tasks sequentially in the invocation
order [63]. Sequential simulators are fast, but cannot correctly sim-
ulate the capacity of channels and applications with tasks commu-
nicating bidirectionally, as discussed in Section 2.3. A multi-thread
simulator invokes tasks in parallel by launching a thread for each
task. is enables the capacity of channels and bidirectional com-
munication to be simulated correctly. However, they may perform
poorly due to the ineciency of inter-thread communication and
context switch handled by the operating system. e FLASH simu-
lator [8, 12] proposed an alternative to the above, which relies on
the HLS scheduling information to mimic the RTL FSM. While this
simulation approach itself is faster than multi-thread simulators,
generating simulation executable becomes slower due to the need
of the HLS scheduler output for cycle-accuracy, which is not needed
for correctness verication.
In this section, we present an alternative approach to run so-
ware simulation on task-parallel applications. Given that the ine-
ciency of multi-thread execution is mainly caused by the preemptive
nature of operating system threads and inspired by the widespread
adoption of coroutines in modern soware languages [25, 41], we
propose an approach that uses collaborative coroutines instead of
preemptive threads. Note that fast and/or cycle-accurate debugging
in general [33] is out of the scope of this paper; we focus on the
correctness and scalability issues for task-parallel programs.
Coroutine-Based Approach. Routines in programming languages
are the units of execution contexts, e.g., functions in C [39]. Corou-
tines [20] are routines that execute collaboratively; more speci-
cally, coroutines can be explicitly suspended and resumed. Corou-
tines can even maintain their own stacks. As a result, each coroutine
can invoke subroutines themselves and suspend from and resume
to any subroutine [41]. Coroutines that have their own stacks are
called stackful coroutines. A context switch between coroutines
takes only 26ns on modern CPUs [41]. As a comparison, an operat-
ing system thread context switch takes 1.2 ˜2.2µs [2], which is two
orders of magnitude slower.
TAPA leverages stackful coroutines to perform soware simu-
lation. When channels are instantiated in the simulator, enough
memory space is reserved to ensure the channel capacity can be
simulated correctly. When tasks are instantiated, a coroutine is
launched but suspended immediately for each task. Once all tasks
are instantiated, the simulator starts to resume the suspended corou-
tines. A resumed task will be suspended again if any input channel
is accessed when empty or any output channel is accessed when
full, which means that no progress can be made from this task. A
dierent task will then be selected and resumed by the simulator.
For example, in the task instantiation code shown in Listing 5,
both VertexHandler and Ctrl are launched as coroutines and
suspended immediately by the invoke function calls. Once all tasks
are instantiated, the simulator starts to pick tasks for execution.
Ctrl is picked rst, which will write vertex requests to vertex req.
Once vertex req becomes full, the simulator determines that no
progress can be made from Ctrl, thus will suspend it and pick
another task for execution. VertexHandler is then resumed and
tokens will be read from vertex req. Once vertex req becomes
empty, the simulator determines that no progress can be made from
VertexHandler, thus will suspend it and pick the next task for
execution.
To beer utilize the available CPU cores, we use a thread pool
to execute the coroutines. We will show in Section 4.4 that the
coroutine-based simulator outperforms the existing simulators by
3.2× on average (Section 4.4).
3.3 RTL Code Generation
State-of-the-art Approach. Current HLS tools treat the whole
task-parallel program as a monolithic design, treat channels as
global variables, and compile dierent instances of tasks as if they
are completely unrelated. While this enables instance-specic op-
timizations, e.g., dierent constant arguments can be propagated
to dierent instances, it can also lead to a signicant amount of
repeated work. For example, the dataow architecture generated
by the SODA compiler [9, 10] is highly modularized and many
modules are functionally identical. However, both the Vivado HLS
backend and Intel FPGA OpenCL backend of SODA generate RTL
code for each SODA module separately. When the design scales
out to hundreds of modules, RTL code generation can easily run
for hours, taking even longer time than logic synthesis and imple-
mentation. While we recognize that a programmer can manually
generate RTL code for each task and glue them at RTL level to
speed up RTL code generation, doing so defeats the purpose of
using HLS for high productivity, because the glued RTL code can be
6
error-prone yet cannot be veried using fast soware simulation.
We also recognize that fast RTL code generation in general is an
interesting problem, but we focus on the ineciency exacerbated
by task-parallel programs in this paper.
Modularized Approach. anks to the hierarchical programming
model, TAPA can keep the program hierarchy, recognize dierent
instances of the same task, and compile each task only once. As
such, the total amount of time spent on RTL code generation is
reduced. Moreover, modularized compilation makes it possible to
compile tasks in parallel, further reducing RTL code generation
time on multi-core machines. TAPA implements this by doing a
source-to-source transformation to generate the vendor HLS code
for each task and invoking the vendor tools in parallel for each
task. On average, TAPA reduces HLS compilation time by 4.9×
(Section 4.5).
3.4 TAPA Automation Overview
e TAPA automation ow is shown in Figure 4. e TAPA C++
source code can be compiled directly for soware simulation and
correctness verication. Starting from the same TAPA C++ source
code, TAPA extracts the HLS code for each task and the metadata
information of the whole design, including the communication
topology among tasks, token types exchanged between tasks, and
channels’ capacity. e vendor HLS tool is then leveraged to gener-
ate RTL code and performance/resource report for each task. e
extracted metadata is used to instantiate the task instances and
connect them together systematically, producing the overall HLS
report and kernel RTL code, which can be used for QoR tuning
and logic synthesis and implementation, respectively. e same
metadata information is also used to create the host-kernel com-
munication interface, which can be used for on-board execution or
optionally RTL simulation.
Handled automatically by TAPA
Extract
metadata
(TAPA)
TAPA C++
code
Kernel
RTL code &
HLS report
HLS code
(per task)
Task info
Chan. info
Source to source
transformation
(TAPA)
RTL code &
HLS report
(per task)
Task & Channel
Instantiation
(TAPA)
HLS
Compiler
Host-kernel
iface. code
TAPA
Figure 4: TAPA automation ow overview.
4 EVALUATION
We prototype TAPA on Xilinx devices using Vivado HLS as the
backend; support for Intel devices will be added later. Clang com-
piler infrastructure is modied to extract information about tasks
and perform source-to-source transformation to generate Vivado
HLS kernel code and OpenCL host code. GCC is used to compile
the host executables and the soware simulators. We compare
the productivity of TAPA with two vendor tools that provide end-
to-end high-level programming experience (including host-kernel
communication): Xilinx Vitis/Vivado HLS 2019.2 suite and Intel
FPGA SDK for OpenCL Pro Edition 19.4. e experimental results
are obtained on an Ubuntu 18.04 server with 2 Xeon Gold 6244
processors.
4.1 Benchmarks
We used the following benchmarks for comparison. All implemen-
tations (Vivado HLS, Intel OpenCL, and TAPA) of each benchmark
are wrien in such a way that tasks in each implementation have
one-to-one correspondence, corresponding loops are scheduled
with the same initiation interval (II), and each task performs the
same computation. is guarantees all tools generate consistent
quality of results. Note that we aim to compare the productivity
of each of the HLS tools, not the quality of result. In particular,
we were unable to guarantee that the generated RTL codes have
exactly the same behavior without having access to the HLS com-
piler’s scheduling algorithm. For example, the network switch
implemented in TAPA has a total latency of 3 cycles while the Vi-
vado HLS implementation has a total latency of 6. is is inevitable
because, using Vivado HLS, one has to manually buer the incom-
ing packets, forcing an additional latency of 1 cycle at each network
stage. Table 3 summarizes the number of tasks and channels used
in each benchmark.
Cannon’s Algorithm. Cannon’s algorithm [44] is a distributed
algorithm for matrix multiplication that runs on 2D mesh of pro-
cessing elements (PE). is benchmark contains 8×8 PEs. Each PE is
internally vectorized to perform 8 multiply-accumulate operations
per cycle for two 128×128 matrices. Besides the 64 PEs, the acceler-
ator also contains 9 data distributor/collector for each matrix. e
inputs to the whole accelerator are 1024×1024×1024.
Convolutional Neural Network. Convolutional neural networks
are very popular for many machine learning applications, e.g., im-
age classication [58]. is benchmark implements the third layer
of VGG [58] 2 based on a systolic array implementation generated
from PolySA [16]. PolySA is a polyhedral-based systolic array auto-
compilation framework that can generate optimal designs within
one hour with performance comparable to state-of-the-art manual
designs.
Gaussian Filter. e Gaussian lter is oen employed for low-
pass ltering on input signals or images, or used iteratively for
solving linear system of equations. is benchmark is based on a
dataow microarchitecture generated from SODA [10]. SODA is a
stencil compiler that can generate optimal communication-reuse
buers with temporal and spatial parallelism. is benchmark
performs 8 iterations of Gaussian ltering, each of which is capa-
ble of processing 16 input elements in parallel. e input size is
32768×32768.
GraphConvolutional Network. Graph convolutional network [38]
is an emerging type of neural network that processes sparse and
irregular data as opposed to dense and regular ones like images.
is benchmark implements a forward layer of GCN for the Cora
2Parameters {i, o, h, w, p, q } = {512, 512, 56, 56, 3, 3}.
7
Table 3: Benchmarks used in this paper. Each task may be
instantiatedmultiple times, so the number of task instances
is greater than the number of tasks.
Benchmark #Tasks #Task Instances #Channels
cannon 5 91 344
cnn 14 209 366
gaussian 15 564 1602
gcn 5 12 25
gemm 14 207 364
network 3 14 32
page rank 4 18 89
dataset [57], which contains 2708 vertices and 10556 edges. e in-
put and output features have 1433 and 16 dimensions, respectively.
General Matrix Multiplication. is benchmark is based on a sys-
tolic array implementation generated from PolySA [16]. Compared
with Cannon’s algorithm, PolySA avoids feedback data paths in the
systolic array, and can support non-square matrices. e inputs to
the accelerator are 1040×1024×1024.
Network Switching. is benchmark implements an 8×8 Omega
network switch [42] that can route packets from any input port
to any output port. e packets are 64-bit wide with the rst 3
bits being the header and are generated randomly with an even
distribution among the 8 destination ports.
PageRank. is benchmark implements the PageRank [49] ci-
tation ranking algorithm for general large graphs as described in
Section 2.3. We use the Slashdot community graph [45] as the
dataset for debugging, which contains 77360 vertices and 905468
edges. e accelerator design itself can scale up to 226 vertices and
228 edges.
4.2 Lines of Kernel Code
TAPA simplies the kernel code in two aspects. First, the TAPA
communication interfaces simplify the code with the built-in sup-
port for peeking and transactions. is not only simplies the body
of each task denition, but also removes the necessity for many
struct denitions. Second, the TAPA instantiation interfaces sim-
plify the code by allowing tasks to be launched and detached con-
cisely. Without this functionality, each task in Vivado HLS must
be carefully given a termination condition, whereas Intel OpenCL
requires verbose kernel instantiation aributes for each instance
of task. Figure 5 shows the lines of kernel code comparison of
each benchmark. On average, TAPA reduces the lines of kernel
code by 22%. Note that only synthesizable kernel code is counted;
code added for multi-thread soware simulation is not counted for
Vivado HLS.
4.3 Lines of Host Code
e host code used in the benchmarks contains a minimal test-
bench to verify the correctness of the kernel code. TAPA system-
integration API automatically interfaces with the OpenCL host
APIs and relieves the programmer from writing repetitive code just
to connect the kernel to a host program. Table 6 shows the lines
cannon cnn gaussia
n gcn gemm networkpage_ra
nk
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Lin
es 
of 
Co
de
 (N
orm
ali
zed
 to
 TA
PA
)
Lines of Kernel Code
Vivado HLS
Intel OpenCL
TAPA
Figure 5: LoC comparison for kernel code. Lower is better.
cannon cnn gaussia
n gcn gemm networkpage_ra
nk
0
1
2
3
4
Lin
es 
of 
Co
de
 (N
orm
ali
zed
 to
 TA
PA
)
Liens of Host Code
Vivado HLS
Intel OpenCL
TAPA
Figure 6: LoC comparison for host code. Lower is better.
of host code comparison. On average, the length of host code is
reduced by 51%.
4.4 Soware Simulation Time
Figure 7 shows four simulators, that is, the sequential Vivado
HLS simulator, the multi-thread Vivado HLS simulator, the multi-
thread Intel OpenCL simulator, and the coroutine-based TAPA
simulator. Among the three simulators, the sequential simulator
fails to correctly simulate benchmarks that require feedback data
paths (cannon and page rank). Due to the larger memory foot-
print required for storing the tokens transmied between tasks
and lack of parallelism, the sequential simulator is outperformed
by the coroutine-based simulator in all but one of the benchmarks
(network). e two multi-thread simulators correctly simulate all
benchmarks, except that Intel OpenCL cannot handle gaussian
because its large number of task instances (564) exceeds the max-
imum allowed (256) by the simulator. However, the multi-thread
8
cannon cnn gaussia
n gcn gemm networkpage_ra
nk
1 sec
10 sec
1 min
10 min
1 hour
10 hour
Ela
ps
ed
 Ti
me
Simulation Time
Vivado HLS (Seq)
Vivado HLS (MT)
Intel OpenCL (MT)
TAPA (Coroutine)
Figure 7: Simulation time comparison. Lower is better. e
sequential simulator fails to simulate cannon and pagerank
correctly. e Intel OpenCL multi-thread simulator cannot
simulate gaussian due to its large number of task instances.
simulators perform poorly on benchmarks that are communication-
intensive (e.g., network) or have more tasks than the number of
available threads (e.g., gaussian). e coroutine-based TAPA simu-
lator can correctly simulate all benchmarks without signicant per-
formance loss for both communication-intensive and computation-
intensive tasks with 3.2× average speedup.
4.5 RTL Code Generation Time
Figure 8 shows the RTL code generation time comparison. anks
to the hierarchical programming model and modularized code gen-
erator, TAPA accelerates the HLS compilation time by 6.8× on
average. is is because ¬ TAPA runs HLS for each task only once
even if it is instantiated many times, while Vivado HLS and Intel
OpenCL runs HLS for each task instance; ­ TAPA runs HLS in
parallel on multi-core machines.
cannon cnn gaussia
n gcn gemm networkpage_ra
nk
1 sec
10 sec
1 min
10 min
1 hour
10 hour
Ela
ps
ed
 Ti
me
RTL Code Generation Time
Vivado HLS
Intel OpenCL
TAPA
Figure 8: RTL code generation time. Lower is better.
5 RELATEDWORK
5.1 HLS Support for Task-Parallel Programs
Intel HLS compiler supports two dierent inter-task communication
interfaces, ihc::pipe and ihc::stream. ihc::pipe implements
a light-weight hardware FIFO with data, valid, and ready signals,
while ihc::stream implements an Avalon-ST interface that sup-
ports transactions. Tasks are instantiated using ihc::launch and
ihc::collect. Soware simulation is done via launching multiple
threads. Instances of the same task are synthesized separately.
Intel OpenCL compiler supports light-weight FIFO via two sets
of APIs, i.e., standard OpenCL pipe and Intel-specic channel.
Tasks are instantiated by dening OpenCL kernels, which forces
instances of the same task to be synthesized separately as dierent
OpenCL kernels. OpenCL runtime handles the soware simulation
by launching multiple threads.
Vivado HLS provides two dierent streaming interfaces: ap fifo
and axis. e ap fifo interface generates light-weight FIFO inter-
face. Tasks are instantiated by invoking the corresponding func-
tions in a dataflow region, and instances of the same task are
synthesized separately. Soware simulation is done by sequen-
tially executing the tasks. e axis interface generates AXI-Stream
interface with transaction support. It requires the programmers
to instantiate channels and tasks in a separate conguration le
when running logic synthesis and implementation. is allows
dierent instances of the same task to be synthesized only once, but
takes longer time to learn and implement compared with ap fifo.
OpenCL runtime handles the soware simulation for tasks instan-
tiated with the axis interface by launching multiple threads.
Xilinx OpenCL compiler supports standard OpenCL pipe, which
generates AXI-Stream interfaces similar to Vivado HLS axis, but
pipe does not provide APIs to support transactions. Like Vivado
HLS axis, soware simulation of pipe is handled by the OpenCL
runtime by launching multiple threads.
LegUp compiler provides legup::FIFO, which implements light-
weight FIFOs. Tasks are instantiated using pthread API (Sec-
tion 5.3). Soware simulation is accomplished by launching multi-
ple threads. Instances of the same task are synthesized separately.
Merlin compiler [14] allows programmers to call the FPGA ker-
nel as a C/C++ function and provides OpenMP-like simple pragmas
with automated design-space exploration based on machine learn-
ing. To support task-parallel programs, Merlin leverages its backend
vendor tools’ programming interfaces. Soware simulation is done
by sequentially executing the tasks.
In summary, as pointed out in Table 1 (on page 2), none of the
state-of-the-art HLS tools provide peeking support. Only Intel HLS
ihc::stream and Vivado HLS axis support transactions. Only
Merlin allows the accelerator kernel to be called as if it is a C/C++
function. Vivado HLS and Merlin execute tasks sequentially for
simulation while others launch multiple threads. All HLS tools
treat a task-parallel program as a monolithic design and generate
RTL code for each instance of task separately, except that Vivado
HLS axis allows programmers to manually instantiate tasks using
a conguration le when running logic synthesis and implementa-
tion.
9
5.2 Streaming Framework
Streaming applications are a special type of task-parallel applica-
tions that do not require complex control over inter-task commu-
nication and oen expose massive data parallelism in addition to
task parallelism. ere are previous works that focus specically
on such applications.
ST-Accel [54] is a high-level programming platform for streaming
applications that features highly ecient host-kernel communica-
tion interface exposed as a virtual le system (VFS). It uses Vivado
HLS as its backend for hardware generation and its soware simu-
lation is done by sequential execution.
Fleet [60] is a massively parallel streaming framework for FPGAs
that features highly ecient memory interfaces for massive in-
stances of parallel processing elements. Programmers write Fleet
programs in a domain-specic RTL language based on Chisel [1].
e programs can be simulated in Scala3.
In summary, while these frameworks are specialized for stream-
ing paerns, neither of them provide peeking and transaction inter-
face in the kernel. Both run soware simulation sequentially, which
does not have correctness problem for streaming applications but
will be restrictive for general task-parallel programs.
5.3 Alternative APIs
SystemC is a set of C++ classes and macros that provide detailed
hardware modeling and event-driven simulation. It supports both
cycle-accurate and untimed simulation and many simulator im-
plementations are available [13, 56]. Some HLS tools support a
subset of untimed SystemC as the input [63]. SystemC supports
task-parallel programs natively via the sc module constructs and
tlm fifo interfaces. Listing 6 shows an example using the accel-
erator discussed in Section 2.3. Compared with other C-like HLS
languages, SystemC can model more hardware details but is more
verbose and less productive due to its special language constructs:
for the code snippets shown in Listing 4 and Listing 5, equivalent
SystemC code would be 37% longer.
Pthread API is a set of widely used standard APIs that can be
used to implement task-parallel programs using threads. Pthread
requires programmers to explicitly create and join threads, and
arguments need to be manually packed and passed. Listing 7 shows
an example using the accelerator discussed in Section 2.3. Com-
pared with the tapa::invoke API used by TAPA, the pthread APIs
require more eort to program: for the code snippets shown in
Listing 4 and Listing 5, equivalent pthread-based code would be
78% longer.
In summary, while the existing API alternatives are widely used
in some domains, they are more verbose and thus less productive
compared with TAPA.
6 CONCLUSION AND FUTUREWORK
In this paper, we present TAPA as an HLS C++ language extension
to enhance the programming productivity of task-parallel programs
on FPGAs. TAPA has multiple advantages over state-of-the-art HLS
tools: 1) its enhanced programming interface helps to reduce the
lines of kernel code by 22% on average, 2) its unied system inte-
gration interface reduces the lines of host code by 51% on average,
3Scala is the language in which Chisel is embedded.
1 SC_MODULE(Ctrl) {
2 sc_core::sc_port<tlm::tlm_fifo_put_if<VertexReq>>
3 vertex_req; // declare communication interface
4 ...
5 SC_CTOR(Ctrl) { SC_THREAD(thread); }
6 void thread() { ... } // task description
7 };
8
9 SC_MODULE(PageRank) {
10 // instantiate channels
11 tlm::tlm_fifo<VertexReq> vertex_req{/*depth=*/2};
12 ...
13 Ctrl ctrl; // instantiate tasks
14 ...
15 SC_CTOR(PageRank) {
16 // bind channels to communication interfaces
17 ctrl.vertex_req(vertex_req);
18 ...
19 }
20 };
Listing 6: SystemC TLM API example.
1 struct Ctrl_Arg { // task communication interface
2 channel<VertexReq>* vertex_req;
3 ...
4 };
5
6 void Ctrl(void* arg) { // task description
7 Ctrl_Arg* ctrl_arg = (Ctrl_Arg*)arg; // unpack arguments
8 channel<VertexReq>* vertex_req = ctrl_arg->vertex_req;
9 ...
10 pthread_exit(NULL);
11 }
12
13 void PageRank(...)
14 channel<VertexReq> vertex_req; // instantiate channels
15 ...
16 Ctrl_Arg Ctrl_arg;
17 Ctrl_arg.vertex_req = &vertex_req; // pack arguments
18 ...
19 pthread_t Ctrl_pid, ...; // launch threads
20 pthread_create(&Ctrl_pid, NULL, Ctrl, (void*)&Ctrl_arg);
21 ...
22 pthread_join(&Ctrl_pid, NULL); // join threads
23 ...
24 }
Listing 7: Pthread API example.
3) its coroutine-based soware simulator reduces the length of cor-
rectness verication development cycle by 3.2× on average, 4) its
modularized code generation approach accelerates the QoR tuning
development cycle by 6.8× on average. As a fully automated and
open-source framework, TAPA aims to provide highly productive
development experience for task-parallel programs using HLS. For
future work, we plan to extend our work to support dynamically
generating and executing tasks on FPGAs.
10
ACKNOWLEDGMENT
is work is partially supported by a Google Faculty Award, the
NSF RTML program, Xilinx Adaptive Compute Cluster (XACC)
Program, and the CDSC industrial partners.
REFERENCES
[1] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Ri-
mas Avizˇienis, John Wawrzynek, and Krste Asanovic´. 2012. Chisel: Constructing
Hardware in a Scala Embedded Language. In DAC.
[2] Eli Bendersky. 2018. Measuring context switching and memory over-
heads for Linux threads. (2018). https://eli.thegreenplace.net/2018/
measuring-context-switching-and-memory-overheads-for-linux-threads/
[3] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,
Jason Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-
Level Synthesis for FPGA-Based Processor/Accelerator Systems. In FPGA.
[4] Han Chen, Sergey Madaminov, Michael Ferdman, and Peter Milder. 2020. FPGA-
Accelerated Samplesort for Large Data Sets. In FPGA.
[5] Yu Ting Chen, Jin Hee Kim, Kexin Li, Graham Hoyes, and Jason H. Anderson.
2019. High-Level Synthesis Techniques to Generate Deeply Pipelined Circuits
for FPGAs with Registered Routing. In FPT.
[6] Jianyi Cheng, Shane T. Fleming, Yu Ting Chen, Jason H. Anderson, and George A.
Constantinides. 2019. EASY: Ecient Arbiter SYnthesis from Multi-threaded
Code. In FPGA.
[7] Jianyi Cheng, Lana Josipovic´, George A. Constantinides, Paolo Ienne, and John
Wickerson. 2020. Combining Dynamic & Static Scheduling in High-level Syn-
thesis. In FPGA.
[8] Yuze Chi, Young-kyu Choi, Jason Cong, and Jie Wang. 2019. Rapid Cycle-Accurate
Simulator for High-Level Synthesis. In FPGA.
[9] Yuze Chi and Jason Cong. 2020. Exploiting Computation Reuse for Stencil
Accelerators. In DAC.
[10] Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA : Stencil with
Optimized Dataow Architecture. In ICCAD.
[11] Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li, and Huazhong
Yang. 2016. NXgraph: An Ecient Graph Processing System on a Single Machine.
In ICDE.
[12] Young-kyu Choi, Yuze Chi, Jie Wang, and Jason Cong. 2020. FLASH: Fast,
ParalleL, and Accurate Simulator for HLS. TCAD (2020).
[13] Moo Kyoung Chung, Jun Kyoung Kim, and Soojung Ryu. 2014. SimParallel:
A High Performance Parallel SystemC Simulator Using Hierarchical Multi-
threading. (2014).
[14] Jason Cong, Muhuan Huang, Peichen Pan, Di Wu, and Peng Zhang. 2016. So-
ware Infrastructure for Enabling FPGA-Based Accelerations in Data Centers. In
ISLPED.
[15] Jason Cong, Bin Liu, Stephen Neuendorer, Juanjo Noguera, Kees Vissers, and
Zhiru Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping to
Deployment. TCAD (2011).
[16] Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-Based Systolic Array Auto-
Compilation. In ICCAD.
[17] Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. 2018. Automated Ac-
celerator Generation and Optimization with Composable, Parallel and Pipeline
Architecture. In DAC.
[18] Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. 2018. Lae: Locality
Aware Transformation for High-Level Synthesis. In FCCM.
[19] Jason Cong and Zhiru Zhang. 2006. An Ecient and Versatile Scheduling
Algorithm Based On SDC Formulation. In DAC.
[20] Melvin E. Conway. 1963. Design of a Separable Transition-Diagram Compiler.
Commun. ACM 6, 7 (1963), 396–408.
[21] Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry Standard API
for Shared-Memory Programming. IEEE Computational Science and Engineering
5, 1 (1998), 46–55.
[22] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph
Processing Framework on FPGA A Case Study of Breadth-First Search. In FPGA.
[23] Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and Huazhong
Yang. 2017. ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA
Architecture. In FPGA.
[24] Johannes De Fine Licht, Grzegorz Kwasniewski, and Torsten Hoeer. 2020. Flex-
ible Communication Avoiding Matrix Multiplication on FPGA with High-Level
Synthesis. In FPGA.
[25] Ana Lu´cia de Moura and Roberto Ierusalimschy. 2009. Revisiting Coroutines.
TOPLAS 31, 2 (2009).
[26] Chenhui Deng, Zhiqiang Zhao, Yongyu Wang, Zhiru Zhang, and Zhuo Feng.
2020. GraphZoom: A Multi-level Spectral Approach for Accurate and Scalable
Graph Embedding. In ICLR.
[27] Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru
Zhang, and Jason Cong. 2020. Analysis and Optimization of the Implicit Broad-
casts in FPGA HLS to Improve Maximum Frequency. In DAC.
[28] Ameer Haj-Ali, Qijing Huang, William Moses, John Xiang, Krste Asanovic, John
Wawrzynek, and Ion Stoica. 2020. AutoPhase: Juggling HLS Phase Orderings in
Random Forests with Deep Reinforcement Learning. In MLSys.
[29] C. A. R. Hoare. 1978. Communicating Sequential Processes. Commun. ACM 21,
8 (1978).
[30] Hsuan Hsiao and Jason Anderson. 2019. read Weaving: Static Resource
Scheduling for Multithreaded High-Level Synthesis. In DAC.
[31] Intel. 2020. Intel FPGA SDK for OpenCL Pro Edition: Programming Guide.
(2020).
[32] Intel. 2020. Intel High Level Synthesis Compiler Pro Edition: User Guide. (2020).
[33] Al Shahna Jamal, Eli Cahill, Jerey Goeders, and Steven J. E. Wilton. 2020. Fast
Turnaround HLS Debugging using Dependency Analysis and Debug Overlays.
TRETS 13, 1 (2020).
[34] Jiantong Jiang, Zeke Wang, Xue Liu, Juan Go´mez-Luna, Nan Guan, Qingxu
Deng, Wei Zhang, and Onur Mutlu. 2020. Boyi: A Systematic Framework for
Automatically Deciding the Right Execution Model of OpenCL Applications on
FPGAs. In FPGA.
[35] Lana Josipovic´, Shabnam Sheikhha, Andrea Guerrieri, Paolo Ienne, and Jordi
Cortadella. 2020. Buer Placement and Sizing for High-Performance Dataow
Circuits. In FPGA.
[36] Gilles Kahn. 1974. e Semantics of a Simple Language for Parallel Programming.
In IFIP.
[37] Soroosh Khoram, Jialiang Zhang, Maxwell Strange, and Jing Li. 2018. Accelerat-
ing Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC
Platform. In FPGA.
[38] omas N. Kipf and Max Welling. 2017. Semi-Supervised Classication with
Graph Convolutional Networks. In ICLR.
[39] Donald Ervin Knuth. 1997. Fundamental Algorithms. e Art of Computer Pro-
gramming 1 (3rd ed.).
[40] Mostafa Koraei, Omid Fatemi, and Magnus Jahre. 2019. DCMI: A Scalable Strategy
for Accelerating Iterative Stencil Loops on FPGAs. TACO 16, 4 (2019).
[41] Oliver Kowalke. 2014. Boost Library Documentation, Coroutine2. (2014).
https://boost.org/doc/libs/1 65 0/libs/coroutine2/doc/html/
coroutine2/intro.html
[42] Duncan H. Lawrie. 1975. Access and Alignment of Data in an Array Processor.
ToC C-24, 12 (1975).
[43] Edward A. Lee and David G. Messerschmi. 1987. Synchronous Data Flow. IEEE
75, 9 (1987).
[44] Hyuk-Jae Lee, James P. Robertson, and Jose´ A.B. Fortes. 1997. Generalized
Cannon’s Algorithm for Parallel Matrix Multiplication. In ICS.
[45] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009.
Community Structure in Large Networks: Natural Cluster Sizes and the Absence
of Large Well-Dened Clusters. Internet Mathematics 6, 1 (2009), 29–123.
[46] Jiajie Li, Yuze Chi, and Jason Cong. 2020. HeteroHalide: From Image Processing
DSL to Ecient FPGA Acceleration. In FPGA.
[47] Julian Mcauley. 2012. Learning to Discover Social Circles in Ego Networks. In
NIPS.
[48] Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C.
Hoe, Jose´ F Martı´nez, and Carlos Guestrin. 2014. GraphGen: An FPGA Framework
for Vertex-Centric Graph Computation. In FCCM.
[49] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1998. e
PageRank Citation Ranking: Bringing Order to the Web. Technical Report.
[50] Philippos Papaphilippou, Jiuxi Meng, and Wayne Luk. 2020. High-Performance
FPGA Network Switch Architecture. In FPGA.
[51] James L Peterson. 1977. Petri Nets. Comput. Surveys 9, 3 (1977).
[52] Jing Pu, Steven Bell, Xuan Yang, Je Seer, Stephen Richardson, Jonathan Ragan-
Kelley, and Mark Horowitz. 2017. Programming Heterogeneous Systems from
an Image Processing DSL. TACO 14, 3 (2017).
[53] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-
centric Graph Processing using Streaming Partitions. In SOSP.
[54] Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel:
A High-Level Programming Platform for Streaming Applications on FPGA. In
FCCM.
[55] Vladimir Rybalkin and Norbert Wehn. 2020. When Massive GPU Parallelism
Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network. In
FPGA.
[56] Tim Schmidt, Guantao Liu, and Rainer Do¨mer. 2017. Exploiting read and Data
Level Parallelism for Ultimate Parallel SystemC Simulation. In DAC.
[57] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gal-
lagher, and Tina Eliassi-Rad. 2008. Collective Classication in Network Data. AI
Magazine 29, 3 (2008), 93–106.
[58] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-
works for Large-Scale Image Recognition. In ICLR.
[59] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. 2020. End-to-End Optimization
of Deep Learning Applications. In FPGA.
11
[60] James omas, Pat Hanrahan, and Matei Zaharia. 2020. Fleet: A Framework for
Massively Parallel Streaming on FPGAs. In ASPLOS.
[61] Yu Wang, James C. Hoe, and Eriko Nurvitadhi. 2019. Processor Assisted Work-
list Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory
Platform. In FCCM.
[62] Xuechao Wei, Yun Liang, and Jason Cong. 2019. Overcoming Data Transfer
Bolenecks in FPGA-based DNN Accelerators via Layer Conscious Memory
Management. In DAC.
[63] Xilinx. 2020. Vivado Design Suite User Guide: High-Level Synthesis (UG902).
(2020).
[64] Tanner Young-Schultz, Lothar Lilge, Stephen Brown, and Vaughn Betz. 2020.
Using OpenCL to Enable Soware-like Development of an FPGA-Accelerated
Biophotonic Cancer Treatment Simulator. In FPGA.
[65] Hanqing Zeng and Viktor Prasanna. 2020. GraphACT: Accelerating GCN training
on CPU-FPGA heterogeneous platforms. In FPGA.
[66] Jialiang Zhang and Jing Li. 2018. Degree-aware Hybrid Graph Traversal on
FPGA-HMC Platform. In FPGA.
[67] Shijie Zhou, Rajgopal Kannan, Viktor K Prasanna, Guna Seetharaman, and Qing
Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA.
TPDS (2019).
[68] Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined
Spatial and Temporal Blocking for High-Performance Stencil Computation on
FPGAs Using OpenCL. In FPGA.
12
