AutoDSE: Enabling Software Programmers Design Efficient FPGA
  Accelerators by Sohrabizadeh, Atefeh et al.
AutoDSE: Enabling Software Programmers Design Efficient
FPGA Accelerators
Atefeh Sohrabizadeh1∗, Cody Hao Yu1∗, Min Gao2, and Jason Cong1,2
∗ indicates co-first authors for this work
1 Computer Science Department, University of California, Los Angeles, USA
2 Falcon-computing Inc., USA
{atefehsz,hyu,cong}@cs.ucla.edu,mingao@falcon-computing.com
ABSTRACT
Adopting FPGA as an accelerator in datacenters is becoming main-
stream for customized computing, but the fact that FPGAs are hard
to program creates a steep learning curve for software program-
mers. Even with the help of high-level synthesis (HLS), accelerator
designers still have to manually perform code reconstruction and
cumbersome parameter tuning to achieve the optimal performance.
While many learning models have been leveraged by existing work
to automate the design of efficient accelerators, the unpredictability
of modern HLS tools becomes a major obstacle for them to maintain
high accuracy. In this paper, we address this problem by incorpo-
rating an automated DSE framework—AutoDSE— that leverages
a bottleneck-guided gradient optimizer to systematically find a
better design point. AutoDSE finds the bottleneck of the design
in each step and focuses on high-impact parameters to overcome
that, which is similar to the approach an expert would take. The
experimental results show that AutoDSE is able to find the design
point that achieves, on the geometric mean, 19.9× speedup over one
CPU core for Machsuite and Rodinia benchmarks and 1.04× over
the manually designed HLS accelerated vision kernels in Xilinx
Vitis libraries yet with 26× reduction of their optimization pragmas.
With less than one optimization pragma per design on average,
we are making progress towards democratizing customizable com-
puting by enabling software programmers design efficient FPGA
accelerators.
1 INTRODUCTION
Due to the rapid growth of datasets in recent years, the demand
for scalable, high-performance computing continues to increase.
However, the breakdown of Dennard’s scaling [13] has led to en-
ergy efficiency becoming an important concern in datacenters,
and has spawned exploration into using accelerators such as field-
programmable gate arrays (FPGAs) to alleviate power consumption.
For example, Microsoft has adopted CPU-FPGA systems in its data-
center to help accelerate the Bing search engine [26]; Amazon
introduced the F1 instance [1], a compute instance equipped with
FPGA boards, in its commercial Elastic Compute Cloud (EC2).
On the other hand, FPGA is difficult to program compared to
CPU and GPU since the traditional register-transfer level (RTL)
programming model is more like circuit design instead of software
implementation. To improve the programmability, high-level syn-
thesis (HLS) [10, 43] has attracted a large amount of attention over
the past decades. Currently, both FPGA vendors have their com-
mercial HLS products—Xilinx SDx [36] and Intel FPGA SDK for
OpenCL [18]. In this paper, we target Xilinx FPGAs as an example
but our approach is extendable to Intel FPGAs as they are also sup-
ported by the Merlin Compiler [8, 9, 14]. Code 1 shows an intuitive
HLS C implementation of one forward path of a Convolutional
Neural Network (CNN) on Xilinx FPGAs. Xilinx SDx can generate
about 5800 lines of RTL kernel from ∼70 lines of code in Code 1
with the same functionality. As a result, it is much more efficient
Code 1: CNN HLS C Code Snippet
// Skip const variable initizalization due to page limit
#define input(j, h, w) \
input[((j) * InImSize * InImSize + (h) * InImSize + (w))]
// Skip the rest due to page limit
void CnnKernel(const float* input 1 , const float* weight 1 ,
const float* bias 1 , float* output 1 ) {
float C[ParallelOut][ImSize][ImSize];
for (int i = 0; i < NumOut / ParallelOut; ++i) { 4
// Initialization
for (int h = 0; h < ImSize; ++h) {
for (int w = 0; w < ImSize; ++w) {
for (int po = 0; po < ParallelOut; po++)
C[po][h][w] = bias[(i << shift) + po]; } }
// Convolution
for (int j = 0; j < NumIn; ++j) { 5
for (int h = 0; h < ImSize; ++h) { 5
for (int w = 0; w < ImSize; ++w) { 5
for (int po = 0; po < ParallelOut; po++) { 5
for (int p = 0; p < kKernel; ++p) { 5
for (int q = 0; q < kKernel; ++q) 5
C[po][h][w] += weight(i, po, j, p, q) * input(j,h + p,w + q); 2
3 } } } }
// ReLU + Max pooling
for (int h = 0; h < OutImSize; ++h) { 5
for (int w = 0; w < OutImSize; ++w) { 5
for (int po = 0; po < ParallelOut; po++) { 5
output(i,h,w) = max(max(0.f,
max(C[po][h * 2][w * 2 ], C[po][h * 2 + 1][w * 2 ]),
max(C[po][h * 2][w * 2 + 1], C[po][h * 2 + 1][w * 2 + 1])));
} } } }
for designers to evaluate and improve their architectures in HLS
C/C++.
Although HLS is suitable for hardware experts to quickly imple-
ment a design, it is not friendly for software designers who have
limited FPGA domain knowledge. Since the hardware architecture
inferred from a syntactic C implementation could be ambiguous,
current commercial HLS tools usually generate architecture struc-
tures according to specific HLS C/C++ code patterns. As a result,
even though Cong et al. [10] illustrated that the HLS tool is capable
of generating FPGA designs with a performance as competitive as
the one in RTL, not every C program gives a good performance
and designers must manually reconstruct the HLS C/C++ kernel
with specific code patterns to achieve high performance. In fact,
the generated FPGA accelerator from Code 1 is 80× slower than
a single-thread CPU even though the optimized code shown in
Code 2 is able to achieve around 7,041× speedup with 28 prag-
mas after we analyze and resolve several performance bottlenecks
listed in Table 2. As a matter of fact, the authors used Code 1 as
lab assignments for an upper-division undergraduate course at
their institution. The students were asked to accelerate this ap-
plication as much as they can using OpenCL targeting different
platforms: CPU, GPU, and FPGA. Note that OpenCL is easier to
use for beginner FPGA developers compared to HLS C/C++ since it
applies some of the optimizations such as memory coalescing by
default. Table 1 summarizes the number of students’ submission in
six different ranges. The performance numbers are normalized with
1
ar
X
iv
:2
00
9.
14
38
1v
1 
 [c
s.A
R]
  3
0 S
ep
 20
20
Conference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al.
Code 2: Optimized CNN HLS C Code Snippet
// Skip const variable initizalization due to page limit
void CnnKernel(const ap\_uint< 128 > * input, float weight,
const ap\_uint< 512 > * bias, ap\_uint< 512 > * output){
#pragma HLS INTERFACE m_axi port=input bundle=gmem1 depth=3326977
#pragma HLS INTERFACE s_axilite port=input bundle=control
// Skip the rest due to page limit
float bias_buf[ParallelOut][ParallelOut];
#pragma HLS array_partition variable=bias_buf complete dim=2
float C[ParallelOut][ImSize][ImSize];
#pragma HLS array_partition variable=C cyclic factor=8 dim=3
#pragma HLS array_partition variable=C cyclic factor=2 dim=2
#pragma HLS array_partition variable=C complete dim=1
LoadBurst(bias, bias_buf);
for (int i = 0; i < NumOut / ParallelOut; i++) {
float weight_buf[NumOut / ParallelOut][NumIn][kKernel][kKernel];
#pragma HLS array_partition variable=weight_buf complete dim=4
#pragma HLS array_partition variable=weight_buf complete dim=3
#pragma HLS array_partition variable=weight_buf complete dim=1
float output_buf[NumOut / ParallelOut][OutImSize][OutImSize];
#pragma HLS array_partition variable=output_buf cyclic factor=16 dim=3
#pragma HLS array_partition variable=output_buf complete dim=1
LoadBurst(weight, weight_buf);
// Initialization
for (int h = 0; h < ImSize; ++h) {
for (int w = 0; w < ImSize / 4; ++w) {
#pragma HLS dependence variable=C array inter false
#pragma HLS pipeline
for (int w_sub = 0; w_sub < 4; ++w_sub) {
#pragma HLS unroll
for (int po = 0; po < ParallelOut; po++) {
#pragma HLS unroll
C[po][h][w * 4 + w_sub] = 0.f;
} } } }
// Convolution
for (int j = 0; j < NumIn; ++j) {
float input_buf[InImSize][InImSize];
#pragma HLS array_partition variable=input_buf cyclic factor=8 dim=2
#pragma HLS array_partition variable=input_buf cyclic factor=5 dim=1
LoadBurst(input, input_buf);
for (int h = 0; h < ImSize; ++h) {
for (int w = 0; w < ImSize / 4; ++w) {
#pragma HLS dependence variable=C array inter false
#pragma HLS pipeline
for (int w_sub = 0; w_sub < 4; ++w_sub) {
#pragma HLS unroll
for (int po = 0; po < ParallelOut; po++) {
#pragma HLS unroll
float tmp = 0.f;
for (int p = 0; p < kKernel; ++p) {
#pragma HLS unroll
for (int q = 0; q < kKernel; ++q) {
#pragma HLS unroll
tmp += ...;
} }
C[po][h][w * 4 + w_sub] += tmp;
} } } } }
// ReLU + Max pooling
for (int h = 0; h < OutImSize; ++h) {
for (int w = 0; w < OutImSize; ++w) {
#pragma HLS dependence variable=output_buf array inter false
#pragma HLS pipeline
for (int po = 0; po < ParallelOut; po++) {
#pragma HLS unroll
output_buf(h, w, po) = ...
} } }
StoreBurst(output, output_buf);
} }
respect to 75% of expert design’s performance which was required
for the students to get the full grade. As it can be seen, although
the students could perform well when targeting CPU and GPU, pro-
gramming FPGA was challenging for them. The results suggest that
the required code transformation, pragma insertion and pragma
tuning present a significant barrier to a software programmer when
targeting FPGA.
It turns out that the bottlenecks presented in Table 2 occur for
most C/C++ programs developed by software programmers, and
similar optimizations have to be repeated for each new application,
which makes HLS C/C++ design not scalable. A possible solution
is to apply an automated micro-architecture optimization. Thus,
everyone with descent knowledge of programming is able to try
Table 1: Number of Students’ Submissions for Acceleration
of Code 1 in Each Range. The Performances Are Normalized
with Respect to 75% of Expert Design’s Performance (The
Required Performance).
Platfrom [0-0.2] (0.2-0.4] (0.4-0.6] (0.6-0.8] (0.8-1] (1-∞]
CPU 15 7 9 6 14 23
GPU 9 25 14 10 4 11
FPGA 69 3 0 0 0 0
Table 2: Analysis of Poor Performance in Code 1
Reason Required Code Changes for Higher Performance
1 Low bandwidth util. Manually apply memory coalescing using HLS
built-in type ap_int.
2 Low bandwidth util. Manually allocate local buffer and use memcpy
to enable memory burst.
3 Does not hide commu-
nication latency
Manually create load/compute/store functions
and double buffering.
4 Lack of parallelism Manually create a function to wrap the loop and
set proper array partition factors.
5 Sequential execution Apply #pragma HLS pipeline and #pragma HLS
unroll with proper array partition factor or fuse
loops.
customized computing with minimum effort. In order to free accel-
erator designers from the iterations of HLS design improvement,
automated design space exploration (DSE) for HLS attracts more
and more attention. However, the recent advances in HLS tools
have brought new challenges in designing DSE methods.
Challenge 1: The large solution space: The various combina-
tions of the pragmas that can be applied on a code make exploring
the whole design space an impossible task. In the simplest case,
one either can insert a pipeline or unroll pragma, or choose not
to insert any pragma at all. If unrolling is applied on a loop, the
partition factors of the buffers used inside that loop will determine
the parallelization factor of the loop; thus, different types of array
partitioning (complete, cyclic, block) and their factors need to be
explored. Only these three pragmas can generate a huge design
space. In fact, they produce 1020 design points for Code 1.
Challenge 2: Non-monotonic effect of design parameters
on performance/area: With the latest HLS tools as well as the
larger design spaces enumerated in this paper (see Section 3 for
details), we cannot assume that an individual design parameter will
affect the performance/area in a smooth and/or monotonic way.
For instance, Fig. 1 depicts the execution cycle of the N-W algo-
rithm [24] with different parallel factors for its 5 loops synthesized
by Xilinx SDx [36]. Although the performance trend of 3 loops are
ideal, the rest of the 2 loops (CG-loop-2 and FG-loop-11) are not.
Challenge 3: Correlation of different characteristics of a
design: When different pragmas are employed together in a de-
sign, they do not affect only one characteristic of a design. One has
to take the interaction between them into account for estimating
the latency and area consumption of the design. Taking convo-
lution part of the Code 1 as an example. Applying fine-grained
pipeline to w loop and parallelizing it with a factor of 2 results
in a loop with initiation interval (II) of 2 synthesized by Vivado
HLS [34]. However, when changing the parallel factor to 4, HLS
tool increases the II to 3 instead of doubling the resource utilization
to optimize resource consumption by reusing some of the logic
units. Note that these behaviors may differ from version to ver-
sion; therefore, it is impractical to maintain an analytical model for
DSE. Furthermore, pipelining the j loop is part of the best design
configuration. However, it does not improve the performance until
after the fine-grained pipelining is applied on the w loop. It suggests
1CG and FG mean coarse-grained and fine-grained, respectively.
2
AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA
that the order of applying the pragmas is crucial in designing the
exploration technique.
1 2 3 4 5 6 7 8 9
Parallel Factor
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Ex
ec
ut
io
n 
Cy
cl
e
1e7
CG-loop-1 CG-loop-2 FG-loop-1 FG-loop-2 FG-loop-3
Figure 1: HLS Cycles of N-Wwith Different Factors on Loops
Challenge 4: Implementation disparity of HLS tools: The
implementation of HLS tools is not fixed across different vendors.
Furthermore, HLS tool implementation of the same vendor is always
changing from version to version. For example, the past Xilinx SDx
versions consistently utilize registers to implement array partitions
with small sizes to save BRAMs, but if an array partition is required
to support two reads in one cycle to achieve fully pipelined, or
II = 1, the latest Xilinx SDx marks the partition as true-dual-port
and uses dual-port BRAMs to implement it even if the array size
is small. Such implementation details are hard to be captured and
maintained in analytical models. This makes it difficult to port an
analytical model built on a specific HLS tool to the other.
Challenge 5: Long synthesis time of HLS tools: A major
challenge of using vendor HLS tools directly for DSE is the long
evaluation time since vendor HLS tools usually take 5-30 minutes
to generate RTL and estimate the performance—and even longer
if the design has a high performance. This emphasizes the need
for a DSE that can find the Pareto-optimal design points in fewer
iterations.
To solve the challenges mentioned above, in this paper, we
treat HLS tool as a black-box and, first, apply gradient descent
with finite difference method to guide our explorer. Then, we dis-
cuss the deficiency of problem-independent heuristics and the
gradient-based approach for HLS DSE problem and present the
AutoDSE2 framework that adapts a bottleneck-guided gradient
optimizer to systematically search for better configurations. We
show that our bottleneck-based optimizer can outperform general
hyper-heuristics used in the literature. Furthermore, it outperforms
the naive gradient-based approach we adapted. It also accelerates
the exploration time as it follows the behavior of an expert and
focuses on high-impact design parameters first. To represent a grid
design space with all invalid points marked, we incorporate flexi-
ble list-comprehension syntax to represent a design space as well
as checking rules. In addition, we also partition the design space
systematically to address the local optimal problem caused by non-
smooth/non-monotonic performance/area trend. In summary, this
paper makes the following contributions:
• We propose two strategies to guide DSE. One adapts naive gradi-
ent descent with finite difference method and the other exploits
a bottleneck-guided gradient optimizer.
• We incorporate list-comprehension to represent a smooth, grid
design space with all invalid points marked.
• We develop the AutoDSE framework on top of the Merlin Com-
piler to automatically perform DSE using bottleneck optimizer,
which follows an expert optimizing the code to systematically
close in on high-QoR design points.
• Evaluation results indicate that AutoDSE is able to achieve a
speedup of 1.04× in 0.3 hours on geometric mean with respect
to 33 kernels from Xilinx optimized vision library [37] yet with
2Codes will be open-sourced when the paper is accepted.
26× reduction of their optimization pragmas resulting in less
than one required optimization pragma per kernel.
• We evaluate AutoDSE on 11 computational kernels from Mach-
suite [27] and Rodinia [5] benchmarks and one convolution layer
of Alexnet [20] on the Amazon EC2 F1 instance [1], showing
that we are able to achieve, on a geometric mean, 19.9× speedup
over a single-thread CPU—only a 7% performance gap compared
to manual designs.
2 RELATEDWORK
There are a number of previous works that propose an automated
framework to explore the HLS design space, and they can be sum-
marized in two categories, model-based and model-free techniques.
2.1 Model-based Techniques
The studies in this category build an analytical model for evalu-
ating the quality of each explored design point by estimating its
performance and resource utilization. The authors in [35, 44, 46]
build the dependence graph of the target application and utilize
graph analysis techniques along with predictive models to search
for the best design. Although, this approach can quickly search
through the whole design space, it is inaccurate and difficult to
maintain the model and port it to other HLS vendors or versions as
explained in Challenge 4 of Section 1. Zhong et al. [48] develops
a simple analytical model for performance and area estimation.
However, their model is based on the assumption that the perfor-
mance/area changes monotonically by modifying an individual
design parameter, which is not a valid assumption as we explained
in Challenge 2 of Section 1. To increase the accuracy of the estima-
tion model, a number of other studies restrict the target application
to those that have a well-defined accelerator micro-architecture
template [6, 11, 12, 28, 32, 42], a specific application [39, 45], or a
particular computation pattern [7, 19, 25] hence, they lose general-
ity.
To the same end, there are other studies that build the predictive
model by synthesizing a set of sample designs and iteratively up-
dating it until the model gets to the desired accuracy. Later on, they
use the trained model for estimating the quality of design instead
of invocations of the HLS tool. To learn the behavior of the HLS
tool, these works adapt supervised learning algorithms to better
capture uncertainty of HLS tools [19, 21, 22, 29, 40, 47]. While this
technique increases the accuracy of the model, it is still hard to port
the model to another HLS tool in a different vendor or version. As
a result, for each of them, a new model should be trained.
2.2 Model-free Approaches
To avoid dealing with uncertainty of HLS tools, in this category,
the studies treat HLS tool as a black box. Instead of learning a
predictive model, they invoke HLS every time to evaluate the qual-
ity of the design. To guide the search, they either exploit general
problem-independent heuristics (e.g., simulated annealing [23] and
genetic algorithm [30]) or develop their own heuristics [15, 16, 31].
S2FA [41], uses a hyper-heuristic approach with several optimiza-
tion strategies to reduce the DSE iterations. The authors employ
multi-armed bandit [17] to combine a set of heuristic algorithms
including uniform greedy mutation, differential evolution genetic
algorithm, particle swarm optimization, and simulated annealing.
However, as wewill present in Section 5.1.1, general hyper-heuristic
approaches are unstable for finding the high quality of result (QoR)
design configuration. Moreover, the authors in [15, 16] claim that
Pareto-optimal design points cluster together. They exploit an ini-
tial sampling to build the first approximation of the Pareto frontier
and require local searches to explore other candidates. However,
3
Conference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al.
the cost of initial sampling is not scalable when the design space is
tremendously large (e.g., the scale of 1010 to 1030), as the ones we
have enumerated in this paper are. Even if it only samples 1% of
the design space (the lowest sampling rate they use), it means 108
to 1028 design points.
3 PROBLEM FORMULATION
Our goal is to expedite the hardware design by automating its
exploration process. Even though High-level synthesis (HLS) is
now widely used to facilitate the FPGA accelerator development
cycle, as illustrated in Section 1, specific code patterns are still
necessary to let the HLS tools apply certain architecture structures.
For instance, although the optimized HLS code based on Code 2
can achieve thousand times speedup, it has about 3× more lines of
code compared to the original code it is modified from. This implies
not only the time-consuming code reconstruction efforts but the
impediment for automated design space exploration.
In general, there are two types of pragmas (using Vivado HLS
as an example) that are applied to a program. One type is the non-
optimization pragmas, such as those shown in Lines 5-6 of Code 2,
which specifies the sizes of the interface variables and the value of
loop bounds (not shown here). These pragmas are relatively easy
for software programmers to learn and apply. The other type is
optimization pragmas, including PIPELINE and UNROLL pragmas.
These pragmas require knowledge of FPGA devices and micro-
architecture optimization experience, which are usually much more
challenging for a software programmer to learn and master. The
experiment at the institution of the authors advocates this claim.
As explained in Section 1, in this experiment, students were asked
to optimize the performance of Code 1 targeting different platforms.
The results are summarized in Table 1. Due to the difficulty of
choosing the best location to put the pragmas, the type of the
pragmas and tuning them for non-expert FPGA programmers, the
best FGPA submission only achieved 0.23× of the performance of
the best design. The goal of this research is to minimize or eliminate
the need of optimization pragmas and let AutoDSE insert them
automatically. More formally, we formulate the HLS DSE problem
as the following:
Problem 1: Identify Design Space. Given a C program P as the
FPGA accelerator kernel, construct a design space RKP with K pa-
rameters that contains possible combinations of HLS pragmas for
P as design configurations.
Problem 2: Find the Optimal Configuration.Given a C program
P, one would like to insert a minimal number of optimization
pragmas to get a new program P ′ as the FPGA accelerator kernel
along with its design space set RKP′ which is identified in Problem
1, and a vendor HLS tool H that estimates the execution cycle
Cycle(H,P ′) and the resource utilization Util(H,P ′) of the given
P ′ as a black-box evaluation function. Find a configuration θ ∈ RKP′
in a given search time limit so that the generated design P ′(θ ) with
θ can fit in the FPGA and execution cycle is minimized. Formally,
we define the problem as:
min
θ
Cycle(H,P ′(θ )) (1)
subject to
θ ∈ RKP′ (2)
∀u ∈ Util(H,P ′(θ )),u < Tu (3)
where u is the utilization of one of the FPGA on-chip resources and
Tu is a user-available resource threshold on FPGAs. We set all Tu
to be 0.8, an empirical threshold, in our experiments. Beyond 0.8,
the design will suffer from high clock frequency degradation due
Code 3: CNN Code Snippet in Merlin C
void CnnKernel(
const float input[NumIn][InImSize][InImSize],
const float weight[NumOut][NumIn][kKernel][kKernel],
const float bias[NumOut],
float output[NumOut][OutImSize][OutImSize]) {
float C[ParallelOut][ImSize][ImSize];
for (int i = 0; i < NumOut/ParallelOut; i++) {
// Initialization
for (int h = 0; h < ImSize; ++h) {
#pragma ACCEL parallel factor=4
for (int w = 0; w < ImSize; ++w){
for (int po = 0; po < ParallelOut; po++)
C[po][h][w] = 0.f; } }
// Convolution
#pragma ACCEL pipeline
for (int j = 0; j < NumIn; ++j) {
for (int h = 0; h < ImSize; ++h) {
#pragma ACCEL parallel factor=4
#pragma ACCEL pipeline FLATTEN
for (int w = 0; w < ImSize; ++w) {
for (int po = 0; po < ParallelOut; po++){
float tmp = 0;
for (int p = 0; p < kKernel; ++p) {
for (int q = 0; q < kKernel; ++q){
tmp += ... } }
C[po][h][w] += tmp; } } } }
// ReLU + Max Pooling
for (int h = 0; h < OutImSize; ++h) {
for (int w = 0; w < OutImSize; ++w) {
for (int po = 0; po < ParallelOut; po++){
output[(i<<shift)+po][h][w] = ...
} } } } }
to the difficulty in placement and routing. In addition, the rest of
the resources are left for the interface logic of vendor HLS tool.
Note that we introduce two optimization objectives, one is to
minimize the optimization pragmas inserted to obtain P ′ and an-
other is to maximize performance of P ′ usingAutoDSE. Obviously,
there is a trade-off between the two. An expert designer can always
get an optimized micro-architecture to achieve the best perfor-
mance by inserting enough HLS optimization pragmas. However,
it is time-consuming and not feasible for software programmers
with little or no FPGA design experience. In our evaluation, our
goal is to match the performance of well-designed HLS library code
(typically written by the experts) yet use much fewer optimization
pragmas. Indeed, our experimental results in Section 6 show that we
can achieve our goal with 26× pragma reduction on the geometric
mean, requiring less than 1 pragma per kernel.
4 THE AUTODSE FRAMEWORK
To reduce the size of the design space, we build our DSE on top of
the Merlin Compiler [8, 9]. Section 4.1 reviews Merlin Compiler
and justifies our choice. Then, we present an overview of AutoDSE
in Section 4.2.
4.1 Merlin Compiler and Design Space
Definition
In order to reduce the design space, we chose to utilize the Merlin
Compiler [8, 9] developed by Falcon Computing Solutions [14] as
the backend of our tool as it provides a small set of pragmas to rep-
resent optimization strategies from the perspective of architecture
design. Table 3 lists the Merlin pragmas with architecture struc-
tures. Note that the fg option in the fine-grained loop pipeline mode
refers to the code transformation that tries to apply fine-grained
pipelining to a loop nest by fully unrolling all its sub-loops. Based
on these user-specified pragmas, the Merlin Compiler performs
source-to-source code transformation to apply the corresponding
architecture optimization by automatically generating the related
HLS pragmas such as PIPELINE, UNROLL, and ARRAY_PARTITION
and applying them to the program. Since the number of pragmas
required by the Merlin Compiler is much smaller (as it performs
source level code reconstruction and generates most of the HLS
4
AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA
required pragmas such as array partition), it defines a more compact
design space, so we use it as the compilation tool for DSE [12, 41].
For instance, Code 3 shows the CNN kernel with Merlin pragmas.
With only a few line of pragmas, the Merlin Compiler is able to
transform Code 3 to a high-performance HLS kernel with the same
performance as Code 2. There are two kinds of code transforma-
tions that one need to employ to get to a high-performance design.
The first one is for increasing the data reuse by doing loop trans-
formations which is common to CPU performance optimizations
as well (e.g. for cache locality); therefore, it is well acceptable by
software programmers and we expect them to apply themmanually
without any problems. The second kind is required to enable archi-
tectural optimizations such as memory burst, memory coalescing,
and double buffering, as mentioned by reasons 1-3 in Table 2. These
transformations are much more difficult for software programmers
to learn and apply effectively. Fortunately, the Merlin Compiler
takes care of this kind of code transformations. For example, in-
stead of rewriting Code 1 to test whether double buffering would
help the performance as denoted by reason 3 in Table 2, we just
need to use the PIPELINE pragma with cg option. Furthermore,
instead of manually applying code transformations for memory
coalescing and memory burst as denoted by bottlenecks 1 and 2 in
Table 2, we can tune the interface and tiling pragma and the
Merlin Compiler will rewrite the code to satisfy these constraints.
As a result, our focus in this work is on finding the best location of
each of the pragmas and tuning them to enable those architectural
optimizations along with the best pipelining and parallelization
attributes to address reasons 4-5 in Table 2 as well.
Table 3: Merlin Pragmas with Architecture Structures
Keyword Available Options Architecture Structure
parallel factor=<int> CG & FG parallelism
pipeline mode=cg CG pipelinemode=fg FG pipeline
tiling factor=<int> Loop Tiling
CG: Coarse-grained; FG: Fine-grained
Our solution to Problem 1 is shown in Table 4. We identify
the design space for each kernel by analyzing the kernel AST to
realize loop trip-counts, available bit-widths, and so on. In addition,
since vendor HLS tools usually schedule fine-grained loops well,
we only explore the parallel factor of fine-grained loops when its
trip-count is larger than 16; otherwise, we simply apply fully unroll
and pipeline to small fine-grained loops to reduce the design space.
Moreover, the parallelization factors and tile sizes considered are
integer divisor of their respective loop trip-count. We do not include
the interface pragma in our search space since the best bitwidth
can be determined by the size of the input and its data type.
Table 4: Design Space Building on Merlin Pragmas
Factor Design Space (Values)
CG-loop parallel {u | 1 < u <= TC(L), u .c = TC(L), c ∈ Z}
FG-loop parallel
{
u |
{
1 < u < TC(L), u .c = TC(L), c ∈ Z, TC(L) > 16
u = TC(L), otherwise
}
CG-loop pipeline {p | p ∈ {of f , cд, f д }}
FG-loop pipeline {p | p = f д }
loop tiling {t | 1 < t < TC(L), t .c = TC(L), c ∈ Z}
CG: Coarse-grained; FG: Fine-grained; TC: Loop trip-count
Now that we have defined the design space in Table 4 for Prob-
lem 1, we focus on Problem 2 in the remainder of this paper. Al-
though to some extents, Merlin pragmas alleviate the manual code
reconstruction overhead, a designer still has to manually search
for the best option for each pragma, including position, type, and
factors. In fact, choices for the CNN design in Code 3 contain four
DRAM buffers and thirteen loops, which result in ∼ 1016 design
configurations. The large design space motivates us to develop an
efficient approach to find the best configuration.
4.2 Framework Overview
We develop and implement AutoDSE, a push-button framework, in
Fig. 2 based on the strategies explained in Section 5. The framework
first automatically builds a design space according to Table 4 by
analyzing the kernel AST using the syntax described in Section 5.2.
Then, it profiles and selects representative partitions using K-Means
as mentioned in Section 5.3. For each partition, AutoDSE explorer
performs DSE using the proposed bottleneck-based gradient strat-
egy in Section 5.1.3. The explorer can be tuned to evaluate the
quality of design points based on different targets such as perfor-
mance, resource, or finite difference introduced in Section 5.1.2.
When the explorer finishes exploring a partition, it stores the best
configuration found by that partition and reallocates the working
threads to other partitions to keep the resource utilization high. Fi-
nally, when all partitions are finished, AutoDSE outputs the design
configuration with the best QoR among all partitions.
Explorer
Bottleneck Optimization 
Algorithm
Cache Hit 
Checking
Code 
Transformation
HLS w. 
Vendor Tools
Bottleneck 
Analysis
Result 
Committing
Evaluator
Design Config. Waiting Queue
C Kernel
Design Space 
Generator + 
Partitioner
Profiler and Seed 
Generation
Design 
Space 
Partition
Design 
Space 
Partition
Design 
Space 
Partition
Design Space 
Partition
Design Space 
Partition
Representative 
Design Space
Result Database
C Kernel w. 
Optimized 
Design Config.
Execution Flow Result Query
Figure 2: The AutoDSE Framework Overview
5 METHODOLOGY OF AUTODSE
The general search techniques perform poorly in HLS DSE problem
due to the non-monotonic effect of the pragmas and their correla-
tion with each other as explained in challenges 2 and 3 of Section 1.
S2FA [41] uses a set of these techniques such as uniform greedy
mutation, differential evolution genetic algorithm, particle swarm
optimization, and simulated annealing. To show the deficiency of
the common search techniques, we further test the performance of
the gradient descent. The experimental results in Section 6 demon-
strate that our proposed bottleneck-guided gradient optimizer out-
performs all of these techniques.
The organization of this section is as follows: we elaborate on
our searching algorithm in Section 5.1. In Section 5.2, we present
an efficient way to represent the design space that helps us rule
out the infeasible design points. To prevent the search engine from
being trapped in local optimal points, we partition the design space
as explained in Section 5.3.
5.1 Parameter Prioritization
In this section, we analyze different approaches for exploring the
design space and gradually come up with an effective algorithm
to solve the problems mentioned in Section 3. We first examine
the efficiency of problem-independent heuristics in Section 5.1.1.
Then, we introduce a new search technique based on gradient de-
scent, a common iterative optimization algorithm, with a finite
difference method in Section 5.1.2 that systematically finds a better
design point in the design space. As we will explain, the problem-
independent heuristics and naive gradient-based approach fail to
5
Conference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al.
identify the killer parameters in few iterations. As a result, in Sec-
tion 5.1.3, we present a bottleneck-guided gradient optimizer that
can mimic an expert’s optimization method and can outperform
the aforementioned approaches.
5.1.1 Weakness of Problem-independent Heuristics. We illustrate a
representative prior work on DSE, which utilized a popular search
engine called OpenTuner [2]. OpenTuner leverages themulti-armed
bandit (MAB) approach [17] to assemble multiple meta-heuristic al-
gorithms for high generalization. At each iteration, the MAB selects
the meta-heuristic with the highest credit and updates the credit
of the selected meta-heuristic based on the QoR, which means the
meta-heuristic that can efficiently find high-quality design points
will be rewarded and activated more frequently by the MAB, and
vice versa. Due to its extensibility, OpenTuner has been adapted
to perform DSE for design optimization. DATuner [38] introduces
entropy-based partition to search for the best parameters for phys-
ical design tools with multiple threads. S2FA [41] further applies
more strategies to improve the OpenTuner efficiency when per-
forming DSE for HLS. Since the S2FA backend also employs the
Merlin Compiler for code transformation, we use its DSE engine to
justify the advantage of developing a bottleneck-guided gradient
optimizer.
0 5 10 15 20 25
Time (hours)
0.0
0.2
0.4
0.6
0.8
1.0
N
or
m
al
iz
ed
 S
pe
ed
up
fft-strided
bfs-queue, stencil
km
p
spm
v
aes
bfs-bulk, nw
fft-trans
gem
m
Figure 3: Speedup Over the Manual Design Using S2FA [41]
We use S2FA to perform the DSE for 24 hours by turning off
its early stopping criteria and depict the speedup of our bench-
mark cases over the corresponding manual design hourly in Fig. 3.
The black dot indicates the time that the S2FA finds the overall
best design point. We can see that S2FA requires on average 16.8
hours to find the best solution. We further analyze the exploration
process and find that most designs have an obvious performance
bottleneck (e.g., effective external memory bandwidth, insufficient
parallel factors, etc.), which usually dominates more than half of the
overall execution cycle and is controlled by only one or two design
parameters. In this situation, the performance gain of tuning other
parameters is often very limited and hard to attribute to the general
learning algorithms. The learning algorithm needs many iterations
to identify the key parameter and tune it to resolve the performance
bottleneck. After that, it has to spend a large number of iterations
again to find the next key parameter. This phenomenon motivates
us to develop a new search algorithm that is guaranteed to optimize
the killer parameter prior to others.
5.1.2 Gradient Descent with Finite Difference. Gradient descent is
a well-known iterative optimization algorithm for finding a local
minimum point in a differentiable objective function. It has also
been successfully applied to solve large scale non-linear physical
design problems with a smooth analytical approximation such as
multi-level circuit placement [4]. Formally, gradient descent is em-
ployed to find a configuration θ with the minimal objective value
J (θ ) in a solution space RKP :
arдmin
θi ∈RKP
J(θi ) (4)
To achieve the goal, we start from an initial configuration θ0, and it-
eratively update the configuration by following the steepest descent,
the negative gradient −∇J:
θi+1 = θi − α∇J(θi ) (5)
where α is the step size.
The gradient descent approach requires the objective function to
be differentiable in order to find the next steepest descent. This lim-
itation makes it impractical in many real-world applications, as the
system may be too complicated to be modeled as partially observ-
able Markov decision problems. To avoid the potential problems
of modeling HLS tools, we leverage the finite difference method
to approximate the gradient value by treating the HLS tool as a
black-box. That is, given a candidate configuration θ j deviated from
the current configuration θi , we use the finite difference method to
approximate the gradient as follows:
д(θ j ,θi ) ∼
Cycle(H,P(θ j )) −Cycle(H,P(θi ))
Util(H,P(θ j )) −Util(H,P(θi )) (6)
Note that Eq. 6 considers not only performance gain but resource
efficiency, so it could reduce the possibility of being trapped in a
local optimal. For example, we may reduce 10% execution cycle
by spending 30% more area if we increase the parallel factor of
a loop (configuration θ1); we can also reduce the 5% execution
cycle by spending 10% more area if we enlarge the bit-width of
a certain buffer (configuration θ2). Although θ1 seems better in
terms of the execution cycle, it may be more easily trapped by
a local optimal point because it has a relatively limited resource
to be further improved. On the other hand, the finite difference
values for the two configurations are д(θ1,θ0) = −10%30% = −0.3
and д(θ2,θ0) = −5%10% = −0.5, so the system prioritizes the second
configuration for a better long-term performance.
Since the finite difference method selects the best candidates
as the next configuration, we need to generate a set of candidates,
Θcand , at each iteration. Specifically, we generate candidates by
advancing the value of each parameter in the current configuration
by one step. Formally, the c-th candidate generated from θi is:
θc = [p0,p1, ...,pc + 1, ...,pk ] (7)
where pc is the value of c-th parameter in θi . Accordingly, we will
generate K candidates at each iteration, which means we run HLS
K times to determine the next configuration:
θi+1 = arдmin
θ j ∈Θcand
д(θ j ,θi ) (8)
By leveraging the gradient descent with a finite difference
method, we expect to find a better design point every K HLS runs.
Unfortunately, as we have illustrated in Fig. 1, the performance
trend is not always smooth, so the gradient process can easily be
trapped by a low-quality local optimal design point. Taking Fig. 1
again as an example, the gradient approach will stop at factor 2 for
FG-loop-1 because factor 3 has a worse performance but consumes
more resources. Actually, the gradient approach proposed in this
section only achieves 2.8× speedup on the geometric mean of our
MachSuite [27] and Rodinia [5] benchmarks, which is even worse
than the results from the problem-independent heuristics reviewed
in Section 5.1.1.
Moreover, the efficiency of using the gradient-based approach
for DSE is limited by the process of approximating gradient value.
In order to approximate the gradient value, at each iteration, we
need to evaluate K design points, where K is the total number of
6
AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA
tuning parameters, to determine the next step. On the other hand,
in most cases, only a few of the K tuning parameters have a high
impact on the performance, so we should evaluate only the K ′
impactful parameters at each iteration where K ′ < K . For instance,
design space generator will instrument Code 1 with 27 pragmas
based on the rules explained in Section 5.2 and the gradient-based
approach proposed in this section need to assess the quality of 27
new designs in each iteration. However, in the early iterations the
convolution part takes more than 90% of the total cycle counts of
the kernel. As a result, changing the pragmas outside of this part
will have insignificant effect on the performance and it is wasteful
to explore them at this stage.
Identifying the K ′ parameters is not straightforward. Although
HLS report may provide the cycle breakdown for the loop and func-
tion statements, it is hard to map them to tuning parameters due to
the applications of several code transformations. Fortunately, the
Merlin Compiler [8] includes a feature that performs back propaga-
tion. This feature transmits the performance breakdown reported
by the HLS tool to the user input code, allowing us to identify
the performance bottleneck by traversing the Merlin report and
mapping the bottleneck statement to one or few tuning parameters,
which will be presented next.
5.1.3 Bottleneck Optimization Strategy. From our experience with
the naive gradient-based DSE approach proposed in Section 5.1.2
and the general learning algorithm or heuristics discussed in Sec-
tion 5.1.1, we see the following inefficiencies when comparing their
behavior with human design experts:
(1) Those approaches have to evaluate many design points to iden-
tify the performance bottleneck. An expert could directly acquire
this information by analyzing the cycle breakdown.
(2) Those approaches have no knowledge of parameters, so they
have no way to prioritize important ones. An expert, on the other
hand, may know which parameter has a high potential of being
the killer parameter.
(3) Those approaches may stop exploring the options of a parameter
due to local optimal. An expert may know whether other options
are worthwhile to explore or not.
The first two inefficiencies can be resolved by leveraging the
bottleneck analysis. We first build a map from the loop or function
statements in the user input code to design parameters so that we
know which parameters should be focused on for a particular state-
ment. To identify the critical path and type, we start with the kernel
top function statement. We first check to see if the current state-
ment has child loop statements. For the function call statements,
we dive into the function implementation to further check its child
statements. Then, we traverse each of them and create hierarchy
paths. Note that since we sort all loop statements according to their
latency by checking the Merlin report, the hierarchy paths we cre-
ated will also be sorted by their latency. Subsequently, we check
the Merlin report again to determine whether the performance
bottleneck of the current statement is memory transfer or compu-
tation. The Merlin Compiler obtains this information by analyzing
the transformed kernel code along with the HLS report. A cycle
is considered to be a memory transfer cycle if it is consumed by
communicating to off-chip memory. Finally, we append the current
statement to the end of each path and return a list of paths in order.
As a result, we can not only figure out the performance bottleneck
for each design point, but also identify a small set of effective de-
sign parameters to focus on. Therefore, we are able to significantly
improve the efficiency of our searching algorithm.
When we obtain an ordered list of critical hierarchy paths from
the bottleneck analysis, we start from the most critical innermost
loop statement and identify its corresponding parameters. For in-
stance, the convolution part of Code 1 takes 98% of the cycle counts;
hence, the parameters applied to this section of code make the top
of the list from the innermost loop to the outermost one. Note that
since the bottleneck analysis also provides the bottleneck type in-
formation (i.e., memory transfer or computation), we may identify
a subset of the parameters mapped to that statement. For example,
we may have design parameters of PARALLEL and TILING at the
same loop level. When the bottleneck type of the loop is memory
transfer, we focus on the TILING parameter for the loop; otherwise,
we focus on PARALLEL parameter. In other words, we reduce the
number of candidate design parameters not only by the bottleneck
statement but also by the bottleneck type.
Table 5: Performance and Resource Utilization Compared
to The Base Design When Parameters of Line 18 in Code 1
Change
Optimization Status Perf BRAM LUT DSP FF
Pi-fg PASS (24 min) 175× +7% +23% +24% +15%
PF=4 TIMEOUT - - - - -
Pi-fg + PF=4 PASS (28 min) 218× +17% +44% +33% +25%
Pi: Pipeline, PF: Parallel Factor, fg: fine-grained
It often happens that there are more than one design param-
eter that can be applied for each bottleneck type. In situations
where the bottleneck of a loop statement is determined to be its
computation, one can apply fine-grained or coarse-grained pipelin-
ing/parallelization in general. When this case happens, we have to
utilize a predefined priority for testing the parameters. We choose
the order of applying the pragmas for compute-bounded loops to be
PIPELINE mode fg, PARALLEL, and PIPELINE mode cg which is a
greedy approach to improve the performance by utilizing more par-
allelization units. Measuring the quality of design points with finite
difference (gradient) value helps AutoDSE not to over-utilize the
FPGA as when for a configuration, the gain of the achieved speedup
is not comparable to the loss of available resources, it will decrease
the quality of design; hence, AutoDSE will turn that pragma off.
Instead, the resources are left for applying a design parameter with
higher impact. Moreover, as mentioned in Challenge 3 of Section 1
the order of applying the pragmas is crucial in order to get to the
best design. Since HLS tools schedule fine-grained pipelining/par-
allelization better than the coarse-grained ones, our experiments
show that evaluating the fine-grained options first helps AutoDSE
reach the best design point in fewer iterations. Table 5 shows how
the performance and resource utilization change compared to the
base design in which all the pragmas are off when PIPELINE mode
fg and PARALLEL pragmas are applied on line 18 in Code 1. The
time limit to run the HLS tool is set to 60 minutes. The results
suggest that in order to get to the optimal configuration for this
loop, we must first apply the fine-grained pipelining. This way, HLS
tool can better schedule the loop when parallelization is further
applied and its synthesis will finish in 28 minutes.
Note that we do not prune the other design parameters. We just
change the order of the parameters to be explored as these rules
can not be generalized to all cases due to the unpredictability of the
HLS tools. If the bottleneck of a design point is memory transfer,
AutoDSE prioritizes PIPELINE mode cg pragma over TILING. The
Merlin Compiler, by default, caches the data and the former will
further overlap the communication time with computation by ap-
plying double buffering; however, the latter, can be used to change
the size of the cached data.
We define level n as a design where we have fixed the value of n
parameters, so the maximum level in our algorithm is equal to the
total number of parameters. Each design point is represented using
a data structure that includes the quality of design measured by
7
Conference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al.
finite difference value introduced in Eq. 6, the focused parameters,
the fixed parameters and the configurations of the parameters along
with a stack containing its unexplored children. Each level has a
heap of the pending design points that can be further explored. Since
new design points are sorted by their quality values when they were
pushed into the heap, the design point with a better quality value
will be explored prior to other points. Moreover, when a new design
point is passed through the bottleneck analyzer, it will generate
new focused parameters in order of importance stored in a stack;
hence, by popping the stack, we get to work with the design point
with the most promising impact. At each iteration of the algorithm,
AutoDSE gets the heap with the highest level. Then, it peeks the
first node of the heap and pulls its stack of unexplored children.
The new candidate is picked by popping the stack and passed to the
bottleneck analyzer to generate a new set of focused parameters as
the children. Then, it will be pushed to the heap of the next level. If
a design point does not generate any focused parameters or when
its stack of unexplored children is empty, it will be popped out of
heap. The algorithm continues either until all the heaps are empty
or when the DSE has reached a runtime threshold.
For the third inefficiency mentioned in the beginning of this
section, we cannot identify whether the current option of a param-
eter is local or global optimal, so the most promising solution is
breaking the dependency between options and searching a set of
them in parallel as explained in Section 5.3. In this way, although
we still need to evaluate multiple design points at every iteration,
we guarantee that each design point can provide the maximum
information for improving the performance because we always
evaluate the options of the parameter that has the largest impact
on the performance bottleneck.
5.2 Efficient Design Space Representation
One solution to facilitating the bottleneck-based optimizer is to
reduce ineffective parameters. Intuitively, we can build a grid design
space from Merlin pragmas by treating each Merlin pragma as a
tuning parameter and search for the best combination. However,
many points in this grid space may be infeasible. For example,
if we have determined to perform coarse-grained pipelining at
the outermost loop of a loop nest, the Merlin Compiler will apply
double-buffering on the loop. In this case, the physical meaning of
double-buffering at the outermost loop is to transfer a batch set of
data from DRAM to BRAM, which cannot be further parallelized.
As a result, pipeline and parallel pragmas are mutually exclusive
in a loop nest. In this section, we propose an efficient approach
to create a design space that preserves the grid design space but
invalidates infeasible combinations.
P1
P2
off
‘’
flatten
1 2 4 8 16 32 64
1
2
Figure 4: Proposed Design Space Representation and Its Im-
pact on DSE
Fig. 4 illustrates the goal of an efficient design space represen-
tation. In this example, we attempt to explore the best parameter
for loop j of Code 1 and the best option for it with pragma P1 and
P2 denoting the PIPELINE and PARALLEL pragma respectively. The
pragma P1 and P2 are exclusive when P1 is used without any option
which in this case results in coarse-grained pipelining; therefore
only one of them should be inserted at a time. A good design space
representation must preserve the grid design space but invalidate
infeasible points. An example of such representation is presented
in Fig. 4. Assume that we are at the configuration (P1, P2) = (cg, 1),
we only have two candidates to be explored in next step because
the configuration (P1, P2) = (cg, 2) is invalid. This representation
is exploration friendly and easy to enforce rules on the infeasible
points.
To represent a grid design space with invalid points, we intro-
duce a Python list comprehension syntax to AutoDSE. The Python
list comprehension is a concise approach for creating lists with
conditions. It has the following syntax:
list_name = [expression for item in list if condition]
Formally, we define the design space representation for Merlin
pragmas with list comprehensions as follows:
#pragma ACCEL <pragma-type> <attribute-key>=auto{
options: parameter_name=list-comprehension-expression;
default: default-value }
For our example, the design space can be represented using list
comprehensions as follows:
// Skip the rest due to page limit
#pragma ACCEL PIPELINE auto{
options: P1 = [x for x in [off, cg, fg]];
default: 'off' }
#pragma ACCEL PARALLEL factor=auto{
options: P2 = [x for x in [1, 2, 4, 8, 16, 32, 64] if P1!=cg];
default: 1 }
for (int j = 0; j < NumIn; ++i) {
// Skip the rest due to page limit
where line 5 indicate that the two pragmas are exclusive. In other
words, when we set P1 = cg, the available option for P2 is only the
default value, which is 1 in this case. Note that the default value of
each pragma turns it off.
There are three main advantages to adopting list comprehension-
based design space representations. First, we are able to represent a
design space with exclusive rules to greatly reduce its size. Second,
the Python list comprehension is general. It provides a friendly and
comprehensive interface for higher layers such as polyhedral anal-
ysis [49] and domain-specific languages to generate an effective
design space in the future. Third, the syntax of this representation
is Python compatible. This means we can leverage the Python inter-
preter to evaluate the design space and improve overall stability of
the DSE framework. The Design Space Generator, depicted in Fig. 2,
analyzes the kernel AST and extracts the required information for
starting the DSE such as the loops in the design, their trip-count,
and available bit-width. Artisan [33] adopts a similar approach for
analyzing the code. However, it only considers unroll pragma in
code instrumentation. Our approach, on the other hand, considers
a wider set of pragmas as mentioned in Table 3 and employs the
following rules to prune the design space:
• Ignore the fine-grained loops with trip count (TC) of less than
or equal to 16 as the HLS tool can schedule these loops well.
• The allowed parallel factors for a loop are all sub-divisors of the
loop TC up tomin(128,TC) plus the TC itself. Parallel factor of
larger than 128 causes HLS tool to run for a long time and it
usually does not result in a good performance.
• For each loop, we should have TF ∗ PF < TC , where TF and PF
are tiling factor and parallel factor respectively.
• When pipeline mode fg is applied on a loop (PIPELINE
FLATTEN), no other pragma is allowed for the inner loops.
• A parallel pragma is invalid for a loop nest when pipeline mode
cg is applied on that loop.
• A tiling pragma is added only to the loops with an inner loop.
According to the evaluation results, our pruning rules are able
to reduce on average 24.65× design space while still achieving 1.3×
8
AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA
speedup on the geometric mean for our MatchSuite and Rodinia
benchmarks.
5.3 Design Space Partitioning
One solution to solving the local optimal issue caused by the non-
smooth performance gain is partitioning the design space based
on the likely distribution of local optimal points and exploring
each partition independently. Intuitively, we could partition the
design space according to a range of values of every parameter in
a design, but it may generate thousands of partitions and result
in a long exploration time. Thus, we only partition the design
space based on the pipeline mode, as pipeline mode fg unrolls
all sub-loops to achieve fine-grained while the mode cg exploits
double buffers to implement coarse-grained pipeline. These two
modes apparently have the most significant different influence on
the generated architecture and are expected to have non-related
performance and resource utilization. According to the pipeline
modes in each loop, we use the tree partition and generate 2N
partitions from a design space with N non-innermost loops.
Supposing we use t working threads to perform, at most, h hours
DSE for 2N design space partitions, we need 2Nt × h hours to
finish the entire process. On the other hand, some partitions that
are based on an insignificant pipeline pragma may have a similar
performance, so it is more efficient to only explore one of them. As
a result, we profile each partition by running HLS with minimized
parameter values to obtain the minimum area and performance and
use K-means clustering with performance and area as features to
identify t representative partitions among all 2N partitions. With
this strategy, we are able to achieve a further 2× speedup on the
geometric mean by exploring, at most, t partitions in h hours.
6 EVALUATION
6.1 Experimental Setup
Our evaluation is performed on Amazon Elastic Compute Cloud
(EC2) [1]. We use r4.4xlarge instance with 16 cores and 122 GiB
memory to perform the DSE and generate accelerator bit-streams.
The generated FPGA accelerators are evaluated on an F1 instance
(f1.2xlarge) with Xilinx Virtex UltraScale+TM VU9P FPGA. In
addition, our benchmark is selected from the MachSuite [27] bench-
mark suite, the FPGA-friendly Rodinia [5] benchmark, and one
convolution layer of Alexnet [20]. For several common kernels,
MachSuite provides C implementation that is programmed without
the consideration of FPGA acceleration, which makes it a natu-
ral fit for demonstrating our approach. Furthermore, we evaluate
the performance of AutoDSE on vision kernels of Xilinx Vitits li-
braries [37] that are optimized for Xilinx FPGAs, based on the
OpenCV library [3].
6.2 Comparative Studies
6.2.1 MachSuite [27] and Rodinia [5] Benchmark. We first
evaluate the gradient descent with a finite difference method and
the proposed optimization strategies. The 1st to 3rd bars of each
case in Fig. 5 show the speedup gained by themwith respect to CPU.
Note that the chart is in logarithmic scale. The list-based design
space representation keeps the search space smooth by invalidat-
ing infeasible combinations. As a result, we can investigate more
design points in a fixed amount of time. This helps AES, NW, KMP,
PATHFINDER, KMEANS, and KNN. Design space partition benefits the
designs with many loop nests in which the gradient process is easily
trapped by the local optimal when changing pipeline modes—such
as AES, GEMM, NW, STENCIL-2D, and STENCIL-3D.
The 4th bar shows the speedup of AutoDSE when bottleneck-
guided gradient optimizer described in Section 5.1.3 is adapted along
with the design space representation introduced in Section 5.2 and
design space partitioning explained in Section 5.3. With this setup,
AutoDSE further improves the result by 5.5× on the geometric
mean. As a result, AutoDSE is able to achieve a speedup of 19.9×
over CPU and get to 0.93× performance of the manual designs with
only 1.1 hours on the geometric mean.
Table 6: Speedup of Our Approach, S2FA [41], Lattice-
traversing DSE [16], and Manual Design Over an Intel Xeon
CPU Core
Approach AES NW GEMM KMP SPMV STENCIL-3D
Lattice [16] 2319.9 536.4 2.2 - - -
S2FA [41] 7.4 3387.4 10.7 2.9 1.7 2.1
Ours 3774.7 3387.5 16.3 5.1 1.7 2.7
Manual 3774.7 3468.1 16.3 9.6 1.7 2.7
We further evaluate the overall performance of generated acceler-
ator designs by our bottleneck-guided gradient optimizer, S2FA [41],
lattice-traversing DSE [16], and manual design over CPU in Table 6.
Note that the performance of S2FA and lattice-traversing DSE are
not reported by the authors for all of the kernels we are testing. The
manual designs are optimized with the Merlin Compiler pragmas
without changing the source programs to illustrate the optimality
of our DSE process. According to Table 6, using the bottleneck ap-
proach, we can outperform S2FA and lattice-traversing DSE by 3.6×
and 4.3× respectively, on the geometric mean. As we discussed in
Section 5.1.1, the reason behind deficiency of S2FA is that it is hard
for the problem-independent learning algorithm to find the killer
parameters. Lattice-traversing DSE needs an initial sampling step to
learn the design space which takes a long time for our benchmark
due to the size of the design space even though the authors only
consider unrolling the loops and function inlining. This constraint
makes it hard for the tool to start the exploration process before
the time limit for DSE is met. However, AutoDSE is able to find a
high-performance design in a few iterations. Fig. 7 depictsAutoDSE
process for four cases that it can significantly outperform the other
methods. This shows that the bottleneck-guided gradient optimizer
can rapidly achieve high performance. The reason that AutoDSE
does not exactly match the performance of manual designs for all of
the cases is the fact that when the kernels contain many unbounded
loops or while-loops, the HLS report may not reflect the accurate
computation cycles. This affects the bottleneck type analysis of
the Merlin report. Therefore, our search algorithm will focus on
unimportant design points. In the future, we will study the Merlin
report analysis to avoid the situations where HLS may produce in-
accurate report. Another advantage of AutoDSE compared to other
DSE tools such as lattice-traversing DSE is that its backend is based
on the Merlin Compiler. This way, the tool can exploit the auto-
matic code transformations for applying the common optimization
techniques such as memory burst, memory coalescing, and double
buffering; and focus only on high-level hardware changes.
6.2.2 Xilinx Vision Library [37]. To further evaluate the perfor-
mance of AutoDSE, we use 33 vision kernels from Xilinx Vitis
Library. These kernels utilize 14 optimization pragmas, on the geo-
metric mean, which include UNROLL, PIPELINE, ARRAY_PARTITION,
DEPENDENCE, LOOP_FLATTEN, INLINE, DATAFLOW, and STREAM. For
each kernel, we remove the pragmas we search for along with
the one that the Merlin Compiler can infer (INLINE) and pass
it to AutoDSE. The pragmas that we remove for these ker-
nels include UNROLL, PIPELINE, ARRAY_PARTITION, DEPENDENCE,
LOOP_FLATTEN, and INLINE which are used 13.5 times, on the
geometric mean. The pragmas we keep include LOOP_TRIPCOUNT,
9
Conference’20, Sep 2020, Los Angeles, CA, USA Sohrabizadeh and Yu, et al.
CPU
Figure 5: Speedup of the Proposed Approach Over an Intel Xeon CPU Core
Figure 6: Speedup and Number of Reduced Pragmas using the AutoDSE Compared to Vision Kernels of Xilinx Vitis li-
braries [37]
Figure 7: Performance of Generated Designs by AutoDSE
Over Time
INTERFACE, DATAFLOW, and STREAM. Note that Merlin Compiler can
work directly with the pragmas from HLS vendor tool as well.
However, it does not apply any code transformation when those
pragmas are utilized. The INTERFACE and LOOP_TRIPCOUNT prag-
mas are there to specify the connection to AXI bus and the range
of the trip count of the loop respectively; therefore, they cannot be
removed. In addition, since our search space is built on top of the
Merlin Compiler, we do not search for DATAFLOW and STREAM prag-
mas as these pragmas are not among the Merlin-specified pragmas.
Nevertheless, we require less than one optimization pragma per
kernel, on the geometric mean. In the future, we will expand our
search engine to HLS pragmas that are not included in Merlin.
Fig. 6 depicts the performance comparison of the design point
AutoDSE generated with respect to Xilinx results along with the
number of pragmas that we removed. Fig. 6 suggests that AutoDSE
is able to achieve a speedup of 1.04× yet with 26× reduction of their
optimization pragmas in 0.3 hours, on the geometric mean, with
respect to Xilinx optimized kernels; therefore, proving the effective-
ness of our bottleneck-based approach and the fact that it can mimic
the method an expert would take. For the cases that AutoDSE does
not exactly match the performance of Vitis, AutoDSE still finds the
best combination of the pragmas. The inequality lies in the differ-
ent II that Merlin has achieved. For example, the histEqualize,
histogram, and otsuthreshold kernels all have a loop that Vivado
HLS achieves an II=3 when used with #pragma HLS PIPELINE.
However, if II=2 is added to the PIPELINE pragma, Vivado HLS
can achieve II=2 but, it is not possible to change the II using the
Merlin Compiler. On the other hand,AutoDSE is able to outperform
the performance of customConv and reduce kernels significantly
by better detecting the choices and locations for pipelining and
parallelization.
7 CONCLUSION AND FUTUREWORK
In this paper, we analyze the difficulty of exploring HLS design
space and demonstrate the inefficiency of the hyper-heuristics ap-
proach. According to our analysis and observation, we propose a
bottleneck-guided gradient optimizer to systematically approach
a better solution. To eliminate meaningless design points, we pro-
pose a list comprehension-based design space representation and
prune 24.65× ineffective configurations on average, while keeping
the design space smooth. We further employ a partitioning strat-
egy to address the local optimal problem. We finally implement
a push-button framework, AutoDSE, based on the bottleneck op-
timizer. The evaluation results show that the performance of the
designs generated by the AutoDSE framework matches the cor-
responding manual designs and achieves on the geometric mean
a 19.9× speedup over one CPU core for Machsuite and Rodinia
benchmarks and 1.04× over the accelerated vision kernels of Xilinx
Vitis libraries with 26× reduction of their optimization pragmas.
The experimental results suggest that AutoDSE lets anyone with a
decent knowledge of programming try customized computing with
minimum effort, which is our goal in democratizing customizable
computing. In the future, we plan to include more transformations
(design space parameters) for optimizing data access and reuse
patterns.
ACKNOWLEDGEMENTS
The authors would like to thank Dr. Peichen Pan for his invaluable
support with the Merlin Compiler and Dr. Lorenzo Ferretti for
helping with the comparison to his work. This work is supported
by the ICN-WEN award jointly funded by the NSF (CNS-1719403)
and Intel (34627365), and CDSC industrial partners3.
REFERENCES
[1] Amazon EC2 F1 Instance. https://aws.amazon.com/ec2/instance-types/f1/, 2020.
3https://cdsc.ucla.edu/partners/
10
AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators Conference’20, Sep 2020, Los Angeles, CA, USA
[2] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jef-
frey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. Opentuner: An
extensible framework for program autotuning. In Proceedings of the 23rd in-
ternational conference on Parallel architectures and compilation, pages 303–316,
2014.
[3] Gary Bradski. The opencv library. Dr Dobb’s J. Software Tools, 25:120–125, 2000.
[4] Tony Chan, Jason Cong, and Kenton Sze. Multilevel generalized force-directed
method for circuit placement. In Proceedings of the 2005 international symposium
on Physical design, pages 185–192, 2005.
[5] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-
Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous
computing. In 2009 IEEE international symposium on workload characterization
(IISWC), pages 44–54. IEEE, 2009.
[6] Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. Soda: stencil with optimized
dataflow architecture. In 2018 IEEE/ACM International Conference on Computer-
Aided Design (ICCAD), pages 1–8. IEEE, 2018.
[7] Young-kyu Choi and Jason Cong. Hls-based optimization and design space explo-
ration for applications with variable loop bounds. In 2018 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), pages 1–8. IEEE, 2018.
[8] Jason Cong, Muhuan Huang, Peichen Pan, Yuxin Wang, and Peng Zhang. Source-
to-source optimization for hls. In FPGAs for Software Programmers, pages 137–163.
Springer, 2016.
[9] Jason Cong, Muhuan Huang, Peichen Pan, Di Wu, and Peng Zhang. Software
infrastructure for enabling fpga-based accelerations in data centers. In Proceedings
of the 2016 International Symposium on Low Power Electronics and Design, pages
154–155, 2016.
[10] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and
Zhiru Zhang. High-level synthesis for fpgas: From prototyping to deployment.
volume 30, pages 473–491. IEEE, 2011.
[11] Jason Cong and Jie Wang. Polysa: polyhedral-based systolic array auto-
compilation. In 2018 IEEE/ACM International Conference on Computer-Aided
Design (ICCAD), pages 1–8. IEEE, 2018.
[12] Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. Automated accelerator
generation and optimization with composable, parallel and pipeline architecture.
In DAC, 2018.
[13] Robert H Dennard, Fritz H Gaensslen, Hwa-Nien Yu, V Leo Rideout, Ernest
Bassous, and Andre R LeBlanc. Design of ion-implanted mosfet’s with very small
physical dimensions. IEEE Journal of Solid-State Circuits, 9(5):256–268, 1974.
[14] Falcon Computing Solutions, Inc. http://www.falcon-computing.com, 2018.
[15] Lorenzo Ferretti, Giovanni Ansaloni, and Laura Pozzi. Cluster-based heuristic
for high level synthesis design space exploration. IEEE Transactions on Emerging
Topics in Computing, 2018.
[16] Lorenzo Ferretti, Giovanni Ansaloni, and Laura Pozzi. Lattice-traversing de-
sign space exploration for high level synthesis. In 2018 IEEE 36th International
Conference on Computer Design (ICCD), pages 210–217. IEEE, 2018.
[17] Álvaro Fialho, Luis Da Costa, Marc Schoenauer, and Michèle Sebag. Analyzing
bandit-based adaptive operator selection mechanisms. volume 60, pages 25–64.
Springer, 2010.
[18] Intel SDK for OpenCL Applications. https://software.intel.com/en-us/intel-
opencl, 2020.
[19] David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, Christos
Kozyrakis, and Kunle Olukotun. Automatic generation of efficient accelerators for
reconfigurable hardware. In 2016 ACM/IEEE 43rd Annual International Symposium
on Computer Architecture (ISCA), pages 115–127. IEEE, 2016.
[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
[21] Hung-Yi Liu and Luca P Carloni. On learning-based methods for design-space
exploration with high-level synthesis. In Proceedings of the 50th annual design
automation conference, pages 1–7, 2013.
[22] Shuangnan Liu, Francis CM Lau, and Benjamin Carrion Schafer. Accelerating
fpga prototyping through predictive model-based hls design space exploration.
In 2019 56th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE,
2019.
[23] Anushree Mahapatra and Benjamin Carrion Schafer. Machine-learning based
simulated annealer method for high level synthesis design space exploration.
In Proceedings of the 2014 Electronic System Level Synthesis Conference (ESLsyn),
pages 1–6. IEEE, 2014.
[24] Saul B Needleman and Christian D Wunsch. A general method applicable to
the search for similarities in the amino acid sequence of two proteins. Mol. Biol,
48:443–153, 1970.
[25] Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christo-
pher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating configurable
hardware from parallel patterns. ACM Sigplan Notices (ASPLOS), 51(4):651–665,
2016.
[26] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Con-
stantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth
Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale data-
center services. In 2014 ACM/IEEE 41st International Symposium on Computer
Architecture (ISCA), pages 13–24. IEEE, 2014.
[27] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David
Brooks. Machsuite: Benchmarks for accelerator design and customized archi-
tectures. In 2014 IEEE International Symposium on Workload Characterization
(IISWC), pages 110–119. IEEE, 2014.
[28] Enrico Reggiani, Marco Rabozzi, Anna Maria Nestorov, Alberto Scolari, Luca
Stornaiuolo, and Marco Santambrogio. Pareto optimal design space exploration
for accelerated cnn on fpga. In 2019 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW), pages 107–114. IEEE, 2019.
[29] B Carrion Schafer and Kazutoshi Wakabayashi. Machine learning predictive
modelling high-level synthesis design space exploration. IET computers & digital
techniques, 6(3):153–159, 2012.
[30] Benjamin Carrion Schafer. Parallel high-level synthesis design space exploration
for behavioral ips of exact latencies. ACM Transactions on Design Automation of
Electronic Systems (TODAES), 22(4):1–20, 2017.
[31] Benjamin Carrion Schafer and Kazutoshi Wakabayashi. Divide and conquer high-
level synthesis design space exploration. ACM Transactions on Design Automation
of Electronic Systems (TODAES), 17(3):1–19, 2012.
[32] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. End-to-end optimization of
deep learning applications. In The 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pages 133–139, 2020.
[33] Jessica Vandebon, Jose GF Coutinho, Wayne Luk, Eriko Nurvitadhi, and Tim Tod-
man. Artisan: a meta-programming approach for codifying optimisation strate-
gies. In 2020 IEEE 28th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pages 177–185. IEEE, 2020.
[34] Vivado HLS. www.xilinx.com/products/design-tools/vivado/integration/esl-
design.html, 2020.
[35] Shuo Wang, Yun Liang, and Wei Zhang. Flexcl: An analytical performance model
for opencl workloads on flexible fpgas. In 2017 54th ACM/EDAC/IEEE Design
Automation Conference (DAC), pages 1–6. IEEE, 2017.
[36] Xilinx SDx. www.xilinx.com/products/design-tools/software-zone/sdaccel.html,
2020.
[37] Xilinx Vitis Libraries. www.github.com/Xilinx/Vitis_Libraries, 2020.
[38] Chang Xu, Gai Liu, Ritchie Zhao, Stephen Yang, Guojie Luo, and Zhiru Zhang. A
parallel bandit-based approach for autotuning fpga compilation. In Proceedings
of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, pages 157–166, 2017.
[39] Pengfei Xu, Xiaofan Zhang, Cong Hao, Yang Zhao, Yongan Zhang, Yue Wang,
Chaojian Li, Zetong Guan, Deming Chen, and Yingyan Lin. Autodnnchip: An
automated dnn chip predictor and builder for both fpgas and asics. In The 2020
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages
40–50, 2020.
[40] Sotirios Xydis, Gianluca Palermo, Vittorio Zaccaria, and Cristina Silvano. Spirit:
Spectral-aware pareto iterative refinement optimization for supervised high-level
synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 34(1):155–159, 2014.
[41] Cody Hao Yu, Peng Wei, Max Grossman, Peng Zhang, Vivek Sarker, and Jason
Cong. S2fa: an accelerator automation framework for heterogeneous computing
in datacenters. In 2018 55th ACM/ESDA/IEEE Design Automation Conference
(DAC), pages 1–6. IEEE, 2018.
[42] Georgios Zacharopoulos, Lorenzo Ferretti, Giovanni Ansaloni, Giuseppe
Di Guglielmo, Luca Carloni, and Laura Pozzi. Compiler-assisted selection of
hardware acceleration candidates from application source code. In 2019 IEEE
37th International Conference on Computer Design (ICCD), pages 129–137. IEEE,
2019.
[43] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong.
Autopilot: A platform-based esl synthesis system. In High-Level Synthesis, pages
99–112. Springer, 2008.
[44] Jieru Zhao, Liang Feng, Sharad Sinha, Wei Zhang, Yun Liang, and Bingsheng
He. Comba: A comprehensive model-based analysis framework for high level
synthesis of real applications. In 2017 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), pages 430–437. IEEE, 2017.
[45] Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. Flexten-
sor: An automatic schedule exploration and optimization framework for tensor
computation on heterogeneous system. In Proceedings of the Twenty-Fifth In-
ternational Conference on Architectural Support for Programming Languages and
Operating Systems, pages 859–873, 2020.
[46] Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. Lin-
analyzer: a high-level performance analysis tool for fpga-based accelerators. In
2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–6.
IEEE, 2016.
[47] Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail
Niar. Design space exploration of fpga-based accelerators with multi-level paral-
lelism. In Design, Automation & Test in Europe Conference & Exhibition (DATE),
2017, pages 1141–1146. IEEE, 2017.
[48] Guanwen Zhong, Vanchinathan Venkataramani, Yun Liang, Tulika Mitra, and
Smail Niar. Design space exploration of multiple loops on fpgas using high level
synthesis. In 2014 IEEE 32nd international conference on computer design (ICCD),
pages 456–463. IEEE, 2014.
[49] Wei Zuo, Peng Li, Deming Chen, Louis-Noël Pouchet, Shunan Zhong, and Jason
Cong. Improving polyhedral code generation for high-level synthesis. In 2013
International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ ISSS), pages 1–10. IEEE, 2013.
11
