Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation
  Aware Neural Architecture Search by Jiang, Weiwen et al.
ar
X
iv
:1
90
1.
11
21
1v
1 
 [c
s.D
C]
  3
1 J
an
 20
19
Accuracy vs. Efficiency: Achieving Both through
FPGA-Implementation Aware Neural Architecture Search
Weiwen Jiang1,2,3 Xinyi Zhang2 Edwin H.-M. Sha1 Lei Yang3,4 Qingfeng Zhuge1
Yiyu Shi5 Jingtong Hu2
1 East China Normal University 2 University of Pittsburgh 3 Chongqing University
4 University of California, Irvine 5 University of Notre Dame
jiang.wwen@pitt.edu
ABSTRACT
A fundamental question lies in almost every application of deep
neural networks: what is the optimal neural architecture given
a specific dataset? Recently, several Neural Architecture Search
(NAS) frameworks have been developed that use reinforcement
learning and evolutionary algorithm to search for the solution.How-
ever, most of them take a long time to find the optimal architec-
ture due to the huge search space and the lengthy training process
needed to evaluate each candidate. In addition, most of them aim
at accuracy only and do not take into consideration the hardware
that will be used to implement the architecture. This will poten-
tially lead to excessive latencies beyond specifications, rendering
the resulting architectures useless. To address both issues, in this
paper we use Field Programmable Gate Arrays (FPGAs) as a ve-
hicle to present a novel hardware-aware NAS framework, namely
FNAS, which will provide an optimal neural architecture with la-
tency guaranteed to meet the specification. In addition, with a per-
formance abstraction model to analyze the latency of neural archi-
tectures without training, our framework can quickly prune archi-
tectures that do not satisfy the specification, leading to higher effi-
ciency. Experimental results on common data set such as ImageNet
show that in the cases where the state-of-the-art generates archi-
tectures with latencies 7.81× longer than the specification, those
from FNAS can meet the specs with less than 1% accuracy loss.
Moreover, FNAS also achieves up to 11.13× speedup for the search
process. To the best of the authors’ knowledge, this is the very first
hardware aware NAS.
1 INTRODUCTION
The performance of a Deep Neural Network (DNN) is mostly de-
cided by its architecture. Yet the design of DNN architecture had
significantly relied on human expertise and labor until the recent
development of Neural Architecture Search (NAS) that can auto-
matically explore the optimal architecture for a particular applica-
tion. Existing research efforts [6, 16] have demonstrated that NAS
can generate DNNs of competitive or even better accuracy against
the human-invented ones (e.g., AlexNet, VGGNet, GoogleNet and
ResNet). However, the prevalence of NAS is obstructed by its ef-
ficiency. As reported in [16], the search process can take several
days even with hundreds of GPUs. The issue mainly comes from
the fact that the search space can be huge, and for each candidate
architecture, lengthy training process is needed to evaluate it.
In addition, a mainstay for any existing NAS framework is that
accuracy is the mono-objective to guide the search [16]. If the re-
sulting architecture is to be deployed in the cloud or latency is not
a critical factor, they will still work. However, if the architecture
is to be implemented on hardware with latency specification, then
there is no guarantee that the specificationwill bemet. In these sce-
narios, the optimal architecture found by NAS is simply useless.
In this paper, we propose a novel hardware-aware NAS frame-
work to address the above issues. To illustrate our framework, we
choose to use Field programmable gate array (FPGA) as a vehicle,
as it has gradually become one of themost popular platforms to im-
plement DNNs due to its high performance and energy efficiency,
in particular for low-batch real-time applications [2]. To introduce
hardware awareness, it seems to be straightforward to simply in-
clude an additional metric in existing NAS frameworks that de-
scribe the latency of a neural architecture on an FPGA. However,
the evaluation of the metric can be challenging. First, unlike the
regular path-based structures in most human-invented DNNs, the
architecture obtained by NAS can be irregular. Many existing de-
sign flows dedicated for human-invented DNNs are not suitable for
such complicated structures [2–4, 8, 9, 13–15]. Second, the archi-
tectures from NAS commonly have large sizes, which may require
multi-FPGAs to collaborate for implementation. Consequently, the
scheduling of tasks on multiple FPGAs should be taken into con-
sideration. As such, a more elegant way to evaluate the metric is
warranted.
Towards this, we put forward an abstraction model that builds
a bridge between the software (neural architecture) and hardware
(FPGA designs) for efficient latency estimation. Specifically, a tile-
based graph model is presented to describe a given DNN under
an FPGA design. In the model, we determine the granularity of
tasks, the task dependencies, and the data accesses according to
theDNN architecture and the tiling parameters in the design. Then,
the complicated dependencies among DNN layers can be captured
by adding extra edges among tasks.
As for the scheduling on multiple FPGAs, a limited number of
works exist in the literature [4, 14], all of which still follow the
scheduler design for a single FPGA [13] (see Figure 5(a)). Such
schedule paradigm, however, cannot fully exploit the parallelism
among FPGAs. In this work, we propose a more flexible schedule
mechanism (see Figure 5(b)). We first study the design principle for
schedulers, based on which we present the mechanism to schedule
NN Architectures
“Hyperparameters”
Number
of Filters
Filter
Size
Stride
Para.
Filter
Size
Layer N Layer N+1Layer N-1
… …
The controller (RNN)
Trainning
Accuracy “A”.
Implement NN
    Get performacne.
R=A
Target FPGAs
    DSP number ...
Not satisfy 
req. perf.
FailedSucc.
Satisfy 
req. perf.
Number
of Filters Generated
NNs
(a) NAS Framework (b) implementation for generated NN
after convergencebefore convergence
Figure 1: NAS framework [16] with its implementation.
tasks in the abstraction model. Furthermore, we theoretically an-
alyze the latency of executions and pipeline stalls to estimate the
overall latency. Kindly note that the proposed schedule paradigm
can also be widely applicable in the design of multi-FPGA systems
beyond the scope of this work.
The main contributions of this paper are as follows.
• Framework.Webuild an FPGA-implementation aware neu-
ral architecture search framework, namely FNAS, which can
generate optimalDNNarchitectures with guaranteed latency
on target FPGAs.
• AbstractionModel.We propose a graph model to describe
neural architectures based on FPGA implementations, which
provides the fundamental support for latency analysis. In
addition, it can model different kinds of architectures.
• Schedule Paradigm. We present a novel schedule para-
digm, which can fully exploit the parallelism among mul-
tiple FPGAs.
Experimental results on commondata set such as ImageNet show
that in the cases where the state-of-the-art [16] generates archi-
tectures with latencies 7.81× longer than the specifications, those
from FNAS can meet them with less than 1% accuracy loss; mean-
while FNAS also achieves 11.13× speedup for the search process.
The remainder of the paper is organized as follows. Section 2
reviews related background and Section 3 demonstrates our moti-
vation and problem formulation. The detailed FNAS algorithm is
presented in Section 4. Experimental results are shown in Section
5 and concluding remarks are given in Section 6.
2 BACKGROUND
In this section, we will present the background on the neural ar-
chitecture search and the FPGA-based DNN implementations.
Searching Neural Network Architecture. Although the re-
search on automatically predicting neural network architectures
can trace back to the 1980s [7], after deep neural networks have
achieved great success in AI domains, there has been growing in-
terests in generating good neural architectures recently. With the
fact that the architectures are growing deeper, the search space
grows exponentially, which makes the search process difficult. In
existing works, there are two main directions in searching an ar-
chitecture: (1) employing reinforcement learning [1, 16, 17], and (2)
applying the evolutionary algorithms [6, 10]. Figure 1 shows the
NAS framework presented in [16]. In NAS, it iteratively generates
a new child network, and obtains its accuracyA by training it on a
held-out data set. Then, accuracy A will be used as the reward sig-
nal for the next iteration. The search process will be stopped if the
Table 1: FNAS uses less time to generate architectureshaving
lower latency on PYNQ with small accuracy degradations.
Methods
TC Elasp. Lat. Acc.
ms (m,s) Imp. (ms) Imp. (%) Deg.
NAS [17] - 190m33s - 19.70 - 99.42% -
FNAS
10 74m29s 2.55× 8.67 2.27× 99.34% -0.08%
5 59m19s 3.21× 4.77 4.13× 99.18% -0.24%
2 17m07s 11.13× 1.80 10.94× 98.61% -0.81%
controller is converged for themaximum accuracy, or the accuracy
of child network satisfies the required accuracy rA. The generated
final design will then be implemented into FPGAs. Existing work
has demonstrated that the automatically searched network archi-
tectures can achieve close accuracy to the best human-invented ar-
chitectures [16, 17]. However, there are two important challenges
that need to be addressed. First, the searching process is inefficent.
[16] reported that 20,000 networks were trained across 500 P100
GPUs over 4 days to find the descired network. Second, the gener-
ated neural architectures achieve high accuracy with the sacrifice
of inference speed. The resultant network are usually complex and
slow, which frequently fails to satisfy the required timing specifi-
cation with available computing resources for real-time AI appli-
cations.
Table 1 reports our results of NAS [16] for image classification
usingMNIST data set targeting onPYNQ board [5].We can observe
that it takes 190m 33s to complete the search by NAS, with 19.70
ms of the latency and 99.42% of the accuracy for the generated
network.
FPGA Implementation. FPGA has demonstrated its excellent
abilities to achieve high performance and energy efficiency for low-
batch real-time inferences. With such vision, a series of works for
implementing neural networks on FPGAs have been carried out: (1)
a single accelerator is implemented on a single FPGA [11, 13]; (2)
multiple accelerators are integrated into a single FPGA [8, 9, 15]; (3)
multiple accelerators are deployed onmultiple FPGAs [2–4, 14]. Ex-
isting NAS did not take implementation into consideration at all. In
order to make NAS process implementation-aware, the inference
latency in FPGA has to be obtained for each child network. Existing
research efforts commonly analyze the latency of DNNs on FPGAs
by generating the HLS- or RTL-level code [8, 13, 15], which may
involve human intervention and vast amount of time. Therefore, if
we simply obtain inference latency using existing techniques and
naively integrating it into the reward for architecture selection, the
NAS search process will take even longer time.
In this work, one of the main contributions is to accurately and
quickly estimate FPGA inference latency so that both search pro-
cess and network generating are efficient. Table 1 also reports re-
sults of the proposed FNAS. From this table, we can see that for the
same dataset, by setting the inference timing specifications (TS)
to 2ms, FNAS can reduce the search time from 190 minutes to 17
minutes, achieving 11.13× reduction. What is more, the latency of
the resultant architecture on PYNQ is 10.94× shorter against that
explored by NAS. Meanwhile, the accuracy degradation is within
1%. For other two cases, FNAS significantly achieves more than
2× speedup with only 0.08% and 0.24% penalty on accuracy. These
verify that the implementation of aware-FNAS can significantly re-
duce the search time and guarantee the generated architecture on
      FNAS-Design
 “Design on Program Logic”
      FNAS-GG
“Tile-based Task Graph Generator”
2
      FNAS-Sched
“Scheduler on Processing System”
3
      FNAS-Analyzer
Performance “L”
4
1
Target FPGAs
    DSP number ...
Trainning
Accuracy “A”.
F
N
A
S
 T
o
o
l
FNAS Framework
NN Architectures
“Hyperparameters”
R=f(A,L)
Number
of Filters
Filter
Size
Stride
Para.
Filter
Size
Layer N Layer N+1Layer N-1
… …
The controller (RNN)
Number
of Filters
Figure 2: Overview of the proposed FNAS framework.
target FPGAs to satisfy the timing specification in the inference
phase while maintaining the accuracy.
3 FNAS FRAMEWORK
3.1 Problem Definition and FNAS Overview
In this paper,we aim to develop an FPGA-implementation aware
Neural Architecture Search. The problem is formally defined as
follows: Given a specific data set, a target FPGA platform and a re-
quired inference latency rL, our objective is to automatically gen-
erate a neural network, such that its inference latency on the given
FPGA platform is less than rL, while achieving the maximum ac-
curacy for the machine learning task on the given data set.
Figure 2 shows the overview of FPGA-implementation aware
Neural Architecture Search (FNAS) framwork. In FNAS, it takes
the FPGA-based inference performance into consideration during
child network searching. Specifically, instead of directly applying
accuracy A as reward, FNAS employs a reward function f to cal-
culate the reward in terms of accuracyA and performance/latency
L. In order to efficiently and accurately estimate the inference la-
tency L for a given NN architecture on target FPGAs, we develop
the “FNAS tool”. There are 4 components in FNAS tool, including
➀ FNAS-Design, ➁ FNAS-GG, ➂ FNAS-Sched, and ➃ FNAS-
Analyzer. In the following texts, we will first present the reward
function, then introduce these components one-by-one.
3.2 Reward function
Reward function takes the accuracy A, latency L, and the re-
quired latency rL to calculate the reward signal. The function to
calculate the reward R is defined as follows.
R =
{
r L−L
r L − 1 L > rL
(A − b) + Lr L L ≤ rL
(1)
In the above function, there are two cases. First, if L > rL, it indi-
cates that the performance of the resultant system cannot satisfy
the timing specification. In this case, we do not train the child net-
work and directly return a negative reward to the controller. In the
second case, we sum up the reward of performance and accuracy,
where the performance reward is set as Lr L , which indicates a solu-
tion has higher performance reward if its latency approaches the
required level. Here, b is a baseline function, which is an exponen-
tial moving average of the previous architecture accuracies [16].
3.3 Tiling Parameters: ➀ FNAS-Design
weights
…
Tm
…
layer 1 layer 2
Tm
N
Tn
Tn
(a) N/Tn=2
Tr
R
Tc
(b) R/Tr=2, C/Tc=2 (c) M/Tm=3, N/Tn=2, R/Tr=1, C/Tc=2
C
1
2
3
4
1
2
layer 1 layer 2 layer 3
T1,1,1
ifm
T1,1,2
ifm
T1,2,1
ifm
T1,2,2
ifm
T2,1,1
ofm
T2,1,2
ofm
T2,2,1
ofm
T2,2,2
ofm
T2,3,1
ofm
T2,3,2
ofm
T2,1,1
ifm
T2,1,2
ifm
T2,2,1
ifm
T2,2,2
ifm
T3,1,1
ofm
T3,1,2
ofm
T3,2,1
ofm
T3,2,2
ofm
T3,3,1
ofm
T3,3,2
ofm
(d) Tm!=Tn
T1,1,1
ifm
T1,1,2
ifm
T1,2,1
ifm
T1,2,2
ifm
T2,1,1
ofm
T2,1,2
ofm
T2,2,1
ofm
T2,2,2
ofm
T2,3,1
ofm
T2,3,2
ofm
v1,1,1,1 v1,1,2,1 v1,1,3,1 v1,1,1,2 v1,1,2,2 v1,1,3,2 v1,2,1,1 v1,2,2,1 v1,2,3,1 v1,2,1,2 v1,2,2,2 v1,2,3,2
T2,1,1
ifm
T2,1,2
ifm
T2,2,1
ifm
T2,2,2
ifm
T3,1,1
ofm
T3,1,2
ofm
T3,2,1
ofm
T3,2,2
ofm
T3,3,1
ofm
T3,3,2
ofm
v2,1,1,1 v2,1,2,1 v2,1,3,1 v2,1,1,2 v2,1,2,2 v2,1,3,2 v2,2,1,1 v2,2,2,1 v2,2,3,1 v2,2,1,2 v2,2,2,2 v2,2,3,2
layer 1
layer 2
layer 3
conv 1
conv 2
(e) Tile-based Task Graphs with convolutional tasks derived from (d)
inter-layer dependencies
intra-layer dependencies
inter-layer dependencies
Figure 3: FNAS-Design and FNAS-GG: (a) channel tiles; (b)
row/col tiles; (c) tile-based convolution; (d) encoding of tiles;
(e) tile-based task graph.
Due to the limited resource on FPGA, it may be difficult to place
a whole convolutional layer on FPGA. In consequence, it is com-
mon to apply tiling technique to split convolutional operations into
multiple small tasks [8, 12–15]. FNAS-Design is to determine the
tiling parameters for a given NN architecture on target FPGAs.
Take one convolutional operation as an example, it involves
four parameters 〈Tm ,Tn ,Tr ,Tc 〉, related to the input/output fea-
ture maps (IFM/OFM). Here, the number of IFM channel is N . The
size of corresponding tiles isTn (channels). IFM is then partitioned
into ⌈ NTn ⌉ tiles, as shown in Figure 3(a). Similarly, OFM with M
channels is partitioned to ⌈ MTm ⌉ tiles. In addition, the numbers of
row/column of OFM are R and C , respectively. They are tiled ac-
cording to Tr and Tc as shown in Figure 3(b).
After tiling the IFM/OFM/row/col, one convolutional operation
is divided to smaller tasks, as shown in Figure 3(c). Each task cor-
responds to a pair of IFM/OFM tiles. Tasks in one layer will be
continuously loaded to a Processing Element (PE) on FPGA for ex-
ecution (the load sequence is determined by➂ FNAS-Sched). For a
task, it involves Kh ×Kw ×Tr ×Tc ×Tm ×Tn Multiply-Accumulate
(MAC) operations, where Kh and Kw are the height and weight of
filter determined by the controller. A PE composed ofTm×Tn DSPs
can execute Tm × Tn MAC operations (16bit fix-point) in parallel
[13]. Then the latency of a task is Kh × Kw ×Tr ×Tc .
In FNAS, each layer is allocated to a dedicated PE, and PEs are
performed in the pipeline fashion. Such architecture can be imple-
mented on one FPGA as in [8, 15] or multiple FPGAs as in [4, 14].
The resource (e.g., DSP and memory bandwidth) for each layer can
be obtained by considering the load balance. And then the best pa-
rameters 〈Tm ,Tn ,Tr ,Tc 〉 can be obtained according to [8, 13].
3.4 Tile-based task graph generator: ➁ FNAS-GG
FNAS-GG is a graph generator that takes the design parameters
and NN architecture to generate the dependency graph between
T1,1,1
ifm
T1,1,2
ifm
T1,2,1
ifm
T1,2,2
ifm
T2,1,1
ofm
T2,1,2
ofm
T2,2,1
ofm
T2,2,2
ofm
T2,3,1
ofm
T2,3,2
ofm
v1,1,1,1 v1,1,2,1 v1,1,3,1 v1,1,1,2 v1,1,2,2 v1,1,3,2v1,2,1,1 v1,2,2,1 v1,2,3,1 v1,2,1,2 v1,2,2,2 v1,2,3,2
T2,1,1
ifm
T2,1,2
ifm
T2,2,1
ifm
T2,2,2
ifm
T3,1,1
ofm
T3,1,2
ofm
T3,2,1
ofm
T3,2,2
ofm
T3,3,1
ofm
T3,3,2
ofm
v2,1,1,1 v2,1,2,1 v2,1,3,1 v2,1,1,2 v2,1,2,2 v2,1,3,2v2,2,1,1 v2,2,2,1 v2,2,3,1 v2,2,1,2 v2,2,2,2 v2,2,3,2
v1,1,1,1
v1,1,2,1
v1,1,3,1
v1,1,1,2
v1,1,2,2
v1,1,3,2
v1,2,1,1
v1,2,2,1
v1,2,3,1
v1,2,1,2
v1,2,2,2
v1,2,3,2
v2,1,1,1
v2,1,2,1
v2,1,3,1
v2,1,1,2
v2,1,2,2
v2,1,3,2
v2,2,1,1
v2,2,2,1
v2,2,3,1
v2,2,1,2
v2,2,2,2
v2,2,3,2
PE1 PE2
start-time
time
OFM
reuse
IFM reuse
(a) Ordered tile-based task graph (b) Schedule
Figure 4: FNAS-Sched: (a) re-ordered graph from 3(e); (b) the
generated schedule graph.
data tiles and tasks, called tile-based task graph. To generate the
graph, the generator first needs to define each tile for a given de-
sign. Then, it can generate the tile-based task graph.
According to FNAS-Design, there are two kinds of tiles: channel
tile and row/col tile. For channel tile, letCH
i f m
i = {1, 2, · · · , ⌈
CHi
Tn
⌉}
be a set of indices in the ith layer under tiling parameter Tn (con-
sidering ith layer’s IFM); similarly CH
ofm
i = {1, 2, · · · , ⌈
CHi
Tm
⌉}
is under tiling parameter Tm (considering i
th layer’s OFM). For
row/col tile, let RCi = {1, 2, · · · , ⌈
Ri
Tr
⌉ · ⌈
Ci
Tc
⌉} be a set of indices in
layer i under tiling parameter Tr and Tc .
Based on these parameters, we can define the tiles as follows.
We define a tile in IFM as T
i f m
i, j,m , indicating the tile is in the i
th
layer, the jth channel tile in CH
i fm
i and m
th row/col tile in RCi .
Similarly, the tile in OFM is defined asT
of m
i,k,m
, where k is the index
of channel tile in CH
ofm
i .
Then, we can build the dependencies among data tiles and tasks.
There are two kinds of dependencies. First, inter-layer dependencies
describe the dependency between tasks and data tiles in two con-
secutive layers. We define a task node to bevi, j,k,m , which process
the tile T
i f m
i, j,m and generate the tile T
of m
i+1,k,m
.
Next, the intra-layer dependencies describe dependency between
two data tiles in one layer. If the tiling parameters Tm and Tn for
a layer are different, as shown in Layer 2 in Figure 3(d), the depen-
dency would not be the simple one-to-one mapping Instead, it can
be represented as follows. For the tile T
i f m
i, j,m , its data is depending
on tile T
of m
i,k,m
if (j − 1) · TnTm + 1 ≤ k ≤ j ·
Tn
Tm , where k ∈ N.
By following the above rules, the graph can be generated. For
the tiles of three layers in Figure 3(d), its corresponding tile-based
task graph is shown in Figure 3(e).
3.5 Scheduler design: ➂ FNAS-Sched
FNAS-Sched is a scheduler to determine the sequence of tasks
to be executed on multiple PEs, such that the schedule length (la-
tency) can be minimized. FNAS-Sched tries to maximally exploit
the parallelism among different convolutional operations based on
the tile-based task graph. The design follows three principles:
• P1. Minimizing the start time of each PE to execute tasks
as early as possible.
for(row=0; row<R; row+=Tr)
for(col=0;  col<C;   col+=Tc)
for(to=0;    to<M;    to+Tm)
for(ti=0;     ti<N;     ti+=Tn)
// load OFM, WEI, IFM
ID = PE1
SCH ={PE1:[v1,1,1,1, v1,2,1,1,  v1,1,2,1, ...], ...}
for (v in SCH[ID])
// load IFM tile, OFM tile, WEI
// extract trr,tcc,too,tii from task v
Fixed scheduler on PS part The proposed scheduler in FNAS
         for(i=0; i<K; i++)
         for(j=0; j<K; j++)
         for(trr=row;  trr<min(row+Tr, R);   trr++)
         for(tcc=col;  tcc<min(col+Tc,C);    tcc++)
         // Unrolling following MACs to process in parallel
         for(too=to;   too<min(to+Tm,M);   too++)
         for(tii=ti;      tii<min(ti+Tn,N);       tii++)
                  OFM[too][trr][tcc] += WEI[too][tii][i][j]*
    IFM[tii][trr+i][tcc+j]
// store OFM
Accelerator on PL part (PE1)
replace
(a) Fixed scheduler
(c) Accelerator implementeation
(b) The proposed scheduler
Figure 5: FPGA-based CNN implementation.
• P2. Maximizing the reuse of data on FPGA to reduce the
communication bandwidth requirement.
• P3. Minimizing the pipeline stall caused by the lack of input
data in the execution on one PE.
FNAS-Sched is carried out in three steps.
Step 1: determine the sequence of IFM tiles in each layer - P1. The
order of executing IFM will affect the start time of the next layer.
There are two possible strategies: i) increase the indices of channel
tiles first; or ii) increase the indices of row/col tiles first. Because an
OFM tile is related to all IFM channels, strategy i) is more favored
than ii) to make the next layer start earlier.
Step 2: determine the sequence of OFM tiles in each layer - P1.
The sequence of OFM tiles will determine the ready time of IFM
tiles in the same layer. After Step 1, the sequence of IFM tiles has
already been determined. Hence, we sequentially visit IFM tile in
a layer and arrange the OFM tiles that are dependent on it.
Step 3: determine the task sequence - P2,P3. Task sequence will
determine the data reuse rate. There are two data reuse strategies
can be exploited: i) OFM reuse, and ii) IFM reuse. The OFM (or
IFM) reuse indicates that the consecutively executed tasks have
the same OFM (or IFM) tile and the same row/col tile. We observe
the uniform reuse strategy for all layers will lead to the pipeline
stall due to the lack of input data. In FNAS, we will alternatively
apply the above two strategies for consecutive layers.
Followed by the above three steps, we can obtain a schedule
of tasks. For the tile-based task graph in Figure 3(e), Figure 4(a)
gives the graph with the reordered IFM/OFM tiles. We also give
the schedule for this graph in Figure 4(b). As shown in Figure 4(b),
tasks in layer1 (PE1) can achieve OFM reuse, while IFM reuse can
be achieved in layer2 (PE2). In addition, the start-time is only 4
time units, and there is no stall in the executions for both layers.
In a FPGA implementation, the scheduler is usually implemented
in the processing system (PS) end and the accelerator is imple-
mented on the programming logics (PL) end. Figures 5(a),(c) il-
lustrate the commonly used PS/PL designs proposed by [13]. In
their design, different data tiles are sent to PL part with a fixed or-
der, called “fixed scheduling”. In order to implement the proposed
scheduler, we modify the implementation of PS part, as shown in
Figure 5(b), where we can specify the tasks to PEs and launch tasks
in an order determined by our scheduler.
3.6 Latency analysis: ➃ FNAS-Analyzer
FNAS-Analyzer aims to efficiently and accurately compute the
latency L of a neural architecture on target FPGAswith determined
schedule. In the schedule, the latency of PE is the sum of three
parts: (1) the processing time, (2) the start time, and (3) the stall
time. We are going to analyze each of them in following sections.
Processing Time.We first determine the execution time of tasks
in the tile-based graph. Since all tasks in layer i utilize the same
accelerator for execution, they have the same execution time, de-
noted as ETi which equals Kh,i × Kw,i ×Tr ,i ×Tc,i (see➀ FNAS-
Design). The processing time PTi of a PE is the summation of ex-
ecution time of all task nodes in the corresponding layer i , which
can be directly calculated as follows.
PTi = ETi × |CH
i fm
i | × |CH
ofm
i+1 | (2)
where |CH
i fm
i | × |CH
ofm
i+1 | is the task number in layer i (see➁).
Start Time. The start time of a layer depends on the start time
and the data reuse strategy of its previous layer. Let ∆ti,ofm be
the difference between start time of layers i − 1 and i . We define
∆tj,i f m similarly.
First, let us consider that layer i − 1 applies OFM reuse, indicat-
ing one tile in OFM is ready after computing ⌈CHi−1Tn, i−1 ⌉ tasks. For
one I FM in layer i , it requires ⌈
Tn, i
Tm, i
⌉ OFM tiles. In consequence,
∆ti,ofm can be calculated as follows.
∆ti,ofm = ⌈
CHi−1
Tn,i−1
⌉ × ⌈
Tn,i
Tm,i
⌉ × ETi−1 (3)
Next, consider layer i − 1 applies IFM reuse to compute ∆tj,i f m .
In this case, each IFM in layer i − 1 will be reused to compute the
partial sum of all related OFMs. After all IFM except the last one in
the same row/col tile have completed, it will continuously generate
OFMs in layer i . Thus, ∆tj,i f m can be calculated as follows.
∆tj,i f m =
[(
⌈
CHi−1
Tn,i−1
⌉ − 1
)
× ⌈
CHj
Tm, j
⌉ + ⌈
Tn,i
Tm,i
⌉
]
× ETj−1 (4)
where the first term of multiplication indicates the number of tasks
completed before the first OFM in layer i is produced. And ⌈
Tn, i
Tm, i
⌉
is the number of OFM required by one IFM in layer i .
Stall Time. After a PE has been launched, it may be stalled be-
cause the data for the next task is not ready. However, there may
exist another task that has already been ready to run. In our sched-
uler, we can maintain a ready-to-run queue. If a stall occurs, we
will pick one task in the ready-to-run queue to avoid pipeline stall.
Latency. We can then derive a tight lower bound on latency
Latsys by summing up processing time and starting time. For a
total of N processing elements (PE), assume the first and the last
PEs apply OFM reuse, we can calculate Latsys as follows.
Latsys =
∑
i=2,4, · · · ,N−1
∆ti,ofm +
∑
j=3,5, · · · ,N
∆tj,i f m + PTN
(5)
After obtaining Latsys , we have the latency L = Latsys .
Table 2: Data sets and parameter settings in FNAS.
Data sets
Training Para. Controller Parameters Para.
Train Val. E L FS FN T [TS4,TS3,TS2,TS1]
MNIST 60,000 10,000 25 4 [5,7,14] [9,18,36] 60
TS-High TS-Low
[2,5,10,20] [1,4,10,20]
CIFAR-10 45,000 5,000 25 10 [1,3,5,7] [24,36,48,64] 60 [1.5,2,2.5,10]
ImageNet 4,500 500 25 15 [1,3,5,7] [16,32,64,128] 60 [2.5,5,7.5,10]
• L: number of layers; FS: filter size; FN: number of filters; T: trails; E: epoch
Summary.FNAS framework considers the performance of child
networks on target FPGAs in the neural architecture search pro-
cess. As shown in Formula 1, if the latency cannot satisfy the tim-
ing specification, there is no need to train the generated child net-
work. In addition, the controller will be guided to avoid searching
architectures that have insufficient performance. Consequently, the
search process can be dramatically accelerated, and the performance
of the resultant child network on target FPGAs can be guaranteed.
The evaluations on the efficiency of FNAS will be presented in the
next section.
4 EXPERIMENTAL RESULTS
This sectionwill report the evaluation results of the proposed FNAS
framework. Results on MNIST, CIFAR-10, and ImageNet data sets
show that FNAS can achieve up to 11.13×, 10.89× and 10.38× speedup
in the search process respectively. Moreover, for the case where
NAS [16] generates architectures with latencies 9.85× longer than
the specification, FNAS can meet the specs with less than 1% accu-
racy loss.
4.1 Experimental setup
Data sets. FNASwill search convolutional neural network struc-
tures for three kinds of data sets, including MNIST, CIFAR-10, and
ImageNet. Each data set is composed of training set and validation
set, as shown in Table 2. For instance, there are 60,000 and 10,000
examples in training and validation sets, respectively. Kindly note
that for ImageNet, we use a smaller data set to reduce the computa-
tion time. In training of the child networks, the number of epochs is
set as 25, and themaximum validation accuracy in the last 5 epochs
will be utilized to compute the reward for updating the controller.
Controller. We implement the reinforcement learning based
RNN controller based on [16] to generate child networks. For differ-
ent data sets, the controller has different configurations as shown
in Table 2. For instance, we explain the configuration of MNIST:
(1) its child network has 4 layers, (2) the possible filter size (height
and width) is 5, 7 or 14, (3) the possible channel number is 9, 18, or
36, and (4) it will find 60 child networks.
FNAS Tool.We implement all the components described in Sec-
tion 3 and integrate them into the controller to realize the FNAS
framework. To compare FNAS with NAS, we employ both low-end
and high-end FPGAs to implement the resultant architectures us-
ing MNIST data set. The low-end and high-end FPGAs selected
are Xilinx 7A50T and 7Z020, respectively. To see how general our
conclusions are, we have explored CIFAR-10 and ImageNet as ad-
ditional data sets, where Xilinx ZU9ED is used.
4.2 Efficiency and accuracy evaluations of FNAS
Figures 6(a),(b),(c) reports comparison results on search time, la-
tencies of the resultant DNNs, and accuracies of resultant DNNs
06
1.2
1.8
2.4
´10
2
0
10
20
30
min ms
0
25%
50%
75%
100%
NAS FNAS-loose FNAS-med FNAS-tight
7Z020 7A50T 7Z020 7A50T 7Z020 7A50T
(a) search time (b) latency (c) accuracy
Figure 6: Comparison of search time, latency and accuracy
between NAS and FNAS.
1.00%
0%
0.25%
0.50%
0.75%
ac
cu
ra
cy
 l
o
ss
12´
0
3´
6´
9´
se
ar
ch
 t
im
e
re
d
u
ct
io
n
MNIST CIFAR-10
TS1 TS2 TS3 TS4 TS1 TS2 TS3 TS4
ImageNet
10.38
10.89
(a) (b)
timing specification timing specification
11.13
Figure 7: (a) Accuracy loss and (b) search time reduction vs.
timing specifications on three data sets. The architectures
from NAS are used as the baseline cases.
respectively on the MINST data set. In these figures, the x-axis rep-
resents different FPGAs. The FNAS-loose (TS2), FNAS-med (TS3),
and FNAS-tight (TS4) correspond to three timing specifications
(see Table 2).
From Figure 6(a), we can see that FNAS can dramatically reduce
the search time. For the target FPGA of 7Z020, for loose, medium
and tight timing specification (actual values in Table 2), the search
times are reduced from 190 minutes to 74, 59, 17 minutes, achiev-
ing 2.56×, 3.22×, 11.13× reductions respectively compared with
NAS. There are two reasons for the improvement: (1) with early-
stage pruning, we will not train the architectures whose inference
latencies violate the specification; (2) the structure of the valid
DNNs are usually simpler than the ones with violations, and ac-
cordingly most complex architectures are naturally pruned with-
out the need of training. From the figure we can also see that
for FNAS the search time decreases as the timing specification
gets tighter, which is also expected as tighter specifications prunes
more potential architectures.
Next, as shown in Figure 6(b), for different timing specifications,
FNAS can generate a specific architecture to satisfy the specifi-
cation, i.e., the latency of the architecture decreases as the spec-
ification gets tighter. On the contrary, NAS can only generate a
single architecture, which has a latency that is 2.54×, 4.19×, and
7.81× longer than the specification. The flexibility of FNAS pro-
vides more choices for designers and also helps to prune useless
designs at the very early design stage.
Figure 6(c) reports the accuracy of the generated architectures.
We can observe that, even compared with the architectures gen-
erated by NAS that have timing violations, those generated by
FNAS only have accuracy degradation of 0.08%, 0.24%, and 0.81%
for loose, medium and tight timing timing specification respec-
tively. The above results clearly show that FNAS can achieve both
high efficiency and high accuracy in exploring neural architectures.
Finally, we explore how the accuracy of the architectures gen-
erated by FNAS as well as the corresponding search time scales
1
2
.2
0
%
1
4
.1
5
%
1
4
.5
1
%
1
4
.5
0
%
1
3
.2
7
%
1
5
.0
2
%
1
4
.5
3
%
1
4
.9
0
%
1
3
.4
2
%
1
3
.3
7
%
1
3
.2
3
%
1
2
.7
5
%
1
2
.7
6
%
9
.7
2
%
8
.5
9
%
1
5
.6
3
%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
10
20
30
40
50
60
´10
5
C
lo
ck
 C
y
cl
es
Explored Architectures
FNAS-Sched Fixed sched
Figure 8: Number of clock cycles on a set of architectures
using FNAS-Sched and fixed scheduling [13] using four ac-
celerators.
with the timing specifications. The results are depicted in Figure
7 where three data sets MNIST, CIFAR-10, and ImageNet are used.
For MNIST, the high-end FPGA is used. The x-axis represents dif-
ferent timing specifications, where TC1 is the loosest one while
TC4 is the tightest one and the corresponding values are summa-
rized in Table 2.When reporting both accuracy and search time, we
use the architecture from NAS as reference and show the accuracy
loss as well as the search time reduction. From Figure 7(a) we can
observe that in general the architectures generated by FNAS with-
out timing violations only have less than 1% accuracy loss com-
pared with those generated by NAS that have violations. The ac-
curacy loss gets higher as the timing constraint becomes tighter.
Meanwhile, from Figure 7(b), we can see that the search time of
FNAS can be significantly reduced, achieving 11.18× reduction for
CIFAR-10 compared with NAS.
4.3 Improvements from the scheduler
Wealso evaluate the efficiency of the proposed scheduler against
that uses the fixed scheduling technique [13] on four accelerators.
We test a set of architectures with 4 convolution layers on PYNQ
board [5]. For each convolution layer, the filter size is 3× 3 and the
number of filters is 64 or 128. Figure 8 reports the number of clock
cycles of each possible architecture. As shown in this figure, our
proposed scheduler can consistently achieve better performance.
This convincingly demonstrates the effectiveness of the proposed
scheduler for FPGA designs.
5 CONCLUSION
In this work, we use FPGA as a vehicle to explore hardware-aware
neural architecture search (NAS). Our objective is to automatically
search the neural architecture with the optimal accuracy while
satisfying the timing specification on target FPGAs. To achieve
this goal, a novel FPGA-implementation aware NAS framework
is proposed. In the framework, we build a performance abstrac-
tion model and a new schedule paradigm to fully exploit the paral-
lelism amongmultiple FPGAs. Evaluation results show that for the
cases where the state-of-the-art NAS generates architectures with
inference latency 7.81× longer than the specification, the proposed
framework can meet the specification with less than 1% accuracy
loss, as well as up to 11.13× speedup in the search process.
ACKNOWLEDGEMENTS
This work was supported in part by the National Natural Science
Foundation of China under Grant 61472052, in part by the National
Science Foundation under Grant CCF-1820537, and in part by the
China Scholarship Council under Grant 201706050116 and Grant
201706050117.
REFERENCES
[1] Bowen Baker et al. 2016. Designing neural network architectures using rein-
forcement learning. arXiv preprint arXiv:1611.02167 (2016).
[2] Eric Chung et al. 2018. Serving DNNs in Real Time at Datacenter Scale with
Project Brainwave. IEEE Micro 38, 2 (2018), 8–20.
[3] Jeremy Fowers et al. 2018. A configurable cloud-scale DNN processor for real-
time AI. In Proc. of ISCA. IEEE Press, 1–14.
[4] Weiwen Jiang et al. 2018. Heterogeneous FPGA-based Cost-Optimal Design for
Timing-Constrained CNNs. IEEE TCAD (2018).
[5] PYNQ. 2018. PYNQ: Python productivity for ZYNQ. http://www.pynq.io/. Ac-
cessed November (2018).
[6] Esteban Real et al. 2017. Large-scale evolution of image classifiers. arXiv preprint
arXiv:1703.01041 (2017).
[7] J David Schaffer et al. 1992. Combinations of genetic algorithms and neural
networks: A survey of the state of the art. In Proc. of COGANN-92. IEEE, 1–37.
[8] Yongming Shen et al. 2017. Maximizing CNN Accelerator Efficiency Through
Resource Partitioning. In Proc. of ISCA. 535–547.
[9] XuechaoWei et al. 2018. TGPA: tile-grained pipeline architecture for low latency
CNN inference. In Proc. ICCAD. ACM, 58.
[10] Lingxi Xie et al. 2017. Genetic CNN.. In Proc. of ICCV. 1388–1397.
[11] Xiaowei Xu et al. 2018. Resource constrained cellular neural networks for real-
time obstacle detection using FPGAs. In Proc. of ISQED. IEEE, 437–440.
[12] Lei Yang et al. 2018. Optimal Application Mapping and Scheduling for Network-
on-Chips with Computation in STT-RAM based Router. IEEE TC (2018).
[13] Chen Zhang et al. 2015. Optimizing fpga-based accelerator design for deep con-
volutional neural networks. In Proc. of FPGA. ACM, 161–170.
[14] Chen Zhang et al. 2016. Energy-Efficient CNN Implementation on a Deeply
Pipelined FPGA Cluster. In Proc. of ISLPED. 326–331.
[15] Xiaofan Zhang et al. 2018. DNNBuilder: an automated tool for building high-
performance DNN hardware accelerators for FPGAs. In Proc. ICCAD. ACM, 56.
[16] Barret Zoph et al. 2016. Neural architecture search with reinforcement learning.
arXiv preprint arXiv:1611.01578 (2016).
[17] Barret Zoph et al. 2017. Learning transferable architectures for scalable image
recognition. arXiv preprint arXiv:1707.07012 2, 6 (2017).
