Hardware/Software Co-Exploration of Neural Architectures by Jiang, Weiwen et al.
ar
X
iv
:1
90
7.
04
65
0v
1 
 [c
s.L
G]
  6
 Ju
l 2
01
9
1
Hardware/Software Co-Exploration of Neural
Architectures
Weiwen Jiang, Student Member, IEEE, Lei Yang, Edwin Sha, Senior Member, IEEE, Qingfeng Zhuge, Shouzhen
Gu, Yiyu Shi, Senior Member, IEEE, and Jingtong Hu, Member, IEEE
Abstract—We propose a novel hardware and software co-
exploration framework for efficient neural architecture search
(NAS). Different from existing hardware-aware NAS which as-
sumes a fixed hardware design and explores the neural architec-
ture search space only, our framework simultaneously explores
both the architecture search space and the hardware design space
to identify the best neural architecture and hardware pairs that
maximize both test accuracy and hardware efficiency. Such a
practice greatly opens up the design freedom and pushes forward
the Pareto frontier between hardware efficiency and test accuracy
for better design tradeoffs. The framework iteratively performs
a two-level (fast and slow) exploration. Without lengthy training,
the fast exploration can effectively fine-tune hyperparameters and
prune inferior architectures in terms of hardware specifications,
which significantly accelerates the NAS process. Then, the slow
exploration trains candidates on a validation set and updates
a controller using the reinforcement learning to maximize the
expected accuracy together with the hardware efficiency. Exper-
iments on ImageNet show that our co-exploration NAS can find
the neural architectures and associated hardware design with
the same accuracy, 35.24% higher throughput, 54.05% higher
energy efficiency and 136× reduced search time, compared with
the state-of-the-art hardware-aware NAS.
Index Terms—Hardware-Software Co-Exploration, Neural Ar-
chitecture Search, FPGA, Multi-Criteria Optimization
I. INTRODUCTION
The neural architecture search (NAS) has achieved great
success to liberate human labor in the design of neural
architectures for various tasks including image classification
and language modeling [1]–[5]. Most recently, targeting a
fixed hardware platform, the hardware-aware NAS [6]–[8] has
been proposed to take into consideration the estimated timing
performance (such as latency or throughput) in addition to
accuracy (see Figure 1(a)).
All of the existing NAS frameworks explore the architecture
search space only, without considering the hardware design
freedom available in many cloud and edge computing applica-
tions. For instance, the cloud platforms (e.g. Amazon AWS [9]
and Microsoft Azure [10]) employ Field Programmable Gate
Array (FPGA) for neural network acceleration, while the edge
computing platforms typically take the programmable FPGAs
[11], [12] or Application-Specific Integrated Circuit (ASIC)
[13], [14]. In addition to neural architecture design, those
W. Jiang and Y. Shi are with the Department of Computer Science and
Engineering, University of Notre Dame, Notre Dame, IN 46556 (e-mail:
wjiang2@nd.edu).
L. Yang and J. Hu are with the Department of Electrical and Computer
Engineering, University of Pittsburgh, Pittsburgh, PA 15261.
E. H.-M. Sha, Q. Zhuge, and S. Gu are with the School of Computer Science
and Software Engineering, East China Normal University, 200062 China
predict arch
…
(a) Hardware-Aware NAS
NN2
Hardware-Awareness Module
meet time?
Y
N
accuracy
Arch Search Space
update controller
train
child networkNN1
fixed target platform
time
Hardware Design Space
meet time?
(b) Co-explore “Architecture Seach Space” and “Hardware Design Space”
Arch Search Space
accuracy
train
child network
Design 1
Design 2
…
time
monetary cost, 
utilization, etc.
update controller
predict arch
select 
hardware
fast-level
N
Y
slow-level
…
NN1
NN2
Figure 1. Comparison between (a) hardware-aware NAS; (b) the proposed
hardware/software co-exploration NAS. The red rectangles convey the metrics
that can be optimized in the exploration.
hardware platforms can also be programmed or even fully
customized for the best performance, expanding a hardware
design space.
Interestingly, the hardware design space is tightly coupled
with the architecture search space, i.e., the best neural ar-
chitecture depends on the hardware (hardware-aware NAS),
and the best hardware depends on the neural architecture.
It is therefore best to jointly explore both spaces to push
forward the Pareto frontier between hardware efficiency and
test accuracy for better design tradeoffs. This can be clearly
seen from the example in Table I, where three designs on
CIFAR-10 and Xilinx XC7Z015 FPGAs are presented: an op-
timized neural architecture for a fixed FPGA implementation
through hardware-aware NAS (design A), the hardware of
which is then further optimized through FPGA optimization
(design B) [15], and a jointly optimized neural architecture
and hardware through our co-exploration (design C). From
the table, we can see that further optimizing the hardware for
the architecture from hardware-aware NAS can lead to 45.45%
higher throughput, 38.24% higher energy efficiency with the
same accuracy. On the other hand, compared with such a
sequential optimization strategy, our co-exploration approach
can identify an architecture with higher accuracy and its tailor-
made hardware with 16.33% and 28.80% improvements in
2Table I
ON CIFAR-10 AND XILINX XC7Z015 FPGA: COMPARISONS OF THREE
NEURAL ARCHITECTURE AND HARDWARE DESIGN PAIRS IN ACCURACY,
THROUGHPUT, AND ENERGY EFFICIENCY (E.-E): A) OPTIMAL
ARCHITECTURE ON A FIXED HARDWARE IMPLEMENTATION THROUGH
HARDWARE-AWARE NAS; B) THE SAME ARCHITECTURE BUT WITH
FURTHER FPGA OPTIMIZATION, AND C) A JOINTLY OPTIMIZED NEURAL
ARCHITECTURE AND FPGA IMPLEMENTATION THROUGH OUR
CO-EXPLORATION.
ID Approach Accuracy
Throughput E.-E
(FPS) (GOPS/W)
A Hardware-Aware NAS 84.53% 16.2 0.84
B Sequential Optimization 84.53% 29.7 1.36
C Co-Exploration 85.19% 35.5 1.91
throughput and energy efficiency, respectively.
Specifically, our architecture search space and hardware de-
sign space co-exploration framework is shown in Figure 1(b).
The proposed co-exploration can be built on any existing NAS
framework [2], [8], [16], [17] by expanding it to delve into
the hardware design space, where a two-level (fast and slow)
exploration is iteratively conducted. In the fast exploration,
the best hardware design is identified for the sampled neural
architectures without lengthy training. The architectures with
inferior hardware efficiency will be quickly pruned, which
significantly accelerates the search process. Thereafter, the
superior candidates are trained in the slow exploration for
controller update using policy gradient reinforcement learning
to explore the coupled architecture search space. The optimiza-
tion objectives in the hardware design space can be varied
according to the design specifications, such as area, monetary
cost, energy efficiency, reliability, resource utilization, etc.
In order to illustrate our framework, we choose to use FPGA
as a vehicle in this paper, as it has gradually become one of
the most popular platforms to implement deep neural networks
(DNNs) due to its programmability, high performance and
energy efficiency, in particular for low-batch inferences [18],
[19]. Our co-exploration concept and the general framework,
however, can also be easily extended to other hardware plat-
forms such as ASICs. Since timing performance on a single
FPGA is limited by its restricted resource, it is prevalent to
organize multiple FPGAs in a pipelined fashion [20]–[23] to
provide high throughput (frame per second, FPS). In such a
system, the pipeline efficiency is one of the most important
metrics needing to be maximized, since it determines the
hardware utilization as well as energy efficiency. As such, we
use accuracy and pipeline efficiency to guide the exploration
of the neural architecture space and hardware design space
respectively, while satisfying a given throughput specifications
(e.g., ≥30FPS for the ordinary camera). Experimental results
show that the co-exploration approach can significantly push
forward the Pareto frontier. On ImageNet, the proposed co-
exploration framework can identify architecture and hardware
pairs to achieve the same accuracy, 35.42% higher throughput,
54.05% higher energy efficiency and 136× reduced search
time, compared with the hardware-aware NAS.
II. BACKGROUND AND PROBLEM DEFINITION
A. Neural Architecture Search
Although the research on the automatic prediction of neural
network architectures can trace back to the 1980s [24], after
deep neural networks have achieved great success in AI
domains, there have been growing interests in generating good
neural architectures for the interested dataset recently. With the
fact that the architectures are growing deeper, the search space
expands exponentially, leading to more difficulties in exploring
the search space. In existing work, there are two mainstreams
of architecture search: (1) employing reinforcement learning
[2], [3], [25], (2) applying evolutionary algorithms [4], [26],
[27]. The basic idea is to iteratively update hyperparameters
to generate better “child networks” in terms of accuracy.
Figure 1(a), without the hardware-aware module, illustrates
a typically used reinforcement learning based neural architec-
ture search (NAS) [2] framework. As shown in this figure,
the RNN controller in NAS iteratively predicts child networks
from architecture search space. These child networks will be
trained on a held-out dataset to obtain its accuracy. Then,
accuracy will be used to update the RNN controller.
Existing work has demonstrated that the automatically re-
sulting architectures can achieve close or even higher accuracy
to the best human-invented architectures [2], [3]. However,
there are two important problems in searching architectures.
First, the search process is inefficient. [2] reported that 20,000
networks were trained across 500 P100 GPUs over 4 days
to find the desired network. Second, since the search process
is hardware oblivious, neither the time performance nor the
hardware efficiency can be guaranteed.
Recently, hardware-aware NAS [6]–[8] has been proposed
to search architectures for a target hardware platform, as
shown in Figure 1(a). They always assume a fixed hardware
design (mobile chips) and only explore the architecture search
space for fixed hardware. However, the hardware design free-
dom is commonly available in many cloud and edge computing
applications, like FPGAs in cloud platforms [9], [10] and
ASIC in edge computing platforms [13], [14]. Without the
consideration of hardware design space will lead to inferior
designs in hardware efficiency, because the hardware design
space and the architecture search space are tightly coupled.
Compared with existing work, the main contribution of this
work is to propose a framework to co-explore the architecture
search space and the hardware design space, as shown in
Figure 1(b). More specifically, this framework determines the
best hardware during the search process, which is tailor-made
for the candidate architectures. In this way, the framework can
obtain a set of superior architecture and hardware design pairs
on the Pareto frontier of accuracy and hardware efficiency
tradeoffs. In addition, the search time can be significantly
reduced, since we prune more inferior architectures according
to multiple design specifications compared with the hardware-
aware NAS.
B. Implementing DNNs on FPGAs
This paper will employ FPGA as a vehicle to study
how to co-explore neural architectures and hardware designs.
3l
2
para
1
 = ánum of filters, filter size, precision, ...
 
ñ
Child 
Network
Pipeline 
Stages
l
1
FPGA 
Pool
... ...
Partition (P)
Assignment (a)
f
1
Pipelined FPGAs1
2
3
4
l
3
l
1
l
4
l
5
l
2
l
3
l
4
l
5
f
k+1
f
n
f
1 f
k+1
f
n
f
k
U
1
U
2
U
3
Figure 2. An overview of implementing a child network onto multiple FPGAs
to be organized in the pipelined fashion.
FPGA has demonstrated its excellent ability to achieve high
performance and energy efficiency for low-batch real-time
inferences [18], [19]. Hence, a large amount of work is made
in implementing neural networks on FPGAs, in which tools
are developed to automatically design accelerators on FPGAs
for a given network architecture. In the early stage, research
efforts are mainly focusing on designing accelerators on a
single FPGA [28]–[31]. Most recently, implementations on
multiple FPGAs has become the mainstream [15], [18]–[20],
[22], [23], since limited resource on a single FPGA becomes
the performance bottleneck.
To fully utilize the computation power provided by multiple
FPGAs, a typical technique is to implement the neural network
on multiple FPGAs in a pipelined fashion [15], [20], [22], [23].
Figure 2 demonstrates one such example, in which a 5-layer
network is partitioned into 3 pipeline stages, and each pipeline
stage is mapped to a certain FPGA in an available pool. Finally,
those FPGAs are connected as a linear array to function in the
pipelined fashion.
C. Definitions and Problem Statement
The goal of the proposed framework is to find both the
neural architectures with the highest test accuracy and hard-
ware design with the guaranteed performance (e.g. timing
requirement and utilization of FPGAs). In this subsection, we
will first present the relevant system definitions. Then, we will
formally define the problem based on FPGA.
➀ Child Network. A child network is defined as C =
〈L, para, acc〉. It consists of a set of layers L. The number of
layers in the child network is the size of set L, i.e., |L|. For the
ith layer li ∈ L, set parai contains the predictable parameters,
such as the number of filters, filter size, etc. The accuracy of
the child network is acc, which can be obtained by training
C on a held-out dataset. For illustration purposes, we use a
linear chain of layers as an example in Figure 2 ➀. However,
the proposed technique is not limited to such structure and is
applicable to more complicated structures, such as Directed
Acyclic Graph (DAG).
The child network is the bridge between the architecture
search space and the hardware design space. Specifically, in
each iteration, the controller RNN will predict child networks
from the architecture search space, and then determine their
implementations in the hardware design space. We will intro-
duce the hardware design space as follows.
➁ Partition Child Network to Pipeline Stages. Let P (C)
be a set of partitions for the child network C. P (C) =
{P1, P2, · · · , PM}, where Pi is a nonempty subset of set L.
We have the following two properties: (1)
⋃
Pi∈P (C)
= L;
and (2) ∀Pi, Pj ∈ P (C), if i 6= j, then Pi ∩ Pj = ∅. After
the partitioning, each set in P (C) corresponds to a pipeline
stage. For example, in Figure 2➁, we partition the given child
network into 3 pipeline stages, P1 = {l1}, P2 = {l2, l3}, and
P3 = {l4, l5}.
➂ Assign Pipeline Stages to FPGAs. Then, we can assign
each pipeline stage to a specific FPGA in an available FPGA
pool, as shown in Figure 2 ➂. An FPGA pool with n FPGAs
can be represented by a set F = {f0, f1, · · · , fn}. Each FPGA,
fi, has a set of attributes, including memorymemi, DSP slices
dspi, etc. These attributes will be utilized to model the timing
performance for a child network.
We define the assignment function α from the partition set
P (C) to FPGA pool F . We have α(Pi) = fj to indicate
the ith pipeline stage Pi is assigned to the j
th FPGA fj to
be implemented. After pipeline stages are assigned to FPGA
pool according to α, each FPGA will process one or multiple
layers. And all FPGAs work together in the pipelined fashion.
➃ Pipelined FPGAs. The pipelined executions of multiple
FPGAs are illustrated in Figure 2 ➃. The system will contin-
uously obtain inputs from the dataset with a fixed rate (frame
per second), and generate output data from the last pipeline
stage. The input rate of the system reflects the throughput
specification TS, which implies that the latency of each
pipeline stage should be no more than 1/TS.
The latency of a pipeline stage under an assignment function
can be easily captured with a performance model [28]. For
FPGA fi, its latency is denoted as Lati. After obtaining the
latency of each FPGA, we introduce pipeline efficiency, which
is composed of the hardware utilization in each pipeline stage
(corresponding to an FPGA). The utilization of FPGA fi is
equal to Lati × TS. Higher utilization of an FPGA indicates
the less idle time in processing and higher energy efficiency.
Therefore, high average utilization of all FPGAs is always
desired.
Problem Statement. Based on the above definitions, we for-
mally define the problem of “hardware/software co-exploration
of neural architectures” as: Given a dataset, a pool of FPGAs
F , and a throughput specification TS, we are going to co-
explore architecture search space and hardware design space
to find a child network C:
• para: parameters of all layers in the child network;
• P : the partition of layer set L in the child network;
• α: the assignment of pipeline stages to set F ;
such that the accuracy of child network C is maximized, the
pipeline FPGA system can meet the required throughput TS,
and the average utilization of all FPGAs is maximized.
III. HW/SW CO-EXPLORATION FRAMEWORK
In this section, we will present the proposed framework. We
will use the NAS discussed in [2] as the backbone framework
4NAS Cell
(RNN Cell)
Layer 1: Parameter
á f
1
, k
1
, s
1
, ...
 
ñ
Prediction
á f
1
’, k
1
’, s
1
’ , ...ñ
q
1 NAS Cell
(RNN Cell)
Layer 2: Parameter
á f
2
, k
2
, s
2 
, ...ñ
Prediction
á f
2
’, k
2
’, s
2
’ , ...ñ
q
2 NAS Cell
(RNN Cell)
Layer 3: Parameter
á f
3
, k
3
, s
3 
, ...ñ
Prediction
á f
3
’, k
3
’, s
3
’ , ...ñ
q
3 NAS Cell
(RNN Cell)
Layer N: Parameter
á f
N
, k
N
, s
N 
, ...ñ
Prediction
á f
N
’, k
N
’, s
N
’ , ...ñ
q
N
...
RNN Controller
Level 1: Fast Exploration (FE)
(1) Generate pipelined FPGA configuration to satisfy the throughput
(2) Iteratively train the controller to maximize utilization of each FPGA
á R
1
, R
2
, R
3
, ..., R
M 
ñHyperparameters of child network
Level 2: Slow Exploration (SE)
(1) Train the child network from Level 1 to obtain its accuracy
(2) Generate Reward in terms of accuracy and utilization
Child networks with better hardware utilization
Reward(A,U)
Figure 3. An overview of HW/SW co-exploration framework: The controller
contains multiple reconfigurable RNN cells and predicts the hyperparameters
in a child network; the fast exploration level prunes child networks with
inferior hardware utilization; the slow exploration level updates controller
using hardware utilization and accuracy obtained by training child networks.
and FPGA as the hardware platform to demonstrate our
concept. It can be integrated with any existing NAS techniques
[2], [8], [16], [17] or extended to incorporate other hardware
platforms.
A. Framework Overview
Figure 3 shows the HW/SW co-exploration framework. The
framework contains a RNN based controller and two levels of
explorations. Unlike that in [2], the controller has multiple
RNN cells instead of one. More specifically, each layer in
a child network has a corresponding RNN cell. During the
exploration, cells will be reorganized to support different
optimization goals.
In the first level, a fast exploration is carried out in four
steps: (1) it first predicts an architecture with probability p, (2)
then, it explores the design space to generate a pipelined FPGA
system to meet the throughput requirement, (3) according to
the pipeline structure, it then reorganizes RNN cells in the
controller, and (4) it updates the controller using reinforce-
ment learning to maximize the pipeline efficiency. This level
explores the hardware design space without training child
networks, therefore it performs efficiently.
In the second level, we train the child network obtained
from the first level on the held-out validation set. After that,
we generate a reward based on both the yielded accuracy and
pipeline efficiency, which is used to update the RNN controller.
In case that no child network can meet the required throughput
specification in the first level, we generate a negative reward
to update the controller. After this level, the controller will
predict a new child network from architecture search space
for the fast exploration level.
RNN Cell
q
1 RNN Cell
PAR
2
PAR
2
’
RNN Cell
PAR
3
PAR
3
’
q
2
=q
3
RNN Cell
PAR
N
PAR
N
’
q
N
...
R
1
P
1
={L
1
}; a(P
1
)=f
3
Pipeline Stage 1
U
1
=BLAST(P
1
,a,PAR)
R
1
=Formula-1(U
1
)
áPAR
1
’ ñ R
2
P
2
={L
2 
,L
3
}; a(P
2
)=f
1
Pipeline Stage 2
U
2
=BLAST(P
2
,a,PAR)
R
2
=Formula-1(U
2
)
áPAR
2
’, PAR
3
’ ñ
Pipeline Stage M...
R
M
á..., PAR
M
’ ñ
PAR
1
’
PAR
1
Partition and Assignment
Reward
RNN 1 RNN 2 RNN M
share wei and states
data flow
Figure 4. Fast Exploration (FE): organize RNN cells in the controller
according to the partition for pipeline stages; independently update multiple
RNNs in the controller to predict parameters of layers assigned to each
pipeline stage.
B. Fast Exploration for High Resource Utilization
In the first level, namely Fast Exploration (FE), the objec-
tive is to maximize pipeline efficiency under the throughput
specification TS. FE takes three types of inputs: (1) a set of
available FPGAs F , (2) hyperparameters of a child network
H , (3) a throughput specification TS. It will generate a new
child network, whose throughput at inference phase can meet
TS using a subset of FPGAs in F . In addition, the average
hardware utilization of FPGAs can be maximized. In FE, there
are two challenges needing to be addressed: first, how to
partition a given child network and assign each partition to
a specific FPGA (Partition and Assignment); second, how to
reorganize the RNN cells in the controller and then update
them to generate child networks with higher pipeline efficiency
(Reorganize and Update Controller).
Partition and Assignment. In the search process, a number
of candidate child networks need to go through the partition
and assignment process. Consequently, an efficient automatic
tool should be employed to avoid performance degradation
on search process. In this paper, we employ the BLAST
algorithm in [20]. BLAST takes child network H , FPGAs F ,
the throughput specification TS, and the attributes of each
FPGA as inputs. It outputs a serial of FPGAs, each of which
will implement one or multiple layers in a pipeline stage. The
resultant system will satisfy TS with the maximum pipeline
efficiency. As shown in Figure 4, layers in a child network
are divided into M partitions, and each partition is assigned
to one specific type of FPGA under function α.
Reorganize and Update Controller. According to the
generated pipeline structure, we then reorganize the controller
and iteratively update the controller to generate child networks
with higher hardware utilization. Our goal is to maximize the
average hardware utilization, which is equivalent to maximize
the utilization of each hardware. However, the design space of
maximizing the average hardware utilization is exponentially
larger than that of maximizing the utilization of each hard-
ware. To efficiently explore the design space, we choose to
maximize the hardware utilization of different pipeline stage
independently. Therefore, we reorganize RNN cells in the
controller according to the determined pipeline structure. More
5RNN Cell RNN Cell
PAR
2
PAR
2
’
RNN Cell
PAR
3
PAR
3
’
q
1
=q
2
=...=q
N
RNN Cell
PAR
N
PAR
N
’PAR
1
’
PAR
1
RNN
share wei and states
...
1. Train C on the held-out dataset to obtain accuracy A
Reward(A,U)
FE
Child Network “C”
partition “P”, assignment “a”
2. Obtain the average uitlization U using BLAST(C,P,a)
3. Compute reward based on A and U
SE
Figure 5. Slow Exploration (SE): configure RNN cells in the controller to
be one RNN; generate reward based on accuracy and pipeline efficiency to
update the controller RNN.
specifically, for multiple layers in one pipeline stage, their
corresponding RNN cells will be configured to form one RNN
and their weights and states are shared (e.g., RNN 2 in Figure
4). In consequence, there will be N RNNs for N pipeline
stages. In this way, each RNN can be trained to maximize the
hardware utilization for each FPGA pipeline stage.
After we form the RNNs, we apply reinforcement learning
to update the parameters in those N RNNs, and use these
RNNs to predict the hyperparameters of child networks. In
each iteration, we will predict T child networks, which can
be viewed as a list of actions a1:T . Correspondingly, notation
ai1:T represents the hyperparameters of the i
th pipeline stage
in these child networks. For each child network predicted by
the controller, we can obtain the utilization of the ith pipeline
stage (corresponding to one FPGA) using BLAST, denoted as
Ui. Then, for RNN i, we utilize Ui to generate a reward Ri
to update its parameters θi. The reward Ri can be calculated
using the following formula.
Ri =


Ui Ui ≤ 1
1− Ui 1 < U i ≤ 2
−1 Ui > 2
(1)
where Ui > 1 indicates that the required throughput cannot be
satisfied, and we give the negative reward. For each RNN, our
objective is to maximize the expected reward for actions from
time 1 to T , represented by J(θi) = EP (ai
1:T
;θi)[Ri]. Since the
reward is non-differentiable, we apply the policy of gradient
method to update θi. Specifically, the method of REINFORCE
rule [32] has been employed as in [2], [8].
C. Slow Exploration for High Accuracy
After obtaining a child network meeting the timing speci-
fication through the fast exploration level, we now move to
the second level. In this level, we aim to update the controller
RNN to generate new child networks with higher accuracy and
pipeline efficiency. We will train the child network on the held-
out validate set, and therefore the exploration speed is much
slower than that of the first one. We call it Slow Exploration
(SE).
As shown in Figure 5, SE takes the generated child network,
the partition and the assignment from FE as the inputs. The
child network is first trained to obtain accuracy A. Then, the
average pipeline efficiency U of the child network under the
partition and assignment will be calculated. Finally, we com-
pute the reward to update the controller using the following
formula.
Reward(A,U) = β ×A+ (1− β)× U (2)
where β is an adjustment parameter, which reflects the bias on
test accuracy and hardware utilization. The value of β ranges
from 0 to 1. We will discuss how to scale β in Section V. After
that, we update the controller using the reward by applying the
policy gradient reinforcement learning, which is the same as
that in FE level. As shown in Figure 5, all RNN cells share
the same weights and states in this level, since we have only
one reward.
D. Interface between Fast-Slow Explorations
Before updating the RNN cells in the controller in the
fast exploration level, we take a snapshot Snap of all RNN
cells. During the fast exploration level, we obtain the hardware
design (i.e., pipeline configuration) for the input child network.
Based on the determined pipeline structure, RNN cells are
reorganized as introduced in Section III-B. And reorganized
cells will be trained to generate better child networks for
the previously obtained hardware design (i.e., pipeline con-
figuration). Finally, a child network with maximum hardware
efficiency on the determined pipeline will be sent to the slow
exploration level.
After entering the slow exploration level, the RNN cells
in the controller will be recovered using the previously saved
snapshot Snap. Then, SE will train the child network to obtain
the accuracy, which will be used to calculate the reward. Using
this reward, we will update the recovered RNN. Then, the
updated RNN will be used to generate new child networks
for the next iteration. In this way, the SE process will always
keep improving the RNN accuracy while the FE process will
always generate the best hardware design for each iteration.
IV. EXPERIMENTS
Datasets:We use CIFAR-10 and ImageNet datasets to study
the efficacy of our approach and compare it with the state-of-
the-art. During the exploration of child networks, we only use
the training images in these datasets, while the test images
are used to test the accuracy of the resultant architectures. To
evaluate the accuracy in the search process, we randomly select
10% of the samples from the training set as a validation set. All
the images undergo the data preprocessing and augmentation
procedure, including whitening, upsampling, random cropping,
and random horizontal flip, which are common among the
related work.
Architecture Search Space: For CIFAR-10, we use convo-
lutional architectures as the backbone. For every convolutional
layer, we first determine the filter size in [24,36,48,64], the ker-
nel size in [1,3,5,7], and the strides. Two sets of experiments
are carried out to determine the strides: (1) by exploring the
child networks with a fixed stride of 1; (2) by allowing the
controller to predict the strides in [1,2]. After each layer, the
6rectified linear units [33] and the batch normalization [34] are
appended.
For ImageNet, the architecture repeats mobile inverted bot-
tleneck convolution layers instead of ordinary convolutional
ones, same as that in [8]. The controller explores the archi-
tectures with various kernel sizes [3,5,7], strides [1,2] and
expansion ratios [3,6].
Hardware Design Space: The hardware design space is
composed of up to three Xilinx FPGAs (XC7Z015), each
of which contains 74K logic cells, 4.9Mb on-chip memory,
and 150 DSP Slices. One reason for our selection is that
such an FPGA provides high speed serial communication (up
to 16.8Gbps of bandwidth), so that a high speed hardware
pipeline can be formed by multiple FPGAs. In the implemen-
tation, the child network is partitioned into pipeline stages,
and each stage is mapped to one FPGA. Kindly note that our
hardware exploration may not end up using all three FPGAs;
it is possible to use fewer for higher hardware efficiency.
In the experiments, we use pipeline efficiency as the metrics
to measure the hardware efficiency. As stated in Section I,
the pipeline efficiency is one of the most important metrics,
since it is related to the hardware utilization, energy efficiency,
and timing performance. Then, the timing specifications are
set according to the desired processing speed of the data
at the inference phase, which are commonly decided by
the data collector (e.g., camera). For CIFAR-10, we set the
throughput specification to 35FPS, which can satisfy most
cameras; whereas for ImageNet, due to the more complicated
architectures and the limited resource, we set the specification
to 10FPS.
Exploration Frameworks: Our proposed framework is
denoted as “Co-Exploration” in the results. The framework
can obtain a set of superior architecture and hardware design
pairs on the Pareto frontier of accuracy and pipeline efficiency
tradeoffs. In particular, among the pairs on the frontier, we
denote the one with the maximum accuracy as “OptSW” and
the one with the maximum pipeline efficiency as “OptHW”.
For comparison, we also implement two other frameworks
based on the state-of-the-art. Because none of the existing
Hardware-Aware NAS [6]–[8] target FPGAs and they use var-
ious settings, for fair evaluation, we use the NAS discussed in
[2] as the backbone to implement a Hardware-Aware NAS for
FPGA under the same setting discussed above. This framework
is denoted as Hardware-Aware NAS in the results. In addition,
for the architectures obtained by the Hardware-Aware NAS,
we further optimize their hardware implementation by consid-
ering the hardware design space. Such a heuristic approach is
denoted as “Sequential Optimization” in the results.
Training Details: For CIFAR-10, the training settings
for both the RNN controller and the child networks are the
same as [2]. For the controller RNN, in both slow and fast
explorations, it is trained by using the calculated rewards with
the ADAM optimizer [35] with a learning rate of 0.0006.
Parameter β in Formula 2 is set to 0.5 to equally optimize test
accuracy and pipeline efficiency. For the child networks, we
apply Momentum Optimizer with a learning rate of 0.1, weight
decay of 10−4. and momentum of 0.9. Each child network is
trained for 50 epochs.
 
20FPS
   
Number of layers
35FPS
100FPS
(a)
4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
P
er
ce
n
ta
g
e 
o
f 
v
al
id
 a
rc
h
.
14
Number of layers
(a)
4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
P
er
ce
n
ta
g
e 
o
f 
v
al
id
 a
rc
h
.
14
Figure 6. Percentages of valid architectures for different timing specifications:
(a) fixed stride of 1; (b) predictable strides.
Table II
CO-EXPLORATION WITH PREDICTABLE STRIDE PERFORMS BETTER THAN
THAT WITH FIXED STRIDE UNDER 35FPS TIMING SPECIFICATION.
Models Depth Accuracy Pipeline Eff.
Co-Exploration fixed stride (OptSW) 13 81.50% 91.92%
Co-Exploration fixed stride (OptHW) 10 78.57% 98.56%
Co-Exploration pred. stride (OptSW) 14 85.19% 92.15%
Co-Exploration pred. stride (OptHW) 6 80.18% 99.69%
For ImageNet, we build the distributed GPU training envi-
ronment on top of Uber Horovod [36]. Training settings are
similar to those for CIFAR-10, with the exceptions that we
set the initial learning rate to 0.0125, decay 10× at selected
epochs, and for the Momentum Optimizer the weight decay is
5× 10−5 and the momentum is 0.9.
V. RESULTS
Impact of Timing Specifications: Figure 6 reports the
impact of timing specifications for the Co-Exploration frame-
work. We randomly sample 10,000 architectures for the layer
size ranged from 4 to 14, and obtain the percentage of valid
architectures that can meet the timing specification on the
CIFAR-10 dataset. In Figure 6, it is obvious that if the
constraint is tight (e.g., FPS=100), only a few architectures
can satisfy the specification, indicating that the number of
architectures with high accuracy is reduced compared with the
one without timing constraints. In this case, we can scale up
the parameter β in Formula 2 to pursue higher accuracy. On
the other hand, if the constraint is loose (e.g., FPS=20), there
are a large number of valid architectures. Correspondingly, we
can scale down β to find more hardware efficient designs with
high accuracy.
Comparison between Fixed Stride and Predictable
Stride: Table II reports the comparison between the explo-
ration with the fixed stride and that with the predictable stride
on CIFAR-10. In the table, column “depth” indicates the
number of layers in the resulting architecture. As shown in
this table, for the exploration with the fixed stride, OptSW
achieves 2.93% higher accuracy but 6.64% loss in pipeline
efficiency than OptHW. These figures are 5.01% and 7.54%
for the exploration with the predictable strides. In addition, it
is obvious that compared with fixed stride, the stride prediction
can help controller to find better results in both accuracy
and pipeline efficiency. As such, in the following experiments
7Table III
COMPARISON AMONG CO-EXPLORATION, HARDWARE-AWARE NAS AND SEQUENTIAL OPTIMIZATION ON CIFAR-10 AND IMAGENET DATASETS.
Dataset Models Depth Parameters
Accuracy Accuracy
Pipeline Eff. FPS
Energy Eff.
(Top1) (Top5) GOPS/W
CIFAR-10
Hardware-Aware NAS 13 0.53M 84.53% - 73.27% 16.2 0.84
Sequential Optimization 13 0.53M 84.53% - 92.20% 29.7 1.36
Co-Exploration (OptHW) 10 0.48M 78.57% - 98.56% 35.5 2.55
Co-Exploration (OptSW) 14 0.36M 85.19% - 92.15% 35.5 1.91
ImageNet
Hardware-Aware NAS 15 0.44M 68.40% 89.84% 81.07% 6.8 0.34
Sequential Optimization 15 0.44M 68.40% 89.84% 86.75% 10.4 0.46
Co-Exploration (OptHW) 17 0.54M 68.00% 89.60% 96.15% 12.1 1.01
Co-Exploration (OptSW) 15 0.48M 70.24% 90.53% 93.89% 10.5 0.74
inferior designs
Pareto frontier 
(Co-Exploration)
(a)
p
ip
el
in
e 
ef
fi
ci
en
cy
 (
H
W
)
Pareto frontier
(Hardware-Aware)
0.75 0.80 0.85
0.85
0.90
0.95
1.00
accuracy (SW)
0.75 0.80 0.85
0.7
0.8
0.9
1.0
accuracy (SW)
(b)
OptHW OptSW
Figure 7. Pareto frontiers between accuracy and pipeline efficiency for
Hardware-Aware NAS and Co-Exploration, both of which are designed under
the timing specification of 35FPS: (a) designs with 2 FPGAs; (b) designs with
3 FPGAs.
we will use predictable stride as the default setting for Co-
Exploration.
Impact of Different Exploration Frameworks on Pareto
Frontier: Figure 7 reports the design space exploration assum-
ing the hardware design space contains up to (a) two FPGAs or
(b) three FPGAs. The x-axis and y-axis represent the accuracy
and pipeline efficiency, respectively. For clear demonstration,
we only include the architectures whose pipeline efficiency is
no less than 85% for two FPGAs in Figure 7(a) and no less
than 75% for three FPGAs in Figure 7(b). In these figures,
the circled design points correspond to those in Table II.
The red lines represent the Pareto frontiers explored by Co-
Exploration. The green lines, on the other hand, represent the
frontier obtained by Hardware-Aware NAS (by examining the
top architectures identified). These figures clearly show that
by exploring hardware design space, our Co-Exploration can
significantly push forward the Pareto frontiers in the accuracy
and efficiency tradeoffs. It effectively identifies better designs
not available through architecture search space only, i.e., those
between the two frontiers.
Comparing the two exploration results in Figure 7(a) and
(b), we can also see that the solution with the highest pipeline
efficiency is located in Figure 7(a), while the one with the
highest accuracy is located in Figure 7(b). In general, we can
always observe that the average accuracy on three FPGAs is
higher than that on two FPGAs, yet the pipeline efficiency is
lower. This is because more FPGAs can accommodate deeper
architecture in layers for higher accuracy. On the other hand,
Table IV
CO-EXPLORATION USES MUCH FEWER GPU HOURS THAN THAT OF
HARDWARE-AWARENAS, BENEFITING FROM THE EARLY-STAGE PRUNING.
Dataset Approach Arch for Training GPU Hours Impr.
CIFAR-10
Hardware-Aware NAS 108,000 16,586 1
Co-Exploration 308 102+1.9=103.9 159×
ImageNet
Hardware-Aware NAS 7,263 36,315 1
Co-Exploration 53 256+1.8=266.8 136×
more layers will easily result in unbalanced pipeline stages,
which in turn reduces the pipeline efficiency.
Comparison between Co-Exploration and Existing
Frameworks: Table III reports the comparison results on accu-
racy, pipeline efficiency, throughput and energy efficiency on
CIFAR-10 and ImageNet. All the architectures identified have
fewer than 1M parameters mainly due to the hardware capacity.
This inevitably leads to accuracy loss; however, as we can see,
the architecture explored by OptSW can still achieve 85.19%
test accuracy on CIFAR-10, and 70.24% top-1 accuracy on
ImageNet. These results demonstrate the effectiveness of the
Co-Exploration approach in resource limited scenarios. In addi-
tion, OptSW outperforms Hardware-Aware NAS by achieving
54.37% and 35.24% higher throughput, and 56.02% and
54.05% higher energy efficiency on CIFAR-10 and ImageNet,
respectively. Compared with Sequential Optimization, OptSW
achieves 16.34% and 28.79% improvements on CIFAR-10
in throughput and energy efficiency, respectively; and on
ImageNet, it can also slightly improve throughput, and achieve
37.84% improvements on energy efficiency.
Finally, Table IV reports the comparison results on nor-
malized search time between the Hardware-Aware NAS and
the Co-Exploration. Results in this table show that the Co-
Exploration can significantly accelerate the search process,
achieving 159× and 136× fewer GPU hours on CIFAR-10
and ImageNet, respectively. The speedup is achieved from
the efficient early-stage pruning in the fast exploration level.
First, many inferior architectures can be pruned, such that the
number of architectures needing to be trained is significantly
reduced, as shown in column “Arch for Training”. Second, the
exploration of the hardware design space is efficient, which
only occupies less than 1% GPU hours in the whole search
process (1.9 GPU hours for CIFAR-10 and 1.8 GPU hours for
8ImageNet).
VI. CONCLUSION
We proposed the co-exploration framework to open up the
hardware design freedom in neural architecture search. This
is driven by the trend that the hardware platform can be
programmed or even fully customized for the best performance
in cloud and edge computing applications. This paper took
the FPGA as a vehicle to show that through jointly exploring
architecture search space and hardware design space, the
design Pareto frontier on accuracy and hardware efficiency
tradeoffs can be significantly pushed forward. For future work,
we would like to conduct experiments on other tasks, such as
object detection and tracking, speech recognition, etc. In addi-
tion, we will identify and study the challenges in integrating
ASIC designs to the proposed co-exploration framework for
NAS.
REFERENCES
[1] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture
search by network transformation.” AAAI, 2018.
[2] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
learning,” in International Conference on Learning Representations
(ICLR), 2017.
[3] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
architectures for scalable image recognition,” in IEEE conference on
Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8697–8710.
[4] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and
A. Kurakin, “Large-scale evolution of image classifiers,” arXiv preprint
arXiv:1703.01041, 2017.
[5] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu,
“Hierarchical representations for efficient architecture search,” arXiv
preprint arXiv:1711.00436, 2017.
[6] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Va-
jda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet
design via differentiable neural architecture search,” arXiv preprint
arXiv:1812.03443, 2018.
[7] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet:
Platform-aware neural architecture search for mobile,” arXiv preprint
arXiv:1807.11626, 2018.
[8] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture
search on target task and hardware,” arXiv preprint arXiv:1812.00332,
2018.
[9] Amazon, “Ec2 f1 instances,” https://aws.amazon.com/ ec2/instance-
types/f1, 2017, accessed: 2019-01-20.
[10] Microsoft, “Real-time ai: Microsoft announces preview of project
brainwave,” https://blogs.microsoft.com/ai/build-2018-project-
brainwave/, 2018, accessed: 2019-01-20.
[11] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design flow of
accelerating hybrid extremely low bit-width neural network in embedded
fpga,” in 2018 28th International Conference on Field Programmable
Logic and Applications (FPL). IEEE, 2018, pp. 163–1636.
[12] F. Shafiq, T. Yamada, A. T. Vilchez, and S. Dasgupta, “Automated
flow for compressing convolution neural networks for efficient edge-
computation with fpga,” arXiv preprint arXiv:1712.06272, 2017.
[13] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Ja-
gannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey et al., “Scaledeep:
A scalable compute architecture for learning and evaluating deep net-
works,” in ACM SIGARCH Computer Architecture News, vol. 45, no. 2.
ACM, 2017, pp. 13–26.
[14] P. Whatmough, S. Lee, N. Mulholland, P. Hansen, S. Kodali, D. Brooks,
and G. Wei, “Dnn engine: A 16nm sub-uj deep neural network inference
accelerator for the embedded masses,” in 2017 IEEE Hot Chips 29
Symposium, 2017.
[15] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient
cnn implementation on a deeply pipelined fpga cluster,” in International
Symposium on Low Power Electronics and Design (ISLPED). ACM,
2016, pp. 326–331.
[16] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture
search,” arXiv preprint arXiv:1806.09055, 2018.
[17] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le, “Under-
standing and simplifying one-shot architecture search,” in International
Conference on Machine Learning, 2018, pp. 549–558.
[18] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,
T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., “Serving
dnns in real time at datacenter scale with project brainwave,” IEEE
Micro, vol. 38, no. 2, pp. 8–20, 2018.
[19] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo,
S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al., “A configurable
cloud-scale dnn processor for real-time ai,” in International Symposium
on Computer Architecture (ISCA). IEEE, 2018, pp. 1–14.
[20] W. Jiang, E. H.-M. Sha, Q. Zhuge, L. Yang, X. Chen, and J. Hu,
“Heterogeneous fpga-based cost-optimal design for timing-constrained
cnns,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 37, no. 11, pp. 2542–2554, 2018.
[21] W. Zhang, J. Zhang, M. Shen, G. Luo, and N. Xiao, “An efficient
mapping approach to large-scale dnns on multi-fpga architectures,” in
Design, Automation & Test in Europe Conference & Exhibition (DATE),
2019. IEEE, 2019, pp. 1–4.
[22] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt,
“A framework for acceleration of cnn training on deeply-pipelined
fpga clusters with work and weight load balancing,” in International
Conference on Field Programmable Logic and Applications (FPL).
IEEE, 2018, pp. 394–3944.
[23] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and
M. Herbordt, “Fpdeep: Acceleration and load balancing of cnn training
on fpga clusters,” in International Symposium on Field-Programmable
Custom Computing Machines (FCCM). IEEE, 2018, pp. 81–84.
[24] J. D. Schaffer, D. Whitley, and L. J. Eshelman, “Combinations of genetic
algorithms and neural networks: A survey of the state of the art,” in
International Workshop on Combinations of Genetic Algorithms and
Neural Networks (COGANN). IEEE, 1992, pp. 1–37.
[25] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural
network architectures using reinforcement learning,” arXiv preprint
arXiv:1611.02167, 2016.
[26] L. Xie and A. Yuille, “Genetic cnn,” in International Conference on
Computer Vision (ICCV). IEEE, 2017, pp. 1388–1397.
[27] Y.-H. Kim, B. Reddy, S. Yun, and C. Seo, “Nemo: Neuro-evolution
with multiobjective optimization of deep neural network for speed and
accuracy,” in ICML 2017 AutoML Workshop, 2017.
[28] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti-
mizing fpga-based accelerator design for deep convolutional neural
networks,” in International Symposium on Field-Programmable Gate
Arrays (FPGA). ACM, 2015, pp. 161–170.
[29] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator
efficiency through resource partitioning,” in International Symposium
on Computer Architecture (ISCA). IEEE, 2017, pp. 535–547.
[30] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and
D. Chen, “Dnnbuilder: An automated tool for building high-performance
dnn hardware accelerators for fpgas,” in International Conference on
Computer-Aided Design (ICCAD). ACM, 2018, p. 56.
[31] X. Wei, Y. Liang, X. Li, C. H. Yu, P. Zhang, and J. Cong, “Tgpa:
tile-grained pipeline architecture for low latency cnn inference,” in
International Conference on Computer-Aided Design (ICCAD). IEEE,
2018, pp. 1–8.
[32] R. J. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4,
pp. 229–256, 1992.
[33] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” in International Conference on Machine Learning
(ICML), 2010, pp. 807–814.
[34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[36] A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep
learning in tensorflow,” arXiv preprint arXiv:1802.05799, 2018.
