EDD: Efficient Differentiable DNN Architecture and Implementation
  Co-search for Embedded AI Solutions by Li, Yuhong et al.
EDD: Efficient Differentiable DNN Architecture and
Implementation Co-search for Embedded AI Solutions
Yuhong Li1∗, Cong Hao1∗, Xiaofan Zhang1, Xinheng Liu1, Yao Chen2, Jinjun Xiong3, Wen-mei Hwu1, Deming Chen1
1Electrical and Computer Engineering, University of Illinois at Urbana-Champaign
2Advanced Digital Sciences Center, Singapore, 3IBM T. J. Watson Research Center
{leeyh, congh, xiaofan3, xliu79, w-hwu, dchen}@illinois.edu, yao.chen@adsc-create.edu.sg, jinjun@us.ibm.com
ABSTRACT
High quality AI solutions require joint optimization of AI algo-
rithms and their hardware implementations. In this work, we are
the first to propose a fully simultaneous, Efficient Differentiable
DNN (deep neural network) architecture and implementation co-
search (EDD) methodology. We formulate the co-search problem
by fusing DNN search variables and hardware implementation
variables into one solution space, and maximize both algorithm
accuracy and hardware implementation quality. The formulation is
differentiable with respect to the fused variables, so that gradient
descent algorithm can be applied to greatly reduce the search time.
The formulation is also applicable for various devices with different
objectives. In the experiments, we demonstrate the effectiveness of
our EDD methodology by searching for three representative DNNs,
targeting low-latency GPU implementation and FPGA implementa-
tions with both recursive and pipelined architectures. Each model
produced by EDD achieves similar accuracy as the best existing
DNN models searched by neural architecture search (NAS) meth-
ods on ImageNet, but with superior performance obtained within
12 GPU-hour searches. Our DNN targeting GPU is 1.40× faster
than the state-of-the-art solution reported in Proxyless [1], and our
DNN targeting FPGA delivers 1.45× higher throughput than the
state-of-the-art solution reported in DNNBuilder [2].
1 INTRODUCTION
AI algorithms have gained ever-increasing research interests, and
remarkable achievements have been demonstrated especially for
deep neural networks (DNNs). To improve algorithm quality, usu-
ally accuracy, a recent approach called neural architecture search
(NAS) [3–5] has shown great success in automatically developing
DNNs that outperform human-crafted designs. Meanwhile, the op-
timization techniques of implementing AI algorithms on hardware
are also being intensively studied. The goal of implementation is
to improve hardware performance, such as latency, throughput
and energy efficiency, etc. Typical implementation techniques in-
clude kernel and DNN optimizations on GPUs and customized
accelerator designs on FPGAs and ASICs [2, 6–10]. On top of these
achievements, to further improve AI solution quality, AI algorithm
designers and hardware developers begin to explore joint optimiza-
tion opportunities. For example, hardware-aware NAS has been
drawing a lot of attention by considering hardware features for
DNN designs [1, 11–15]. Meanwhile, hardware/software co-design
approaches [9, 16] focus on FPGA implementation characteristics
and study their influences on DNN software design.
Despite these achievements of hardware-aware NAS and hard-
ware/software co-design, there is still a large optimization opportu-
nity missing: the hardware implementation should be simultaneously
searched during NAS. Here, for general-purpose computing devices,
such as CPUs and GPUs, implementation search means optimizing
∗These authors made equal contributions.
DNN implementations such as kernel fusion and memory access
optimization. For reconfigurable devices, such as FPGAs, imple-
mentation search means optimizing a customized DNN accelerator
through techniques such as quantization, loop tiling and paralleliza-
tion. Hardware implementation search not only provides more
accurate performance evaluations on latency and throughput, but
more importantly, provides instant guidance to hardware-aware
DNN design during NAS. Currently, all existing works are missing
the large design space of implementation search in their NAS flows,
using estimated hardware performance from a fixed implementa-
tion [1, 11–14]. We provided an initial discussion of the potential
of simultaneous neural architecture and implementation co-search
in [17], called NAIS. Inspired by [17], in this work, we propose
a simultaneous and efficient DNN and implementation co-search
methodology. We summarize our contributions as follows:
• This is the first work that proposes a mathematical formulation
to solve the simultaneous DNN architecture and hardware imple-
mentation co-search problem. The formulation fuses the search
variables for DNN architecture and its hardware implementa-
tion into one search space, to simultaneously maximize the DNN
accuracy and implementation performance.
• The proposed formulation is differentiable with respect to the
fused search space, which allows a gradient descent algorithm
to be applied and enables an efficient differentiable DNN and
implementation co-search methodology (EDD).
• The formulation is unified and comprehensive. It can be applied
on various hardware platforms such as GPUs, FPGAs and dedi-
cated accelerators; it can target various performance objectives
such as latency, throughput or energy; it also formulates resource
usage considering resource sharing.
• We demonstrate our EDD methodology targeting three different
architectures: GPU, recursive FPGA accelerator, and pipelined
FPGA accelerator. Each model produced by EDD achieves similar
accuracy as the best existing DNNs on ImageNet but with supe-
rior performance: our GPU-targeted DNN is 1.40× faster than the
state-of-the-art Proxyless solution [1], and our FPGA-targeted
DNN delivers 1.45× higher throughput than the state-of-the-art
DNNBuilder solution [2].
2 RELATEDWORKS
Neural architecture search (NAS) is a technique for automating the
design of DNNs which can outperform hand-crafted ones [3–5].
Among all the NAS approaches, differentiable NAS is becoming pre-
vailing because of its high search efficiency in terms of GPU hours
[18]. One typical way is to construct a DNN supernet composed of
multiple branch candidates associated with architecture weights.
The architecture weights are differentiable with respect to the loss
function of the supernet, and are updated through stochastic gradi-
ent descent. The final searched DNNs will keep the branches with
larger architecture weights and eliminate others. For example, [1]
ar
X
iv
:2
00
5.
02
56
3v
1 
 [c
s.L
G]
  6
 M
ay
 20
20
uses binarized parameters, and [13] uses Gumbel-Softmax to choose
between different DNN branches. Previously published literature
also introduces hardware-aware NAS [1, 11–14]. Some works incor-
porate hardware latency into the objective of NAS [1, 12, 13], while
some treat latency as a hard constraint [11]. In [15], a bottom-up
DNN design approach is proposed for hardware-efficient models.
Meanwhile, various embedded FPGA-based accelerators have
been studied to support more efficient DNN inference [2, 6]. Other
approaches proposed in [9, 16] involve hardware/software co-design.
Specifically, researchers in [16] proposed a reinforcement learning
based architecture search with FPGA implementation performance
integrated into the reward function. The work in [9] proposed a
bundle-based co-design methodology, where a bundle is the basic
building block for both FPGA accelerator and DNNmodel. However,
none of the previous works is able to explore the DNN architecture
and implementation co-search space simultaneously and compre-
hensively.
3 EDD PROBLEM FORMULATION
Our simultaneous DNN and implementation co-search method
fuses the design space of DNN architecture search and hardware
implementation search, as shown in Fig. 1. We collectively denote
the variables used in DNN search and implementation search as A
and I , respectively, and the fused space of co-search is {A, I }. The
objective of DNN search is to quickly find a DNN architecture while
minimizing accuracy loss, denoted as Accloss . For implementation
search, we define performance loss, denoted as Per floss , which
can be specified by users, such as end-to-end inference latency,
throughput, energy, DNN model complexity, etc. We denote the
resource utilization as RES , and the resource upper-bound of the
target hardware as RESub . The DNN and implementation co-search
problem is to minimize accuracy loss Accloss and performance loss
Per floss simultaneously by effectively searching {A, I }:
min : L = Accloss (A, I ) · Per floss (I ) + β ·CRES (I )−RESub (1)
In Eq. 1, Accloss is a function of A and I ; Per floss and RES are
functions of I . Resource upper-bound RESub is expressed in an
exponent term to introduce large penalty when being violated.
Worth noting, in the existing hardware-aware NAS approaches,
only A is searched while I is fixed during NAS; while in our co-
search formulation, I is also variable.
As introduced in Sec. 2, motivated by the high search efficiency
and appealing model accuracy of differentiable NAS, in this work,
we propose a differentiable formulation for both {A, I }: in Eq. 1,
Accloss is differentiable with respect to A and I , and Per floss and
RES(I ) are differentiable with respect to I . By descending L with
respect to the variables in {A, I } on validation set as ▽{A, I }Lval ,
{A, I } will be searched simultaneously.
Fig. 1 shows our proposed overall differentiable design space.
The blue blocks represent the DNN search space, while the red
blocks represent the hardware implementation search space. We
first introduce DNN search space in Sec. 3.1, and then introduce
the merged design space with implementation in Sec. 3.2.
3.1 NAS Design Space
The differentiable NAS space is shown as the blue blocks in Fig. 1.
First, the DNN is composed ofN basic building blocks,blocki , where
1 ≤ i ≤ N . In this work, in order to design hardware-friendly DNNs
and to reduce search time, we adopt the single-path DNN structure
without branches [12]. Inside the i-th block, there areM candidate
operations, denoted as opmi (1 ≤ m ≤ M). We let the operations
to be the most commonly used DNN blocks in NAS approaches,
called MBConv [11]. It is composed of sequential layers of conv-
1×1, dwconv-k ×k (depth-wise convolution with kernel size k) and
conv-1 × 1. Between conv-1 × 1 and dwconv-k × k , the number of
channels expands/shrinks by a ratio of chmi . The output of a block
is calculated based on the outputs of itsM candidate operations. For
example in [18], the output is the weighted sum of theM operations,
where the weights are determined by a Softmax function. Instead
of Softmax, in this work, we use the Gumbel-Softmax function in
[13] in order to sample only one operation out ofM during feedfor-
ward propagation, since Gumbel-Softmax function can convert the
discrete non-differentiable sampling to continuous differentiable
sampling. This greatly reduces the memory requirement and speeds
up the feedforward propagation. The sampling parameters θi,m
organize a two-dimension N ×M array, denoted as Θ ∈ A, which
is the primary DNN search variable.
3.2 Implementation Formulation
As shown in the red block in Fig. 1, each candidate operation opmi
has its own implementation variables, forming an implementation
search space Imi . The primary implementation variable is quanti-
zation q, i.e., data precision, since it has a large impact on DNN
accuracy, implementation performance and hardware resource.
Rather than a train-and-quantize manner, the quantization shall be
searched together with DNN structure to provide implementation
performance feedback. Besides quantization, other implementation
variables may be device oriented. For example, FPGA implementa-
tion design space includes parallelism, loop tiling factors, etc.
To formulate the final Per floss and RES of a DNN in Eq. 1, we
need to capture the intermediate performance and resource of each
operation and DNN block. As shown in the bottom four blocks in
Fig. 1, there are four stages to derive Eq. 1:
• Stage-1: we first formulate the quantization in a differentiable
way; then, for each operation candidate opmi under q-bit, we for-
mulate the performance as Per f q (opmi ) and resource asResq (opmi ).• Stage-2: the performance and resource of opmi regardless of quan-
tization, Per f (opmi ) andRes(opmi ), can be derived from Per f q (opmi )
and Resq (opmi ) in Stage-1.• Stage-3: the performance and resource of i-th DNN block, Per fi
and Resi , are derived from Per f (opmi ) and Res(opmi ) in Stage-2.• Stage-4: the overall DNN performance loss and resource usage,
Per floss and RES , are derived from Per fi and Resi in Stage-3.
• Finally, Per floss and RES will be plugged into Eq. 1 as the objec-
tive function during our EDD co-search.
In the following, we introduce the differentiable quantization
and performance and resource formulations stage by stage.
3.2.1 Stage-1: Differentiable Quantization
As shown in the red block in Fig. 1, to enable differentiable quan-
tization formulation, we create Q quantization paths for each oper-
ation opmi , indicating each operation has Q quantization choices,
from q1-bit to qQ -bit . Similar to differentiable NAS formulation,
each quantization scheme is also sampled by the Gumbel-Softmax
𝑜𝑜𝑝𝑝𝑖𝑖
1 𝑜𝑜𝑝𝑝𝑖𝑖
2 𝑜𝑜𝑝𝑝𝑖𝑖
𝑚𝑚…
𝐼𝐼𝐼𝐼𝑝𝑝𝐼𝐼𝑡𝑡𝑖𝑖
𝑂𝑂𝐼𝐼𝑡𝑡𝑝𝑝𝐼𝐼𝑡𝑡𝑖𝑖
𝜃𝜃𝑖𝑖,1 𝜃𝜃𝑖𝑖,2 𝜃𝜃𝑖𝑖,𝑚𝑚 𝑑𝑑𝑑𝑑-𝑘𝑘 × 𝑘𝑘
𝐶𝐶𝑜𝑜𝐼𝐼𝐶𝐶-1 × 1
𝐶𝐶𝑜𝑜𝐼𝐼𝐶𝐶-1 × 1Channel expands
by 𝑐𝑐𝑐𝑖𝑖𝑚𝑚
Channel 
shrinks
by 𝑐𝑐𝑐𝑖𝑖𝑚𝑚
…
𝑏𝑏𝑏𝑏𝑜𝑜𝑐𝑐𝑘𝑘𝑖𝑖
DNN One candidate operation 𝑜𝑜𝑝𝑝𝑖𝑖𝑚𝑚Candidate operations of 𝑏𝑏𝑏𝑏𝑜𝑜𝑐𝑐𝑘𝑘𝑖𝑖
𝜃𝜃𝑖𝑖,𝑚𝑚 ∈ 𝜣𝜣: sampling 
parameters of operation 𝑜𝑜𝑝𝑝𝑖𝑖𝑚𝑚
𝑞𝑞1 𝑞𝑞2 𝑞𝑞Q
input
output
…
𝜙𝜙𝑖𝑖,𝑚𝑚,𝑞𝑞 ∈ 𝜱𝜱: Sampling 
parameters of 𝑞𝑞-𝑏𝑏𝑏𝑏𝑡𝑡 quantization𝐼𝐼𝐼𝐼𝑝𝑝𝐼𝐼𝑡𝑡𝑖𝑖
𝑂𝑂𝐼𝐼𝑡𝑡𝑝𝑝𝐼𝐼𝑡𝑡𝑖𝑖
Quantization of 𝑜𝑜𝑝𝑝𝑖𝑖𝑚𝑚 and other implementation variables
Neural Architecture Search (NAS) Implementation Search
…
𝜙𝜙𝑖𝑖,𝑚𝑚,1 𝜙𝜙𝑖𝑖,𝑚𝑚,2 𝜙𝜙𝑖𝑖,𝑚𝑚,𝑄𝑄
Other implementation 
variables in 𝐼𝐼𝑖𝑖𝑚𝑚
• FPGA: parallel 
factors, loop tiling 
factors, etc.
• GPU: batch size, 
etc.
• 𝑷𝑷𝑷𝑷𝒓𝒓𝒓𝒓𝒒𝒒 𝒐𝒐𝒑𝒑𝒊𝒊𝒎𝒎 = 𝒓𝒓(𝑰𝑰𝒊𝒊𝒎𝒎)
Performance and resource formulation of 
𝑜𝑜𝑝𝑝𝑖𝑖
𝑚𝑚 under 𝑞𝑞-𝑏𝑏𝑏𝑏𝑡𝑡 quantization
• 𝑹𝑹𝑷𝑷𝒔𝒔𝒒𝒒 𝒐𝒐𝒑𝒑𝒊𝒊𝒎𝒎 = 𝒈𝒈(𝑰𝑰𝒊𝒊𝒎𝒎)
Performance and resource 
formulation of 𝑜𝑜𝑝𝑝𝑖𝑖𝑚𝑚
• 𝑷𝑷𝑷𝑷𝒓𝒓𝒓𝒓 𝒐𝒐𝒑𝒑𝒊𝒊𝒎𝒎
• 𝑹𝑹𝑷𝑷𝒔𝒔(𝒐𝒐𝒑𝒑𝒊𝒊𝒎𝒎)
Performance and resource 
formulation of 𝑏𝑏𝑏𝑏𝑜𝑜𝑐𝑐𝑘𝑘𝑖𝑖
• 𝑷𝑷𝑷𝑷𝒓𝒓𝒓𝒓𝒊𝒊
• 𝑹𝑹𝑷𝑷𝒔𝒔𝒊𝒊
Performance loss and resource 
formulation of 𝐷𝐷𝐷𝐷𝐷𝐷
• 𝑷𝑷𝑷𝑷𝒓𝒓𝒓𝒓𝒍𝒍𝒐𝒐𝒔𝒔𝒔𝒔
• 𝑹𝑹𝑹𝑹𝑹𝑹
𝜱𝜱𝜣𝜣
-bit -bit -bit
Stage-1Stage-2Stage-3Stage-4
𝑨𝑨 𝑰𝑰𝒊𝒊
𝒎𝒎
…
(𝑞𝑞 ∈ 𝐼𝐼𝑖𝑖𝑚𝑚)
Figure 1: Our proposed differentiable DNN and implementation co-design (EDD) space.
function with a sampling parameter ϕi,m,q , generating a possibil-
ity for opmi to be quantized to q-bit . The ϕi,m,q organizes a three-
dimension array of sizeN ×M×Q , denoted as Φ. In this formulation,
we have the flexibility to choose different quantizations for different
layers of a DNN; such a mixed precision computation can be well
supported by reconfigurable hardware and dedicated accelerators.
Under q-bit quantization, the performance and resource of oper-
ation opmi , Per f
q (opmi ) and Resq (opmi ), should be functions of im-
plementation variables in Imi (including quantization q), expressed
as Per f q (opmi ) = f (Imi ) and Resq (opmi ) = д(Imi ). Since one opera-
tion contains multiple layers as shown in Fig. 1, for simplicity, we
treat them as a whole: the q-bit quantization applies to all layers
within opmi , and the latency and resource are the summation of all
layers. Per f q (opmi ) and Resq (opmi ) largely vary with devices and
will be further discussed in Section 4.
Given differentiable DNN search variablesΘ, differentiable quan-
tization variables Φ and formulations under each quantization
scheme, the DNN and implementation search spaces are fused as
shown in Fig. 2.
3.2.2 From Stage-1 to Stage-2
Given array Φ, following Gumbel-Softmax sampling rule, de-
noted asGS(·), the performance Per f (opmi ) and resource Res(opmi )
can be computed as the following:
Per f (opmi ) =
∑
1≤q≤Q
GS(ϕi,m,q |ϕi,m ) · Per f q (opmi ) (2)
Res(opmi ) =
∑
1≤q≤Q
GS(ϕi,m,q |ϕi,m ) · Resq (opmi ) (3)
where both Per f (opmi ) and Res(opmi ) are differentiable with respect
to ϕi,m,q . This is to compute the performance and resource expecta-
tion under different quantizations, which follows Gunbel-Softmax
distribution with parameter ϕi,m,q .
3.2.3 From Stage-2 to Stage-3
Similar to Eq. 2 and Eq. 3, given array Θ, the performance and
resource of i-th DNN block, can be expressed as:
Per fi =
∑
1≤m≤M
GS(θi,m |θi ) · Per f (opmi ) (4)
𝑜𝑜𝑝𝑝𝑖𝑖
1 𝑜𝑜𝑝𝑝𝑖𝑖
2 𝑜𝑜𝑝𝑝𝑖𝑖
𝑀𝑀…
𝜽𝜽𝒊𝒊,𝟏𝟏 𝜽𝜽𝒊𝒊,𝟐𝟐 𝜽𝜽𝒊𝒊,𝑴𝑴
𝑞𝑞1 𝑞𝑞2 𝑞𝑞𝑄𝑄…
𝝓𝝓𝒊𝒊,1,𝟏𝟏 𝝓𝝓𝒊𝒊,1,𝟐𝟐 𝝓𝝓𝒊𝒊,1,𝑸𝑸
𝑞𝑞1 𝑞𝑞2 𝑞𝑞𝑇𝑇…
𝝓𝝓𝒊𝒊,𝟐𝟐,𝟏𝟏 𝝓𝝓𝒊𝒊,𝟐𝟐,𝟐𝟐 𝝓𝝓𝒊𝒊,𝟐𝟐,𝑸𝑸
𝑞𝑞1 𝑞𝑞2 𝑞𝑞𝑇𝑇…
𝝓𝝓𝒊𝒊,𝑴𝑴,𝟏𝟏 𝝓𝝓𝒊𝒊,𝑴𝑴,𝟐𝟐 𝝓𝝓𝒊𝒊,𝑴𝑴,𝑸𝑸
𝒐𝒐𝒑𝒑𝒊𝒊
𝟏𝟏 𝒐𝒐𝒑𝒑𝒊𝒊
𝟐𝟐 𝒐𝒐𝒑𝒑𝒊𝒊
𝑴𝑴
𝑰𝑰𝒊𝒊
𝟏𝟏 𝑰𝑰𝒊𝒊
𝟐𝟐 𝑰𝑰𝒊𝒊
𝑴𝑴
𝑜𝑜𝑝𝑝𝑗𝑗
1 𝑜𝑜𝑝𝑝𝑗𝑗
2 𝑜𝑜𝑝𝑝𝑗𝑗
𝑀𝑀…
𝜽𝜽𝒋𝒋,𝟏𝟏 𝜽𝜽𝒋𝒋,𝟐𝟐 𝜽𝜽𝒋𝒋,𝑴𝑴
𝑞𝑞1 𝑞𝑞2 𝑞𝑞𝑄𝑄…
𝝓𝝓𝒋𝒋,1,𝟏𝟏 𝝓𝝓𝒋𝒋,1,𝟐𝟐 𝝓𝝓𝒋𝒋,1,𝑸𝑸
𝑞𝑞1 𝑞𝑞2 𝑞𝑞𝑄𝑄…
𝝓𝝓𝒋𝒋,𝟐𝟐,𝟏𝟏 𝝓𝝓𝒋𝒋,𝟐𝟐,𝟐𝟐 𝝓𝝓𝒋𝒋,𝟐𝟐,𝑸𝑸
𝑞𝑞1 𝑞𝑞2 𝑞𝑞𝑄𝑄…
𝝓𝝓𝒋𝒋,𝑴𝑴,𝟏𝟏 𝝓𝝓𝒋𝒋,𝑴𝑴,𝟐𝟐 𝝓𝝓𝒋𝒋,𝑴𝑴,𝑸𝑸
𝒐𝒐𝒑𝒑𝒋𝒋
𝟏𝟏 𝒐𝒐𝒑𝒑𝒋𝒋
𝟐𝟐 𝒐𝒐𝒑𝒑𝒋𝒋
𝑴𝑴
⋯
𝑅𝑅𝑅𝑅𝑅𝑅(𝑜𝑜𝑝𝑝𝑗𝑗2) 𝑅𝑅𝑅𝑅𝑅𝑅(𝑜𝑜𝑝𝑝𝑗𝑗𝑀𝑀)
𝑅𝑅𝑅𝑅𝑅𝑅(𝑜𝑜𝑝𝑝𝑖𝑖1) 𝑅𝑅𝑅𝑅𝑅𝑅(𝑜𝑜𝑝𝑝𝑖𝑖𝑀𝑀)𝑅𝑅𝑅𝑅𝑅𝑅(𝑜𝑜𝑝𝑝𝑖𝑖2)
If resource sharing (e.g. on FPGA): 𝑰𝑰𝒋𝒋𝟏𝟏 = 𝑰𝑰𝒊𝒊𝟏𝟏𝑅𝑅𝑅𝑅𝑅𝑅(𝑜𝑜𝑝𝑝𝑗𝑗1)
𝑰𝑰𝒋𝒋
𝟏𝟏 𝑰𝑰𝒋𝒋
𝟐𝟐 𝑰𝑰𝒋𝒋
𝑴𝑴
Figure 2: Fused design space including Θ (for DNN), Φ (for
quantization) and other implementation variables in I .
Resi =
∑
1≤m≤M
GS(θi,m |θi ) · Res(opmi ) (5)
3.2.4 From Stage-3 to Stage-4 — Performance
Given the performance and resource of i-th DNN block, we can
compute the overall DNN performance loss and resource usage,
which need to be tailored towards specific search objectives and
different devices.
First, if the overall objective is end-to-end latency, total energy
or model size, the performance loss can be expressed using the
summation of all DNN blocks as:
Per floss = α ·
∑
1≤i≤N
Per fi (6)
where α scales Per floss to the same magnitude of Accloss in Eq. 1.
If the objective is throughput, the performance loss is the maxi-
mum latency of all blocks. Since getting the maximum value is a
Each should be counted only once
𝒐𝒐𝒐𝒐𝟏𝟏
𝟏𝟏 𝑜𝑜𝑜𝑜1
2 ⋯ 𝑜𝑜𝑜𝑜1
𝑀𝑀
𝑜𝑜𝑜𝑜2
1 𝒐𝒐𝒐𝒐𝟐𝟐
𝟐𝟐 ⋯ 𝑜𝑜𝑜𝑜2
𝑀𝑀
⋯ ⋯ ⋯ ⋯
𝒐𝒐𝒐𝒐𝒊𝒊
𝟏𝟏 𝑜𝑜𝑜𝑜𝑖𝑖
2 ⋯ 𝑜𝑜𝑜𝑜𝑖𝑖
𝑀𝑀
Highest chance to 
be selected within 
its DNN block 
𝑹𝑹𝑹𝑹𝑹𝑹(𝒐𝒐𝒐𝒐𝟏𝟏) 𝑹𝑹𝑹𝑹𝑹𝑹(𝒐𝒐𝒐𝒐𝟐𝟐) 𝑅𝑅𝑅𝑅𝑅𝑅(𝑜𝑜𝑜𝑜𝑀𝑀)𝒐𝒐𝒐𝒐𝟏𝟏
𝟏𝟏 and 𝒐𝒐𝒐𝒐𝒊𝒊𝟏𝟏
share the same 
resource ⋯
Should not be counted
Figure 3: Demonstration of resource sharing.
non-differentiable operation, we use a smooth maximum, Log-Sum-
Exp (LSE) function [19], for differentiable approximation as:
Per floss = α ·max{Per fi } ≈ α · loд
∑
1≤i≤N
ePer fi (7)
If there are multiple objectives such as minimizing both latency
and energy, as long as the objectives are not conflicting, we can
simply let Per floss be the production of different objectives.
3.2.5 From Stage-3 to Stage-4 — Resource
The formulation of overall resource usage has two situations:
without and with resource sharing. Without sharing, the total re-
source can be computed as the summation of all blocks:
RES =
∑
1≤i≤N
Resi (8)
However, resource sharing is very common especially in IP-
based FPGA or ASIC accelerators. Fig. 2 demonstrates a resource
sharing scenario. In this example, we assume the operation op1i of
i-th block, and operation op1j of j-th block, will share a same piece of
computing resource. For example, in FPGA or ASIC, it is a reusable
IP. To allow sharing, the quantization and other implementation
variables of op1i and op
1
j shall be the same
1: f or ∀i, j ∈ [1,N ], we
have Imi = I
m
j = I
m .
Second, we discuss the resource estimationwith resource sharing.
As shown in Fig. 3, i-th row is the i-th DNN block; the blue entries
are the operations with the largest possibility to be selected. In
this example, op11 and op
1
i are most likely to be chosen. Since they
share the same computing resource Res(op1), it shall be counted
only once. If them-th operation is not selected in any of the blocks,
the resource Res(opm ) shall not be counted.
To describe resource sharing, we propose the following differ-
entiable approximation for the resource usage for operation opm ,
Res(opm ), which is shared across blocks as:
Res(opm ) ← tanh(
∑
1≤i≤N
GS(θi,m |θi ) · 1) · Res(opm ) (9)
In the above formula, GS(θi,m |θi ) · 1 means the unit resource
expectation of opmi in block i . To avoid operation resource being
redundantly counted across block, we use tanh(∑) to suppress the
maximum expectation ofm-th operation to be 1 before multiplied
by Res(opm ).
Thus, the overall DNN resource is computed as:
RES =
∑
1≤m≤M
Res(opm ) (10)
1Some accelerators allow different bit-width operations to share resource. In these
cases this constraint is not needed.
4 DEVICE-SPECIFIC FORMULATION
In this section we discuss the device-specific formulations, which
describe the performance and resource of operation opmi under
q-bit quantization, Per f q (opmi ) and Resq (opmi ).
4.1 FPGA
For FPGA implementation, we follow an IP-based accelerator archi-
tecture: for each operation opmi , there is a customizable IP instance
to conduct its computation. Consider the following two FPGA im-
plementation architectures, recursive and pipelined:
• In the recursive architecture such as [8, 9], every DNN layer of
the same type shares the same IP. The implementation objective
is usually end-to-end latency. Thus, the Per floss computation
follows Eq. 6, and the RES follows Eq. 9 and Eq. 10.
• In the pipelined architecture such as [2], every DNN layer has its
own accelerator without resource sharing. The implementation
target is usually throughput. The Per floss computation follows
Eq. 7, and the RES computation follows Eq. 8.
Either way, the operation performance Per f q (opmi ) will be the
operation latency. We let the operation resource Resq (opmi ) be the
number of DSPs, which are usually the most critical resource on
FPGA. To formulate Per f q (opmi ) and Resq (opmi ), we introduce ad-
ditional implementation variables for FPGA, the parallel factors
of the IPs, denoted as p fmi . Parallel factors describe the parallelism,
indicating how many multiplications can be done concurrently. In
FPGA design, since the parallelism usually increases exponentially
such as 64, 128, 256, etc., we use the exponential form of 2pf to
describe parallelism.
4.1.1 Latency — Per f q (opmi )
As defined in Section 3.1, each operation opmi is composed of a set
of sequential DNN layers such as convolution, batch normalization
and activation, and the latency and resource of opmi should be the
summation of all layers. For an operation opmi with parallel factor
of p fmi in q-bit, its latency can be approximated as:
Per f q (opmi ) = Latq (opmi ) =
∑
l ∈op
(latl ) (11)
latl = Φ(q) ×

2−pf mi · k2 · h ·w · cin · cout if l is conv.
2−pf mi · k2 · h ·w · cin if l is dw-conv.
2−pf mi · h ·w · cin otherwise
(12)
In the above equation, k is the convolution kernel size; h,w , cin
and cout represent the data dimension of operation opmi ; Φ(q) is
the calibration for latency under bit-width of q. Intuitively, smaller
bit-width leads to shorter latency because of less off-chip data
movement and less computation. For simplification, we let Φ(q) = q
to simulate such a phenomenon.
4.1.2 Resource — Resq (opmi )
The resource (number of DSPs) of the IP with parallel factor of
p fmi in q-bit can be approximated as:
Resq (opmi ) = Ψ(q) × 2pf
m
i (13)
where Ψ(q) is the calibration for resource under bit-width of q. On
FPGA, the number of DSPs is non-linear to bit-width. For example,
if the data precision is lower than 8-bit, then two multiplications
can be calculated on one Xilinx DSP48, reducing the DSP usage
by half; if the data precision is lower than 4-bit, we assume that
multiplications are computed using Lookup-tables (LUTs). There-
fore, we use a piece-wise function to describe Ψ(q): Ψ(q) = 1 when
9 ≤ q ≤ 16; Ψ(q) = 12 when 5 ≤ q ≤ 8; Ψ(q) = 0 when q ≤ 4.
4.2 GPU
On GPUs, the most widely used performance metric is latency, so
we let Per floss to be the latency assuming the batch size is 1. We
assume the resource is fixed given a GPU. Since GPU latency is
relatively easy to measure, we use normalized latency from directly
measured values to represent inference latency under q-bit data
precision. Therefore, Per f q (opmi ) is a constant under a specific
q. Currently, the GPU data precision is greatly restricted by the
framework support. Since the current TensorRT only supports 8-bit
fixed and 16-/32-bit floating data, we limit data precisions to be
8/16/32-bit for now but can be easily extended to support more
bitwidths. Meanwhile, since the current mixed precision inference
has not been well supported by GPU development framework, we
constrain the overall DNN to use the same data precision. Therefore,
∀i,m, we have ϕi,m,q = ϕq , which simplifies Eq. 2.
4.3 Dedicated Accelerators
Besides GPU and FPGA, there are also dedicated ASIC accelerators
for efficient DNN implementation, such as Stripes [20], Loom [21]
and Bit-Fusion [22], which are DNN accelerators that support dy-
namic data precisions efficiently. As an example, in the Loom [21]
work, the computation latency and energy of convolution layers
scale inversely and almost proportionally with the precisions of
weights and activations. Our proposed method can be directly ap-
plied to such accelerators as well, by formulating the latency and
energy of an operation opim proportionally to data precision. We
will leave this for future work.
5 OVERALL ALGORITHM
First of all, the DNN variables and implementation variables are ini-
tialized, including the number of blocks N , the number of operation
candidates M , and sampling possibilities Θ and Φ. For recursive
FPGA architecture, the algorithm initializesM IP instances, each
with an initial parallel factor p f0 = loд(RESubM ). For pipelined FPGA
architecture, the algorithm initializes M × N parallel factors for
all the operation candidates, each p f0 = loд(RESubM×N ). For different
devices, only the initializations are different, and the remaining
co-search follows the same procedure.
After initialization, the algorithm optimizes Eq. 1 using stochastic
gradient descent on the fused search variables {A, I }, following a
bilevel approach [18]. In each iteration, it first fixes Θ, Φ and I , and
updates DNNweightsω by minimizing the training loss on training
dataset. Then, it fixes the DNN weights ω and updates Θ, Φ and I
by descending Eq. 1 on the validation set. The algorithm iterates
until DNN training converges or reaches a fixed number of epochs.
Finally, the searched DNN needs to be trained from scratch on the
target dataset, e.g. ImageNet, and the implementation variables,
such as parallel factors, also need to be re-tuned.
6 EXPERIMENTS
We apply our EDD co-search on a subset of ImageNet dataset ran-
domly sampled from 100 classes. The searched DNNs are trained
Table 1: Comparisons with existing NAS solutions
Test Error (%) GPU Latency FPGA Latency
Top-1 Top-5 Titan RTX ZCU102 [23]
Baseline Models
GoogleNet 30.22 10.47 27.75 ms 13.25 ms
MobileNet-V2 [24] 28.1 9.7 17.87 ms 10.85 ms
ShuffleNet-V2 [25] 30.6 11.7 21.91 ms NA
ResNet18 30.2 10.9 9.71 ms 10.15ms
Hardware-aware NAS Models
MNasNet-A1 [11] 24.8 7.5 17.94 ms 8.78 ms
FBNet-C [13] 24.9 7.6 22.54 ms 12.21 ms
Proxyless-cpu [1] 24.7 7.6 21.34 ms 10.81 ms
Proxyless-Mobile [1] 25.4 7.8 21.23 ms 10.78 ms
Proxyless-gpu [1] 24.9 7.5 15.72 ms 10.79 ms
EDD-Net-1 25.3 7.7 11.17 ms 11.15 ms
EDD-Net-2 25.4 7.9 13.00 ms 7.96 ms
Table 2: EDD-Net-1 accuracy and latency on 1080 Ti
32-bit Floating 16-bit Floating 8-bit Integer
Test Error 25.5% 25.3% 26.4%
Latency 2.83 ms 2.29 ms 1.74 ms
Table 3: Comparison of EDD-Net-3 with DNNBuilder[2]
Top-1 Error (%) Top-5 Error (%) Throughput (ZC706)
VGG16 29.5 10.0 27.7 fps
EDD-Net-3 25.6 7.7 40.2 fps
from scratch on the entire ImageNet with 1000 classes. We run for
fixed 50 epochs during the EDD search. The initial DNN has 20
MBConv blocks (N = 20). Each MBConv has a filter size within
{3, 5, 7} and a channel expansion ratio within {4, 5, 6}. So there are
M = 3×3 = 9 operations within a MBConv block with different filer
sizes and the numbers of channels. During the search, for GPUs,
the DNN weights are 8-/16-/32-bit and activations are 32-bit; for
FPGAs, the DNN weights are 4-/8-/16-bit and activations are 16-bit
fixed point.
We demonstrate our EDD methodology targeting three different
hardware platforms, each with a searched DNN model, called EDD-
Net.We then compare our DNNswith the ones searched by the state-
of-the-art hardware-aware NAS approaches. The three DNNs are
targeting: (1) low-latency oriented GPU (EDD-Net-1); (2) recursive
FPGA architecture (EDD-Net-2); (3) pipelined FPGA architecture
(EDD-Net-3). The three DNNs are shown in Fig. 4; each is produced
through EDD within a 12-hour search on a P100 GPU.
First, for GPU-targeted EDD-Net-1, the algorithm suggests the
16-bit precision for weights for the combined objective function
including both accuracy and latency. We compare EDD-Net-1 with
the state-of-the-art hardware-aware NAS approaches as shown in
Table 1, where the GPU inference latency is tested on Titan RTX
GPU. It shows that, EDD-Net-1 reaches the similar accuracy as the
state-of-the-art DNNmodels, while achieving the shortest inference
latency, 11.17 ms, which is 1.4× faster than Proxyless-GPU [1], the
previous best result reported through the NAS approach. Compared
to other mobile-oriented NAS results as a reference, it is 2.0× faster
than FBNet-C [13] and 1.6× faster than MNasNet [11]. Table 2
shows the accuracy and latency results of EDD-Net-1 on Nvidia
1080 Ti GPU after re-training and fine-tuning using TensorRT under
different data precisions.
In
pu
t
Co
nv
3x
3
M
B 
4 
5x
5
M
B 
4
3x
3
3 32
Se
p 
3x
3
Co
nv
1x
1
32 16 32
M
B 
5
3x
3
40
M
B 
4
3x
3
40
M
B 
5
3x
3
40
M
B 
5
5x
5
80 80
M
B 
4
3x
3
80
M
B 
4
3x
3
80
M
B 
5 
5x
5
80
M
B 
4 
3x
3
96
M
B 
4
5x
5
96
M
B 
4 
3x
3
96
M
B 
4 
3x
3
96
M
B 
4
5x
5
192
M
B 
4 
5x
5
192
M
B 
4
3x
3
192
M
B 
4
5x
5
192
M
B 
6 
3x
3
320
Co
nv
1x
1
FC
# of channels stride=2
M
B 
4 
3x
3
96
M
B 
4
3x
3
192
ED
D
-N
et
-2
In
pu
t
Co
nv
3x
3
M
B 
5
3x
3
M
B 
4
5x
5
3 32
Se
p 
3x
3
Co
nv
1x
1
32 16 32
M
B 
6
5x
5
32
M
B 
4 
5x
5
32
M
B 
4
5x
5
40
M
B 
4 
3x
3
40
M
B 
5
5x
5
40 80
M
B 
6
5x
5
80
M
B 
5
5x
5
80
M
B 
5 
5x
5
80
M
B 
6
3x
3
96
M
B 
5 
3x
3
96
M
B 
5
3x
3
96
M
B 
4 
5x
5
96
M
B 
6 
5x
5
192
M
B 
6
3x
3
192
M
B 
6 
5x
5
192
Co
nv
1x
1
FC
# of channels stride=2
M
B 
6 
5x
5
192
M
B 
4 
3x
3
320
ED
D
-N
et
-1
In
pu
t
Co
nv
3x
3
M
B 
5
5x
5
M
B 
6
5x
5
3 32
Se
p 
3x
3
Co
nv
1x
1
32 16 32
M
B 
4
5x
5
48
M
B 
4
5x
5
48
M
B 
5 
3x
3
48
M
B 
4
5x
5
96
M
B 
5
5x
5
96 96
M
B 
6
5x
5
96
M
B 
6 
5x
5
128
M
B 
4
3x
3
128
M
B 
4
3x
3
128
M
B 
4 
5x
5
256
M
B 
4
3x
3
256
M
B 
4
3x
3
256
M
B 
4 
3x
3
256
M
B 
6 
5x
5
320
Co
nv
1x
1
FC
# of channels stride=2
ED
D
-N
et
-3
Figure 4: Architectures of three EDD-Net models. EDD-Net-1 targets GPU, EDD-Net-2 targets recursive FPGA accelerator, and
EDD-Net-3 targets pipelined FPGA accelerator.
Second, we intend to compare FPGA-targeted EDD-Net-2 and
EDD-Net-3 with existing FPGA/DNN co-design works such as [16]
and [9], but neither of them provided accuracy results on ImageNet.
Therefore, to make a relatively fair comparison, for EDD-Net-2,
which targets a recursive FPGA accelerator, we adopt the well-
recognized CHaiDNN framework [23], which is also a recursive
FPGA accelerator. The FPGA latency is collected by running vari-
ous DNN models with CHaiDNN accelerators under the same data
precision on Xilinx ZCU102 FPGA as shown in Table 1. The Shuf-
fleNet [25] is currently not supported by CHaiDNN. It shows that
EDD-Net-2 on FPGA delivers the shortest latency among the FPGA
implementations of all the DNNs, 7.96 ms. It is 1.37× faster than
the FPGA implementation of ProxylessNet [1], 1.53× faster than
FBNet [13] and 1.1× faster than MNasNet [11]. This result shows
that our methodology generalizes well to FPGA and can search for
FPGA-friendly DNNs effectively.
Third, EDD-Net-3 is searched targeting a pipelined FPGA ac-
celerator. In this case, we limit the total number of DNN blocks
because more DNN blocks require more resource and complicated
memory control logic. In Fig 4, it shows that EDD-Net-3 is shal-
lower but with more channels and larger kernels. We compare the
throughput of EDD-Net-3 with a state-of-the-art pipelined FPGA
accelerator, DNNBuilder [2], on ZC706 FPGA with 900 DSPs. As
shown in Table 3, under 16-bit fixed point, EDD-Net-3 achieves
1.45× higher throughput with a much higher accuracy.
7 CONCLUSION
In this work, we proposed EDD, a fully simultaneous, efficient differ-
entiable DNN architecture and implementation co-search method-
ology, which can target different hardware devices with different
performance objectives. We formulated the co-search problem as
an elegant differentiable mathematical formulation by fusing DNN
architecture search variables and hardware implementation vari-
ables into one solution space, considering both accuracy loss and
hardware performance loss. In the experiments, we demonstrated
three DNN models, targeting low-latency GPU, recursive FPGA
accelerator and pipelined FPGA accelerator, respectively. Our EDD
models deliver similar accuracy as the best existing DNN models
searched by NAS on ImageNet, with an improved performance by
1.4× on GPU and 1.45× on FPGA. The future works include GPU
power and resource formulation, and EDD search for dedicated
accelerators.
ACKNOWLEDGEMENT
This work is supported in part by the IBM-Illinois Center for Cogni-
tive Computing System Research (C3SR), Semiconductor Research
Corporation (SRC) and Campus for Research Excellence and Tech-
nological Enterprise (CREATE) programme in Singapore.
REFERENCES
[1] Han Cai et al. Proxylessnas: Direct neural architecture search on target task and
hardware. In ICLR, 2019.
[2] Xiaofan Zhang et al. DNNBuilder: An automated tool for building high-
performance DNN hardware accelerators for FPGAs. In ICCAD, 2018.
[3] Barret Zoph et al. Learning transferable architectures for scalable image recogni-
tion. In CVPR, 2018.
[4] Gao Huang et al. Densely connected convolutional networks. In CVPR, 2017.
[5] Esteban Real et al. Regularized evolution for image classifier architecture search.
In AAAI, 2019.
[6] Jiantao Qiu et al. Going deeper with embedded FPGA platform for convolutional
neural network. In FPGA, 2016.
[7] Xiaofan Zhang et al. High-performance video content recognition with long-term
recurrent convolutional network for FPGA. In FPL, 2017.
[8] Yao Chen et al. Cloud-DNN: An open framework for mapping DNN models to
cloud FPGAs. In FPGA, 2019.
[9] Cong Hao et al. FPGA/DNN co-design: An efficient design methodology for IoT
intelligence on the edge. DAC, 2019.
[10] Pengfei Xu et al. AutoDNNchip: An automated DNN chip predictor and builder
for both FPGAs and ASICs. 2020.
[11] Mingxing Tan et al. Mnasnet: Platform-aware neural architecture search for
mobile. In CVPR, 2019.
[12] Dimitrios Stamoulis et al. Single-path NAS: Designing hardware-efficient con-
vnets in less than 4 hours. arXiv:1904.02877, 2019.
[13] Bichen Wu et al. FbNet: Hardware-aware efficient convnet design via differen-
tiable neural architecture search. In CVPR, 2019.
[14] Chengyue Gong et al. Mixed precision neural architecture search for energy
efficient deep learning. In ICCAD, 2019.
[15] Xiaofan Zhang et al. SkyNet: a hardware-efficient method for object detection
and tracking on embedded systems. In MLSys, 2020.
[16] Weiwen Jiang et al. Accuracy vs. efficiency: Achieving both through FPGA-
implementation aware neural architecture search. In DAC, 2019.
[17] Cong Hao et al. NAIS: Neural architecture and implementation search and its
applications in autonomous driving. In ICCAD, 2019.
[18] Hanxiao Liu et al. DARTS: Differentiable architecture search. In ICLR, 2019.
[19] Elijah Polak. Optimization: algorithms and consistent approximations, volume 124.
Springer Science & Business Media, 2012.
[20] Patrick Judd et al. Stripes: Bit-serial deep neural network computing. In MICRO,
2016.
[21] Sayeh Sharify et al. Loom: Exploiting weight and activation precisions to accel-
erate convolutional neural networks. In DAC, 2018.
[22] Hardik Sharma et al. Bit fusion: Bit-level dynamically composable architecture
for accelerating deep neural networks. In ISCA, 2018.
[23] CHaiDNN. https://github.com/Xilinx/CHaiDNN. Accessed: 2019-11-27.
[24] Mark Sandler et al. Mobilenetv2: Inverted residuals and linear bottlenecks. In
CVPR, 2018.
[25] Ningning Ma et al. Shufflenet v2: Practical guidelines for efficient CNN architec-
ture design. In ECCV, 2018.
