NAIS: Neural Architecture and Implementation Search and its Applications
  in Autonomous Driving by Hao, Cong et al.
NAIS: Neural Architecture and Implementation Search and its
Applications in Autonomous Driving
Cong Hao1,4, Yao Chen3, Xinheng Liu1, Atif Sarwari2, Daryl Sew2, Ashutosh Dhar1,4, Bryan Wu2, Dongdong Fu2, Jinjun Xiong4,1,
Wen-mei Hwu1,4, Junli Gu2 and Deming Chen1,4
1University of Illinois at Urbana-Champaign, 2XMotors.ai
3Advanced Digital Sciences Center, Singapore, 4IBM-Illinois Center for Cognitive Computing Systems Research (C3SR)
ABSTRACT
The rapidly growing demands for powerful AI algorithms in many
application domains have motivated massive investment in both
high-quality deep neural network (DNN)models and high-efficiency
implementations. In this position paper, we argue that a simultane-
ous DNN/implementation co-design methodology, named Neural
Architecture and Implementation Search (NAIS), deserves more
research attention to boost the development productivity and effi-
ciency of both DNN models and implementation optimization. We
propose a stylized designmethodology that can drastically cut down
the search cost while preserving the quality of the end solution.
As an illustration, we discuss this DNN/implementation methodol-
ogy in the context of both FPGAs and GPUs. We take autonomous
driving as a key use case as it is one of the most demanding areas
for high quality AI algorithms and accelerators. We discuss how
such a co-design methodology can impact the autonomous driving
industry significantly. We identify several research opportunities
in this exciting domain.
1 INTRODUCTION
The world has seen tremendous improvements in AI algorithms
as well as their high performance implementations in recent years.
Remarkable achievements have been demonstrated for AI algo-
rithms in many areas with expeditious improvements in algorithm
quality and robustness. Deep neural network (DNN) is one of the
most popular AI algorithms with impressive advancements, from
AlexNet [1] to modern models [2–4]. Meanwhile, the optimization
techniques for high performance implementations of AI algorithms
on hardware are also being intensively studied. Such implementa-
tion techniques include kernel and DNN optimizations on GPUs
and TPUs [5–8], accelerator designs on customizable hardware
such as FPGAs [9–13] and AI chips [14, 15].
Despite many of these accomplishments, there are still many
challenges, one of which is the gap between high quality DNN
models during design and their implementation performance during
deployment. One reason for such a gap is isolated design of DNNs
and optimization of their implementations, where the former does
not integrate sufficient hardware knowledge, and the later does not
have enough freedom to accommodate pre-designed DNNs at such
a late stage. Instead, DNNs and their hardware implementations
need to be designed simultaneously, i.e., DNN/implementation co-
design, as illustrated in Fig. 1. We call it Neural Architecture and
Implementation Search (NAIS). The outputs of NAIS include both
DNNs that are of high quality of result (QoR), and implementations
that are of high quality of service (QoS). The NAIS methodology
brings immense optimization opportunities for:
• Proposing specific hardware-orientedDNNmodels. For DNN
deployment, there are many hardware candidates such as
GPUs, cloud and edge TPUs, cloud and embedded FPGAs,
9
DNN 1
IMP 1
DNN 2
IMP 2
DNN n
IMP n
…
…
Solution 1
Neural Architecture and 
Implementation Search (NAIS)
Solution 2 Solution n
• Model accuracy
• Low latency
• High throughput
• Low resource
High QoS: 
High QoR: 
Figure 1: Neural Architecture and Implementation Search (NAIS)
generates DNNs and optimizes their implementations simultane-
ously, achieving both high Quality of Result (QoR) and Quality of
Service (QoS).
each of which has largely different characteristics such as
computation capability, memory capacity and bandwidth.
The NAIS method will explore DNNs based on specific hard-
ware features and search for DNNs with the best match.
• Meeting resource and performance constraints. The NAIS
method will search for DNNs within available hardware
resources and performance constraints, which provides pre-
dictable and guaranteed performance for DNN deployment.
• Shortening design cycles. While existing top-down design
methods require back-and-forth efforts to find satisfying
solutions, an automated NAIS flow can simultaneously find
an optimized DNN model and its deployment on hardware.
In modern industry applications, as AI algorithms are increas-
ingly adopted, high performance computing platforms are in great
need, especially with reconfigurable devices for acceleration. Take
autonomous driving as an example, which is one of the most de-
manding areas for high QoR AI algorithms and high performance
computing implementation. Fig. 2 shows three types of computing
platforms for autonomous driving: commodity platform composed
of commercial CPUs, GPUs or DSPs, semi-customized platform
composed of GPUs and FPGAs, and fully-customized platform com-
posed of dedicated ASICs. As shown in the figure, though fully-
customized platforms are most favorable in terms of high perfor-
mance and low price-performance ratio (e.g. $/Gops), they suffer
from high non-recurring engineering (NRE) cost, long design cy-
cle and high risk in making mistakes, which hinders their wide
adoption. In contrast, with reconfigurable devices such as FPGAs,
semi-customized platforms become a competitive alternative with
a good trade-off in performance and cost. Moreover, once the AI
algorithms and their hardware implementations have been fully
validated on FPGAs, the design can be made into ASICs to take ad-
vantage of what a fully-customized platform can offer. Thus, finding
high quality AI algorithms with their optimized implementations
on reconfigurable devices not only provides good solutions for
semi-customized platforms, but also provides a good path to move
from semi to fully customized platforms. Because of this, there is a
ar
X
iv
:1
91
1.
07
44
6v
1 
 [c
s.L
G]
  1
8 N
ov
 20
19
 Commercial CPU, GPU, DSP, etc.
 Large-size devices 
Commodity Platform
 Hybrid GPU + FPGA
 Medium-size GPU + large-size FPGA
Semi-Customized Platform
 ASICs with dedicated accelerators
 Medium-size GPU + small-size ASIC
Fully-Customized Platform
 Low NRE
 Rich open-source libraries
 Low risk in mistakes
× Hard to differentiate with competitors
× Hard to achieve high performance 
× Price–performance ratio: High
 Low NRE
 Short design cycle
 Low risk in mistakes (reconfigurability)
× Medium performance
× Price–performance ratio: Medium
 Strong differentiations with 
competitors
 High performance
 Price-performance ratio: Low
× High NRE
× Long design cycle
× High risk in mistakesPerfect DNN model + accelerator 
validated on FPGA Solved!
Figure 2: Industrial position of reconfigurable architectures (e.g. FPGA) in autonomous driving: an important step from semi-customized
platform to fully-customized platform.
Model
Search  Strategy
Performance 
Estimation
Model Search 
Space
DNN Model Performance estimate of 
Model 
݉
Hardware-
related 
Optimization Implementation 
Model & 
Implementation
Search  Strategy
on
Model 
Search Space
Implementation 
Search Space
Output Model 
&
Implementation 
Performance 
Estimation
Co-design Solution Performance 
estimate of 
Co-design 
Spaceࣧ
ࣣ
ࣧ ࣭
݉ ∈ ࣧ ݉
ሼࣧ∗, ࣣ∗ሽ
ሼࣧ∗, ࣣ∗ሽ
࣭∗
൏ ݉, ݅ ൐	∈ ሼࣧ∗, ࣣ∗ሽ ൏ ݉, ݅ ൐
݉
݅i
(a)
Model
Search  Strategy
Performance 
Estimation
Model Search 
Space
DNN Model Performance estimate of 
Model 
݉
Hardware-
related 
Optimization Implementation 
Model & 
Implementation
Search  Strategy
on
Model 
Search Space
Implementation 
Search Space
Output Model 
&
Implementation 
Performance 
Estimation
Co-design Solution Performance 
estimate of 
Co-design 
Spaceࣧ
ܫ
ࣧ ࣭
݉ ∈ ࣧ ݉
ሼࣧ∗, ܫ∗ሽ
ሼࣧ∗, ܫ∗ሽ
࣭∗
൏ ݉, ݅ ൐	∈ ሼࣧ∗, ܫ∗ሽ ൏ ݉, ݅ ൐
݉
݅i
(b)
Figure 3: (a) The traditional neural architecture search (NAS) [16].
(b) Our proposed NAIS co-design methodology.
pressing need for NAIS, an automatic co-design of AI algorithms
and their optimized implementations, on GPUs, FPGAs and ASICs,
given the widely varying device characteristics and the large design
space of both algorithmic and implementation optimization.
Motivated by those opportunities, in this work, we propose NAIS
as a simultaneous DNN/implementation co-design approach to ef-
fectively search for high quality DNNmodels and high performance
implementations for different hardware platforms. We demonstrate
how such a NAIS approach can be utilized to solve real-world ap-
plications, including autonomous driving.
2 NAIS DESIGN METHDOLOGY
ANAISmethodology has two tasks: to search for DNNs of high QoR
(e.g. accuracy), and for implementations of high QoS (e.g. latency,
throughput). Such an implementation can be an optimized software
stack on a given accelerator device such as GPUs, or a customized
hardware accelerator on FPGAs, CGRAs, and ASICs.
Neural Architecture Search (NAS). For DNN search, most ex-
isting NAS engines can find high quality DNNs. As illustrated in
Fig. 3a, given a model search space M, a NAS engine applies a
certain search strategy S such as reinforcement learning or evo-
lutionary algorithm. During the search, the performance (QoR)
of the modelm ∈ M is estimated and provided back to the NAS
engine. After NAS gener t s a satisfying DNN, it will be imple-
mented and deployed on GPU, FPGA or other devices. During the
search, however, implementation optimization is not considered.
For example in a recent hardware-aware NAS approach [17], it con-
siders directly measured inference latency on the GPU but does not
explore optimization techniques. This will result in a large perfor-
mance gap between estimation and final implementation, especially
when there are multiple candidate devices, each requiring differ-
ent optimization techniques. When targeting FPGAs, it becomes
more important that DNN search and implementation search being
tightly coupled during NAS: different accelerator implementation
configurations can result in large performance variation.
Neural Architecture and Implementation Search (NAIS) —
Beyond NAS. To fully explore implementation optimizations and
to consider the impacts of implementation on DNNs, we propose
a fully simultaneous DNN/implementation co-design approach:
it not only searches for neural architectures, but also searches
for implementation optimizations, i.e., a Neural Architecture and
Implementation Search, NAIS. As illustrated in Fig. 3b, the NAIS
search space includes both model search spaceM and implemen-
tation search space I. We combine the two spaces as a co-design
space {M∗,I∗}, and apply a joint search strategy S∗ on the co-
design space. During NAIS, each solution (m, i) is composed of
two parts: a DNN model solutionm, and a corresponding imple-
mentation solution i , where specific optimization techniques have
been applied to i . After searching, the NAIS engine outputs both
the DNN model and its optimized hardware implementation. The
design space of NAIS is the product of the design space of DNN
search and the design space of implementation optimization, which
can be huge. Such combined design space makes the co-design pro-
cedure time-consuming and hard to converge. Innovative research
is needed to address this new challenge.
In this position paper, we first prototype a NAIS methodology in
the context of DNN/FPGA co-design, and show how we effectively
narrow the co-design space to generate high quality DNNs and their
FPGA implementations within the resource constraints of a target
FPGA. We then discuss how such a NAIS design methodology can
be extended for GPU in a similar fashion.
Target ML task; FPGA device (resources); performance targets (QoS)
Software: DNN model Hardware: FPGA accelerator
Inputs
Outputs
Step 1. FPGA-oriented Bundle 
generation and modeling
FPGA/DNN Co-Design Flow
Step 3. DNNs are built by replicating Bundles, and are updated by altering the # of 
replications, channels and up/down-sampling spots
Bundle i-r2
Bundle i-r1
Bundle i-r3
Input
Output
Possible up/down-sampling spots
(between Bundles)
Replicated Bundles
1
2
3
…
CONV 3x3
DW-CONV 3x3
Activation
CONV 1x1
Possible channel change spots
FPGA-oriented IP Pool
DW-CONV CONVCONV
Vector-Mul. Matrix-Mul.
Intel FPGA @ 6-bit × 6-bit
Xilinx FPGA @ 8-bit × 10-bit
…
Example:
…
CONV 3x3
Pooling
CONV 1x1
Activation
Bundle 0
1
n
FPGA-oriented Bundles
Step 2. Bundle Selection according to <accuracy, performance> characteristics
• Accuracy characteristic: potential contribution to accuracy
• Performance characteristic: resource (DSP, BRAM) usage, latency, etc.
 Each Bundle is associated with a group of IPs with known hardware performance
Figure 4: NAIS applied to DNN/FPGA co-design.
3 NAIS FOR FPGA
3.1 DNN/Implementation Co-design Space
The FPGA accelerator optimization problem is very complicated
and requires comprehensive domain-specific knowledge. For exam-
ple, the overall accelerator architecture (pipelined or folded), the
number of IPs and parallelism of each IP, data quantization, buffer
allocation, data reuse, etc., and each has a significant impact on the
final performance. Besides, the FPGA underlying characteristics
(DSP structure, block RAM, bandwidth, etc.) and available resources
can be very different between FPGA devices or families.
To efficiently narrow down the combined design space of NAIS
for a target FPGA, we propose to co-design both DNN structure and
its FPGA accelerator implementation using hardware-aware basic
building blocks, named Bundles [18]. A Bundle represents a set
of sequential DNN layers, and a DNN can be constructed by repli-
cating a Bundle for n times with configurations (the ’A’ in NAIS).
Meanwhile, a Bundle is composed of a set of FPGA configurable
IPs, where each IP is well designed and highly optimized, and the
Bundle is used to construct the FPGA implementation (the ’I’ in
NAIS). For DNN, each Bundle replication can be configured to have
different number of channels of its layers; for FPGA, a Bundle can
be configured to have a certain number of IP instances, and each
IP instance with specific parallel factors, data precision, on-chip
buffers, etc. When a Bundle is selected and configured, both the
DNN model and its accelerator can be determined. That is, Bundles
provide a stylized approach to design both the DNNs and FPGA
implementations, thus narrowing the search space efficiently.
3.2 Overall Co-Design Flow
Given the co-design space, Fig. 4 shows our proposed NAIS co-
design flow targeting FPGA [18]. The inputs include a machine
learning task such as image classification or object detection, re-
source constraints of a specific FPGA device, and performance
target such as frame rate. The outputs include both DNN models
and corresponding FPGA accelerator with achieved performance.
Inside the co-design flow, there are three major steps.
Step 1: FPGA-oriented Bundle generation. First, we design
a pool of FPGA-oriented IPs considering specific FPGA character-
istics such as DSP and BRAM structures. The IPs may have same
functionality but different designs. For example, to best utilize the
DSP resource, a Xilinx FPGA may best support 8-bit × 10-bit multi-
plication IPs, while an Intel FPGA may best support 9-bit × 9-bit
multiplication IPs. Based on the IPs, we build FPGA-oriented Bun-
dles, where the data tiling, pipelining and data movement between
these IPs are considered.
Step 2: Bundle selection. Second, we apply Bundle evaluation
to reduce the co-design space by only selecting the most promising
Bundles for future exploration. Each Bundle will be evaluated re-
garding its resource utilization and potential contribution to DNN
accuracy. We build a Bundle-wise DNN template with fixed front-
end and back-end structures, and insert one Bundle (with replica-
tions) in the middle each time [18, 19]. Such Bundle-wise DNNs
will be quickly trained using a small number of epochs to evaluate
the accuracy. The Bundles on the resource-accuracy Pareto curve
will be selected.
Step 3: Hardware-aware DNN search and update. Third, we
perform hardware-aware DNN search. The inputs include the ini-
tial DNNs, performance objectives such as latency, and resource
constraints. We use stochastic coordinate descent (SCD) to update
three variables related to DNN structure: the number of Bundle
replications; down-sampling configuration between Bundles; and
the number of channels in each Bundle. During the iterations of
SCD, only DNNs within the resource constraints and performance
requirements are kept for downstream training. In such a way,
the final generated DNNs are more structured, resulting in more
efficient hardware implementations.
3.3 FPGA-oriented IP design
Since the FPGA’s characteristics vary with device vendors and
types, a well designed IP must fully consider such characteristics
to achieve the maximum performance while minimizing resource
utilization. We discuss two most important factors as an illustration:
the structure of DSPs and embedded block memory.
3.3.1 DSP consideration. Table 2 shows different multiplication
and accumulation precision of DSPs in different FPGA devices,
where the variation can be large even within the same vendor.
The computational IPs should be carefully designed based on the
underlying DSP structure to take full advantage of its computation
capability, which, in turn, affects the DNN design.
One important factor that must be considered is DNN’s data
precision. Take the Xilinx DSP48E1 and DSP48E2 as examples. As-
sume a simple case of two multiplications, a × c and b × c with a
common multiplier c , and a, b, c have ba , bb , bc bits, respectively.
To increase multiplication parallelism, one possibility is to let two
multiplicands (in this case a and b) occupy one DSP input I1, and
let the common multiplier (in this case c) occupy the other input
I2, so that the two multiplications can be conducted at one clock
cycle. To ensure correctness, there must be at least bc empty bits
between a and b, so that the two products do not overlap with each
other in the output. When using DSP48E1, which supports 18-bit ×
25-bit multiplications and 48-bit accumulation, if a and b are both
8-bit and occupy the 25-bit operand, then c must not exceed 9-bit
(8+8+9 ≤ 25); when using DSP48E2, c can be 10-bit (8+8+10 ≤ 27).
In a scenario where a and b are activations and c is the weight, if
the target FPGA has DSP48E1, the DNN weights should be quan-
tized to 9-bit or less, while with DSP48E2, the weights can be 10-bit.
Similarly, if the target device is Intel FPGA, the preferable quantiza-
tion changes accordingly. For example, on Stratix V, <9-bit, 9-bit>
is more preferred than <10-bit, 9-bit> for weights and activation,
because Stratix V DSPs support 9 × 9-bit multiplications.
Moreover, the DSP structure also affects the computation pattern
and parallelism, which determines the detailed IP design. Fig. 5
shows an example of a convolution 1 × 1 IP targeting Xilinx and
Intel Arria V devices, respectively. On Xilinx DSPs, to conduct two
multiplications in parallel by sharing a common multiplier, one
kernel will once consume two pieces of feature map data to fully
utilize one DSP. On Intel Arria V series, where one DSP is capable
of running three independent 9 × 9 multiplications, three kernels
will consume three pieces of different feature map data at a time.
Such differences between DSPs will result in disparate IP designs
and performance, and need to be considered in the NAIS engine.
*…
On Xilinx Device using DSP48E2 On Intel Device Stratix V
Input Feature Map Kernels
F1
K1
…
*
DSP1 DSP2
* *
DSP1
F2 F3 F4
K2 K3
K1 K2 K3K1 K2
F1 F2 F3 F4 F1 F2 F3
Figure 5: Different DSP structures lead to different IP designs.
Table 1: BRAM data width of different FPGA devices.
Device Data Width
Xilinx [20] RAMB18E1 1, 2, 4, 9, 18
RAMB36E1 1, 2, 4, 9, 18, 36
Intel [21] MLAB 8, 9, 10, 16, 18, 20
M9K 1, 2, 4, 8, 9, 16, 18, 32∗, 36∗
M20K 8, 10, 16, 20, 32, 40
M144K 8, 9, 16, 18, 32, 36, 64∗, 72∗
eSRAM 72
*Only applicable for single-port RAM, simple-dual port RAM, and single-port ROM
3.3.2 BRAM consideration. On-chip block memory (BRAM) is an-
other important design factor to consider. Effectively utilizing on-
chip memory for data buffering can greatly reduce the amount of
off-chip data movement, thus reducing both latency and energy
consumption. Table 1 shows the supported data width of different
FPGA devices. For Xilinx, the commonly used bit widths are 9, 18
and 36 in its RAMB18E1 and RAMB36E1. For Intel, the common
block memory is M20K, which has a capacity of 20Kb organized
into either 10- or 20-bit storage words and read/write operations.
The on-chip data buffers need to be carefully allocated to align with
the block memory depth and width. For example, if a continuous
buffer is allocated to be 21Kb, it will occupy two blocks of Intel
M20k, resulting in a large waste of the second block.
The differences in block memory structure can affect the desir-
able DNN designs as well. Take the buffer allocation for feature
maps using Xilinx RAMB18E1 as an example. If the input feature
map dimension of a layer is 96 × 96 × 1 (one channel) using 8-bit
data, the number of occupied RAM blocks is 4, and a slightly larger
feature map will consume an additional block. Usually, the inter-
mediate feature map dimensions are closely related to the original
input size and up/down sampling. Therefore, resizing the input
image to 384× 384 (96× 4) may be better than 388× 388 (97× 4) as
far as on-chip buffer allocation is concerned.
The discussions in this section show that the structures of DSPs
and BRAMs play an important role in guiding the DNN design in a
NAIS framework.
4 NAIS FOR GPU
There are recent works discussing hardware-aware NAS targeting
GPUs [28, 29]. However, during NAS, GPU kernel configuration and
optimization were ignored, which is a non-trivial problem that has
Table 2: Multiplication precision of different FPGA devices.
Device Precision Mode Accumulator
Xilinx 7 series (DSP48E1) [22] One 25 × 18 48-bit
Xilinx UltraScale (DSP48E2) [23] One 27 × 18 48-bit
Intel Stratix V [24] Three 9 × 9 64-bit
Two 18 × 18
One 18 × 36
One 27 × 27
Intel Arria V [25] Three 9 × 9 64-bit
Two 18×18
One 27×27
Intel Stratix 10 [26] Two 18 × 19 64-bit
Intel Arria 10 [27] One 27×27
attracted a lot of research interest [5, 30, 31]. Table 3 summarizes
a set of GPU architecture-specific and kernel-specific parameters,
which can affect the kernel configuration and performance on a
specific GPU [5]. These parameters vary greatly with different
GPU generations. Hence, the selection of the most adequate con-
figurations of the GPU kernels has proven to be a difficult design
optimization problem [5]. In [31], it is demonstrated that for just a
single AlexNet layer with 4 tunable parameters, the possible con-
figurations are 17 × 254, and the performance ranges from 44.7 to
5735.8 Gflop/s on an AMD Fury X GPU. Even with GPU optimiza-
tion tools, such as TensorRT [6] on top of cuDNN and cuBLAS, one
kernel can still have varied performance. Fig.6 shows the variation
of GPU throughput when computing one convolution layer with
different filter configurations.
To apply NAIS in DNN and GPU implementation co-design, we
can generalize the aforementioned DNN/FPGA co-design method-
ology. For example, a GPU-oriented Bundle can be defined as well.
One GPU Bundle is composed of a set of GPU kernels, which shall
be configured and optimized targeting the specific GPU device (usu-
ally the GPUs used for training and for inference are different). The
parameters of the Bundle may include favorable matrix shapes for a
matrix-multiplication kernel, the number of threads, the batch size,
etc. Such Bundle optimization problem is being intensively studied
with auto-tuning tools such as [31] and [5], where [5] especially
targets multi-kernel optimizations. With optimized kernel Bundles,
structured NAS [32] can be applied. Similar to the normal cells and
reduction cells used in [32], we can search for DNNs with different
configurations of normal Bundles and reduction Bundles, which
are optimized GPU kernels in NAIS. Leveraging both GPU Bundle
optimization and structured NAS search, we can develop a NAIS
engine that can be naturally applied to GPU and DNN co-design.
With more advanced profiling capabilities, such as a recent
MLModelScope [33] tool, we can easily evaluate and profile DNN
models across different datasets, frameworks and hardware at scale
and across stack.With such detailed layer-wise and kernel-wise pro-
filing data, roofline models for all kernels can be built to understand
whether a kernel configuration is computation or memory bound.
All those performance models can be leveraged to furtherance the
development of NAIS for GPU.
5 NAIS FOR AUTONOMOUS DRIVING
An autonomous driving system collects a large amount of data
from surrounding environment, and executes a complicated soft-
ware pipeline for localization, perception, prediction, planning and
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 632 64 128 256 384 512
12K
1K
8K
6K
4K
2K
0T
h r
o u
g h
p u
t  o
n  
N
v i
d i
a
T i
t a
n  
X
P  
( G
f l o
p s
/ s
)
# of output channels
3 ൈ 3 filters on 64ሺܪሻ ൈ 64ሺܹሻ	feature map
5 ൈ 5 filters on 64ሺܪሻ ൈ 64ሺܹሻ	feature map
3 ൈ 3 filters on 32ሺܪሻ ൈ 32ሺܹሻ	feature map
5 ൈ 5 filters on 32ሺܪሻ ൈ 32ሺܹሻ	feature map
Figure 6: Different kernel size and feature map size result in differ-
ent throughput on Nvidia Titan XP using cuBLAS 10.0.
Table 3: GPU architecture and kernel specific parameters [5].
Type Parameter
Architecture Max. number of blocks per SM
specific Max. number of warps per SM
Shared memory per SM
Shared memory alloc. unit size
Max. number of registers per SM
Registers alloc. unit size
Warp size
Max. number of threads per SM
Kernel Number of warps per thread block
specific Shared memory per block
Number of registers per thread
Architecture Max. number of thread blocks
& kernel specific Hardware utilization measure
control. To support a safe and robust software pipeline, a power-
ful computing platform as well as high quality AI algorithms are
indispensable, and a NAIS approach is imperative to support both.
5.1 Computing Platforms
Currently, GPUs are the prevailing computing platforms for au-
tonomous driving with high programmablity, flexibility and per-
formance. In this demand, Nvidia brought Drive AGX [35], a pow-
erful autonomous driving hybrid platform built on Nvidia Xavier,
incorporating 8-core CPUs, deep learning accelerators (DLA), inte-
grated GPU and programmable vision accelerators (PVA). Within
these components, DLA is most adequate for DNN-based infer-
ence, which can be replaced by ASICs or FPGAs. Therefore, for
competitive differentiation, some leading autonomous driving com-
panies have started to adopt specialized platforms. For example,
the Mobileye [36] and Tesla [37] have developed their own chips
to achieve outstanding AI performance and low power. FPGAs, on
the other hand, have also been a popular computing platform for
autonomous driving cars because of its appealing advantages such
as industrial reliability, specialization, high performance and low
power. There are ongoing efforts from technology companies and
academia institutions for FPGA based solutions [38, 39]. Xilinx, for
example, has developed their ADAS using Zynq-7000 SoC-based
FPGA devices [40]. In a recent collaborative work of UIUC and
XMotors [34], a hybrid GPU + FPGA computing system for au-
tonomous driving has been proposed. Fig. 7 illustrates the hybrid
system, where the GPU serves as a primary system, and the FPGA
Available Input Sources
Front & Side 
Cameras
Front & Rear
Cameras
Side & Rear 
Cameras …
Highway
7 images*
30 FPS
Acc. > 85%
4 images
30 FPS
Acc. > 85%
5 images
30 FPS
Acc. > 85%
…
Urban
7 images
20 FPS
Acc. > 90%
4 images
20 FPS
Acc. > 90%
5 images
20 FPS
Acc. > 90%
…
School
7 images
15 FPS
Acc. > 95%
4 images
15 FPS
Acc. > 95%
5 images
15 FPS
Acc. > 95%
…
… … … …
Auxiliary information
Hybrid Computing Platform
Primary 
GPU 
System
Cameras
Radars
Ultrasonic
……
Central MCU
GPU Check FPGA Check
Control
Secondary 
FPGA 
System
FPGA
Programmable Logic
Embedded ARM
DDR Memory
Normal Mode
• High resolution traffic 
light/sign detection
• High resolution 
pedestrian detection
• Driver status monitor
• …
Error Mode
• Keep driving without 
lane changing
• Safe stop with lane 
changing
• Emergency stop
• …
Different inputs, tasks and 
performance requirements
* 7 cameras, each providing one input image.
Figure 7: A hybrid GPU + FPGA system in autonomous driving [34]. The functionality, input sources and performance requirements for the
FPGA system are complicated and largely vary with driving scenarios.
serves as a secondary system for failure fallback and providing
auxiliary information for assistive driving.
Given the emerging needs of semi-customized platforms with re-
configurable devices and a full-customized platform as a direct next
step, the DNN and implementation co-design is highly expected to
boost the ongoing productivity and platform evolving.
5.2 Autonomous Driving Algorithms
Self-driving is a comprehensive robotic capability including parking,
driving and in-cabin intelligence functions, and each contains a set
of varied sub-functions with different AI algorithms, input sources
and performance requirements.
Varied sub-functions. Self-driving pipeline requires different func-
tions and algorithms in different scenarios. For example, the algo-
rithms for parking and driving can be very different: parking task
focuses on parking lot detection with near distance, localization
and low speed vehicle control, while driving task focuses on mo-
tion objects, obstacle, lane detection within hundreds of meters
and high speed vehicle control. In-cabin intelligence also has mul-
tiple sub-functions such as DSM (Driver State Monitoring), voice
recognition, gesture based interactions, and passenger detection.
Another example can be seen in the hybrid GPU and FPGA system
proposed in [34], shown in Fig. 7. In the system error mode, the
FPGA executes different tasks: when the car is on a highway, it
keeps driving and while maintaining a minimum speed limit; when
in urban area, it slows down the car and applies a safe pull-over.
Each scenario requires different DNNs to be mapped to FPGA.
Varied input sources. The self-driving system will be provided
with varied input sources. For example, parking functions usually
use surrounding cameras and ultra-sonic sensors, while highway
drive uses multiple front, side and rear cameras with assistance of
radars. Another example is shown in Fig. 7, where the FPGA accepts
input images with different resolutions: in normal mode, it may
conduct traffic light detection using high resolution input images,
while in error mode, it runs simplified autonomous driving pipeline
using low resolution input image for object and lane detection.
Varied performance requirements.Autonomous driving algorithms
need to cope with numerous and complicated driving scenarios
with different performance requirements. For example, when driv-
ing in highways, the perception module requires at least 30 FPS but
the number of objects to be detected may be limited to cars, lanes
Table 4: Peek performance and accuracy under different data preci-
sions of SkyNet [19].
Device Mul. Precision Max. GMACs # of DSPs Accuracy
Xilinx <9-bit, 11-bit> 90 360 72.7%
Ultra96 <9-bit, 10-bit> 180 71.2%
<8-bit, 11-bit> 180 68.8%
<8-bit, 10-bit> 180 68.0%
Intel <9-bit, 9-bit> 180 240 70.7%
5AGXA1 <12-bit, 12-bit> 120 72.9%
Table 5: Accelerator performance of SkyNet [19] under different in-
put image size (after resizing) on Xilinx Ultra96.
Input Size FM precision Latency Accuracy
320 × 160 9-bit 40ms 72.7%
340 × 180 9-bit 65ms 72.8%
320 × 160 8-bit 33ms 68.8%
340 × 180 8-bit 49ms 69.0%
and traffic signs; in urban area, it requires 20 FPS but with a larger
number of objects to detect; in school area, it may require 15 FPS
but need a higher accuracy especially for pedestrian detection.
Given such variations in sub-functions, input sources and perfor-
mance requirements, the detailed AI algorithms to each situation
will be very different. Accordingly, the overall pipeline including
other traditional algorithms will be significantly different, and all
have to run on the same centralized electronic control unit (ECU)
platform. Thus, an automatic NAIS co-design flow will enable us to
explore the optimal solution under each situation.
6 EXPERIMENT RESULTS
We first demonstrate that FPGA-oriented IP and DNN design will
have a large impact on accelerator performance.We use SkyNet [19],
a light-weight object detection network, as the baseline. Table 4
shows that different data precisions result in 30% to 50% difference
in peak performance under 250MHz onXilinx FPGA and Intel FPGA,
while the accuracy does not change dramatically. It implies that the
data precision sometimes is a more sensitive design factor in FPGA
accelerator than in DNN model. Exploiting device-oriented NAIS
co-design can take advantage of such difference in sensitivity and
come up with DNNs that best match the hardware.
Table 5 shows another example regarding BRAM consideration
in Xilinx Ultra96 FPGA using RAMB18E1. It shows that when the
Table 6: DifferentDNNs generated by our co-design frameworkwith
different performance constraints and input image resolutions on
Xilinx UltraScale+ ZCU102.
Target Performance
15 FPS 20 FPS 30 FPS
400 × 400 Bundle 5 Bundle 4 Bundle 4
13 Replication 14 Replication 13 Replication
Max. 1264 ch Max. 1008 ch Max. 1024 ch
mAP 46.1 mAP 42.4 mAP 43.9
300 × 300 Bundle 1 Bundle 1 Bundle 5
15 Replication 14 Replication 15 Replication
Max. 1120 ch Max. 784 ch Max. 736 ch
mAP 45.4 mAP 44.3 mAP 39.7
Bundle 1: conv_3x3_stride1
Bundle 2: conv_5x5_stride1
Bundle 3: conv_3x3_stride1 + conv_5x5_stride1
Bundle 4: dw-conv_3x3_stride1 + conv_1x1
Bundle 5: dw-conv5x5_stride1 + conv1x1
precisions of feature map data are the same, the model accuracy
shows negligible difference but the latency shows 32% and 38%
difference between two input image sizes. This is because when the
image is resized to 340 × 180, the total bit number of one image tile
exceeds 18Kb and occupies two memory blocks, while being resized
to 320 × 160, one image tile (following the same tiling rule) only
consumes one memory block. Besides the computation capacity,
the 340 × 180 input results in less efficient BRAM utilization and
more off-chip data movements, and thus longer latency.
We then apply our NAIS methodology on an object detection
task on FPGA for autonomous driving under different input image
resolutions and latency constraints. The target device is Xilinx Ultra-
Scale+ ZCU102, a large scaled FPGAwith 599,550 logic cells, 32.1Mb
block RAM and 2,520 DSP slices. We set the performance require-
ments to be 15 FPS, 20 FPS and 30 FPS, respectively, corresponding
to different driving speeds in busy downtown, urban street and
highway. We also consider two input resolutions, 400 × 400 and
300 × 300, respectively. As shown in Table 6, under each constraint
and input resolution, our co-design engine proposes a DNN that
is built by replicating a pre-optimized Bundle, as described in Sec-
tion 3.2. In each scenario, we show the Bundle used for building the
DNN, as well as the number of replications and maximum number
of channels. The DNNs are trained and tested on a subset of VOC
2012 dataset, including bike, car, bus and person, which are most re-
lated to autonomous driving. It shows that with different inputs and
target performance, the generated DNNs are different. For example,
when the input resolution is 400 × 400, more light-weight depth-
wise Bundles are selected such as Bundle 4 and 5; when the input
resolution is 300× 300, Bundle 1 seems more preferable. This result
implies that such a co-design is helpful in searching for the best
DNNs within performance constraints under varied circumstances.
For GPU platform, we first show a summary of popular DNN
models regarding their inference latency on various GPU platforms,
including Titan V, 1080 Ti and 2080 Ti. The summary is shown
in Table 7, where part of the data are obtained from open-source
repository [41]. The input images are 224 × 224 × 3 with a batch
size of 16 with single and half precision. In addition to powerful
GPUs, we also make a performance comparison between YOLO
v3 [43] to SkyNet [19] on an embedded GPU, Nvidia Jetson TX2,
where SkyNet is showing appealing real-time performance. SkyNet
Table 7: Inference latency (ms) of DNNs on different GPUs [41].
DNN Titan V 1080 Ti 2080 Ti
Models Single Half Single Half Single Half
Densenet121 17.49 11.87 23.53 18.62 16.70 13.47
Densenet161 39.33 22.88 51.53 42.26 34.64 27.10
Densnet169 23.63 16.04 31.82 25.27 21.94 17.54
Densnet201 30.93 20.70 47.73 33.01 28.89 22.51
Resnet18 4.82 3.09 6.43 5.65 4.89 3.42
Resnet34 8.43 5.12 10.97 9.77 8.62 5.67
Resnet50 14.27 7.61 20.17 16.26 14.65 9.04
Resnet101 23.96 12.80 33.02 27.49 24.57 14.51
Resnet152 34.22 18.11 47.02 38.88 35.15 20.52
Vgg16 22.94 10.96 33.73 30.69 23.70 16.14
Vgg19 27.55 12.72 39.95 36.71 28.03 17.89
Table 8: Inference performance on embedded GPU Jetson TX2.
1280 × 720 320 × 160 Accuracy Dataset
YOLO v3 3.3 fps 12.7 fps 51.5 @ 320 × 320 (mAP50) COCO
SkyNet 20.5 fps 67.3 fps 73.1 @ 320 × 160 (IoU) DAC-SDC [42]
is a light-weight object detection network we proposed that won
the 2019 DAC-SDC competition [44]. It is composed of basic Bun-
dles, where each Bundle has a depth-wise 3 × 3 convolution layer
followed by a point-wise 1 × 1 convolution layer. Instead of a tradi-
tional top-down design method which starts from a large DNN and
prunes it till reaching required performance, SkyNet was designed
by utilizing our proposed NAIS idea discussed in Section 3.2.
Our SkyNet design on Jetson TX2 is an initial demonstration of
the potential of such NAIS approach. As a future research direction,
GPU implementation shall be optimized during NAIS.
7 RELATEDWORK
For FPGA-based DNN implementations, technologies such as quan-
tization [45, 46] and model compression [47] are used to reduce
DNN model size, while FPGA resource allocation [48] and fine-
grained pipeline architecture [11] are proposed to deliver low la-
tency accelerators. Other works explore FPGA accelerator param-
eter configuration [13, 49, 50] and optimizations such as loop un-
rolling and pipelining, but they do not explore configurations on
the DNN side. Besides, there are works on DNN and FPGA co-
design, which explores both DNN model and accelerator designs.
The work in [51] discussed the DNN and accelerators for embedded
vision applications. It first designed a specific DNN accelerator tar-
geting SqueezeNet [52], and then proposed a tailored DNN model
called SqueezeNext according to the hardware utilization of differ-
ent layers of SqueezeNet. Another work [53] proposed a framework
named FNAS, which is a reinforcement learning based NAS by
combining the estimated FPGA inference latency into the reward
function. However, none of these works applied simultaneous DNN
and FPGA implementation search, as NAIS proposed in our work.
On the other hand, for GPU based DNN search and implemen-
tation, NAS has seen a big success in designing high quality DNN
models that outperform manually designed ones [16]. Most early
NAS works target purely on improving model accuracy, while re-
cent works have been conducting performance-aware searches by
incorporating estimated hardware performance such as inference
latency on GPU or CPU into the NAS engine. One representative
work [17] addressed the high memory consumption issue as well
as the high computational cost of differentiable NAS, and solved
the problem with gradient-based approach to enable hardware-
aware neural architecture search. Another work [28] discussed
device-aware neural architecture search by extending the NAS into
a multiple-objective problem. Targeting difference devices, their
framework came up with a Pareto Frontier regarding DNN accuracy
and energy. Though these works are closely related to DNN and
GPU co-design, they missed the opportunities in device-oriented
implementation optimizations and their possible guidance to DNN
design, which is an essential goal in NAIS.
8 CONCLUSIONS AND FUTUREWORKS
In this paper, we proposed a DNN and implementation co-design
methodology, calledNeuralArchitecture and Implementation Search
(NAIS), to explore the opportunities of boosting the development
productivity and efficiency of mapping AI algorithms to targeted
platforms. The NAIS searches DNN models and the underlying
hardware implementations simultaneously in a pre-defined co-
design space, with the goal of converging to the best hardware
specific solution efficiently. We first demonstrated how NAIS works
for DNN/FPGA co-design, and then discussed the NAIS approach
for DNN/GPUs co-design. The NAIS approach can generate vari-
ous design solutions with different accuracy, latency and comput-
ing complexity, which helps to find an optimized implementation
for application deployment. We believe that such a NAIS design
methodology can benefit the development productivity and the
algorithm/hardware system quality for general DNN algorithms.
We also provide a detailed application level case study on how
autonomous driving can benefit from such a NAIS approach. Our
future work includes systematic design space definition and appli-
cation specific full stack optimizations.
ACKNOWLEDGEMENT
This work is supported in part by XMotors.ai, Semiconductor Re-
search Corporation (SRC), the IBM-Illinois Center for Cognitive
Computing System Research (C3SR) and Advanced Digital Sciences
Center (ADSC) in Singapore. The authors would also like to thank
Vibhakar Vemulapati for helpful discussions.
REFERENCES
[1] Alex Krizhevsky et al. Imagenet classification with deep convolutional neural
networks. In NeurIPS, 2012.
[2] Christian Szegedy et al. Going deeper with convolutions. In CVPR, 2015.
[3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv:1409.1556, 2014.
[4] Kaiming He et al. Deep residual learning for image recognition. In CVPR, 2016.
[5] João Guerreiro et al. Multi-kernel auto-tuning on GPUs: Performance and energy-
aware optimization. In PDP, 2015.
[6] TensorRT. https://developer.nvidia.com/tensorrt.
[7] cuDNN. https://developer.nvidia.com/cuDNN.
[8] Edge TPU. https://cloud.google.com/edge-tpu/.
[9] Chen Zhang et al. Optimizing FPGA-based accelerator design for deep convolu-
tional neural networks. In FPGA, 2015.
[10] Hardik Sharma et al. From high-level deep neural models to FPGAs. In MICRO,
2016.
[11] Xiaofan Zhang et al. DNNBuilder: an automated tool for building high-
performance DNN hardware accelerators for FPGAs. In ICCAD, 2018.
[12] Chen Zhang et al. Caffeine: Towards uniformed representation and acceleration
for deep convolutional neural networks. IEEE TCAD, 2018.
[13] Yao Chen et al. Cloud-DNN: An open framework for mapping DNN models to
cloud FPGAs. In FPGA, 2019.
[14] Yu-Hsin Chen et al. Eyeriss: A spatial architecture for energy-efficient dataflow
for convolutional neural networks. In ACM SIGARCH Computer Architecture
News, volume 44, pages 367–379. IEEE Press, 2016.
[15] Shouyi Yin et al. A high energy efficient reconfigurable hybrid neural network
processor for deep learning applications. IEEE JSSC, 53(4):968–982, 2017.
[16] Thomas Elsken et al. Neural architecture search: A survey. Journal of Machine
Learning Research, 20(55):1–21, 2019.
[17] Han Cai et al. Proxylessnas: Direct neural architecture search on target task and
hardware. arXiv:1812.00332, 2018.
[18] Cong Hao et al. FPGA/DNN co-design: An efficient design methodology for IoT
intelligence on the edge. In DAC, 2019.
[19] Xiaofan Zhang et al. SkyNet: A Champion Design for DAC-SDC on Low Power
Object Detection. arXiv:1906.10327, 2019.
[20] Xilinx Block RAM. https://www.xilinx.com/support/documentation/user_guides/
ug473_7Series_Memory_Resources.pdf.
[21] Intel Block RAM. https://perso-etis.ensea.fr/olivier.romain/Teaching_2A_IUT_
UCP_files/ug_ram_rom.pdf.
[22] Xilinx DSP48E1. https://www.xilinx.com/support/documentation/user_guides/
ug479_7Series_DSP48E1.pdf.
[23] Xilinx DSP48E2. https://www.xilinx.com/support/documentation/user_guides/
ug579-ultrascale-dsp.pdf.
[24] Intel Stratix V DSP. https://www.intel.com/content/dam/www/programmable/
us/en/pdfs/literature/wp/wp-01131-stxv-dsp-architecture.pdf.
[25] Intel Arria V DSP. https://www.intel.com/content/dam/www/programmable/us/
en/pdfs/literature/hb/arria-v/av_51001.pdf.
[26] Intel Stratix 10 DSP. https://www.intel.com/content/dam/www/programmable/
us/en/pdfs/literature/hb/stratix-10/ug-s10-dsp.pdf.
[27] Intel Arria 10 DSP. https://www.intel.com/content/dam/www/programmable/
us/en/pdfs/literature/hb/arria-10/a10_overview.pdf.
[28] An-Chieh Cheng et al. Searching toward pareto-optimal device-aware neural
architectures. In ICCAD, 2018.
[29] Diana Marculescu et al. Hardware-aware machine learning: modeling and opti-
mization. In ICCAD, 2018.
[30] Keren Zhou et al. A performance analysis framework for exploiting GPU mi-
croarchitectural capability. In ICS, 2017.
[31] Yaohung M Tsai et al. Performance-portable autotuning of opencl kernels for
convolutional layers of deep neural networks. In MLHPC. IEEE Press, 2016.
[32] Barret Zoph et al. Learning transferable architectures for scalable image recogni-
tion. arXiv:1707.07012, 2017.
[33] Abdul Dakkak et al. Frustrated with replicating claims of a shared model? a
solution. arXiv:1811.09737, 2019.
[34] Hao Cong et al. A hybrid GPU + FPGA system design for autonomous driving
cars. In SiPS, 2019.
[35] https://www.nvidia.com/en-us/self-driving-cars/drive-platform/hardware/.
[36] https://www.mobileye.com/our-technology/evolution-eyeq-chip/.
[37] https://www.teslarati.com/tesla-tsla-fsd-chip-4-years-ahead-analyst/.
[38] M.R Nithin et al. Advanced driver assistance system using FPGA. White Paper,
2014.
[39] Ryosuke Okuda et al. A survey of technical trend of ADAS and autonomous
driving. VLSI-DAT, 2014.
[40] https://www.xilinx.com/applications/megatrends/automotive-driver-assist.
html.
[41] https://github.com/ryujaehun/pytorch-gpu-benchmark.
[42] DJI. DAC-SDC dataset. https://github.com/xyzxinyi zhang/2018-DAC-System-
Design-Contest, 2018.
[43] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.
arXiv:1804.02767, 2018.
[44] DAC-SDC. http://www.cse.cuhk.edu.hk/~byu/2019-DAC-SDC/index.html.
[45] Jiantao Qiu et al. Going deeper with embedded FPGA platform for convolutional
neural network. In FPGA, 2016.
[46] Gong Cheng et al. µ l2q: An ultra-low loss quantization method for DNN. In
IJCNN, 2019.
[47] Song Han et al. Ese: Efficient speech recognition engine with sparse LSTM on
FPGA. In FPGA, 2017.
[48] Xiaofan Zhang et al. High-performance video content recognition with long-term
recurrent convolutional network for FPGA. In FPL, 2017.
[49] Mohammad Motamedi et al. Design space exploration of FPGA-based deep
convolutional neural networks. In ASP-DAC, 2016.
[50] Guanwen Zhong et al. Design space exploration of FPGA-based accelerators
with multi-level parallelism. In DATE, 2017.
[51] Kiseok Kwon et al. Co-design of deep neural nets and neural net accelerators for
embedded vision applications. In DAC, 2018.
[52] Forrest N Iandola et al. Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and< 0.5 mb model size. arXiv:1602.07360, 2016.
[53] Weiwen Jiang et al. Accuracy vs. efficiency: Achieving both through fpga-
implementation aware neural architecture search. DAC, 2019.
