AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs
  and ASICs by Xu, Pengfei et al.
AutoDNNchip: An Automated DNN Chip Predictor and Builder
for Both FPGAs and ASICs
Pengfei Xu1, Xiaofan Zhang2, Cong Hao2, Yang Zhao1, Yongan Zhang1, Yue Wang1, Chaojian Li1,
Zetong Guan1, Deming Chen2, Yingyan Lin1
1Rice University, TX, USA, 2University of Illinois at Urbana-Champaign, IL, USA
{eiclab, zy34, yz87, yw68, cl114, zg20, yingyan.lin}@rice.edu, {xiaofan3, congh, dchen}@illinois.edu
ABSTRACT
Recent breakthroughs in Deep Neural Networks (DNNs) have fu-
eled a growing demand for domain-specific hardware accelerators
(i.e., DNN chips). However, designing DNN chips is non-trivial
because: (1) mainstream DNNs have millions of parameters and
operations; (2) the design space is large due to the numerous de-
sign choices of dataflows, processing elements, memory hierarchy,
etc.; and (3) an algorithm/hardware co-design is needed to allow
the same DNN functionality to have a different decomposition,
which would require different hardware IPs that correspond to
dramatically different performance/energy/area tradeoffs. There-
fore, DNN chips often take months to years to design and require a
large team of cross-disciplinary experts. To enable fast and effective
DNN chip design, we propose AutoDNNchip − a DNN chip genera-
tor that can automatically generate both FPGA- and ASIC-based
DNN chip implementation (i.e., synthesizable RTL code with opti-
mized algorithm-to-hardware mapping (i.e., dataflow) ) given DNNs
from machine learning frameworks (e.g., PyTorch) for a designated
application and dataset without humans in the loop. Specifically,
AutoDNNchip consists of two integrated enablers: (1) a Chip Predic-
tor, built on top of a graph-based accelerator representation, which
can accurately and efficiently predict a DNN accelerator’s energy,
throughput, latency, and area based on the DNN model parameters,
hardware configuration, technology-based IPs, and platform con-
straints; and (2) a Chip Builder, which can automatically explore
the design space of DNN chips (including IP selection, block config-
uration, resource balance, etc.), optimize chip design via the Chip
Predictor, and then generate synthesizable RTL code with optimized
dataflows to achieve the target design metrics. Experimental results
show that our Chip Predictor’s predicted performance differs from
real-measured ones by <10% when validated using 15 DNN models
and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore,
both the FPGA- and ASIC-based DNN accelerators generated by our
AutoDNNchip can achieve better (up to 3.86× improvement) per-
formance than that of expert-crafted state-of-the-art accelerators,
showing the effectiveness of AutoDNNchip. Our open-source code
can be found at https://github.com/RICE-EIC/AutoDNNchip.git.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
FPGA ’20, February 23–25, 2020, Seaside, CA, USA
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7099-8/20/02. . . $15.00
https://doi.org/10.1145/3373087.3375306
ACM Reference Format:
Pengfei Xu, Xiaofan Zhang, CongHao, Yang Zhao, Yongan Zhang, YueWang,
Chaojian Li, Zetong Guan, Deming Chen, Yingyan Lin. 2020. AutoDNNchip:
An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs.
In 2020 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA’20), February 23-25, 2020, Seaside, CA, USA. ACM, New York,
NY, USA, 11 pages. https://doi.org/10.1145/3373087.3375306
1 INTRODUCTION
We have seen the rapid adoption of Deep Neural Networks (DNNs)
for solving real-life problems, such as image classification [1, 2],
object detection [3], natural language processing [4], etc. Although
DNNs enable high-quality inferences, they also require a large
amount of computation and memory demand during deployment
due to their inherently immense complexity [5–9]. Moreover, DNN-
based applications often require not only high inference accuracy,
but also aggressive hardware performance, including high through-
put, low end-to-end latency, and limited energy consumption. Re-
cently, we have seen intensive studies on DNN accelerators in
hardware, which attempt to take advantage of different hardware
design styles, such as GPUs, FPGAs, and ASICs, to improve the
speed and efficiency of DNN inference and training [10? –21].
However, developing customized DNN accelerators presents
significant challenges as it asks for cross-disciplinary knowledge
in machine learning, micro-architecture, and physical chip design.
Specifically, to build accelerators on FPGAs or ASICs, it is inevitable
to include (1) customized architectures for running DNNworkloads,
(2) RTL programming for implementing accelerator prototypes, and
(3) reiterative verifications for validating the functionality correct-
ness. The whole task requires designers to have a deep understand-
ing of both DNN algorithms and hardware design. In response
to the intense demands and challenges of designing DNN accel-
erators, we have seen rapid development of high-level synthesis
(HLS) design flow [22–25] and DNN design automation frame-
works [16, 26–30] that improve the hardware design efficiency by
allowing DNN accelerator design from high-level algorithmic de-
scriptions and using pre-defined high-quality hardware IPs. Still,
they either rely on hardware experts to trim down the large de-
sign space (e.g., use pre-defined/fixed architecture templates and
explore other factors [16, 29]) or conduct merely limited design ex-
ploration and optimization, hindering the development of optimal
DNN accelerators that can be deployed into various platforms.
To address the challenges above, we propose AutoDNNchip, an
end-to-end automation tool for generating optimized FPGA- and
ASIC-based accelerators from machine learning frameworks (e.g.,
Pytorch/Tensorflow) and providing fast and accurate performance
estimations of hardware accelerators implemented on various tar-
geted devices. The main contributions of this paper are as follows:
ar
X
iv
:2
00
1.
03
53
5v
4 
 [c
s.D
C]
  1
0 J
un
 20
20
• One-for-all Design Space Description.Wemake use of a graph-
based representation that can unify design factors in all of the
three design abstraction levels (including IP, architecture, and
hardware-mapping levels) of DNN accelerator design, allow-
ing highly flexible architecture configuration, scalable architec-
ture/IP/mapping co-optimization, and algorithm-adaptive accel-
erator design.
• Chip Predictor. Built on top of the above design space descrip-
tion, we propose a DNN Chip Predictor, a multi-grained per-
formance estimation/simulation tool, which includes a coarse-
grained, analytical-model based mode and a fine-grained, run-
time-simulation based mode. Experiments using 15 DNN models
and 4 platforms (edge-FPGA/TPU/GPU and ASIC) show that our
Chip Predictor’s predicted error is within 10% of real-measured
energy/latency/resource-consumption.
• Chip Builder.We further propose a DNN Chip Builder, which
features a two-stage Design Space Exploration (DSE) methodol-
ogy. Specifically, our Chip Builder realizes: (1) an architecture/IP
design based on the Chip Predictor’s coarse-grained, analytical-
model based prediction for a 1st-stage fast exploration and op-
timization, and (2) an IP/pipeline design based on the Chip Pre-
dictor’s fine-grained, run-time-simulation based prediction as a
2nd-stage IP-pipeline co-optimization. Experiments show that
the Chip Builder’s 1st-stage DSE can efficiently rule out infeasible
choices, while its 2nd-stage co-optimization can effectively boost
the performance of remaining design candidates, e.g., 36.46%
throughput improvement and 2.4× idle cycles reduction.
• AutoDNNchip. Integrating the aforementioned two enablers
(i.e., Chip Predictor and Chip Builder), we develop AutoDNNchip,
which can automatically generate optimized DNN accelerator
implementation (i.e., synthesizable RTL implementation) given
the user-defined DNN models from machine learning frame-
works (e.g., Pytorch), application-driven specifications (e.g., en-
ergy and latency), and resource budget (e.g., size of the processing
array and memories). Experiments demonstrate that the opti-
mized FPGA- and ASIC-based DNN accelerators generated by
AutoDNNchip outperform the recent award-winning design [31]
by 11% and a state-of-the-art accelerator [32] by up to 3.86× .
As an automated DNN accelerator design tool, AutoDNNchip
is the first to highlight all of the following features: (1) efficient
and accurate performance prediction of DNN accelerators on 4
platforms, enabling fast optimal algorithm-to-accelerator mapping
design and algorithm/accelerator co-design/co-optimization; (2)
a design space description that unifies the descriptions of design
factors from all of the three design abstraction levels in DNN accel-
erators into one directed graph, supporting arbitrary accelerator
architectures (e.g., both homogeneous and heterogeneous IPs and
their inter-connections), and (3) can automatically generate both
FPGA- and ASIC-based DNN accelerator implementation that out-
performs expert-crafted state-of-the-art designs for various appli-
cations.
2 BACKGROUND AND RELATEDWORKS
FPGA- and ASIC-based DNN Accelerators. There has been in-
tensive study in customized FPGA- and ASIC-based DNN accelera-
tors. The accelerator in [10] uses loop tiling for accelerating convo-
lutional layers on FPGAs. The DNNBuilder accelerator [16] applies
Input
Output FPGA
C++ code
AutoDNNchip
Chip Predictor Chip Builder  
or
ASIC
RTL code
DNN Models App. Specs & 
Budget
START
Figure 1: Overview of the proposed AutoDNNchip frame-
work, which accepts user-defined DNNmodels/datasets and
application-driven specifications to automatically generate
optimized FPGA- or ASIC-based DNN accelerator designs.
an optimal resource allocation strategy, fine-grained layer-based
pipeline, and column-based cache to deliver high-quality FPGA-
based DNN accelerators. The work in [13] proposes a throughput-
oriented accelerator with multiple levels (i.e., task, layer, loop, and
operator levels) of parallelisms. The recent designs in [31, 33] intro-
duce a hardware-efficient DNN and accelerator co-design strategy
by considering both algorithm and hardware optimizations, using
DNN building blocks (called Bundles) to capture hardware con-
straints. For ASIC-based DNN accelerators, efforts have been made
in both industry and academia, where representative ones include
TPU [17, 18], ShiDianNao [20], and Eyeriss [21], and different ac-
celerators exploit different optimizations for various applications.
DNN Accelerator Performance Prediction. For designing
FPGA-based DNN accelerators, current practice usually relies on
roofline models [10] or customized analytical tools [13, 16] to es-
timate the achievable performance. For ASIC-based accelerators,
recently published designs [21, 34, 35] introduce various perfor-
mance prediction methods. Eyeriss [21] proposes an energy model
for capturing the energy overhead of the customized memory and
computation units and a delay model that simplifies the latency cal-
culation. Similarly, MAESTRO [34] develops an energy estimation
model that considers hardware design configurations and memory
access behaviors, while Timeloop [35] adopts a loop-based descrip-
tion of targeted workloads and analyzes the data movement and
memory access for latency estimation.
DNN Accelerator Generation. The tremendous need for de-
veloping FPGA-/ASIC-based DNN accelerators motivates the de-
velopment of automated DNN accelerator generation. For example,
DeepBurning [26] is a design automation tool for building FPGA-
based DNN accelerators with customized design parameters using
a pre-constructed RTL module library. DNNBuilder [16] and FP-
DNN [28] propose end-to-end tools that can automatically generate
optimized FPGA-based accelerators from high-level DNN symbolic
descriptions in Caffe/Tensorflow frameworks. Caffeine [27] is an-
other automation tool that provides guidelines for choosing FPGA
hardware parameters, such as the number of processing elements
(PEs), bit precision of variables, and parallel data factors. By using
these automation tools, it is easier to bridge the gap between fast
DNN construction in popular machine learning frameworks and
slow implementation of targeted hardware accelerators.
Step III: RTL Generation
Design Space
<N3>
● Architecture template pool 
● IP template pool
    -  Memory IPs: BRAM, DRAM, SRAM, etc.
    -  Data path IPs: Axi-bus, sync FIFO, etc.
    -  Computation IPs: Adder tree, MAC array, etc.
● DNN def: Pytorch/Tensorflow
● Application Spec & budget: Lmax, Pmax, 
Rmax (for Latency, Power, and Resource)
● Opt. Obj.: Cost = obj(E,L)
● Target back-ends: FPGA/ASIC
User Specified Inputs
Hardware IP Pool
IP1
IP2
IP3
IP4
IP5
directed graph
Step I: Early-Stage Arch-IP Co-optimization
Chip Predictor
(coarse-grained mode)
DNN 
Parser DNN
Design 
Space 
<N1>
for i=1:N1
Meet Requirement? 
Lmax, Pmax, Rmax
the i-th Graph Gi
Predicted  Ei, Li, Ri
Yes:  save Gi
Hardware Candidates 
<N2>
Step II: IP-Pipeline Co-optimization
Design Space
<N3>
Inter-IP   pipeline insert
Chip Predictor
(fine-grained mode)
Pipeline/IP optimizer 
updated Gi
Ei, Li, Ri,    Bottleneck IPj
for i=1:N3
If Costi converge:  save GiNopt Hardware designs
with lowest Cost = obj(E, L)
HLS Generator
RTL Generator
vivadoFPGA
ASIC
Memory compiler
Output: 
Optimized RTL Design
the i-th Graph Gi
Figure 2: AutoDNNchip’s three-step design flow for the design space exploration, optimization, and DNN-to-RTL generation.
3 OVERVIEW OF AUTODNNCHIP
Fig. 1 shows an overview of the proposed AutoDNNchip, which
can automatically generate optimized FPGA- or ASIC-based DNN
accelerators as well as an optimal algorithm-to-hardware map-
ping (i.e., dataflow), according to three customized inputs: (1) the
high-level DNN descriptions trained in desired datasets, (2) the
application-driven specifications regarding DNN inference quality
and hardware performance, and (3) the available resources of tar-
geted platforms. The realization of AutoDNNchip is achieved by the
proposed One-for-all Design Space Description (see Section 4), Chip
Predictor (see Section 5), and Chip Builder (see Section 6).
One of the major challenges that AutoDNNchip needs to over-
come is the lack of effective representations of DNN accelera-
tors’ large design space given the numerous design choices (e.g.,
dataflows, the number of pipeline stages, parallelism factors, mem-
ory hierarchy, etc.), as a precise and concise representation is a
precondition of valid DNN accelerator design. To address this chal-
lenge, we propose aOne-for-all Design Space Description, which is an
object-oriented graph-based definition for DNN accelerator design
that can unify the description of design factors from all of the three
design abstraction levels into one directed graph. Furthermore, Au-
toDNNchip features another two key enablers, the Chip Predictor
and the Chip Builder. Specifically, the proposed Chip Predictor can
accurately and efficiently estimate the energy, throughput, latency,
and area overhead of DNN accelerators based on the parameters
that can characterize the algorithms, hardware architectures, and
technology-based IPs. The proposed Chip Builder can automatically
(1) explore the design space of DNN accelerators (including IP se-
lection, block configuration, resource balance, etc.), (2) optimize
chip designs via the Chip Predictor, and (3) generate synthesizable
Verilog code to achieve target design metrics.
4 ONE-FOR-ALL DESIGN SPACE
DESCRIPTION
Overview. It is well known that the design space of DNN accel-
erators can be very large. For effective and efficient design space
exploration and optimization, it is critical that the design space
Table 1: A summary of DNN accelerators’ design factors.
Design factor Description Back-end Opt. level
BW , BA, BAcc a Bit precision F, A b IP, Accuracy req.
Freq . Clock frequency F, A Arch., IP
Archmem Memory tech/hierarchy/volume A Arch., IP, Mapping
Archpe PE array architecture F, A Arch., IP, Mapping
Bw Port/Bus width for data transfer A Arch., IP
Malloc Memory allocation F, A Arch., IP, Mapping
Data Schedule DNN to accelerator mapping F, A Arch., IP, Mapping
a BW , BA, BAcc : Bit precision for weights, activations, accumulations
b A: ASIC design; F: FPGA design
Table 2: A summary of attributes for the nodes and edges in
the graph-based description
Compo.a Hardware meaning Attributes
Node
Memory IPs Impl., Freq., Vol., Prec., Dt., StM., E, Lb
Computation IPs Impl., Freq., Prec., StM., E, L
Data Path IPs Impl., Freq., Bw.c Prec., Dt., StM., E, L
Edge IP inter-connections (IP dependency) Start, End d
a Compo.: graph components including nodes and directed edges;
b Impl.: implementation, e.g., 14nm DRAM, 28nm SRAM, DSP48E, AXI-bus, sync FIFO, etc.;
Freq.: clock frequency (MHz); Vol.: volume/capacity (bits); Prec.: bit precision;
Dt.: data type including weights, input activations, and partial sums.; E/L: energy&latency
overhead; StM.: the state machine storing all the states (including needed inputs and generated
outputs) through the whole execution process;
c Bw.: port/bus width; d Start & End: the start and ending node for the directed edge.
can be precisely and concisely described, e.g., that the different
design abstraction levels of optimization in DNN accelerator de-
sign, including architecture level, IP level, and hardware-mapping
level, are considered. To this end, we adopt a One-for-all Design
Space Description that unifies the design factors of the three levels
into one directed graph. Table 1 lists the design factors which are
sufficient for most cases and the last column shows the levels of de-
sign/optimization that may influence the corresponding factors. We
can see that (1) most of the design factors are related to cross-level
optimization which also reflects the fact that DNN accelerators
have a large design space and (2) optimization at merely one level
(or one hardware component) does not guarantee overall system
performance. We thus adopt an object-oriented directed graph for
the DNN accelerator design space description, an illustrative exam-
ple of which is shown in Fig. 3. Specifically, a basic directed graph is
first constructed using the PE array architecture, memory architec-
ture and mapping/dataflow factors, where each node in the graph
denotes a computation/data-path/memory IP and each directed
edge denotes an inter-connection between nodes whose direction
is determined by the corresponding data movement’s direction.
Proper attributes (e.g., those in Table 2) are then assigned to the
nodes and edges of the directed graph in an object-oriented manner.
In the following subsections, we will briefly describe four graph-
based accelerator templates corresponding to four state-of-the-art
DNN accelerators which are stored in the Hardware IP Pool (see
Fig. 2 under the User Specified Inputs) ofAutoDNNchip together with
other templates to provide a sufficient number of design candidates,
and then discuss the IP attributes for the nodes and edges.
Graph-basedAccelerator Templates. Fig. 4 shows four graph-
based accelerator template examples for describing DNN accelera-
tors that can be translated into real hardware implementation by
applying appropriate IP attributes. Specifically, Fig. 4 (a) shows a
spatial architecture based on a single adder-tree based computation
AXI-bus
for input
DRAM 
AXI-bus  
for weight
On-chip 
buffer for 
CONV1 input On-chip buffer 
for CONV1 
weight 
PE array for 
CONV1
On-chip buffer 
for CONV1 
output 
On-chip buffer 
for CONV2 
weight 
PE array for 
CONV2
On-chip buffer 
for CONV2 
output 
AXI-bus
for output
IP Attributes
Freq = 200MHz
Feature map prec = 11
Weight prec = 8
StM description:
    Inter-IP pipeline = True
    M= 16 (pipeline II = 2)
    C = 16 (unroll = 16)
IP Attributes
Impl = SRAM
Bw = 512, Prec = 11
Freq = 200MHz
StM description:
    Inter-IP pipeline = False
    M= 16 
    C = 16 (unroll=16)
Memory IP Node
Data path IP Node
Comp. IP Node
Figure 3: An illustrative example of the graph-based design
space description for a heterogeneous architecture to accel-
erate residual block in ResNet [36], where M and C denote
the output and input channel, respectively.
IP, which is a commonly-used architecture on FPGA-based accelera-
tors; Fig. 4 (b) is a graph with 2 different computation IPs, including
depth-wise convolutional (denoted as DW_CONV) and normal con-
volutional (denoted as CONV) ones commonly adopted in compact
DNN models, and two BRAM IPs that handle the memory data
arrangement for the computation IPs; Fig. 4 (c) is an architecture
template for TPU [17] type DNN accelerators using a systolic ar-
ray; and Fig. 4 (d) shows the graph-based representation for DNN
accelerators with Eyeriss [21] type architectures, where the data
path IPs (i.e., NoC IPs in Fig. 4 (d)) between PEs describe the local
data reuse patterns of inputs, outputs, and weights.
IP Attributes. Table 2 summarizes the attributes for three types
of node IP including memory (e.g., BRAM and off-Chip DRAM ),
data access (e.g., bus), and computation hardware that characterizes
the corresponding design, as elaborated below: (1) The Implemen-
tation or Impl. attribute refers to the required hardware resource
for implementing the IPs, e.g., DRAM and SRAM for implementing
memory IPs, and AXI-bus and NoC for implementing data path
IPs; (2) The state machine or StM. attribute is used to describe when
the IPs will update their states between computation and load-
ing/unloading data, where each state defines both the needed input
address and generated output address. Fig. 5 shows that different
pipeline designs can be captured by the IPs’ state machine attribute:
PE Array IP
A
XI
-B
U
S 
IP
BRAM IP
Input
BRAM IP
Weight
BRAM IP
Output
O
ff-
ch
ip
 D
R
A
M
BRAM IP
Input
BRAM IP
DW-CONV Weight
BRAM IP
CONV Weight
BRAM IP
Output
PE Array IP
DW-CONV
PE Array IP
CONV
A
XI
-B
U
S 
IP
O
ff-
ch
ip
 D
R
A
M
(a) (b)
MAC
IP
MAC
IP
MAC
IP
MAC
IP
MAC
IP
MAC
IP
MAC
IP
MAC
IP
MAC
IP
1D Systolic NoC IP
Intput
1D
 S
ys
to
lic
 N
oC
 IP
 
W
ei
gh
t
1D Systolic NoC IP
Output
SRAM IP
Input
SRAM IP
Output
O
ff-
ch
ip
 D
R
A
M
B
U
S 
IP SRAM IP
Weight
PE PE PE
PE PE PE
PE PE PE
RF IP
Weight
RF IP
Input
MAC
IP
RF IP
Output
PE
Forward NoC IP 
Output
M
ul
tic
as
t1
 N
oC
 IP
 
W
ei
gh
t
Multicast2 NoC IP 
Input
SRAM IP
Output
SRAM IP
Intput
O
ff-
ch
ip
 D
R
A
M
B
U
S 
IP SRAM IP
Weight
(c) (d)
Figure 4: An illustration of 4 architecture templates in our
Hardware IP Pool including 2 architectures for both state-of-
the-art FPGA- and ASIC-based DNN accelerators.
Data Path IP
Trans. [NxN]
2N
T1 T3
T2 T4
2N
R1
R2
R1
R2
R1
R2
R1
R2
2N
2N
Computation IP
Comp. [NxN]
SD1 SD2
SDN
SDN+1
SD4N
SD2NSD3N
SC1 SC2
SC3SC4
SD1 SD2
SD3SD4
SC1 SC2
SCN
SCN+1
SC4N
SC2NSC3N
R1
R
R2
R1 R2
R1 R2
R1 R2
Trans.
Comp.
Execution Time
T
T1
T1
T2
T2
Trans.
Comp.
Execution Time
T
(a)
(b) (c)
Task division State
Mathine
Attribute
Task division StateMathine 
Attribute
TaskToy Architecture 
with 2 IPs
2N
2N
Figure 5: A toy example of IP’s state machine attribute w/o
and w/ considering inter-IP pipeline effects: (a) a simple ar-
chitecture with 2 IPs, i.e., one data path IP and one compu-
tation IP; the task division, state machine, and the run-time
process when (b) excluding the inter-IP pipeline and (c) con-
sidering the inter-IP pipeline, where SD and SC denote the
state for data path IP and computation IP, respectively.
Fig. 5 (b) and (c) illustrate two kinds of designs (w/o and w/ inter-IP
pipeline) and their corresponding state machine definition, respec-
tively, where there are more states in Fig. 5 (c) for capturing the
inter-IP pipeline between data transfer and computation IPs; (3)
The data precision or Prec. attribute refers to the IPs’ bit precision;
(4) The clock Frequency or Freq. and energy/latency or E/L attributes
capture the operating clock frequency and required energy/latency
for IPs; and (5) The port/bus width or Bw and memory volume or Vol.
attributes refers to the port/bus width of data path IPs and memory
volume for memory IPs, respectively.
5 THE PROPOSED CHIP PREDICTOR
5.1 Overview
As shown in Fig. 6, the proposed Chip Predictor accepts DNN mod-
els (e.g., number of layers, layer structure, precision, etc.), hardware
architectures (e.g., memory hierarchy, number of PEs, NoC design,
etc.), hardware mapping, and IP design (e.g., unit energy/delay
cost of a multiply-and-accumulate (MAC) operation and memory
accesses to various memory hierarchies), and then outputs the es-
timated energy consumption, latency, and resource consumption
when executing the DNN in the target accelerator defined by the
given hardware architecture and hardware mapping. First, to cap-
ture the large search space and consider all the design abstraction
levels (including architecture, IP, and hardware mapping levels),
we construct a graph-based description that serves as one input of
Chip Predictor. Second, to match the different tradeoff requirements
coarse-grained
stop No inter-IP pipeline
Chip Predictor
 Graph def.
DNN
model
Energy
Latency
Resource
consumption
Arch selection
IP design
Hw mapping
System Bottleneck
(only for fine-grained)
fine-grained
Enable inter-IP 
pipeline
Figure 6: An overview of the proposed Chip Predictor.
a00 a01 a02
a10 a11 a12
a20 a21 a22
b00 b01 b02
b10 b11 b12
b20 b21 b22
(a)
MAC
1
MAC
4
MAC
7
MAC
9
MAC
3
MAC
6
MAC
2
MAC
5
MAC
8
a0x
a1x
a2x
bx0 bx1 bx2
Critical 
path 
Cycles = 15Coarse-grained
(b)
Cycles
a00 a01 a02
MAC
1
b00 b10 b20
a00 a01 a02
b01 b11 b21
a20 a21 a22
b02 b12 b22
MAC
2
MAC
9
0 1 4 7Fine-grained
(c)
t=0 
     check a00X b01✓
t=1 
     check a00✓  b01✓
...
Idle
Busy
t=0 t=1
t=2,3
t=4
MAC 2 State Machine
stay idle state
jump to busy state
check input ready or not for MAC 2
Figure 7: A toy example using a systolic array, illustrating:
(a) the corresponding matrix-matrix multiplication, and (b)
coarse-grained and (c) fine-grained latency estimation.
of the Chip Builder’s two-stage DSE, which aims for efficient and
accurate design space exploration and optimization, our Chip Pre-
dictor adopts a mixed-granularity prediction: (1) a coarse-grained
mode that can quickly provide IP performance estimation to enable
the identification of critical paths when the inter-IP pipeline is not
considered, in order to be used for the Chip Builder’s early-stage
architecture and IP exploration and selection; and (2) a fine-grained
mode that can perform accurate performance prediction by con-
sidering the pipeline dependency between IPs based on run-time
simulations, in order to be used for the Chip Builder’s 2nd-stage
DSE that targets IP-pipeline co-optimization.
5.2 The Chip Predictor’s Coarse-grained Mode
Overview. The Chip Predictor’s coarse-grained mode is analytical-
model-based, i.e., using equations to formulate the accelerators’
energy, critical path latency, and resource consumption given a
DNN model and a graph-based hardware design description (see
Fig. 6). Specifically, the energy and latency of IPs are first calcu-
lated using: (1) analytical equations as described below and (2) the
attributes of each IP, where the unit energy/latency costs are ob-
tained from single-IP RTL implementation or simulations; and the
energy and latency consumption of the whole DNN accelerator are
formulated by considering: (1) the total energy and latency of all
IPs when executing the DNN model and (2) the energy and latency
overhead of the CPU and on-chip controller.
Analytical-model-based Intra-IPModeling. If we define the
energy, latency and resource utilization as E, L, and R, respectively,
and use ipcomp , ipdp , ipmem to denote the computation IP, data
path IP and the memory IP, respectively, the energy and latency of
the computation IPs can be formulated by:
Eipcomp = e1 + (#states) × (e2 + emac ×U ) (1)
Lipcomp = l1 + (#states) × lmac (2)
where (#states) denotes the total number of the states in the IP state
machine; U denotes the unrolling factor (PE parallelism) for the
computation IP; emac and lmac denote the unit energy and latency
costs for a MAC operation, respectively; e1 and l1 are the energy
and latency overhead for warming up, i.e., configure the data path
and pre-load data; and e2 denotes the energy overhead of run-time
control of CPU or on-chip logic units. Meanwhile, the energy and
latency of the data path IPs can be formulated by:
Eipdp = e3 + (#states) × (e4 +V × ebit ) (3)
Lipdp = l2 + (#states) × (l3 +
V
Pw
× lbit ) (4)
where V denotes the total data volume (bits) needed to be trans-
ferred when the IPs are called; Pw denotes the port width for the
corresponding data path; ebit and lbit denote the unit energy and
latency costs for each bit of data access, respectively; e3 and l2 are
the energy and latency overhead for warming up, respectively; and
e4 and l3 denote the energy and latency overhead of the run-time
control of CPU or on-chip logic units, respectively.
Analytical-model-based Inter-IP Modeling. For the system
performance, including energy, latency, and resource consumption
of a convolutional layer or a DNN building block (e.g., the Bundle
in [31, 33]), the resource and energy consumption are obtained by
summing up that of all the IPs in the graph, and the total latency
can be calculated by summing up the latency of all the IPs on the
critical path of the graph, i.e.,
Rmem =
∑
ipmem ∈G
Volipmem (5)
Rmul =
∑
ipcomp ∈G
Uipcomp + Rmuldec (6)
E =
∑
ip∈G
Eip (7)
L = max
path∈G
∑
ip∈path
Lip (8)
whereG denotes the whole graph; Rmem denotes the total memory
volume consumption for one type of memory; Rmul denotes the
total number of multipliers used in both the computation IPs and
when decoding the memory address, with the latter denoted as
Rmuldec . Regarding the latency estimation, the inter-IP pipeline ef-
fects are excluded in the coarse-grained mode and it can be captured
in the fine-grained mode of Chip Predictor (see Section 5.3). As a toy
example, Fig. 7 (b) and (c) illustrate the latency estimation when
operating a matrix-matrix multiplication in a systolic array using
both the coarse-grained mode and fine-grained mode, where the
resulting estimated latency results are 15 and 7 cycles, respectively.
5.3 The Chip Predictor’s Fine-grained Mode
Overview. In the fine-grained mode of the Chip Predictor, we adopt:
(1) Algorithm 1 to perform run-time simulations based on inter-IP
pipeline to obtain the corresponding inter-IP latency; and (2) the
Chip Predictor’s coarse-grained mode to get the IPs’ energy and
latency for estimating the intra-IP performance.
Implementation. The run-time simulation algorithm is de-
scribed in Algorithm 1, where each IP (denoted as ip) has (1) its
neighbour IPs on the graph defined as ip.prev and ip.next , respec-
tively, and ip will use the data from ip.prev as its inputs and pass its
outputs to ip.next ; and (2) a state machine to store different states
(including its needed inputs and generated outputs) through the
whole execution process. For each clock cycle in the simulation, ip
can jump to the next state when (1) it has finished generating all
the outputs in its current state (i.e., ip is in an idle status) and (2)
ip.prev has generated all the inputs ip needed for the next state.
If ip is in an idle status but its needed inputs are not ready from
ip.prev , it will continue to wait on the idle status, resulting in an
increase of the idle cycles associated with this IP; If ip is in a busy
status, it will generate its outputs and jump to an idle status when
it finishes generating all the outputs in this state.
Algorithm 1 Run-time sim. in the fine-grained Chip Predictor
1: Input: One accelerator design described by graph G ;
2: For each edдe in G
3: ipstar t ←− edдe ′s starting node;
4: ipend ←− edдe ′s ending node;
5: Add ipstar t to ipend .prev ;
6: Add ipend to ipstar t .next ;
7: Initialize energy and latency: E = 0, cycles = 0;
8: While not all inference outputs are stored back
9: cycles ←− cycles + 1;
10: For each ip in G
11: If (ip is idle ) & (all needed inputs ∈ outputs of ip .prev )
12: ip ←− busy ;
13: ip jumps to the next state;
14: If (ip is idle ) & (not all needed inputs ∈ outputs of ip .prev )
15: ip .idle_cycles ←− ip .idle_cycles + 1;
16: If (ip is busy) & (not all outputs for ip is ready)
17: Update the ready outputs for ip ;
18: If (ip is busy) & (all outputs for ip is ready)
19: ip ←− idle ;
20: E ←− E + Eip ;
21: L ←− cyclesдlobal clk f r eq ;
22: ipbott leneck ←− ip with minimum idle cycles.
For better understanding, Fig. 7 uses a toy example to show that
the Chip Predictor’s fine-grained mode (see Fig. 7 (c)) can more
accurately estimate the required latency than its coarse-grained
mode. In this 3×3 systolic array with the local-data-forwarding and
computation operations being pipelined, we assume each MAC unit
takes 3 cycles to do the computation and 1 cycle to forward the
data to its nearby MAC units. In the coarse-grained mode case, we
add the intra-IP latency in the graph’s critical path to estimate the
overall latency (see Fig. 7 (b)), resulting in an estimated latency of
15 cycles. In the fine-grained mode case (see Fig. 7 (c)), we define
the state machine for each MAC unit and adopt Algorithm 1 to
keep track of when each MAC unit jumps to the next state. In this
particular example, MAC 2 will wait at cycle 0 since its required
input data a00 is not ready, and it will jump to next state to start
computing at cycle 1 when all its required inputs are ready. We
can see that the fine-grained mode’s estimated latency (7 cycles,
the same as the ground truth) is more accurate for modeling the
overlapped computation and data transferring in this example. In
practical designs, the overall latency is not determined by merely
one stage, so the Chip Builder will launch the Chip Predictor to
simulate the whole graph iteratively in order to generate an optimal
design for the whole accelerator system.
6 THE PROPOSED CHIP BUILDER
Fig. 2 elaborates the design flow of AutoDNNchip that leverages
the Chip Builder’s two-stage DSE engine. To effectively explore
the design space (e.g., the design factors in Table 1), AutoDNNchip
involves three major steps as shown in Fig. 2: (1) the 1st-stage
DSE: an early stage architecture and IP configuration exploration
to efficiently rule out infeasible designs using the Chip Predictor’s
coarse-grained mode; (2) the 2nd-stage DSE: an inter-IP pipeline
exploration and IP optimization to effectively boost the performance
of the remaining design candidates resulting from the 1st-stage DSE;
and (3) a design validation through RTL generation and execution.
Step I. Early Stage Architecture and IP Configuration Ex-
ploration.As shown in the middle part of Fig. 2, this step considers
the following exploration. First, the DNN model from a mainstream
machine learing framework is applied to the DNN parser to extract
the DNN layer information, e.g., layer types (CONV, Pooling, ReLU,
Algorithm 2 IP-pipeline co-optimizationusing the Chip Builder
1: Input: Design space DG with N2 graphs;
2: For each G in DG
3: For each edдe in G
4: ipstar t ←− edдe ′s starting node;
5: ipend ←− edдe ′s ending node;
6: Add ipstar t to ipend .prev ;
7: Add ipend to ipstar t .next ;
8: While simulated (using Algorithm 1) latency LG does not converge
9: ip ←− simulated bottleneck IP (i.e., ipbott leneck from Algorithm 1);
10: If inter-IP pipeline is adopted for ip and ip .next
11: allocate more resource to ip ;
12: Else
13: adopt inter-IP pipeline between ip and ip .next ;
14: update the state machine of ip ;
15: update the state machine of ip .next ;
16: Select top Nopt candidates in DG
Reorg [31], etc.), feature map inter-connections (Concat, Add, etc.),
and layer shapes (shape of weight and feature map tensors). Sec-
ond, according to the given DNNmodel, performance requirements
(e.g., latency and throughput) and hardware budgets (e.g., resource
and power budget of FPGA or ASIC), a design space of size N1 is
generated by fetching commonly-used or promising hardware ar-
chitecture templates and hardware IP templates from the Hardware
IP pool. For example, when the given resource budgets are tight, a
folded hardware architecture will be chosen instead of a flattened
one; whereas flattened structures which facilitate IP pipelines are
preferred when there are sufficient budgets. Third, an architecture
and IP configuration optimization is then performed to rule out
most of the infeasible choices and trim down the design space to N2
(N2 < N1) promising candidates, e.g, more efficient with a lower la-
tency. This fast early exploration makes use of the analytical nature
of the Chip Predictor’s coarse-grained mode.
Step II. Inter-IP Pipeline Exploration and IPOptimization.
This step accepts the resulting N2 designs and performs further
exploration and IP optimization using Algorithm 2. First, inter-IP
pipelines are inserted into different locations of the corresponding
computation graphs, resulting in a new design space of size N3,
i.e., N3 new graphs with different inter-IP pipeline designs. Sec-
ond, for each of these graphs, the bottleneck IPs will be recorded
during Algorithm 1’s run-time simulations and then optimized via
deeper inter-IP pipeline design or re-allocating more resource until
convergence based on the Chip Predictor’s fine-grained mode’s
predicted performance, as shown inAlgorithm 2. Third, the topNopt
design candidates will be chosen according to the Chip Predictor’s
predicted energy consumption or/and latency, and then passed to
the next step for validation through RTL generation and execution.
Step III. Design Validation through RTL Generation and
Execution. In this step, we generate RTL code for the top Nopt
optimized designs through an automated code generation proce-
dure: (1) For the FPGA back-end, the generated files include the
testbench for a board-level implementation, the binary file for the
quantized-and-reordered weights, and the C-code for the HLS IP
implementation. We use Vivado [22] to actually generate the bit-
stream and meanwhile eliminate the designs that fail in place and
route (PnR) to guarantee that AutoDNNchip’s generated designs
are valid; (2) For the ASIC back-end, the generated files include the
RTL testbench for the DNN model, the quantized-and-reordered
weights, the synthesizable RTL code, and the memory specifica-
tions. The RTL code could be further passed to an EDA tool like
Design Compiler and IC Compiler to generate gate-level/layout
netlist, during which Memory Compilers could take the memory
specifications to generate the memory design. After this step, all
the output designs are fully validated with correct functionality.
7 EXPERIMENT RESULTS
In this section, we evaluate the proposed AutoDNNChip on 20
DNN models across 4 platforms (3 edge devices including edge-
FPGA/TPU/GPU and 2 ASIC-based accelerators).
7.1 Validation of the Chip Predictor
Methodology and Setup. Table 3 summarizes the details of our
validation experiments for the Chip Predictor, including the plat-
forms, performance metrics, DNN models, methods to obtain the
unit parameters, employed precision for theweights and activations,
and frequency of the corresponding computation core.Methodol-
ogy. In order to conduct a solid validation, we validate the Chip Pre-
dictor by comparing its predicted performance with actual device-
measured ones on 3 edge devices (Ultra96 FPGA [37], edge TPU [18],
Jetson TX2 [38]) and paper-reported ones of 2 published ASIC-based
accelerators (Eyeriss [21] and ShiDianNao [20]), when adopting
the same experiment settings (e.g., clock frequency, DNN model
and dataset, bit precision, architecture design, and dataflow, etc).
Benchmark DNN Models and Datasets. For the 3 edge devices,
we consider 15 representative compact/light-weight DNN models
(see Table 4 and Table 5, where the models in Table 5 use the Ima-
geNet dataset [39] and the models in Table 4 use the dataset in the
System Design Contest of the DAC 2019 conference [40]); for the 2
published DNN accelerators, we use the same benchmark models
and datasets as the original papers. Unit Parameters. The unit
energy/latency parameters are obtained through either real-device
measurement or synthesized RTL implementation as mentioned in
Section 5. For the 3 edge devices, we measure the unit energy and
latency by running the basic IP operations (such as the memory ac-
cesses and the MAC computation) over multiple sets of experiments
under different settings and average the energy and latency values
to get unit parameters. Specially, for memory accesses, we change
the clock frequency, memory volume, port width, bit precision, and
burst read length; for the MAC operations, the clock frequency, to-
tal number of MACs and parallelism of MACs are changed. For the
ASIC-based accelerators, the unit parameters are obtained either
from the paper [21] or gate-level simulations of the synthesized
RTL implementation on the same CMOS technology.
Table 3: Experiment settings for the Chip Predictor’s cross-
platform/model/design/dataset validation.
Arch/Device Metricsa DNNsb
Unit Precision Freq.e
Param.c <W,A>d (MHz)
Ultra96 [37] E, L, R Compact Measured <11, 9> 220
Edge TPU [18] E, L Compact Measured <8, 8> 500
Jetson TX2 [38] E, L Compact Measured <32, 32> 1300
Eyeriss [21] E, L, R AlexNet Reported <16, 16> 250
ShiDianNao [20] E Small Synthesized <16, 16> 1000
a Metrics – E: energy, L: latency, R: resource;
b DNN benchmarks – Compact: see the 15 compact DNN models in Table 4 and Table 5;
AlexNet [41]; and Small: DNNs used in [20] (< 5 convolutional/fully-connected layers);
c Methods to obtain the unit parameters;
d Bit precision for different types of data, i.e., <weight precision, activation precision>;
e Clock frequency for the computation core.
Table 4: The 10 model variants of the SkyNet backbone [31].
DNN SK SK1 SK2 SK3 SK4 SK5 SK6 SK7 SK8 SK9
Size (MB) 1.75 1.79 2.11 1.18 1.77 3.21 3.79 3.05 0.96 1.95
Layer # 14 14 14 14 17 14 16 14 14 17
Bypass ✓ ✓ ✓ ✓ ✓ - - - - -
Table 5: The 5 model variants of MobileNetV2 [42] using dif-
ferent channel scaling factors and input resolutions.
DNN V-Model 1 V-Model 2 V-Model 3 V-Model 4 V-Model 5
Resolution 128 128 224 224 224
Channel scaling 0.5 1.0 0.5 1.0 1.4
Validation of the Predicted Energy Consumption.We com-
pare the Chip Predictor’s predicted energy with the measured ones
from 3 edge devices, including Ultra96 FPGA [37] (edge FPGA),
edge TPU [18], and Jetson TX2 (edge GPU) [38]) under the same
settings (see Table 3).
Fig. 8 summarizes the validation results, and shows that the
maximumprediction error of ourChip Predictor is 9.17% for
all 15 DNN models across 3 platforms. Specifically, the predic-
tion error ranges from 0.89% to 8.13%, 2.12% to 7.67%, and 2.72%
to 9.17%, for the cases using the edge GPU, the edge FPGA, and
the edge TPU, respectively, and the corresponding average pre-
diction error is 5.40%, 5.20%, and 6.05%, respectively. We notice
the energy consumption of the SkyNet and SK1-SK4 models are
relatively large using the edge TPU. The reason is that these models
contain unsupported operations (e.g., short-cut paths and feature
map reorganization [31]) that need to be handled by the embedded
CPU instead of the optimized tensor unit with higher efficiency.
In Fig. 9 and Table 6, we validate the proposed Chip Predictor
by comparing it to 2 state-of-the-art ASIC-based accelerators: Eye-
riss [21] and ShiDianNao [20]. For Eyeriss [21], we first compare
the predicted energy breakdown of the first and fifth convolutional
layers of AlexNet, of which the maximum error is 5.15% and 1.64%,
respectively, as shown in Fig. 9 (a). Since the memory accesses dom-
inate the energy consumption [21], we further compare the number
of DRAM and SRAM accesses. In Fig. 9 (b), we present the error
between the predicted and Eyeriss’s paper-reported results. The
relatively large error of SRAM accesses in the first convolutional
layer is caused by the unsupported large stride number (4 in this
case) since our predictor only considers the commonly used strides
of 1 and 2 for simplicity. Note our Predictor can be straightforwardly
1.6
1.8
2.0
2.2
max error 9.17% Platforms
FPGA (Ultra96) [11,9]
Edge TPU [8,8]
Edge GPU (Jetson TX2) [32,32]
Prediction Error
0%-1% error
1%-5% error
5%-10% error
SK SK1 SK2 SK3 SK4 SK5 SK6 SK7 SK8 SK9 MB1 MB2 MB3 MB4 MB5
Different DNN models
0.0
0.2
0.4
max error 7.67%
max error 8.13%
min error 2.12%
min error 2.72%
min error 3.43%
E
n
er
g
y
(J
)
Figure 8: The energy prediction error of the Chip Predictor
when using the 15 DNN models in Table 4 and Table 5 run-
ning on 3 edge devices including an edge FPGA [37], Edge
TPU [18], and edge GPU [38].
60
70
80
max 5.15%
Prediction Error
0%-1% error
1%-5% error
5%-10% error
Comp. RF NoC GB
Different hardware IPs
0
10
20
30
max 1.64%
min 0.03%
min 0.26%
Layer
CONV1
CONV5
E
n
er
g
y
b
re
a
k
d
o
w
n
(%
)
(a)
CONV1 CONV2 CONV3 CONV4 CONV5
Different layer of AlexNet
0
5
10
15
20
25
30
35
#
o
f
D
R
A
M
/
S
R
A
M
a
cc
es
se
s
(M
)
max 9.56%
max 9.64%
min 2.18%
min 0.7%
# of memory acc.
# of DRAM acc.
# of SRAM acc.
Prediction Error
0%-1% error
1%-5% error
5%-10% error
(b)
Figure 9: The Chip Predictor’s energy prediction error con-
sidering the Eyeriss architecture [43]: (a) The energy break-
down for AlexNet’s 1st and 5th convolutional layers and (b)
the # of DRAM and SRAM accesses of convolutional layers.
extended to include other stride values. Note that the prediction
errors of DRAM accesses in the last three layers are relatively large,
because the input data are compressed to save DRAM accesses
in [21] and we lack their information regarding the input data
sparsity. The validation results over ShiDianNao [20] are listed in
Table 6. By showing the average energy over 10 DNN benchmarks
of the 4 IPs in [20], we verify the maximum prediction error is
9.59%, where the error is mainly due to the difference between our
adopted commercial CMOS IP library and the one used in [20].
Table 6: The energy prediction error of the Chip Predictor
when using the architecture of ShiDianNao [20]: The energy
breakdown over 10 benchmarks.
IP Computation Input SRAM Output SRAM Weight SRAM
Predicted (%) 89.2 7.4 1.7 1.6
Paper-reported (%) 89.0 8.0 1.6 1.5
Prediction error 0.35% -7.19% 9.59% 7.87%
Validation of the Predicted Latency. The latency prediction
of the Chip Predictor is validated over the measured results of the
same 15 DNN models and 3 edge devices and shown in Fig. 10. he
edge GPU, the Ultra96 FPGA board, and the edge TPU, respectively,
and the corresponding average prediction error is 4.85%, 3.73%, and
6.57%, respectively. The maximum latency prediction error of
our Chip Predictor is 9.75%. Specifically, the prediction errors
range from 0.89% to 9.75%, 1.78% to 5.98%, and 2.92% to 9.44%,
when using the edge GPU, the edge FPGA, and the edge TPU,
respectively. The corresponding average prediction error is 4.85%,
3.73%, and 6.57%, respectively. Similar to the case in Fig. 8, the
latency of the SkyNet and SK1-SK4 models when using the edge
TPU are relatively large because of the unsupported operations.
Table 7 summarizes the latency prediction when running the 5
convolutional layers of AlexNet on Eyeriss [21] with the largest
error peaking at 4.12%. The predicted latency generated by the
proposed Chip Predictor is smaller than the paper-reported results,
as Chip Predictor does not consider the special scenario when the
accelerator needs to access memory multiple times for one single
wordline of data. Such a scenario only happens when one wordline
of data is physically stored inmultiple wordlines of thememory. The
Chip Predictor can be extended to include such a case by configuring
corresponding memory data arrangements.
Validation of thePredictedResourceConsumption.Table 8
summarizes the Chip Predictor’s predicted resource consumption
120
140
160
max 9.44%
Platforms
FPGA (Ultra96) [11,9]
Edge TPU [8,8]
Edge GPU (Jetson TX2) [32,32]
Prediction Error
0%-1% error
1%-5% error
5%-10% error
SK SK1 SK2 SK3 SK4 SK5 SK6 SK7 SK8 SK9 MB1 MB2 MB3 MB4 MB5
Different DNN models
0
25
50
75
max 5.98%max 9.75%
min 1.78%
min 2.92%
min 0.89%
L
a
te
n
cy
(m
s)
Figure 10: The latency prediction error of the Chip Predic-
tor when using 15 DNN models on 3 edge devices including
Ultra96 FPGA board, Edge TPU, and Jetson TX2 (edge GPU).
Table 7: The latency prediction error of the Chip Predic-
tor when using the architecture of Eyeriss [21]: The latency
when processing the 5 convolutional layers of AlexNet.
AlexNet Layer CONV1 CONV2 CONV3 CONV4 CONV5
Predicted latency (ms) 16.04 37.58 21.09 15.59 9.79
Paper-reported latency (ms) 16.5 39.2 21.8 16 10
Prediction error -2.88% -4.12% -3.24% -2.56% -2.14%
based on the experiments using Ultra96 FPGA [37]. Specifically, the
predicted resource consumption for the 2 critical on-chip resources
of FPGAs, DSP48E and BRAM18K, is validated against those ob-
tained from the post-implementation utilization reports, and has
a corresponding prediction error of smaller than 4.2% and 3.2%,
respectively. Note that the DSP48Es are the embedded multipli-
ers in the FPGA and the BRAM18K is the main on-chip memory
resource in the FPGA. In addition, the 6 cases, i.e., Bg. 1-6 in Ta-
ble 8, correspond to 6 designs under varied resource budgets. For
the validation over ASIC-based DNN accelerators, we consider the
MAC utilization as the validation metric. Estimated MAC utilization
among the 5 convolutional layers of AlexNet is the same as the
paper-reported ones in Eyeriss [21], because the MAC utilization is
only determined by the parallelism level of the PE array in Eyeriss.
Table 8: The resource consumption prediction error of the
Chip Predictor’s on the Ultra96 FPGA board when consider-
ing 6 different designs given 6 different resource budgets.
Resource type Val. Bg. 1 Bg. 2 Bg. 3 Bg. 4 Bg. 5 Bg. 6
DSP48E
Predicted 35 69 141 213 285 331
Measured 36 72 144 216 288 360
Error -3.2% -4.2% -2.4% -1.2% -1.0% -0.8%
BRAM18K
Predicted 65 87 175 265 354 446
Measured 64 86 173 259 346 432
Error +1.0% +0.8% +1.2% +2.2% +2.4% +3.2%
7.2 Evaluation for the Chip Builder and
AutoDNNchip
In this subsection, we evaluate the performance of the proposed
Chip Builder, which makes use of the time-efficient and accurate
Chip Predictor to perform an effective two-stage DSE efficiently,
and AutoDNNchip. Specifically, we study the performance of the
resulting DNN accelerators generated and optimized by the Chip
Builder and AutoDNNchip. First, we show experiment results to vi-
sualize the Chip Builder’s two-stage DSE process; Second, we study
the performance improvement resulted from the Chip Builder’s
second-stage IP-pipeline co-optimization in terms of bottleneck
blocks’ latency and idle cycle reduction; Finally, we validate the
effectiveness of the AutoDNNchip by comparing the performance
of its generated FPGA- and ASIC-based accelerators (i.e., the corre-
sponding RTL implementation) with that of state-of-the-art designs
under the same conditions.
Evaluation Setup. In this set of experiments, we consider the
application-driven specifications and constraints summarized in
Table 9, where the throughput requirement and power budget are
set to meet real-time applications of visual recognition (e.g., im-
age classification and object detection [3]) on edge devices. For
the FPGA-based accelerator design, we use a state-of-the-art edge
device Ultra96 FPGA board [37] whose resource budget is fixed;
for the ASIC-based accelerator design, we evaluate our generated
designs through RTL simulations. Regarding the design space ex-
ploration, we use Algorithm 1 to perform the accelerator design
optimization, considering the architecture/dataflow search space
and the design factors in Table 1 for the IP/dataflow design.
Visualizing the Chip Builder’s Two-stage DSE Process. For
demonstrating the effectiveness of the Chip Builder’s two-stage
DSE engine, here we visualize the DSE process, when using Au-
toDNNchip to design an FPGA-based accelerator for achieving com-
petitive performance as the award winning state-of-the-art design
in [31] given the same target performance specification/constraint,
FPGA board, DNN model, and dataset. The FPGA measured energy
consumption of both the resulting design from the AutoDNNchip
and the reported one of [31] are marked in purple in Fig. 11. It
demonstrates that: (1) the DSE engine of the Chip Builder can ef-
fectively trim down the design choices and generate optimized
designs with better performance compared to the state-of-the-art
design published in [31].Without humans in the loop, the Au-
toDNNchip can indeed automatically generate DNN acceler-
ators that achieve optimized performance; (2) most of the de-
sign choices can be efficiently ruled out by the 1st stage of the
DSE engine, i.e., the early stage exploration based on the Chip Pre-
dictor’s coarse-grained analytical performance estimation; and (3)
the 2nd stage IP-pipeline co-optimization of the Chip Builder can
effectively boost (up to 36.46% improvement and an average of
28.92% improvement) the performance, i.e., throughput of the DNN
accelerators here, as compared to that of the designs resulted from
the 1st stage DSE. The final generated design candidates in the
HLS code format will be passed to Vivado [22] for implementation.
Then, we eliminate the designs that fail in the PnR step as shown
in Fig. 11 and find an optimal design from the remaining ones. As
a reference point, the 1st stage DSE takes about 0.65 ms for each
design point and only 0.8 hour for exploring a total of 4.6 million
design points when running on an Intel Core i5 CPU with a single
thread, thanks to the analytical nature of the Chip Predictor.
Table 9: The considered application-driven specifications
(i.e., throughput requirement) and constraints (i.e., power
and resource budget) when evaluating the Chip Builder’s
generated FPGA- and ASIC-based DNN accelerators.
Target Back-end Application Opt. Obj. Th./P. Req. Res. Budget
Ultra96 FPGA Object Detection E, L 20FPS DSP=360, FF=141120
10W LUT=70560, BRAM=432
ASIC Vision Tasks [20] E, L 15FPS On-chip SRAM=128KB
600mW # of MAC units=64
Evaluation of the Chip Builder’s 2nd-stage Optimization.
Fig. 12 summarizes the evaluation experiments for theChip Builder’s
2nd-stage Optimization process. As described in Section 6, this
stage targets an IP-pipeline co-optimization and thus can lead to
more balanced pipeline and more efficient resource allocation. From
Fig. 12, we can see that the Chip Builder’s 2nd-stage optimization
can achieve up to 2.4× idle cycles reduction, when optimizing the de-
sign of SkyNet’s 6 blocks [31] on the Ultra96 edge FPGA board [37].
10 20 30 40 50 60 70 80 90 100
Latency (ms)
0.150
0.175
0.200
0.225
0.250
0.275
0.300
0.325
0.350
E
n
er
g
y/
im
g
(J
)
Optimized
Baseline
25.89%
33.59%
13.22%
35.39%
36.46%
29.04%
Failed in
PnR step
Optimized Design and baseline
Designs eliminated by the 1st stage DSE
Candidates after the 1st stage DSE
Candidates after the 2nd stage DSE
Figure 11: Visualizing the energy consumption per image
and processing latency of the resulting designs from the
Chip Builder’s 1st and 2nd stage optimization, when using
AutoDNNchip to design an FPGA-based accelerator formeet-
ing the performance of a state-of-the-art design [31] given
the sameperformance specification/constraint, FPGAboard,
DNN model, and dataset (see Table 9).
Evaluation of AutoDNNchip’s Generated FPGA-based Ac-
celerators. Fig. 11 shows that the DNN accelerator which is gener-
ated by AutoDNNchip can apparently outperform the recent award-
winning design [31]. We further conduct another set of experiments
to compare the performance of AutoDNNchip’s generated DNN ac-
celerators on the Ultra96 FPGA board with that of a mobile CPU
(Pixel2 XL [32]), when both designs (1) adopt the settings in Table 3,
Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
0
5
10
15
20
P
ro
ce
ss
in
g
la
te
n
cy
(M
cy
cl
es
)
1.9X
1.7X
1.5X
1.9X
1.2X
2.4X
Busy cycles w/o the 2nd Opt.
Idle cycles w/o the 2nd Opt.
Busy cycles w/ the 2nd Opt.
Idle cycles w/ the 2nd Opt.
Figure 12: The busy and idle cycles of the bottleneck IPs in
SkyNet’s 6 different blocks, before and after conducting the
Chip Builder’s 2nd-stage IP-pipeline co-optimization, when
using the AutoDNNchip to generate designs for the Ultra96
FPGA board with the same target performance as [31].
4.5 5.0 5.5 6.0 6.5 7.0 7.5
Energy Efficiency (FPS/W)
20
40
60
80
100
120
140
160
L
a
te
n
cy
(m
s)
latency reduction 3.86x
Platforms
Generated design on FPGA (Ultra96) [8,8]
Mobile (Pixel2 XL) [8,8]
Mean value
Figure 13: Processing latency and energy efficiency on Ul-
tra96 FPGA compared with a mobile device (Pixel2 XL ) on
10 compact DNN models
using the same bit precision and the 10 DNN models in Table 4,
and (2) try to minimize the latency for considering time-critical
applications. Note that the DNN mapping to the mobile CPU is
optimized using Tensorflow Lite [44]. Fig. 13 illustrates the cor-
responding latency vs. energy efficiency, where the results under
the same DNN models are marked with makers of the same shape.
We can see that AutoDNNchip generated accelerators consistently
achieve smaller latency than the baselines under the same DNN
model and settings while having similar (<15% difference) energy ef-
ficiency. Specifically, AutoDNNchip generated accelerators achieve
an average latency reduction of 3.86× while having slightly better
(10%) or worse (differs <15%) energy efficiency, demonstrating the
effectiveness of AutoDNNchip in generating optimized FPGA-based
accelerators.
Evaluation of AutoDNNchip’s Generated ASIC-based Ac-
celerators. Fig.14 illustrates thatAutoDNNchip indeed can generate
ASIC-based accelerator that leads to an optimal tradeoff between la-
tency and energy consumption by visualizing the latency vs. energy
consumption of the generated accelerators, where dots with differ-
ent colors correspond to designs based on different hardware tem-
plates. Furthermore, we evaluate the performance of AutoDNNchip
generated ASIC-based accelerators by comparing their energy con-
sumption with that of a state-of-the-art ASIC-based accelerator
[20] based on 5 shallow neural networks, which are used in [20] for
performance evaluation, given both having the same throughput
constraint as shown in Table 9. Fig. 15 shows the comparison, where
all the energy consumption in Fig.15 are obtained from RTL imple-
mentation and simulation. We can see that AutoDNNchip generated
ASIC-based accelerators consistently outperform [20] in all the 5
networks with energy consumption improvement ranging from
7.9% to 58.3%, demonstrating the effectiveness of AutoDNNchip in
generating optimized ASIC-based accelerators.
For the aforementioned set of experiments, We first use the
application-driven performance and constraints (see Table 9) to
perform design space exploration and then validate the generated
designs using RTL simulations adopting the same clock frequency
(1 GHz) and technology (65nm) as our baseline [20]. Specifically, the
DSE process optimizes the accelerators’ energy-delay-product, and
considering different: (1) hardware templates with three different
architectures [17, 20, 21] (denoted as template 1/2/3 in Fig. 14), (2)
memory size and # of PEs within the resource constraint, (3) mem-
ory allocation (i.e., input/weight/output buffer), and (4) memory
accesses and reuse patterns.
0.5 1.0 1.5 2.0 2.5
Energy (mJ)
1.5
1.6
1.7
1.8
1.9
2.0
2.1
P
ro
ce
ss
in
g
la
te
n
cy
(m
s)
Optimized
Designs exceed the power budget
Power Budget = 600 mW
Designs using template 1
Designs using template 2
Designs using template 3
Figure 14: Visualizing the latency vs. energy consumption
per image of the ASIC-based accelerators in the design space
pool, when using AutoDNNchip to design an ASIC-based ac-
celerator for meeting the performance of a state-of-the-art
ASIC-based accelerator [20], with both having the same per-
formance constraints, DNNmodel, and dataset (see Table 9).
Face Reco. Lenet-5 CFF ConvNN Face align.
0.0
0.5
1.0
1.5
2.0
E
n
er
g
y
N
o
rm
.
1.2X
1.2X
1.3X
1.1X
1.6X
Baseline paper reported
AutoDNNchip optimized
Figure 15: Comparing the normalized energy consumption
between the AutoDNNchip generated ASIC-based accelera-
tors and [20], when accelerating 5 shallow neural networks
under the same throughput requirement.
8 CONCLUSIONS
To close the gap between the growing demand for DNN accelerators
with various specifications and the time consuming and challenging
DNN accelerator design, we develop AutoDNNchip which can auto-
matically generate both FPGA- and ASIC-based DNN accelerators.
Experiments using over 20 DNN models and 4 platforms show that
DNN accelerators generated by AutoDNNchip outperform state-of-
the-art designs by up to 3.86×. Specifically, AutoDNNchip is made
possible by the proposed one-for-all design space description, Chip
Predictor, and Chip Builder. Experiments based on 15 DNN models
and 4 platforms demonstrate that the Chip Predictor’s prediction
error is smaller than 10% compared with real-measured ones, and
the Chip Builder can effectively and efficiently perform design space
exploration and optimization.
ACKNOWLEDGMENTS
This work is supported in part by the NSF RTML grant 1937592
and NSF 1801865, the IBM-Illinois Center for Cognitive Computing
System Research (C3SR), and XMotors.ai.
REFERENCES
[1] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-
Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
[2] Y.Wang, Z. Jiang, X. Chen, P. Xu, Y. Zhao, Y. Lin, and Z.Wang, “E2-Train: Training
State-of-the-art CNNs with Over 80% Energy Savings,” in Advances in Neural
Information Processing Systems, pp. 5139–5151, 2019.
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object
detection with region proposal networks,” in Advances in neural information
processing systems, pp. 91–99, 2015.
[4] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig,
“The Microsoft 2016 conversational speech recognition system,” in Acoustics,
Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on,
pp. 5255–5259, IEEE, 2017.
[5] S. Liu, Y. Lin, Z. Zhou, K. Nan, H. Liu, and J. Du, “On-demand deep model
compression for mobile devices: A usage-driven model selection framework,”
in Proceedings of the 16th Annual International Conference on Mobile Systems,
Applications, and Services, pp. 389–400, ACM, 2018.
[6] Y. Wang, T. Nguyen, Y. Zhao, Z. Wang, Y. Lin, and R. Baraniuk, “Energynet:
Energy-efficient dynamic inference,” in Advances in Neural Information Processing
Systems (Workshop), 2018.
[7] J. Shen, Y. Fu, Y. Wang, P. Xu, Z. Wang, and Y. Lin, “Fractional Skipping: To-
wards Finer-Grained Dynamic Inference,” in The Thirty-Forth AAAI Conference
on Artificial Intelligence, 2020.
[8] J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, and Y. Lin, “Deep k-Means:
Re-Training and Parameter Sharing with Harder Cluster Assignments for Com-
pressing Deep Convolutions,” in Thirty-fifth International Conference on Machine
Learning, 2018.
[9] Y. Wang, J. Shen, T.-K. Hu, P. Xu, T. Nguyen, R. Baraniuk, Z. Wang, and Y. Lin,
“Dual dynamic inference: Enabling more efficient, adaptive and controllable deep
inference,” IEEE Journal of Selected Topics in Signal Processing, 2019.
[10] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-based
accelerator design for deep convolutional neural networks,” in Proceedings of
International Symposium on Field-Programmable Gate Arrays, pp. 161–170, ACM,
2015.
[11] Y. Lin, S. Zhang, and N. Shanbhag, “Variation-Tolerant Architectures for Convo-
lutional Neural Networks in the Near Threshold Voltage Regime,” in 2016 IEEE
International Workshop on Signal Processing Systems (SiPS), pp. 17–22, Oct 2016.
[12] S. Liu, A. Papakonstantinou, H. Wang, and D. Chen, “Real-time object track-
ing system on FPGAs,” in 2011 Symposium on Application Accelerators in High-
Performance Computing, pp. 1–7, IEEE, 2011.
[13] Z. Liu, Y. Dou, J. Jiang, J. Xu, S. Li, Y. Zhou, and Y. Xu, “Throughput-optimized
FPGA accelerator for deep convolutional neural networks,” ACM Transactions on
Reconfigurable Technology and Systems (TRETS), vol. 10, no. 3, p. 17, 2017.
[14] X. Zhang, X. Liu, A. Ramachandran, C. Zhuge, S. Tang, P. Ouyang, Z. Cheng,
K. Rupnow, and D. Chen, “High-performance video content recognition with
long-term recurrent convolutional network for FPGA,” in 2017 27th International
Conference on Field Programmable Logic and Applications (FPL), pp. 1–4, IEEE,
2017.
[15] C. Zhuge, X. Liu, X. Zhang, S. Gummadi, J. Xiong, and D. Chen, “Face recognition
with hybrid efficient convolution algorithms on FPGAs,” in Proceedings of the
2018 on Great Lakes Symposium on VLSI, pp. 123–128, ACM, 2018.
[16] X. Zhang, J.Wang, C. Zhu, Y. Lin, J. Xiong,W.-m. Hwu, andD. Chen, “DNNBuilder:
an automated tool for building high-performance DNN hardware accelerators for
FPGAs,” in Proceedings of the International Conference on Computer-Aided Design,
p. 56, ACM, 2018.
[17] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a
tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium
on Computer Architecture (ISCA), pp. 1–12, IEEE, 2017.
[18] Google Inc., “Edge TPU.” https://coral.withgoogle.com/docs/edgetpu/faq/, ac-
cessed 2019-09-01.
[19] Y. Lin and J. R. Cavallaro, “Energy-efficient convolutional neural networks via sta-
tistical error compensated near threshold computing,” in 2018 IEEE International
Symposium on Circuits and Systems (ISCAS), pp. 1–5, May 2018.
[20] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, andO. Temam,
“Shidiannao: Shifting vision processing closer to the sensor,” in ACM SIGARCH
Computer Architecture News, vol. 43, pp. 92–104, ACM, 2015.
[21] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient
dataflow for convolutional neural networks,” in Computer Architecture (ISCA),
2016 ACM/IEEE 43th Annual International Symposium on, pp. 367–379, IEEE Press,
2016.
[22] Xinlinx, “Vivado High-Level Synthesis.” https://https://www.xilinx.com/
products/design-tools/vivado/integration/esl-design.html, accessed 2019-09-16.
[23] D. Chen, J. Cong, Y. Fan, G. Han,W. Jiang, and Z. Zhang, “xpilot: A platform-based
behavioral synthesis system,” SRC TechCon, vol. 5, 2005.
[24] D. Chen, J. Cong, Y. Fan, and L.Wan, “Lopass: A low-power architectural synthesis
system for FPGAs with interconnect estimation and optimization,” IEEE Transac-
tions on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 4, pp. 564–577,
2009.
[25] K. Rupnow, Y. Liang, Y. Li, D. Min, M. Do, and D. Chen, “High level synthesis of
stereo matching: Productivity, performance, and software constraints,” in 2011
International Conference on Field-Programmable Technology, pp. 1–8, IEEE, 2011.
[26] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “DeepBurning: automatic generation of
FPGA-based learning accelerators for the neural network family,” in Proceedings
of the 53rd Annual Design Automation Conference, p. 110, ACM, 2016.
[27] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards uni-
formed representation and acceleration for deep convolutional neural networks,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2018.
[28] Y. Guan, H. Liang, N. Xu, W.Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong,
“FP-DNN: An automated framework for mapping deep neural networks onto
FPGAs with RTL-HLS hybrid templates,” in 2017 IEEE 25th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 152–
159, IEEE, 2017.
[29] R. Venkatesan, Y. S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller,
A. Klinefelter, N. Pinckney, P. Raina, et al., “MAGNet: A Modular Accelerator
Generator for Neural Networks,” in Proceedings of the International Conference on
Computer-Aided Design (ICCAD), 2019.
[30] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design flow of accelerat-
ing hybrid extremely low bit-width neural network in embedded FPGA,” in 2018
28th International Conference on Field Programmable Logic and Applications (FPL),
2018.
[31] X. Zhang, H. Lu, C. Hao, J. Li, B. Cheng, Y. Li, K. Rupnow, J. Xiong, T. Huang,
H. Shi, et al., “SkyNet: a Hardware-Efficient Method for Object Detection and
Tracking on Embedded Systems,” arXiv preprint arXiv:1909.09709, 2019.
[32] Google Inc., “Pixel Phone 2 XL.” https://store.google.com/product/pixel_3?srp=
/product/pixel_2/, accessed 2019-09-01.
[33] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m. Hwu, and D. Chen,
“FPGA/DNN Co-Design: An efficient design methodology for IoT intelligence on
the edge,” in Proceedings of the Design Automation Conference, p. 206, ACM, 2019.
[34] H. Kwon, M. Pellauer, and T. Krishna, “MAESTRO: an open-source infrastruc-
ture for modeling dataflows within deep learning accelerators,” arXiv preprint
arXiv:1805.02566, 2018.
[35] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A.Mukkara, R. Venkatesan,
B. Khailany, S.W. Keckler, and J. Emer, “Timeloop: A Systematic Approach to DNN
Accelerator Evaluation,” in 2019 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS), pp. 304–315, IEEE, 2019.
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778, 2016.
[37] Xilinx Inc., “Avnet Ultra96.” https://www.xilinx.com/products/boards-and-kits/1-
vad4rl.html, accessed 2019-09-01.
[38] NVIDIA Inc., “NVIDIA Jetson TX2.” https://www.nvidia.com/en-us/autonomous-
machines/embedded-systems/jetson-tx2/, accessed 2019-09-01.
[39] J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei, “Imagenet: A large-scale
hierarchical image database,” in In CVPR, 2009.
[40] J. Hu, J. Goeders, P. Brisk, Y. Wang, G. Luo, and B. Yu, “2019 DAC system design
contest on low power object detection,” When Accuracy meets Power: 2019 DAC
System Design Contest on Low Power Object Detection, 2019.
[41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep
Convolutional Neural Networks,” in Advances in Neural Information Processing
Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.),
pp. 1097–1105, Curran Associates, Inc., 2012.
[42] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2:
Inverted Residuals and Linear Bottlenecks,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2018.
[43] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal
of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[44] Google Inc., “Tensorflow Lite.” https://www.tensorflow.org/lite, accessed 2019-
09-01.
