DNN-Chip Predictor: An Analytical Performance Predictor for DNN
  Accelerators with Various Dataflows and Hardware Architectures by Zhao, Yang et al.
DNN-CHIP PREDICTOR: AN ANALYTICAL PERFORMANCE PREDICTOR FOR DNN
ACCELERATORS WITH VARIOUS DATAFLOWS AND HARDWARE ARCHITECTURES
Yang Zhao, Chaojian Li, Yue Wang, Pengfei Xu, Yongan Zhang, and Yingyan Lin
Department of Electrical and Computer Engineering, Rice University
ABSTRACT
The recent breakthroughs in deep neural networks (DNNs)
have spurred a tremendously increased demand for DNN accel-
erators. However, designing DNN accelerators is non-trivial
as it often takes months/years and requires cross-disciplinary
knowledge. To enable fast and effective DNN accelerator de-
velopment, we propose DNN-Chip Predictor, an analytical
performance predictor which can accurately predict DNN ac-
celerators’ energy, throughput, and latency prior to their actual
implementation. Our Predictor features two highlights: (1)
its analytical performance formulation of DNN ASIC/FPGA
accelerators facilitates fast design space exploration and opti-
mization; and (2) it supports DNN accelerators with different
algorithm-to-hardware mapping methods (i.e., dataflows) and
hardware architectures. Experiment results based on 2 DNN
models and 3 different ASIC/FPGA implementations show
that our DNN-Chip Predictor’s predicted performance differs
from those of chip measurements of FPGA/ASIC implemen-
tation by no more than 17.66% when using different DNN
models, hardware architectures, and dataflows. We will re-
lease code upon acceptance.
Index Terms— DNN accelerator, ASIC, FPGA, design
simulator, design automation
1. INTRODUCTION
Deep Neural Networks (DNNs) have achieved record-breaking
performance in various applications, such as image classifica-
tion [1, 2] and natural language processing [3]. However, their
powerful performance often comes with a prohibitive complex-
ity [4, 5, 6, 7, 8]. Moreover, DNN-based applications often
require not only high accuracy, but also aggressive hardware
performance, including high throughput, low latency, and high
energy efficiency. As such, there has been intensive research
on DNN accelerators in order to take advantage of different
hardware platforms, such as FPGAs and ASICs, for improving
DNN acceleration efficiency [9, 10, 11, 12, 13, 14].
While DNN accelerators can be 1000× more efficient
than general purpose computing platforms [15], developing
DNN accelerators presents significant challenges, because: (1)
mainstream DNNs have millions of parameters and billions of
operations; (2) the design space of DNN accelerator is large
due to numerous design choices of architectures, hardware
IPs, DNN-to-accelerator-mappings, etc.; and (3) there is an
algorithm/hardware co-design need for the same DNN func-
tionality to have a different decomposition that would require
different hardware IPs and thus correspond to dramatically
different hardware performance/energy/area trade-offs. There-
fore, high-quality DNN accelerators often take months/years
to design and require a large team of cross-disciplinary experts
with knowledge in DNN algorithms, micro-architectures, and
physical chip design. Such a barrier makes it difficult to scien-
tifically explore innovative DNN accelerator design and thus
limits DNNs’ more extensive applications.
To address the aforementioned challenges, we propose
DNN-Chip Predictor, an analytical performance predictor
which can efficiently and accurately predict DNN accelerators’
performance prior to time-consuming ASIC/FPGA hardware
implementation. Specifically, our Predictor formulates DNN
accelerators’ energy, throughput, and latency based on param-
eters that characterize the DNN models and corresponding
accelerators’ architectures and algorithm-to-hardware map-
ping methods (i.e., dataflows). Such a generic Predictor (1)
enables fast evaluation of DNN accelerator innovations and (2)
can be used as an efficient design exploration and optimization
tool for DNN accelerators, given their large design space. To
the best of our knowledge, our proposed Predictor is the first
that highlights the following three features simultaneously for
practical and wide adoption: (1) analytical and thus fast; (2)
covering both ASIC and FPGA DNN accelerators; (3) are
validated using different DNN models and accelerator designs
(i.e., architectures, dataflows, and process technologies).
2. BACKGROUND
DNN Accelerators. There have been intensive studies of
DNN accelerators. For example, the first well-optimized
FPGA DNN accelerator [16] uses loop tiling; the DianNao
series [13, 17] is an early effort on synthesis based ASIC ac-
celerators; Eyeriss proposes a row-stationary dataflow [14] to
reduce expensive DRAM accesses; and Google TPUs [11, 12]
use a systolic array to achieve high throughput.
DNN Accelerator Performance Prediction. DNNs often fea-
ture a high complexity while there exists various opportunities
for reuse, pipeline, and resource allocation to maximize DNN
accelerators’ performance. Therefore, an accurate yet fast
performance predictor is desired to enable efficient design
space exploration and optimization with different performance
trade-offs. Various methods have been developed for predict-
ar
X
iv
:2
00
2.
11
27
0v
1 
 [c
s.L
G]
  2
6 F
eb
 20
20
ing or simulating DNN accelerators’ performance. Roofline
models [16, 18] and customized analytical models which are
closely tied to the specific design attributes [9, 19, 20] are used.
However, the roofline model lack fine-grained estimation and
customized models are not general as desired. Timeloop [21]
and Eyeriss [22] use for and parallel-for to describe the tem-
poral and spatial mapping of DNN accelerators. Specifically,
Timeloop obtains the number of memory accesses and esti-
mates the latency by calculating the maximum isolated execu-
tion cycle across all hardware IPs based on a double-buffering
assumption. Accelergy [23] proposes a configuration language
to describe hardware architectures and depends on plug-ins,
e.g., Timeloop, to calculate the energy as in [14]. The work in
[24] adopts Halide [25], a domain-specific language for image
processing applications, and proposes a modeling framework
which is similar to that of [14]. MAESTRO [26] is the very
first to adopt a data-centric approach.
3. THE PROPOSED DNN-CHIP PREDICTOR
This section presents the proposed DNN-Chip Predictor which
is an analytical modelling framework to formulate DNN infer-
ence accelerators’ energy cost, latency, and throughput when
employing different dataflows and hardware architectures. We
first introduce the employed design space description method,
and then describe the developed performance models. The
advantages of the DNN-Chip Predictor are that it (1) matches
well with actual implementation results (<18%); (2) is an-
alytical and intuitive (directly ties to the DNN model and
accelerator parameters), facilitating its ease of use for time-
efficient design space exploration and optimization; and (3)
is programmer friendly and compatible with commonly used
DNN frameworks (e.g., Pytorch [27]) thanks to its adopted
generic description of DNN accelerators’ design space.
3.1. Design Space Description
For modeling DNN accelerators’ performance given their large
design space, one critical question is how to describe the whole
design space, i.e., cover all possible design choices, in a way
that is easy to follow? For ease of use and better visualization,
we adopt a nested for-loop description [14] to describe the
design space as shown in Fig. 1. Specifically, we employ (1)
the primitive, for, to describe the temporal operations of each
process element (PE) as well as the temporal data tiling and
mapping operations at the DRAM, global buffer (GB) and
register file (RF) levels; and (2) the primitive, parallel-for, to
describe the spatial data tiling and mapping operations at the
network-on-chip (NoC) level (i.e., in the PE array). Without
loss of generality, we consider four levels of memory hierarchy,
i.e., off-chip DRAM, on-chip GB, NoC in the PE array, and RF
within the PEs. The design space of DNN accelerators mainly
includes two aspects: hardware architectures and dataflows.
Hardware architecture. It can be described using a set of
architecture-dependent hardware parameters and technology-
dependent IP parameters. In particular, the architecture-
dependent hardware parameters includes PE array archi-
Fig. 1: A nested for-loop description of DNN accelerators’
design space, using a CONV layer as an example, where
0,1,2,3 denotes the four memory hierarchies (i.e., RF, NoC,
GB, and DRAM, respectively), and M,C,R, S,E, F de-
note the six dimensions of a CONV layer (i.e., input/output
channels, kernel width/height, and output feature map
width/height, respectively).
tectures (e.g., spatial array, systolic array, and adder tree),
number of PEs, NoC design (e.g., unicast, multicast, or
broadcast), memory hierarchies, and the storage capacity
and communication bandwidth of each memory hierarchy;
the technology-dependent IP parameters includes unit en-
ergy/delay costs of (1) a MAC operation, (2) memory accesses
to various memory hierarchies, and (3) the clock frequency.
Dataflow. This describes how a DNN is temporally and
spatially scheduled to be executed in an accelerator. Specif-
ically, a dataflow answers the following questions: (1) how
to map and schedule the computations in the PE array and
within each PE?; and (2) what are the loop ordering and tiling
factors on the DRAM and global buffer levels? The former
captures the design choice of holding a certain type of data
locally in the PE once being fetched from the memories, e.g.,
row/weight/output stationary. The latter shows how to store
data in SRAM and DRAM to accommodate data stationary
effectively. These two questions can be described using three
groups of parameters as defined below in the context of the
example in Fig. 1: Loop ordering factors for the twenty-four
nested for-loops associated with the six dimensions of the 3D
convolution operation and the four considered memory hierar-
chies (i.e., DRAM, GB, NoC, and RF); Loop tiling factors for
the twenty-four nested for-loops associated with the six dimen-
sions of the 3D convolution operation and the four considered
memory hierarchies; and Data access locations in which of the
nested for-loops we refresh the on-chip GB and in-PE RFs for
the activations and weights.
Fig. 2: A high-level view of the DNN-Chip Predictor.
3.2. The DNN-Chip Predictor
3.2.1. Overview
Fig. 2 shows a high-level view of the proposed DNN-Chip
Predictor, which accepts DNN models (e.g., number of
layers, layer structure, bit-precision, etc.), hardware architec-
tures (e.g., memory hierarchy, number of PEs, NoC design,
etc.), dataflows (e.g., row/weight/output stationary, loop
tiling/ordering factors, etc.), and technology-dependent unit
costs (e.g., unit energy/delay cost of a MAC operation and
memory accesses to various memory hierarchies), and then
outputs the estimated energy consumption, latency, and
throughput when executing the DNN in a target accelerator. It
thus can be used to (1) validate DNN accelerator techniques
prior to the time- and cost-consuming DNN ASIC/FPGA
accelerator implementation, and (2) perform time-efficient
design space exploration and optimization.
3.2.2. The Proposed Analytical Models
This subsection introduces the Predictor’s analytical models.
Energy Models. DNN accelerators’ energy cost include
both computational (Ecomp) and data movement (EDM ) costs,
where Ecomp = NMAC × eMAC with NMAC denoting the
total number of MACs in the DNN. Similarly, the data move-
ment cost can be calculated by multiplying the unit energy
cost per access (eDMi,j , j ∈ {I,O,W}) with the total num-
ber of accesses (NDMi,j , j ∈ {I,O,W}) to the i-th memory
hierarchy (e.g., GB) using the j-th type of data (i.e., inputs (I),
outputs (O), and weights (W )):
EDM =
∑
i∈SMemory
∑
j∈{I,O,W}
NDMi,j ,×eDMi,j (1)
where SMemory = {DRAM ) GB,GB ) NoC,NoC )
RF,RF ) PE} for inputs/weights; and SMemory =
{DRAM ↔ GB,GB ↔ NoC,NoC ↔ RF,RF ↔
MAC} for outputs.
The key challenge is to obtain NDMi,j for various mem-
ory hierarchies and data types when using different DNN
models, hardware architectures, and dataflows. We are the first
to find that NDMi,j can be calculated as the product of the j-th
data volume (Vrefi,j ) involved in each refresh and the total
number of such refreshes (Nrefi,j ) for the i-th memory:
NDMi,j = Nrefi,j × Vrefi,j (2)
To obtain Nrefi,j and Vrefi,j , we propose an intuitive
methodology: we first (1) choose a refresh location, which
can be straightforwardly decided once the dataflow is known,
in the nested for-loops (see Fig. 1) for a given data type; (2)
Nrefi,j is equal to the product of all the loop bounds in the
for-loops above the refresh location; and (3) Vrefi,j is equal
to the product of all the loop bounds in the for-loops below
the refresh location and associated with the particular type of
data. Once Nrefi,j and Vrefi,j are obtained, the energy can be
calculated as:
EDRAM =
∑
j∈{I,O,W}
NrefGB,j × VrefGB,j × eDMDRAM,j
(3)
EGB =
∑
j∈{I,O,W}
NrefRF,j × VrefRF,j ×
NPE
Mj
× eDMGB,j
(4)ENoC =
∑
j∈{I,O,W}
NrefRF,j × VrefRF,j ×NPE × eDMNoC,j
(5)
ERF =
∑
j∈{I,O,W}
NMAC × eDMRF,j (6)
where NPE is the number of active PEs and Mj is the number
of PEs that share the same data.
Latency Models. Similarly, the latency of DNN accelera-
tors can be formulated as:
L = Lsetup +max{LDRAM , LGB , Lcomp} (7)
where Lcomp, LDRAM , LGB , and Lsetup denote the latency
of computation in the PE array, accessing the DRAM from
the GB, accessing the GB from an RF in the PEs, and set-
ting up the first set of the weights and inputs, respectively.
Adopting N jbit-bit precision for inputs/outputs/weights is N
j
bit,
j ∈ {I,O,W}, we have:
Lcomp = NMAC × tcomp (8)
LDRAM = max
j∈{I,O,W}
NrefGB,j ×
VrefGB,j ×N jbit
min{BW jGB , BWDRAM}
(9)
LGB = max
j∈{I,O,W}
NrefRF,j ×
NrefRF,j ×N jbit ×NPE
BW jGB
(10)
Lsetup = max(L
′
DRAM , L
′
GB) (11)
L
′
DRAM = max
j∈{I,W}
VrefGB,j ×N jbit
min{BWj,GB , BWDRAM} (12)
L
′
GB = max
j∈{I,W}
NrefRF,j ×N jbit
min{BWj,RF , BWj,GB} (13)
where BW ji is the memory bandwidth for the i-th memory
hierarchy for the data type j ∈ {I,O,W}.
4. EXPERIMENT RESULTS
We validate our proposed DNN-Chip Predictor by comparing
its predicted performance with actual chip measured ones
in [14], FPGA implementation results in [28], and synthesis
results based on a commercial CMOS technology, under the
same experiment settings (e.g., unit energy, clock frequency,
DNN model, architecture design and dataflow, etc).
Validation against Chip Measurements. For this set of
experiments, we compare our Predictor’s predicted perfor-
mance with Eyeriss’s chip measurement results using their
Fig. 3: The # of (L) DRAM and (R) GB accesses in Eye-
riss [29] and our Predictor for AlexNet’s CONV layers.
Table 1: The energy breakdown from Eyeriss [29] and our
Predictor, for the CONV1 and CONV5 of AlexNet [30].
Layer
comp. RF NoC GB
Meas. Pred. Meas. Pred. Meas. Pred. Meas. Pred.
CONV1 16.7% 18.7% 79.6% 74.4% 1.7% 4.8% 2.0% 2.0%
∆ 2.08% -5.15% 3.10% -0.03%
CONV5 7.3% 7.5% 80.3% 79.1% 5.3% 7.0% 7.0% 6.3%
∆ 0.26% -1.16% 1.64% -0.74%
normalized unit energy [14]. First, Table 1 compares the en-
ergy breakdown of AlexNet’s first and fifth CONV layers
(denoted as CONV1 and CONV5, respectively), showing that
the maximum difference is 5.15% and 1.64%, respectively.
Second, Fig. 3 compares the number of DRAM/GB ac-
cesses. The difference between the predicted number of
DRAM accesses and Eyeriss’s measured results is between
2.18% and 12.10%, while the difference in terms of GB
accesses is between -0.70% and 17.66%. Our Predictor’s
predicted DRAM access number is smaller than that of Eyeriss
because the RLC overhead of sparse activations depends on
the input images and we lack the information about which set
of images were used in Eyeriss’s measurements. Additionally,
Fig. 3 shows that the difference between the predicted number
of GB accesses and Eyeriss’s results is less than 5% except for
the CONV1 layer where the relative larger prediction error is
caused by its larger stride, which is 4. Specifically, a larger
stride leads to lower utilization of inputs fetched from the
GB, whereas our current Predictor considers the generic case
where stride is 1 as it is more often seen in recent DNN models.
For better prediction accuracy, our Predictor can be adjusted
to cover cases with other stride values, i.e., more considered
cases for the analytical models in Section 3.2.2.
Third, Fig. 4 compares the latency of executing AlexNet’s
five CONV layers, and shows that the predicted ones and Eye-
Fig. 4: Comparison on the inference latency from Eye-
riss [29] and our Predictor when running AlexNet.
Table 2: The energy breakdown from synthesized results
and our Predictor for AlexNet’s CONV3-CONV5 layers.
Layer
comp. (%) RF(%) GB(%)
Syn. Pred. ∆ Syn. Pred. ∆ Syn. Pred. ∆
CONV3 38.76 34.49 4.26 60.99 65.25 4.26 0.24 0.25 0.01
CONV4 39.46 34.28 5.19 60.28 65.45 5.16 0.25 0.27 0.02
CONV5 31.13 25.85 5.28 68.65 73.91 5.26 0.22 0.24 0.02
riss’s differ by≤ 15.51%. The predicted latency is smaller than
the measured one because our Predictor’s analytical models
do not consider the corner cycles when the memory accesses
and computation can not be fully pipelined where processing
stalls occur. Finally, the predicted throughput of executing
AlexNet is 46.0 GOPS while the one measured by Eyeriss is
51.6 GOPS, showing a prediction error of ≤11%.
Validation against FPGA Implementation. We com-
pare our Predictor’s predicted latency with FPGA measured
ones under the same DNN model and hardware configura-
tions [31]. Specifically, for the FPGA one we use the open
source implementation of the award winner [31] in a state-of-
the-art design contest [32]. Fig. 5 shows that our Predictor’s
predicted latency differs from the FPGA-synthesized ones by
≤ 16.84%. Note that in FPGA implementations the GB can be
partitioned into smaller chunks to be accessed simultaneously
for increasing the parallelism and minimizing the latency. Our
current models do not include the overhead of this partition,
which is larger when the GB is partitioned into more chunks
for layers with a larger size, leading to a larger prediction error
for the CONV4/CONV5/CONV6 layers in Fig. 5.
0.99 1.00 1.94 1.95
4.15 4.60
8.84 9.09
17.1418.05 17.3217.95
1.96 1.96
Max: 9.88
Min: 0.25
CONV1 CONV2 CONV3 CONV4 CONV5 CONV6 CONV7
0
5
10
15
20
La
te
nc
y 
(m
s)
0
5
10
15
Pr
ed
ic
tio
n 
er
ro
rs
 (%
)
Predicted
FPGA synthesized
Prediction errors
Fig. 5: Our Predictor’s predicted latency and the FPGA
measured one for the 7 CONV layers of SkyNet [28].
Validation against Synthesis Results. Table 2 compares
the Predictor’s energy breakdown with that from the synthe-
sis results for AlexNet’s CONV3-CONV5 layers when using
an in-house dedicated accelerator using a commercial 65nm
CMOS technology. It can be seen from Table 2 that the differ-
ence between our Predictor’s predicted energy breakdown and
that from the synthesis results is less than 5.28%.
5. CONCLUSION
To close the gap between the growing demand for dedicated
DNN accelerators with various specifications and the time-
consuming and challenging DNN accelerator design, we de-
velop DNN-Chip Predictor, which can efficiently and effec-
tively predict an accelerator’s energy, latency, and resource
consumption. Such an analytical performance prediction tool
will facilitate fast development of innovations for not only
DNN accelerators but also hardware-aware efficient DNNs.
6. ACKNOWLEDGEMENT
The work is supported by the National Science Foundation
(NSF) through the ECCS Division Of Electrical, Communica-
tion & Cyber System (Award number: 1934767).
References
[1] Karen Simonyan et al., “Very deep convolutional net-
works for large-scale image recognition,” CoRR, vol.
abs/1409.1556, 2014.
[2] Yue Wang et al., “E2-Train: Training State-of-the-art
CNNs with Over 80% Energy Savings,” in Advances in
Neural Information Processing Systems, 2019, pp. 5139–
5151.
[3] Wayne Xiong et al., “The microsoft 2016 conversational
speech recognition system,” in Acoustics, Speech and
Signal Processing (ICASSP), 2017 IEEE International
Conference on. IEEE, 2017, pp. 5255–5259.
[4] Sicong Liu et al., “On-demand deep model compres-
sion for mobile devices: A usage-driven model selection
framework,” in Proceedings of the 16th Annual Interna-
tional Conference on Mobile Systems, Applications, and
Services. ACM, 2018, pp. 389–400.
[5] Yue Wang et al., “Energynet: Energy-efficient dynamic
inference,” in Advances in Neural Information Process-
ing Systems (Workshop), 2018.
[6] Jianghao Shen et al., “Fractional Skipping: Towards
Finer-Grained Dynamic Inference,” in The Thirty-Forth
AAAI Conference on Artificial Intelligence, 2020.
[7] Junru Wu et al., “Deep k-Means: Re-Training and Pa-
rameter Sharing with Harder Cluster Assignments for
Compressing Deep Convolutions,” in Thirty-fifth Inter-
national Conference on Machine Learning, 2018.
[8] Yue Wang et al., “Dual dynamic inference: Enabling
more efficient, adaptive and controllable deep inference,”
IEEE Journal of Selected Topics in Signal Processing,
2019.
[9] Xiaofan Zhang et al., “DNNBuilder: an automated tool
for building high-performance dnn hardware accelerators
for FPGAs,” in Proc. of ICCAD, 2018.
[10] Yingyan Lin et al., “Predictivenet: An energy-efficient
convolutional neural network via zero prediction,” in
2017 IEEE International Symposium on Circuits and
Systems (ISCAS), May 2017, pp. 1–4.
[11] Google Inc., “Edge TPU,” https://cloud.
google.com/tpu/, accessed 2019-09-01.
[12] Google Inc., “Edge TPU,” https://coral.
withgoogle.com/docs/edgetpu/faq/, ac-
cessed 2019-09-01.
[13] Zidong Du et al., “Shidiannao: Shifting vision process-
ing closer to the sensor,” in ACM SIGARCH Computer
Architecture News. ACM, 2015, vol. 43, pp. 92–104.
[14] Yu-Hsin Chen et al., “Eyeriss: A spatial architecture
for energy-efficient dataflow for convolutional neural
networks,” in Computer Architecture (ISCA), 2016
ACM/IEEE 43th Annual International Symposium on.
IEEE Press, 2016, pp. 367–379.
[15] Song Han et al., “Eie: efficient inference engine on com-
pressed deep neural network,” in 2016 ACM/IEEE 43rd
Annual International Symposium on Computer Architec-
ture (ISCA). IEEE, 2016, pp. 243–254.
[16] Chen Zhang et al., “Optimizing fpga-based accelerator
design for deep convolutional neural networks,” in Pro-
ceedings of the 2015 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays. ACM, 2015,
pp. 161–170.
[17] Tianshi Chen et al., “Diannao: A small-footprint high-
throughput accelerator for ubiquitous machine-learning,”
in ACM Sigplan Notices. ACM, 2014, vol. 49, pp. 269–
284.
[18] Tianqi Tang et al., “Mlpat: A power, area, timing model-
ing framework for machine learning accelerators,” The
Second International Workshop on Domain Specific Sys-
tem Architecture (DOSSA), 2019.
[19] Zhiqiang Liu et al., “Throughput-optimized fpga accel-
erator for deep convolutional neural networks,” ACM
Transactions on Reconfigurable Technology and Systems
(TRETS), vol. 10, no. 3, pp. 17, 2017.
[20] Ananda Samajdar et al., “Scale-sim: Systolic cnn ac-
celerator simulator,” arXiv preprint arXiv:1811.02883,
2018.
[21] Angshuman Parashar et al., “Timeloop: A systematic
approach to dnn accelerator evaluation,” in 2019 IEEE
International Symposium on Performance Analysis of
Systems and Software (ISPASS). IEEE, 2019, pp. 304–
315.
[22] Yu-Hsin Chen et al., “Eyeriss v2: A flexible accelerator
for emerging deep neural networks on mobile devices,”
arXiv preprint arXiv:1807.07928, 2018.
[23] Yannan Wu et al., “Accelergy: An architecture-level
energy estimation methodology for accelerator designs,”
2019.
[24] Xuan Yang et al., “Dnn dataflow choice is overrated,”
arXiv preprint arXiv:1809.04070, 2018.
[25] Jonathan Ragan-Kelley et al., “Halide: a language and
compiler for optimizing parallelism, locality, and recom-
putation in image processing pipelines,” Acm Sigplan
Notices, vol. 48, no. 6, pp. 519–530, 2013.
[26] Hyoukjun Kwon et al., “Understanding reuse, perfor-
mance, and hardware cost of dnn dataflows: A data-
centric approach,” in Proceedings of the 52nd Annual
IEEE/ACM International Symposium on Microarchitec-
ture. ACM, 2019, pp. 754–768.
[27] Adam Paszke et al., “Automatic differentiation in py-
torch,” 2017.
[28] Cong Hao et al., “Fpga/dnn co-design: An efficient
design methodology for iot intelligence on the edge,”
Proc. of DAC, 2019.
[29] Yu-Hsin Chen et al., “Eyeriss: An energy-efficient re-
configurable accelerator for deep convolutional neural
networks,” IEEE Journal of Solid-State Circuits, vol. 52,
no. 1, pp. 127–138, 2017.
[30] Alex Krizhevsky et al., “Imagenet classification with
deep convolutional neural networks,” in Advances in
Neural Information Processing Systems 25, F. Pereira,
C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.,
pp. 1097–1105. Curran Associates, Inc., 2012.
[31] Xiaofan Zhang et al., “Skynet: A champion model for
dac-sdc on low power object detection,” arXiv preprint
arXiv:1906.10327, 2019.
[32] Xilinx, Nvidia, and DJI, “Dac 2019 system design con-
test,” 2019.
