An End-to-End HW/SW Co-Design Methodology to Design Efficient Deep
  Neural Network Systems using Virtual Models by Klaiber, Michael J. et al.
An End-to-End HW/SW Co-Design Methodology to Design
Efficient Deep Neural Network Systems using Virtual Models
Michael J. Klaiber
michael.klaiber@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
Sebastian Vogel
sebastian.vogel@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
Axel Acosta
axel.acosta@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
Robert Korn
robert.korn@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
Leonardo Ecco
leonardo.ecco@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
Kristine Back
kristine.back@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
Andre Guntoro
andre.guntoro@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
Ingo Feldner
ingo.feldner@de.bosch.com
Robert Bosch Corporate Research
Renningen, Germany
ABSTRACT
End-to-end performance estimation and measurement of deep neu-
ral network (DNN) systems becomemore important with increasing
complexity of DNN systems consisting of hardware and software
components. The methodology proposed in this paper aims at a
reduced turn-around time for evaluating different design choices of
hardware and software components of DNN systems. This reduc-
tion is achieved by moving the performance estimation from the
implementation phase to the concept phase by employing virtual
hardware models instead of gathering measurement results from
physical prototypes. Deep learning compilers introduce hardware-
specific transformations and are, therefore, considered a part of
the design flow of virtual system models to extract end-to-end per-
formance estimations. To validate the run-time accuracy of the
proposed methodology, a system processing the DilatedVGG DNN
is realized both as virtual system model and as hardware implemen-
tation. The results show that up to 92 % accuracy can be reached in
predicting the processing time of the DNN inference.
CCS CONCEPTS
• Computing methodologies → Modeling methodologies; •
Hardware→ Best practices for EDA; Methodologies for EDA.
KEYWORDS
Virtual Prototyping, Hardware Model, Simulation Methodology,
Design Space Exploration
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
Embedded Systems Week 2019, October 13–18, 2019, New York, NY, USA
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-7652-5/19/10. . . $15.00
https://doi.org/10.1145/3372394.3372396
1 INTRODUCTION AND MOTIVATION
Systems to perform the inference phase of a learned deep neural
networks (DNN) have become more important in many fields from
automotive to Industry 4.0 applications. In the beginning of neural
network research there were a limited number of computational
kernels which lend themselves to simple hardware structures for
efficient execution. With the emerge of graphics processing units
(GPUs) deeper neural network structures were feasible to train and
deploy, as they were able to efficiently map those computational
kernels. Modern DNNs, however, have become more complex in
structure and operations, e.g. InceptionNet-v4 or NasNET [3, 9].
Future DNNs are expected to become even more complex in struc-
ture and the number of different operations, e.g. DNNs generated
by the uprising field of neural architecture search [1, 9]. Designing
systems to process the current and future DNNs requires, therefore,
an efficient and powerful methodology to balance compute and
communication resources.
Models of hardware or software components that only mimic
the timing behavior and the memory transactions of a component
while neglecting functional computation are referred to as non-
functional virtual models. The used methodology is in essence
similar to transaction-level modeling. All models introduced in the
following are implicitly non-functional. Virtual hardware models
are executable high-level descriptions of hardware elements (e.g.
CPUs, interconnects or memories) that annotate operations with
simulation cycles. A deep learning compiler breaks the DNN graph
down into a graph where each node represents a memory access
or processing cycles on a virtual hardware model. This graph is
called the Task Graph, and is effectively a virtual software model.
A combination of multiple virtual hardware models (e.g. shown
in Figure 1) and a task graph is referred to as an abstract virtual
systemmodel (AVSM). Due to the high abstraction level of an AVSM,
these models are much faster compared to a simulation at register-
transfer level (RTL) and allow to model system aspects, e.g. timing
behavior within a tolerable error.
©ACM 2019 This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive version was published in
Embedded Systems Week 2019, INTelligent Embedded Systems Architectures and Applications Workshop 2019, https://doi.org/10.1145/3372394.3372396
ar
X
iv
:1
91
0.
11
63
2v
2 
 [c
s.L
G]
  1
8 N
ov
 20
19
Embedded Systems Week 2019, October 13–18, 2019, New York, NY, USA M. J. Klaiber et al.
In a DNN system, the requirements on processing elements,
as well as the requirements on communication infrastructure, are
mainly driven by the topology and the arithmetic operations of
the target DNNs. When trying to evaluate novel concepts for DNN
systems that consist both of hardware and software components,
an accepted approach is to implement a prototype and measure
the performance of the target DNNs. This requires development of
software and hardware components (and possibly manufacturing
of hardware) for the evaluation of one specific concept. The huge
design space for DNN systems in the algorithmical domain, the
software domain and the hardware domain, makes finding and
evaluating efficient concepts considerably time consuming.
There are also analytical approaches [2, 7, 8] for designing DNN
systems, e.g. by analyzing bandwidth and computation require-
ments of the DNN and applying techniques like loop tiling or trans-
formation [8]. However, deep learning compilers that transform the
DNN graph, hence create the task graph, are often neglected in this
optimization process. Some approaches use statistical methods for
performance estimation. They use the frequency at which events
of previously known communication patterns occur to describe
a system, whereas simulation considers the causality. Therefore,
simulation is more adequate to detect communication bottlenecks
and blocking behavior.
The lack of an end-to-end methodology that considers both
hardware architecture and software tool chain becomes particularly
apparent by the enormous number of publications that describe
implementations of DNN systems for one specific design point.
In this paper, we address these challenges by proposing a simu-
lation-based end-to-end methodology that uses virtual hardware
models in combination with a deep learning compiler to evaluate
the performance of novel concepts for DNN systems.
2 METHODOLOGY
Evaluating novel concepts for DNN systems by implementing phys-
ical prototypes requires a full iteration in a hardware and software
development cycle for only a single design point. For totally new
hardware concepts, designing and fully implementing an extension
of the tool chain might be even necessary. This significantly limits
the number of iterations in the development cycle within the scope
of a project, i.e. limits the design space that can be explored. Figure
1 shows the implementation-based prototyping flow, as well as the
virtual-system-based prototyping flow.
The DNN system architecture shown in Figure 2 is based on the
common properties of state-of-the-art DNN systems and, therefore,
serves as starting point to evaluate themethodology proposed in the
following. Featuremaps, weights and intermediate data are stored in
external memory. The matrix multiplications and other arithmetic
operations are performed by the Neural Complex Engine (NCE).
Memory transactions are carried out by a direct memory access
(DMA) hardware. All components of the system are connected
by an interconnect and controlled by a house-keeping processor
(HKP) that executes the task graph. The flows from Figure 1(a) are
exemplified for the DNN system architecture shown in Figure 2.
In both flows, the deep learning compiler converts the DNN
graph to a hardware-adapted task graph according to the hardware
constraints that are either provided by the virtual hardware model
H
ar
dw
ar
e
C
on
st
ra
in
ts
Deep Neural Network Graph
Deep Learning Compiler
Physical Annotations
Virtual-system-based
Prototyping
Virtual 
Hardware
Model
Simulation
Kernel
Compiler
toolchain
Hardware
Implemenation
Implementation-based
System Prototyping
HW-adapted 
Task GraphH
ar
dw
ar
e
C
on
st
ra
in
ts
Executable
Simulation
Results
Measurement
Results
Figure 1: Similarities and differences of implementation-
based and virtual-system-based prototyping flows.
External Memory
DMA HKP
NCE
Multiplier
Array
On-chip
Memory
Figure 2: Base architecture of DNN system.
or the hardware implementation. The resulting task graph considers
the memory hierarchy, the on-chip memory sizes and the supported
operations of the DNN system.
The implementation-based prototyping flow requires an imple-
mentation of all the hardware at RTL level. To measure/evaluate
the performance, there are two possibilities, either a simulation
on RTL level or a performance analysis after manufacturing. The
RTL level simulation has the disadvantage that running a single
inference of a DNN requires several hours or days depending on
End-to-End HW/SW Co-Design Methodology to Design Efficient DNN Systems using Virtual Models Embedded Systems Week 2019, October 13–18, 2019, New York, NY, USA
the DNN’s complexity. Manufacturing the system, of course, pro-
vides the most accurate measurement results, but has a significantly
slower turn-around time.
By contrast our virtual-system-based prototyping flow requires
virtual hardware models for all hardware components. Compared to
an RTL implementation, these components are described at a higher
level which results in faster implementation time. To determine
the system performance, the task graph is deployed in the virtual
model of the HKP which controls the execution of the virtual
model of the NCE.
In the implementation-based flow careful engineering is required
to meet physical constraints. A virtual-system-based approach aims
at faster evaluation by modeling a high-level system description.
Therefore, physical annotations, such as clock frequency, are im-
ported to the AVSM.
The described properties enable the AVSMmethodology to assess
the performance either in bottom-up or in a top-downmanner. If the
DNN system’s target performance is known, it is possible to assess
physical requirements (e.g. the required frequency) of components,
such as for the NCE. For the case where physical annotation of a
component are already available, the performance and scalability
at system level can be estimated accurately.
3 PRELIMINARY RESULTS
The flow outlined in Figure 1 is implemented as Python framework
we developed specially for the purpose of modeling abstract virtual
system models (AVSMs). This framework consists of a library of
parametrizable components to describe hardware components, a
compiler interface to transform the internal graph representation
into a hardware-adapted task graph, and a model generation engine.
Each instance of an AVSM is described as system description
file that defines the topology of the virtual hardware models of
the NCE, the memory sub-system and the bus. It also contains
the physical annotations, such as the frequency of the NCE or
the memory frequency. The model generation engine then uses
the system description file and the hardware-adapted task graph
to automatically generate an executable SystemC model that is
simulated in Synopsys Platform Architect.
As run-time is one of the major advantages of the presented
virtual-system-based prototyping flow, Figure 3 shows the total
run-time to build an AVSM from the system description file and
simulate all layers of the DilatedVGG neural network. The total
processing time on an Intel Xeon CPU E5620 running at 2.40 GHz
is around 20 minutes. Generation of the hardware-adapted task
graph and the hardware models of the AVSM takes 16.4 seconds.
The simulation of the resulting SystemC model is carried out in 105
seconds. A majority of approximately 91% of the processing time is
spent for importing the hardware-adapted task graph, exporting
the results and the build process of the SystemC model. This part
of the flow has not been optimized for performance yet, therefore,
bears a great potential for further improvement.
To compare the presented methodology quantitatively, an AVSM
and an FPGA implementation of the DNN system architecture
shown in Figure 2 were created. The physical prototype [4] was
implemented on a Xilinx Virtex7 FPGA platform, with an NCE that
has 32 × 64 multipliers that run at a clock frequency of 250 MHz.
Simulation
7.8%
Tool Import/Export
 & Model Build
91.0%
ML Compiler & 
 Graph Generation
1.2%
Task Time [s]
Simulation 105.82
Tool import/export and Model build 1231.08
ML Compiler & Graph Generation 16.64∑
1353.54
Figure 3: Distribution of run-time for generation and simu-
lation of ASVM.
Load_4
Load_5
Load_7
Load_10
Conv2D_8
Conv2D_11
Store_9
Load_13
Conv2D_14
Store_12
DMA
NCE
Time [s]
Communication-bound Layer
Load_4
Load_5
Load_7
Load_10
Conv2D_8
Conv2D_11
Store_9
Load_13
Conv2D_14
Store_12
Load_16
Conv2D_17
Store_15
Load_19
Conv2D_20
DMA
NCE
Time [s]
Compute-bound Layer
Load
Store
Compute
Figure 4: Gantt chart showing simulation of tasks and usage
of computation and communication resources.
Figure 5 compares the processing time of the physical prototype
and the AVSM for processing a (slightly modified version) of the Di-
latedVGG DNN [6]. The total processing time for a single inference
deviates by 8.3% when comparing the physical prototype and the
AVSM. Individual layers deviate between 0.6 % and 11.2 %. This de-
viation is a result of a high-level model of the memory sub-system
and could be further improved by adding more hardware properties
into the virtual hardware models.
Embedded Systems Week 2019, October 13–18, 2019, New York, NY, USA M. J. Klaiber et al.
Co
nv
1_
1  
    
Co
nv
1_
2_
Po
ol1
Co
nv
2_
1  
    
Co
nv
2_
2_
Po
ol2
Co
nv
3_
1  
    
Co
nv
3_
2  
    
Co
nv
3_
3  
    
Co
nv
4_
0  
    
Co
nv
4_
1  
    
Co
nv
4_
2  
    
Co
nv
4_
3  
    
Co
nv
4_
4  
    
Co
nv
4_
5  
    
De
ns
e1
    
   
Up
sa
mp
lin
g  
 
0.0
2.5
5.0
7.5
10.0
12.5
15.0
Pr
oc
es
si
ng
 ti
m
e 
[m
s]
Hardware Implementation on FPGA
Abstract Virtual System Model
Figure 5: Comparison of HW implementation and AVSM.
The virtual-system-based prototyping allows to track compu-
tation time at the level of individual operations and the traffic on
the bus for each memory transaction. Therefore, a detailed analy-
sis of the performance and efficiency for a design point of a DNN
system is possible. The Gantt chart in Figure 4 shows the depen-
dencies of memory transactions and the computations, as well as
the usage of the communication resources and the computation
resources for communication-bound and compute-bound layers.
For the compute-bound bound layers, the hardware model of the
NCE is occupied continuously, the hardware model of the DMA is
partially vacant; in the compute-bound case this is the other way
around. The detailed level of observability of the AVSM, therefore,
allows a virtual performance analysis for each layer to identify
potential performance bottlenecks.
The roofline model [5] in Figure 6 visualizes the performance and
efficiency of the AVSM of the DNN system specified in the previous
paragraph for processing each layer of the DilatedVGG DNN. The
layers are represented by dots, where the size of a dot shows the
execution time in relation to the time required for a single inference
of the neural network. Most layers are fairly close to the vertical
threshold of the roofline (e.g. Conv4_0 – Conv4_5), which indicates
that these layers are compute-bound. Figure 7 zooms into the part
of Figure 6 to show the compute-bound layers. This indicates that
increasing the peak performance of the DNN could accelerate the
processing of these layers. The layers Dense1, Upscaling, Conv1_1
are neither compute-bound, nor communication-bound. Therefore,
increasing the peak performance or the bus bandwidth of the DNN
system would not necessarily have an effect on their execution
time. Possibilities for accelerating those layers range from software
approaches like changing how the Deep Learning compiler maps
and transforms individual operations, to optimizations of low level
hardware like the arrangement of the multiplier array or the hier-
archy of the on-chip memory in the NCE. Both, the software and
the hardware changes can be done in the AVSM. This underlines
the importance of an end-to-end flow for optimizing DNN systems.
10 2 10 1 100 101 102 103 104 105 106
Arithmetic Intensity [OP/Byte]
108
109
1010
1011
1012
E
ff
ec
tiv
e 
Pe
rf
or
m
an
ce
 [O
P/
s]
Conv1_1 
Dense1 
Upsampling 
10%
90%
Figure 6: Roofline model of AVSM executing DilatedVGG.
103 2 × 103 3 × 103
Arithmetic Intensity [OP/Byte]
1012
9 × 1011
E
ff
ec
tiv
e 
Pe
rf
or
m
an
ce
 [O
P/
s]
Conv1_2_Pool1
Conv2_1 
Conv2_2_Pool2
Conv3_1 
Conv3_2 34 0 145
Figure 7: View on compute-bound layers in Figure 6.
End-to-End HW/SW Co-Design Methodology to Design Efficient DNN Systems using Virtual Models Embedded Systems Week 2019, October 13–18, 2019, New York, NY, USA
4 CONCLUSION
The presented results show that AVSMs are a promising alternative
to classical implementation-based physical prototypes. Compared
to an implementation at RTL level, the turn-around time for gen-
erating performance estimations of DNN systems is significantly
faster with AVSMs. The end-to-end design space exploration of
DNN systems in a huge design space can easily be done by a click
of a button. The tight coupling of the deep learning compiler to
the AVSMs in the proposed methodology provides accurate results
with less than 9 % deviation for the evaluated cases.
REFERENCES
[1] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2018. Neural architecture
search: A survey. arXiv preprint arXiv:1808.05377 (2018).
[2] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi. 2016. Design space exploration of
FPGA-based Deep Convolutional Neural Networks. In 2016 21st Asia and South
Pacific Design Automation Conference (ASP-DAC). 575–580. https://doi.org/10.
1109/ASPDAC.2016.7428073
[3] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017.
Inception-v4, inception-resnet and the impact of residual connections on learning.
In Thirty-First AAAI Conference on Artificial Intelligence.
[4] Sebastian Vogel, Jannik Springer, Andre Guntoro, and Gerd Ascheid. 2019. Efficient
Acceleration of CNNs for Semantic Segmentation on FPGAs. In Proceedings of the
2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
(FPGA ’19). ACM, New York, NY, USA, 309–309. https://doi.org/10.1145/3289602.
3294006
[5] Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An
Insightful Visual Performance Model for Multicore Architectures. Commun. ACM
52, 4 (April 2009), 65–76. https://doi.org/10.1145/1498765.1498785
[6] Fisher Yu and Vladlen Koltun. 2015. Multi-Scale Context Aggregation by Dilated
Convolutions. arXiv:cs.CV/1511.07122
[7] Ye Yu, Yingmin Li, Shuai Che, Niraj K Jha, and Weifeng Zhang. 2019. Software-
Defined Design Space Exploration for an Efficient AI Accelerator Architecture.
arXiv preprint arXiv:1903.07676 (2019), 1–11.
[8] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong.
2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural
Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA ’15). ACM, New York, NY, USA, 161–170.
https://doi.org/10.1145/2684746.2689060
[9] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning
transferable architectures for scalable image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition. 8697–8710.
