Deploying Deep Neural Networks in the Embedded Space by Venieris, Stylianos I. et al.
Deploying Deep Neural Networks in the Embedded Space
Stylianos I. Venieris, Alexandros Kouris and Christos-Savvas Bouganis
Dept. of Electrical and Electronic Enigineering, Imperial College London
{stylianos.venieris10,a.kouris16,christos-savvas.bouganis}@imperial.ac.uk
ABSTRACT
Recently, Deep Neural Networks (DNNs) have emerged as the dom-
inant model across various AI applications. In the era of IoT and
mobile systems, the efficient deployment of DNNs on embedded
platforms is vital to enable the development of intelligent applica-
tions. This paper summarises our recent work on the optimised
mapping of DNNs on embedded settings. By covering such diverse
topics as DNN-to-accelerator toolflows, high-throughput cascaded
classifiers and domain-specific model design, the presented set of
works aim to enable the deployment of sophisticated deep learning
models on cutting-edge mobile and embedded systems.
1 INTRODUCTION
The effective mapping of DNN inference on embedded platforms
can offer multiple benefits for mobile AI applications. These include:
1) enabling the processing of high resolution inputs without com-
promising the user experience due to high-latency access of cloud
services, 2) enabling the processing of multiple data sources in
high-throughput applications, 3) reducing response time and 4)
complying with the power constraints of embedded platforms.
In this context, custom hardware accelerators offer unique op-
portunities for tailor-made solutions that meet the system-level con-
straints while providing higher-performance execution and lower
power consumption than conventional programmable architectures.
However, the inherent complexity of mapping applications on such
specialised platforms hinders their adoption from deep learning
practitioners. In our recent work, we have tackled a number of
critical problems to enhance the accessibility of such platforms.
The scope of our work ranges from software infrastructure for
the automated generation of DNN accelerators to DNN model de-
sign with application-specific optimisations. The key categories
involve: 1) CNN-to-accelerator toolflows for the automated gen-
eration of CNN accelerators, targeting both high-throughput and
low-latency settings [21–23]; 2) the exploitation of the resilience
of CNNs to low-precision arithmetic to achieve substantial perfor-
mance gains [10, 11]; 3) automated synthesis of optimised architec-
tures for emerging multi-CNN applications that employ multiple
models for different tasks [25]; 4) LSTM accelerators that enable de-
ployment in latency-critical applications with limited computation
time budget [19]; and 5) design of domain-specific DNN models
tailored to both the target system’s accuracy requirements and com-
pute capabilities [12]. The rest of the paper presents a high-level
view of our work.
2 CNN-TO-ACCELERATOR AUTOMATION
The success of CNNs has come with an increase in compute and
memory requirements. In this context, FPGAs stand as a promising
platform that can meet both the compute requirements and the
2nd International Workshop on Embedded and Mobile Deep Learning, Munich, Germany,
June 2018.
Network 
Description
FPGA Target Platform 
Specifications
Automated 
Design Space 
Exploration
Network Hardware Mapping
Bitstream
Performance 
Requirements
Supplied by 
Deep Learning Expert
fpgaConvNet
Figure 1: Overview of fpgaConvNet’s flow.
power constraints of emerging CNN applications. Currently, several
obstacles stand as a barrier between deep learning practitioners
and FPGAs. From a development perspective, FPGA system devel-
opment requires expertise in hardware design and familiarity with
FPGA toolchains, two skills that typically do not fall within the
skillset of deep learning scientists. Moreover, due to the design flex-
ibility of FPGAs, the possible mappings of a CNN on an FPGA lie on
a high-dimensional design space that cannot be explored manually.
From an application perspective, the diversity of CNN application
domains results in a wide spectrum of performance needs. Span-
ning from the high-throughput needs of multi-sensor systems to
latency-critical self-driving cars, the underlying hardware has to be
optimised for the particular performance metric of interest. In this
context, there is a need for frameworks that abstract the low-level
details of FPGAs and automate the generation of FPGA-based CNN
accelerators, optimised for the needs of the target application [27].
2.1 The fpgaConvNet Toolflow
fpgaConvNet is a toolflowwhose goal is to automate themapping of
CNNs on FPGAs [21–23, 26]. Starting from a high-level description
of a CNN model, fpgaConvNet (Fig. 1) considers both the supplied
model’s workload and the application-level performance needs, in-
cluding the required throughput and latency, and generates an opti-
mised accelerator for the target FPGA device. At the hardware level,
fpgaConvNet employs a highly customisable streaming architec-
tural template which exploits the parallelism both within and across
layers and supports large networks by posing no constraints on
the model size. To tailor the generated hardware to the CNN-FPGA
pair, fpgaConvNet employs an analytical Synchronous Dataflow
model [13] for capturing both CNN workloads and hardware map-
pings. This formulation enables the fast exploration of the design
space by means of a set of algebraic operations that correspond to a
wide range of optimisations and modify the performance-resource
cost space of the implementation. To yield the final hardware de-
sign, fpgaConvNet casts design space exploration as a mathematical
optimisation problem with an objective function that captures the
throughput and latency requirements of the target application and
automatically generates the resulting hardware design by means
of code generation. Overall, in low-power embedded and mobile
settings, fgpaConvNet’s designs demonstrate performance gains of
ar
X
iv
:1
80
6.
08
61
6v
1 
 [c
s.C
V]
  2
2 J
un
 20
18
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Latency (s) (batch size = 1)
0
50
100
150
200
Th
ro
ug
hp
ut
 (G
Op
/s)
LeNet-5 both points
CIFAR-10
CIFAR-10
Sign Recogn CNN
Sign Recogn CNN
Scene Label CNN
Scene Label CNN
VGG16
VGG16
AlexNet
AlexNet
Throughput-driven Mode
Latency-driven Mode
Figure 2: Throughput-driven vs. latency-driven mode.
up to 6.65× over highly optimised embedded GPU implementations
when operating under the same power budget.
2.2 A Latency-Driven Methodology
The majority of existing CNN implementations on CPUs [17], GPUs
[3] and FPGAs [5] are optimised for high throughput. Emerging
new AI systems, from autonomous drones [14] and cars [4] to low
response-time mobile applications, require the very low-latency ex-
ecution of CNN inference without the overhead of batch processing.
To enable this type of applications, fpgaConvNet can place latency
at the centre of optimisation and generate latency-optimised hard-
ware designs for the target CNN-FPGA pair [24]. The latency-driven
methodology comprises a run-time configurable architecture that
enables the high-performance execution of CNNs without the la-
tency penalty imposed by batching, together with a latency-centric
optimiser that guides the design space exploration to previously
unreachable low-latency regions (Fig. 2). fpgaConvNet’s latency-
driven flow delivers up to 73.45× and 5.61× improvements in la-
tency over throughput-optimised designs of AlexNet and VGG16
respectively.
3 PRECISION
The state-of-the-art accuracy of DNNs in machine vision tasks
is achieved at the expense of high computational and memory
requirements, that often prohibit their deployment in real-world
embedded applications [9]. A common strategy to alleviate that
cost is the use of reduced precision. The majority of approaches
assume the availability of the training set and employ a retraining
step to restore the quantised model’s accuracy [6, 7]. However, in
privacy-aware scenarios, such data are not available and hence
methods for reducing the precision without retraining are required.
Moreover, in applications with low error tolerance, the accuracy
loss due to quantisation often prohibits the use of low-precision
arithmetic. In this context, there is a need for efficient methods of
executing DNNs that combine the gains of reduced precision with
negligible accuracy drop.
3.1 CascadeCNN’s Approach
CacsadeCNN [10, 11] introduces an automated toolflow for gener-
ating a high-throughput cascade of CNN classifiers that pushes the
performance of precision-quantised CNNs. Our key observation
is that not all inputs of a CNN require the same level of precision
in the computation to yield a confident prediction. In this respect,
CascadeCNN exploits this property and generates a two-stage ar-
chitecture of precision-quantised models (Fig. 3). The first stage
consists of an excessively low-precision processing unit that en-
ables rapid classification prediction. The outputs from the first stage
x
…
x
x
+
+
+
x
…
x
x
+
+
+
CEU
Memory
PASS
FAIL
M
A
C
C
s-
p
e
r-
P
E x
…
x
x
+
+
+
Figure 3: CascadeCNN ’s high-level architecture.
are fed to a confidence evaluation unit which estimates the pre-
diction confidence. The samples that are detected as misclassified
are recomputed on a high-precision unit to restore the application-
level accuracy and comply with the user-specified error tolerance.
Overall, CascadeCNN considers the error tolerance and the input
CNN-FPGA pair to select quantisation scheme, configure the confi-
dence evaluation mechanism and generate the cascaded low- and
high-precision processing units. The CascadeCNN ’s designs demon-
strate a performance boost of up to 55% for VGG16 and 48% for
AlexNet over baseline designs achieving the same accuracy.
4 MULTI-DNN SYSTEMS
In the construction of complex AI systems, DNN models are used
as building blocks of a larger system. In this respect, multi-DNN
systems have emerged, employing several models, each one trained
for a different subtask. In particular, in the emerging field of intel-
ligent autonomous systems, such as drones [20] and self-driving
cars [2], the system’s perception is largely based on computer vi-
sion tasks, such as object detection [18] and semantic and instance
segmentation [1, 8]. Such systems require the concurrent execution
of these subtasks and hence the parallel and continuous execution
of multiple models.
Nevertheless, deploying multiple models on a target platform
poses a number of challenges. From a resource allocation perspec-
tive, with each model targeting a different task, the performance
constraints, such as required throughput and latency, vary accord-
ingly. Instead of being model-agnostic, this property requires the
design of an architecture that captures and reflects the performance
requirements of each model. Moreover, in resource-constrained
setups, multiple DNNs compete for the same pool of resources and
hence resource allocation between models becomes a critical factor.
In this respect, the mapping of multiple DNNs is a multidimensional
design problem that encompasses both the performance needs of
each model and the resource constraints of the target platform.
4.1 f-CNNx: Deploying Multiple CNNs
f-CNNx [25] is a toolflow which addresses the challenge of map-
ping multiple CNNs on a target FPGA platform while meeting the
required performance for each model. f-CNNx exploits the struc-
ture of CNN workloads and the fine-grained control over resource
allocation of FPGAs to yield latency-optimised designs. From a
hardware perspective, the developed toolflow introduces a highly
parametrised multi-CNN architecture (Fig. 4) that allows the fine-
grained allocation of resources among CNNs and the deterministic
scheduling of competing memory transfers. f-CNNx explores a wide
range of resource and bandwidth allocations and incorporates the
application-level importance of each model by means of multiob-
jective cost functions to guide the design space exploration to the
optimum hardware design. Overall, f-CNNx overcomes the limi-
tations of other parallel platforms by yielding up to 6.8× gains in
2
C-PE
Weights Mem.
C-PE
Weights Mem.
C-PE
Weights Mem.
PE Folding
Weights Mem.
Dot-product Unit 
Folding
Conv
Layer
Pool 
Layer
Conv
Layer
Pool 
Layer
Conv
Layer
Conv
Layer
Pool 
Layer
CNN Engine1
CNN Engine 2
Conv
Layer
Pool 
Layer
Conv
Layer
Pool 
Layer
CNN Engine N
…
Multi-CNN Hardware 
Scheduler
FPGA
Off-chip Memory
…
Figure 4: Parallel architecture for multiple CNNs.
performance-per-Watt over highly optimised embedded GPU de-
signs inmulti-CNN settings. To the best of our knowledge, this work
addresses for the first time in the literature the latency-optimised
mapping of multiple CNNs.
5 COMPUTING UNDER TIME CONSTRAINTS
Modern intelligent systems such as mobile robots and UAVs that
employ DNNs to perceive and interact with their surroundings
often operate under time-constrained, latency-critical settings. In
such scenarios, the output of a DNNwould typically yield an action,
such as a manoeuvre to avoid an obstacle. For such decision-making
to happen both in real-time and with the best possible outcome,
obtaining the most informative output from a DNN given a con-
straint in computation time is vital. In this respect, the design of
computing systems that exploit the runtime-accuracy trade-off of
DNNs is necessary to enable the timely operation of such systems.
5.1 Approximate FPGA-based LSTMs
With a focus on the high-performance deployment of LSTMs under
time-constrained settings, [19] presents a framework that comprises
an approximate computing scheme together with a novel FPGA-
based hardware architecture for LSTMs. The proposed framework
(Fig. 5) employs an iterative approximation method to compress and
prune the target LSTM and explore the computation time-accuracy
trade-off. Internally, the framework co-optimises the LSTM approx-
imation and the hardware design in order to meet the computation
time constraints. By targeting a real-life image captioning applica-
tion, the designs generated by the developed framework demon-
strate 6.5× less time to achieve the same application-level accuracy
over a baseline accelerator, while reaching an average of 25× higher
accuracy under the same computation time constraints.
6 DOMAIN-SPECIFIC DNN DESIGN
Conventionally, the deep learning community’s design method-
ology for DNN models focuses on maximising the accuracy on
the target task, while largely neglecting the implications on the
inference-time computational cost. By considering domain-specific
properties in order to guide the design of DNNs, more efficient
models can be constructed which both meet the required accuracy
and lie within the compute capabilities of the target platform.
6.1 The DroNet Vehicle Detector
Drones are emerging as a promising technology for a broad range
of applications from domains such as agriculture, security, emer-
gency response and infrastructure monitoring [15] [16]. A promi-
nent application includes the detection of vehicles for emergency
#DSPs ,
Memory BW, 
…
wwwwwwwww
wwwwwwwww
wwwwwwwww
wwwwwwww
wwwwwww
N𝑠𝑡𝑒𝑝𝑠
NZ
w      w        w 
w       w         
w     w      w   
w         w
GND
< 𝑇𝑟, 𝑇𝑐 >
𝑒𝑟
𝑟𝑜
𝑟
𝑡𝑖𝑚𝑒
Sel
Figure 5: Overview of approximate LSTMs design flow.
response and traffic monitoring. In this scenario, drones operate au-
tonomously and employ CNN models for the detection of vehicles.
[12] presents an end-to-end investigation of different single-shot
CNN detectors for drone-based vehicle detection. Starting from the
dataset collection and model design down to the deployment on
the Odroid-XU4 and Rasperry Pi 3 embedded hardware platforms,
this work presents an exploration over the structure of the CNN.
To find the CNN that optimises both accuracy and computation
cost, a custom metric is employed that, given a model instance,
captures both the detection accuracy and the achieved runtime
on the target hardware platform. By following this methodology,
the resulting detection CNN yields the highest performing balance
between detection accuracy and fast execution on the two target
embedded platforms.
7 CONCLUSION
The presented set of works focuses on bridging the gap between
the deep learning community and the deployment of models in the
embedded space. The fpgaConvNet toolflow automates the opti-
mised generation of FPGA-based CNN accelerators targeting both
high-throughput and latency-driven applications. In the context
of quantised CNNs, CascadeCNN alleviates the need for training
data availability and enables in this way the use of high-throughput
CNN accelerators that employ extremely low-precision arithmetic
in privacy-critical applications. By targeting multi-CNN systems,
f-CNNx paves the way in executing multiple CNNs under latency
constraints on FPGAs.Moreover, an approximate computingmethod-
ology for the deployment of LSTMs in time-constrained applications
is presented, enabling latency-critical systems to make informed
decisions in real-time. Finally, by considering domain-specific op-
timisations, a model design methodology has been developed to
construct DNNs that are optimised for both the task-level accuracy
and compute capabilities of the target platform.
REFERENCES
[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. 2017. SegNet: A Deep Convolu-
tional Encoder-Decoder Architecture for Scene Segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2481–2495.
[2] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. 2015. DeepDriving:
Learning Affordance for Direct Perception in Autonomous Driving. In 2015 IEEE
International Conference on Computer Vision (ICCV). 2722–2730.
[3] Sharan Chetlur et al. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR
(2014).
[4] A. Geiger, P. Lenz, and R. Urtasun. 2012. Are we ready for autonomous driving?
The KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision
and Pattern Recognition. 3354–3361.
[5] Yijin Guan et al. 2017. FP-DNN: An Automated Framework for Mapping Deep
Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. In 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom Computing
Machines (FCCM). 152–159.
[6] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang. 2018.
Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA.
3
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
37, 1 (Jan 2018), 35–47.
[7] SongHan, Huizi Mao, andWilliam J Dally. 2016. Deep Compression: Compressing
Deep Neural Network with Pruning, Trained Quantization and Huffman Coding.
International Conference on Learning Representations (ICLR) (2016).
[8] K. He, G. Gkioxari, P. DollÃąr, and R. Girshick. 2017. Mask R-CNN. In 2017 IEEE
International Conference on Computer Vision (ICCV). 2980–2988.
[9] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,
Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al.
2017. Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. In
Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE,
3296–3297.
[10] Alexandros Kouris, Stylianos I. Venieris, and Christos-Savvas Bouganis. 2018.
CascadeCNN: Pushing the performance limits of quantisation. In SysML.
[11] Alexandros Kouris, Stylianos I. Venieris, and Christos-Savvas Bouganis. 2018.
CascadeCNN: Pushing the Performance Limits of Quantisation in Convolutional
Neural Networks. In 2018 28th International Conference on Field Programmable
Logic and Applications (FPL). 1–8.
[12] C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris, and C. S. Bouganis. 2018.
DroNet: Efficient Convolutional Neural Network Detector for Real-Time UAV
Applications. In 2018 Design, Automation Test in Europe Conference Exhibition
(DATE). 967–972.
[13] E. A. Lee and D. G. Messerschmitt. 1987. Synchronous Data Flow. Proc. IEEE
(Sept 1987).
[14] Antonio Loquercio, Ana I Maqueda, Carlos R del Blanco, and Davide Scaramuzza.
2018. Dronet: Learning to fly by driving. IEEE Robotics and Automation Letters 3,
2 (2018), 1088–1095.
[15] NathanMichael, Shaojie Shen, KartikMohta, Vijay Kumar, Keiji Nagatani, Yoshito
Okada, Seiga Kiribayashi, Kazuki Otake, Kazuya Yoshida, Kazunori Ohno, et al.
2014. Collaborative mapping of an earthquake damaged building via ground and
aerial robots. In Field and Service Robotics. Springer, 33–47.
[16] J. Nikolic, M. Burri, J. Rehder, S. Leutenegger, C. Huerzeler, and R. Siegwart.
2013. A UAV system for inspection of industrial facilities. In 2013 IEEE Aerospace
Conference. 1–8. https://doi.org/10.1109/AERO.2013.6496959
[17] Samyam Rajbhandari, Yuxiong He, Olatunji Ruwase, Michael Carbin, and Trishul
Chilimbi. 2017. Optimizing CNNs on Multicores for Scalability, Performance
and Goodput. In Proceedings of the Twenty-Second International Conference on
Architectural Support for Programming Languages and Operating Systems (ASPLOS
’17). ACM, New York, NY, USA, 267–280.
[18] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN:
Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence 39, 6 (June 2017), 1137–
1149.
[19] Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas
Bouganis. 2018. Approximate FPGA-based LSTMs under Computation Time
Constraints. In Applied Reconfigurable Computing - 14th International Symposium,
ARC 2018, Santorini, Greece, May 2 - 4, 2018, Proceedings. 3–15.
[20] Nikolai Smolyanskiy et al. 2017. Toward Low-Flying Autonomous MAV Trail
Navigation using Deep Neural Networks for Environmental Awareness. In 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[21] Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A Frame-
work for Mapping Convolutional Neural Networks on FPGAs. In 2016 IEEE
24th Annual International Symposium on Field-Programmable Custom Computing
Machines (FCCM). 40–47.
[22] Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. fpgaConvNet: A
Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded
FPGAs. In NIPS 2017 Workshop on Machine Learning on the Phone and other
Consumer Devices.
[23] Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. fpgaConvNet: Au-
tomated Mapping of Convolutional Neural Networks on FPGAs (Abstract
Only). In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays. ACM, 291–292.
[24] S. I. Venieris and C. S. Bouganis. 2017. Latency-Driven Design for FPGA-based
Convolutional Neural Networks. In 2017 27th International Conference on Field
Programmable Logic and Applications (FPL). 1–8.
[25] S. I. Venieris and C. S. Bouganis. 2018. f-CNNx : A Toolflow for Mapping Multiple
Convolutional Neural Networks on FPGAs. In 2018 28th International Conference
on Field Programmable Logic and Applications (FPL). 1–8.
[26] Stylianos I. Venieris and Christos-Savvas Bouganis. 2018. fpgaConvNet: Map-
ping Regular and Irregular Convolutional Neural Networks on FPGAs. IEEE
Transactions on Neural Networks and Learning Systems (2018).
[27] Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018.
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and
Future Directions. ACM Comput. Surv. 51, 3, Article 56 (June 2018), 39 pages.
4
