Expanding a robot's life: Low power object recognition via FPGA-based
  DCNN deployment by Mousouliotis, Panagiotis G. et al.
Expanding a robot's life: Low power object
recognition via FPGA-based DCNN deployment
Panagiotis G. Mousouliotis, Konstantinos L. Panayiotou, Emmanouil G. Tsardoulias,
Loukas P. Petrou, Andreas L. Symeonidis
School of Electrical and Computer Engineering, Faculty of Engineering,
Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
Abstract—FPGAs are commonly used to accelerate domain-
specific algorithmic implementations, as they can achieve im-
pressive performance boosts, are reprogrammable and exhibit
minimal power consumption. In this work, the SqueezeNet DCNN
is accelerated using an SoC FPGA in order for the offered object
recognition resource to be employed in a robotic application.
Experiments are conducted to investigate the performance and
power consumption of the implementation in comparison to
deployment on other widely-used computational systems.
Index Terms—Robotics, CNN, Deep Learning, Computer Vi-
sion, FPGA, Distributed Robotic Architecture
I. INTRODUCTION
Object recognition is considered one of the most fundamen-
tal abilities an autonomous robot should exhibit. Traditionally,
object recognition is performed via feature employment (e.g.
SIFT, SURF, ORB), nevertheless current advances in deep
learning have crowned DCNNs (Deep Convolutional Neural
Networks) as the kings of the field, since they have out-
performed any previous approaches in terms of recognition
accuracy. Leaving aside their high training times, one of the
main drawbacks of DCNNs is the amount of convolution
operations needed to perform a forward pass in the NN,
i.e. to execute a classification operation given an image as
input. It is true that modern CPUs and/or GPUs are quite
capable of executing such operations in high frequencies, but
unfortunately power consumption is rather high, which has a
direct impact on the battery of a robot (specially in the case
of drones).
The current paper presents a real-world robotics applica-
tion and an evaluation of the work in [1]. Specifically, a
distributed robotic architecture is proposed, which embeds a
DCNN deployed in a System on Chip (SoC) FPGA, namely
the SqueezeNet v1.1 DCNN (SqN). SqN is accelerated by
the SqueezeJet FPGA accelerator (SqJ) [1] and is remotely
executed on a Xilinx Zynq Platform. This approach achieves
comparable performance to common CPUs but with less
power consumption, while it also provides remotely accessible
resources for object recognition tasks.
II. STATE OF THE ART
Although there is not much research work on presenting full
FPGA-based DCNN systems for robotic applications, many
approaches exist where FPGAs are used either in robotics
or for vision applications (including DCNNs). As far as
FPGA employment in robots is concerned, work in [2] has
implemented the scan-matching genetic-based SMG-SLAM
algorithm on Xilinx Virtex-5, achieving almost 15 times faster
iteration times in comparison to the original algorithm. Also,
interesting control and navigation-oriented implementations
have been proposed, such as [3], which investigates an FPGA-
based PID motion control system for small, self-adaptive
systems and [4], which removes a servo control loop from the
digital signal processor (DSP) and implements a high-speed
servo loop in an FPGA.
Besides the robotics domain, several FPGA implementations
of vision-related algorithms exist (including DCNNs), which
are described as “robotics-suitable”, however have not been
tested in real-life conditions. For example, in [5], a parallel
implementation of low-level image filtering was created in an
FPGA-based system, whereas in [6] and [7] common feature
extractors like SURF, Harris-Stephens corner detector or ORB
were implemented in FPGAs.
Regarding NN implementations, several DCNN FPGA ac-
celeration approaches exist, such as in [8] where a scal-
able hardware architecture to implement large-scale CNNs
and state-of-the-art multi-layered artificial vision systems is
presented, [9] where an efficient implementation for acceler-
ating DCNNs on a mobile platform (Xilinx Zynq-7000) in
a pipelined manner, [10], which presents an FPGA imple-
mentation of CNN designed for addressing portability and
power efficiency and [11], where a CNN accelerator design
on embedded FPGA for Image-Net large-scale image classi-
fication is proposed. Finally, architectures such as Angel-Eye
[12] offer quantization strategies and compilation tools taking
into account requirements on memory, computation, and the
flexibility of the system.
In our work, we propose offloading the execution of the SqN
DCNN on the SoC FPGA of the Xilinx ZC702 board using the
SqJ accelerator, invoked from an application deployed in the
ARM Cortex-A9 core of the same SoC FPGA. The inference
of the SqN DCNN is exposed via a single Microservice
(uService) to a robotics controller in order to provide remote
access, acting as a stand-alone physical node capable of
performing object recognition tasks.
ar
X
iv
:1
80
4.
00
51
2v
1 
 [c
s.C
V]
  2
3 M
ar 
20
18
Fig. 1. Simplified R4A architecture conceptualization
III. IMPLEMENTATION
A. System architecture
System design and deployment processes follow the R4A
methodology and system architecture [13] which exposes
robotic (or device) resources in a unified and agnostic way.
In figure 1, each device contains an LLCA (Low Level Core
Agent) that includes hardware/device drivers and exposes raw
functionalities, and an HLCA (High Level Core Agent) that can
manage the operation mode of the respective LLCA, perform
pre/post-processing, abstract functionalities by making them
vendor-agnostic and expose them via over-network trans-
ports. As far as deployment is concerned, LLCA lives within
the physical boundaries of each robot/device, while relaxed
robot/device boundaries are applied for the other components
within the R4A architecture (HLCA, Robot Memory and
Resource Transports). The latter can be deployed either in-
robot or on a remote node or platform.
Conforming to the R4A architecture, an LLCA has been
developed and deployed on the Xilinx Zynq ZC702 embed-
ded board, exposing the inference operation of SqN via a
single uService, in order to provide remote access to object
recognition tasks. The uService is implemented via a headless
protocol for transmitting raw image data over a plain TCP
connection, whereas response data consists of the top-5 object
classes (ILSVRC 2012 dataset1) along with their probability
of certainty. As denoted in figure 2, the robotic resources
layer is located on the physical node executed in the robotic
platform, and includes several LLCA instances which offer
functionalities to the application layer.
SqN-HLCA is deployed on the robotic platform, acting
as a proxy to the SqN-LLCA. It consists of a controller
(SqNProxyAgent) for managing the internal state and a stack
of preprocessing operations that are applied to the images pub-
lished by a ROS Node [14], implementing the HW interface of
usb cameras (UsbCamNodeROS). SqN-HLCA provides three
interfaces: a ROS-Service for setting the internal state of the
SqN-HLCA, a ROS-Publisher for streaming the classification
results, and a ROS-Service for one-shot object recognition op-
erations. An interesting observation is that SqN-HLCA is at the
same time an LLCA of the robot since it offers raw resources
to the robot controller (ObjectRecognition-LLCA). In figure
2, the orange dashed lines indicate the boundary limits of the
current implementation. To further expose robot functionalities
in an agnostic manner, for each LLCA a corresponding HLCA
must ultimately be deployed.
1http://www.image-net.org/challenges/LSVRC/2012/
B. FPGA accelerator implementation
The tools employed for the FPGA accelerator develop-
ment are: (1) the Xilinx Vivado HLS (VHLS), which allows
the seamless conversion of high-level programming language
source code (such as C/C++) to efficient hardware descrip-
tion language (HDL) code that can then be synthesized
and mapped to FPGA devices and (2) the Xilinx SDSoC
Environment, which raises the abstraction level of FPGA
application development even more compared to VHLS by
providing C/C++/OpenCL HLS capabilities, easy directive
driven interface synthesis and automated (Standalone, Linux)
OS image generation for the developed application.
SqJ accelerates all but the first SqN convolutional layer due
to its architectural differences compared to all the other convo-
lutional SqN layers, whilst a dedicated accelerator is designed
to model the first layer. The common characteristics of the
SqJ-accelerated layers are a stride equal to 1, an input channel
dimension with a greatest common divisor (GCD) equal to
16, and an output channel dimension which is divisible by
a power of 2. Since the HLS compiler cannot unroll loops
with variable-length bounds, the input channel GCD is used
as a design parameter of a pipelined multiply-accumulate unit
(MAC-16) which performs 16 MACs in every clock cycle
of its operation. Using the output channel characteristic, the
MAC-16 unit is replicated 2n (n = 2, 3, ...) times and the
resulting architecture is used to concurrently calculate 2n
output channels. Since the parallelism exploitation is focused
on the input and output channel dimensions, SqJ can support
both 1× 1 and 3× 3 kernel sizes.
The Ristretto tool [15] is used to squeeze the SqN parame-
ters (weights, bias) to 8 bits and the SqN future maps (fmaps)
to 16 bits with 0.88% top-5 accuracy loss without performing
any fine-tuning. Parameter and fmaps quantization aims at: (1)
making the SqJ design smaller, requiring much fewer FPGA
resources than the floating-point design and fitting into low-
end FPGA devices, (2) storing the SqN parameters in Block
RAM (BRAM) FPGA resources since they are used for the
calculation of every output multi-channel pixel of the output
fmap, avoiding unnecessary memory accesses which introduce
additional latency and power consumption. Additionally to the
parameter buffering design choice, an input fmap tile buffer
(ITB) and an input fmap tile buffer window (ITBW) are used.
After the initialization of these two buffers, SqJ consumes and
produces data pixel-by-pixel (in SqJ jargon, a pixel consists of
all the channels at a specific (x, y) location in the fmap volume
[1]). The ITB and ITBW are designed to take advantage of
the spatial locality of the convolution input data and minimize
unnecessary data movement. Figure 3 shows, for simplicity,
SqJ implemented with 4 MAC-16 units. In this work, an SqJ
with 8 MAC-16 units is used and it runs at 100MHz. The
FPGA resource utilization of SqJ is shown in Table I.
IV. EXPERIMENTS / RESULTS
Two boards, an Intel NUC and an Ultrabook, both with
Intel low power processors, namely the Intel Core i3-
7100U@2.4GHz and the Intel Core i5-3337U@1.8GHz, are
Fig. 2. Overall system architecture
Fig. 3. SqJ Block Diagram
used as alternative computation nodes to the Xilinx XC702.
Specifically, the SqN application is offloaded on 4 different
computation node configurations acquiring latency and power
measurements, as shown in Table II. Additionally, Table II
reports local SqN runs and provides latency results per CNN
layer. All the latency results are the average value of 100
iterations and the power consumption results are acquired: (1)
using Intel PCM2 while the computation node serves 1000
image recognition requests, in the case of the Intel CPUs,
and (2) using Xilinx XPE3 in the case of Xilinx XC702.
SqN consists of single-threaded 32-bit floating point precision
C++ function accelerated with single-instruction multiple-data
(SIMD) instruction set extensions (Intel AVX, ARM NEON)
and executed on a single core of the target computation nodes.
Results report a remote application performance in frames
per second (fps) of (see End-To-End observation in Table II)
4.16fps for the i3 core, 2.92fps for the i5 core, 0.19fps for the
ARM core, and 2.62fps for the ARM+SqJ cores (SqJ executes
only the Conv+Fire layers). The Total Conv+Fire and Chip
Power results indicate that the ARM+SqJ configuration is
slightly faster on convolution operation execution than the i5
core and 2.68 times more power efficient. Although the Intel
i3 core operates in higher frequency than the i5, it consumes
2https://www.intel.com/software/pcm
3https://www.xilinx.com/products/technology/power/xpe.html
TABLE I
SQJ RESOURCE UTILIZATION ON THE XC7Z020 SOC FPGA
Resource Utilization Available Utilization %
LUT 20148 53200 37.87
LUTRAM 1273 17400 7.32
FF 29568 106400 27.55
BRAM 134.5 140 96.07
DSP 192 220 87.27
TABLE II
SQN REMOTE/LOCAL APPLICATION RESULTS
Experimental platforms
NUC
Intel-i3@2.4GHz
Ultrabook
Intel-i5@1.8GHz
ZC702
ARM@667MHz
ZC702
ARM@667MHz
SqJ@100MHz
SqN Remote Application Latency Results (ms)
Img Preprocessing 10.0961 10.4999 10.0174 10.3446
SqN Inference 181.8990 285.5170 5057.6200 323.3100
Net Transfer 58.2517 57.3641 91.6487 58.4605
End-To-End 240.1507 342.8810 5149.2687 381.7705
Total 250.2468 353.3810 5159.2861 392.1151
SqN Local Application Per Layer Latency Results (ms)
1:Conv 25.5531 35.6304 297.3461 26.4994
2:Maxpool 2.2457 3.4679 28.7091 22.7482
3:Fire 16.6766 25.4867 446.0529 32.7412
4:Fire 17.8092 26.8687 474.0225 34.8575
5:Maxpool 1.5101 2.1909 27.3655 18.0697
6:Fire 14.167 20.6089 450.0639 17.8422
7:Fire 15.1649 22.0343 482.4270 19.0028
8:Maxpool 0.06697 1.0116 14.4056 9.4262
9:Fire 7.7804 11.1605 258.0127 8.6744
10:Fire 8.2085 11.6817 273.4767 8.8977
11:Fire 13.7099 19.3248 497.9448 12.2668
12:Fire 14.2955 20.0220 517.3455 12.8121
13:Conv 36.3992 49.9700 1258.8026 49.5907
14:Avgpool 1.6158 1.5544 5.7776 5.7192
15:Softmax 0.0277 0.0420 0.2242 0.2255
Total Conv+Fire 169.7643 242.7880 4955.4947 223.1848
Total 175.23057 251.0548 5031.9767 279.3736
SqN Remote Application CPU/SoC Power Results (Watts)
Technology 14nm 22nm 28nm 28nm
Chip Power 4.1187 5.9883 1.629 2.227
less power due to the newer technology used (14nm vs 22nm)
and architecture improvements. Finally, SqJ provides to the
remote SqN application a 13.487 times speedup compared to
the ARM-only configuration.
V. CONCLUSION / FUTURE WORK
In the context of the current paper, a distributed imple-
mentation of the SqueezeNet CNN is proposed, exploiting
low power consumption capabilities of FPGA-based embedded
systems. The application execution is distributed between the
Xilinx XC702 device, and a ROS-enabled node. Performance
is expressed in terms of both execution time and power con-
sumption, and results indicate comparable execution times at
lower power consumption rates, over common CPUs. Though,
most robotic applications not only require to observe the
existence of several objects, but also to localize them in the
scene, or even on a global map. For this purpose, SqueezeDet
network [16] can be used to support both recognition and
multi-object localization tasks. Furthermore, the SqJ convo-
lutional hardware accelerator could be redesigned to support:
(1) Maxpool layers, since they require considerable amount
(almost 20%) of the total inference time on a mobile ARM
core, and (2) streaming execution, to avoid memory accesses
for fmaps (requires additional BRAM resources).
REFERENCES
[1] Panagiotis G. Mousouliotis, Loukas Petrou, “SqueezeJet: High-level
Synthesis Accelerator Design for Deep Convolutional Neural Networks”,
accepted in ARC 2018 : 14th International Symposium on Applied
Reconfigurable Computing
[2] Mingas, Grigorios, Emmanouil Tsardoulias, and Loukas Petrou. “An
FPGA implementation of the SMG-SLAM algorithm.” Microprocessors
and Microsystems 36, no. 3 (2012): pp. 190-204.
[3] Zhao, Wei, Byung Hwa Kim, Amy C. Larson, and Richard M. Voyles.
“FPGA implementation of closed-loop control system for small-scale
robot.” In Advanced Robotics, 2005. ICAR’05. Proceedings., 12th In-
ternational Conference on, pp. 70-77. IEEE, 2005.
[4] Shao, Xiaoyin, and Dong Sun. “Development of a new robot controller
architecture with FPGA-based IC design for improved high-speed perfor-
mance.” IEEE Transactions on Industrial Informatics 3, no. 4 (2007): pp.
312-321.
[5] GholamHosseini, Hamid, and Shuying Hu. “A high speed vision system
for robots using FPGA technology.” In Mechatronics and Machine Vision
in Practice, 2008. M2VIP 2008. 15th International Conference on, pp. 81-
84. IEEE, 2008.
[6] Svab, Jan, Tomas Krajnik, Jan Faigl, and Libor Preucil. “FPGA based
speeded up robust features.” In Technologies for Practical Robot Appli-
cations, 2009. TePRA 2009. IEEE International Conference on, pp. 35-41.
IEEE, 2009.
[7] Weberruss, Josh, Lindsay Kleeman, David Boland, and Tom Drummond.
“FPGA acceleration of multilevel ORB feature extraction for computer
vision.” In Field Programmable Logic and Applications (FPL), 2017 27th
International Conference on, pp. 1-8. IEEE, 2017.
[8] Farabet, Clment, Berin Martini, Polina Akselrod, Seluk Talay, Yann
LeCun, and Eugenio Culurciello. “Hardware accelerated convolutional
neural networks for synthetic vision systems.” In Circuits and Systems
(ISCAS), Proceedings of 2010 IEEE International Symposium on, pp.
257-260. IEEE, 2010.
[9] Jin, Jonghoon, Vinayak Gokhale, Aysegul Dundar, Bharadwaj Krishna-
murthy, Berin Martini, and Eugenio Culurciello. “An efficient implemen-
tation of deep convolutional neural networks on a mobile coprocessor.” In
Circuits and Systems (MWSCAS), 2014 IEEE 57th International Midwest
Symposium on, pp. 133-136. IEEE, 2014.
[10] Bettoni, Marco, Gianvito Urgese, Yuki Kobayashi, Enrico Macii, and
Andrea Acquaviva. “A Convolutional Neural Network Fully Implemented
on FPGA for Embedded Platforms.” In CAS (NGCAS), 2017 New
Generation of, pp. 49-52. IEEE, 2017.
[11] Qiu, Jiantao, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou,
Jincheng Yu et al. “Going deeper with embedded fpga platform for
convolutional neural network.” In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pp. 26-35.
ACM, 2016.
[12] Guo, Kaiyuan, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang,
Song Yao, Song Han, Yu Wang, and Huazhong Yang. “Angel-Eye:
A Complete Design Flow for Mapping CNN onto Embedded FPGA.”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems (2017).
[13] Panagiotis Doxopoulos, Konstantinos Panayiotou, Emmanouil Tsardou-
lias, Andreas L. Symeonidis, “Creating an extrovert robotic assistant
via IoT networking devices”, In International Conference on Cloud and
Robotics 2017, Saint Quentin, France, 2017
[14] Quigley, Morgan, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote,
Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. “ROS: an open-source
Robot Operating System.” In ICRA workshop on open source software,
vol. 3, no. 3.2, p. 5. 2009.
[15] Gysel, Philipp, Mohammad Motamedi, and Soheil Ghiasi. “Hardware-
oriented approximation of convolutional neural networks.” arXiv preprint
arXiv:1604.03168 (2016).
[16] Bichen Wu and Forrest N. Iandola and Peter H. Jin and Kurt Keutzer.
“SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural
Networs for Real-Time Object Detection for Autonomous Driving.” In
arXiv:1612.01051 Journal.
