Exploring NEURAGHE: A Customizable Template for APSoC-based CNN Inference at the Edge by Meloni, Paolo et al.
This is the post peer-review accepted manuscript of: 
 
Paolo Meloni, Daniela Loi, Gianfranco Deriu, Marco Carreras, Francesco Conti, 
Alessandro Capotondi, Davide Rossi, " Exploring NEURAGHE: A Customizable 
Template for APSoC-based CNN Inference at the Edge", 
 
in IEEE Embedded Systems Letters. doi: 10.1109/LES.2019.2947312 
 
The published version is available online at: 
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8869860  
 
 
 
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be 
obtained for all other uses, in any current or future media, including reprinting/republishing 
this material for advertising or promotional purposes, creating new collective works, for 
resale or redistribution to servers or lists, or reuse of any copyrighted component of this work 
in other works. 
 
IEEE EMBEDDED SYSTEMS LETTERS, VOL. XX, NO. YY, ZZ 2018 1
Exploring NEURAGHE: A Customizable Template
for APSoC-based CNN Inference at the Edge
Paolo Meloni, Daniela Loi, Gianfranco Deriu, Marco Carreras, Francesco Conti, Alessandro Capotondi
and Davide Rossi
Abstract—The NEURAGHE architecture has proved to be a
powerful accelerator for Deep Convolutional Neural Networks
running on heterogeneous architectures based on Xilinx Zynq-
7000 APSoCs. NEURAGHE exploits the processing system and
the programmable logic available in these devices, to improve
performance through parallelism and to widen the scope of
use-cases that can be supported. In this work, we extend the
NEURAghe template-based architecture to guarantee design-time
scalability to multi-processor SoCs with vastly different cost, size
and power envelope such as Xilinx’s Z-7007s, Z-7020 and Z-7045.
The proposed architecture achieves state-of-the-art performance
and cost effectiveness in all the analyzed configurations, reaching
up to 335 GOps/s on the Z-7045.
I. INTRODUCTION
Convolutional Neural Networks (CNNs) are one of the
most promising classes of deep learning algorithms due to
remarkable performance achieved in a broad area of appli-
cations, ranging from speech recognition to computer vision
and natural language processing [1]. However, improvements
in the accuracy of a CNN come at the expense of increased
computational and memory workload. The execution of CNN
algorithms involves a huge number of Multiply Accumulate
(MAC) operations representing the core of the convolution ker-
nels, and requires significant memory storage and bandwidth
for storing and accessing weights efficiently. Therefore, there
is a growing need for low-cost hardware platforms capable
of running computationally intensive CNN-based applications
fast and efficiently. Due to the intrinsic parallelism in such
convolutions, Field-Programmable Gate Arrays (FPGA) are a
very promising target technology for the implementation of
hardware accelerators for different kind of neural networks
[2], [3], [4], [5]. Exploiting dedicated hardware, FPGAs allow
to realize very flexible architectures. They can integrate a
high number of parallel Digital Signal Processing (DSP)
units, usable to efficiently implement MAC operations and to
deliver high throughput at limited clock frequencies. More-
over, FPGAs offer a significant amount of memory resources,
which can be used to create temporary storage buffers for
partial convolution results, and other hardware primitives,
such as registers and look-up tables (LUTs), suitable to
implement the glue logic around the accelerators. Based on
P. Meloni, D. Loi, G. Deriu and M. Carreras are with DIEE, University of
Cagliari, Cagliari, 09123 Italy.
F. Conti, A. Capotondi and D. Rossi are with DEI, University of Bologna,
Bologna 40126 Italy.
Manuscript received XX YY, 20XX; revised XX YY, 20XX.
This work has received funding from the European Union’s Horizon 2020
Research and Innovation Programme under grant agreement No. 780788
these considerations, in [6] we have proposed NEURAGHE,
a programmable CNN accelerator exploiting the synergistic
execution on ARM processor and on the programmable logic
of Xilinx Zynq Z-7000 All Programmable System-on-Chip
(APSoC). Automated generation of CNN processors, based
on High-level synthesis or data-flow generators, has often
been argued to be more effective than a template-based
design in maximizing the utilization of the available FPGA
resources [2], [7]. Contrary to this line of thought, in this work
we show how the NEURAGHE architecture can be tailor-cut to
support different trade-off optimization scenarios with respect
to cost, computation efficiency and power consumption, and
to meet a wide scope of use-cases requiring at-the-edge CNN
inference. We demonstrate that, thanks to its unique flexibility,
NEURAGHE can fit on a wide scope of devices, ranging from
low-end IoT nodes to mid-to-high end embedded processing
platforms. NEURAGHE outperforms all existing architectures
on Z-7045 using 8-bit and 16-bit data (delivering up to 335
GOps/s or 173 GOps/s, respectively), achieves state-of-the-art
performance on Z-7020 using 8-bit data (up to ∼85 GOps/s),
and can be used even on tiny devices such as Z-7007s.
II. NEURAGHE ARCHITECTURAL TEMPLATE
The NEURAGHE1 architecture consists of a hierarchical
structure that overlaps with the hardware organization of
Xilinx Zynq SoCs (adaptable to other APSoCs on the market).
It contains a General-Purpose Processor (GPP), i.e. the ARM-
based Processing System (PS) in the Zynq, a memory-mapped
off-chip DDR (as available on the Zynq), and a set of Convo-
lution Specific Processors (CSP), hosted on the Programmable
Logic (PL), acting as accelerator. Fig. 1 illustrates the overall
system-level organization of the NEURAGHE architecture.
NEURAGHE’s CSPs are designed to operate the requested
convolutions autonomously, relieving the GPP from any
control-related action. The design enables a synergistic usage
of GPP to take care of effective computing workloads, such
as e.g. data marshaling or linear layers, while CSPs elaborate
convolutions. This execution model is available through a
dedicated software library provided with NEURAGHE [6],
optimized to exploit the NEON extension and the two cores
available on the ARM system. Each Convolution Specific
Processor is composed of: i) a RISC microcontroller (µC)
executing a control firmware for synchronizing transfers and
convolution computations; ii) a Convolution Engine (CE) that
represents the computational core of the accelerator, with a
1The name of the accelerator template derives from the ancient megalithic
edifices named nuraghes, typical of the prehistoric culture in Sardinia.
The different configurations are named after most important nuraghes.
https://en.wikipedia.org/wiki/Nuraghe0000–0000/00$00.00 c© 2018 IEEE


IEEE EMBEDDED SYSTEMS LETTERS, VOL. XX, NO. YY, ZZ 2018 4
TABLE II
EVALUATION OF DIFFERENT NEURAGHE CONFIGURATIONS (∗ indicates results based on inter-frame parallelism, the dual number indicates intra-frame).
LOSA ARRUBIU SABINA LOSA ARRUBIU SABINA BANZOS
single 4×4 dual 2×4 single 2×2 single 4×4 dual 2×4 single 2×2 single 1×1
Xilinx Zynq SoC Z-7045 Z-7045 Z-7020 Z-7045 Z-7045 Z-7020 Z-7007s
(Price) ($2500) ($2500) ($450) ($2500) ($2500) ($450) ($89)
DSP [#]; Freq [MHz] 864; 140 864; 140 216; 120 864; 140 864; 140 216; 120 54; 80
Benchmark net ResNet-18 ResNet-18 ResNet-18 VGG-16 VGG-16 VGG-16 SqueezeNet All-CNN-C
GOps/s (16-bit) 61.91 59.84∗ 53.52 27.51 172.67 188∗ 170.81 42.48 7.46 6.21
GOps/s per Watt (16 bit) 6.19 5.98∗ 5.35 7.86 17.26 18.8∗ 17.08 12.56 2.98 2.49
GOps/s per k$ (16 bit) 25 21.4∗ 23.93 61.13 69 75.2∗ 68.3 97.5 95 27.93
GOps/s (8-bit) 111.12 63.94∗ 73.27 49.61 335.09 370.44∗ 296.27 84.77 14.05 10.62
TABLE III
COMPARISON BETWEEN NEURAGHE AND OTHER ALTERNATIVES IN LITERATURE WHEN USING 8- OR 16-BIT ON VGG-16.
This work Venieris et al. [7] Sharma et al. [8] Guo et al. [9] This work Venieris et al. [10] Guo et al. [9]
Xilinx Zynq SoC Z-7020 Z-7020 Z-7020 Z-7020 Z-7045 Z-7045 Z-7045
Frequency [MHz] 120 125 150 214 140 125 150
Performance (16-bit) [GOps/s] 42.48 48.53 31.38 - 172.67 155.81 137
Performance (8-bit) [GOps/s] 84.77 - - 84.30 335.09 - 292
on smaller features and performance overheads related mostly
with loading of line buffers have higher impact. Moreover,
input data tiles have to overlap with each other, generating an
additional overhead in terms of repeated MAC operations to be
payed when computing border pixels. Exploiting inter-frame
parallelism is not possible in every use-case. For example, in
ResNet-18, the GPP in the architecture is used very frequently
along the CNN datapath for marshalling and shortcuts. Thus
it cannot serve as companion of two different CSPs, without
compromising the overall scheduling efficiency with respect
to a single-CSP configuration like LOSA (see Table II).
C. Exploring CE arithmetic precision
In NEURAGHE configurations processing 16-bit activation
and weights, each DSP slice in the programmable logic per-
forms one MAC operation per cycle. NEURAGHE allows the
programmer to select at runtime an operating mode processing
8-bit data, allowing to execute 2 MAC operations per cycle on
each slice, prospectively improving performance by a factor of
2. As shown in Table II, speed-up measured in 8-bit operating
modes is very close to the theoretical limit.
D. Comparison With S.o.A.
In Table III, we show a comparison, on VGG-16, between
the proposed platform and some related work. The availability
of design-time configurability and runtime tuning (selection
between 8- and 16-bit data precision) makes NEURAGHE
more flexible with respect to alternatives. On Z-7020, with
respect to [7], NEURAGHE provides around the same per-
formance, but supports more data types. The same results
from comparison with [9], that does not provide results for
16-bit, and provides slightly lower performance for 8-bit.
Another alternative in literature is [8]. NEURAGHE is more
flexible, since [8] only supports 16-bit, and provides higher
performance, being ∼35% faster. On Z-7045, NEURAGHE
outperforms competitors for all considered data precision val-
ues. NEURAGHE is ∼10% faster of the work in [10] (20%, if
we consider ARRUBIU) when using 16-bit, and ∼14% faster
than the work in [9] for 8-bit. Moreover, to the best of our
knowledge, NEURAGHE is the only alternative presenting an
implementation on very low-cost platforms such as BANZOS
(see rightmost column of Table II). This configuration is
suitable to be used to build low-cost FPGA-accelerated IoT
nodes. We report also performance achieved by BANZOS on
a lightweight CNN [11], indicated as All-CNN-C, performing
classification on low-resolution images. In this case, it supports
∼7 FPS at 16-bit precision and ∼13 FPS when using 8-bit.
VI. CONCLUSION
In this paper we present NEURAGHE, a customizable
architecture for CNN acceleration on FPGA-endowed SoCs,
focusing on its parametrization capabilities. Changing param-
eters of the NEURAGHE architecture leads to a variety of
configurations that may be used in different use-cases to fit in
different embedded systems, ranging from entry-level to high-
end. This offers different trade-off optimization scenarios with
respect to performance, cost and power consumption.
REFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521,
pp. 436–444, 2015.
[2] M. Blott et al., “FINN-R: An end-to-end deep-learning framework for
fast exploration of quantized neural networks,” ACM Trans. Reconfig-
urable Technol. Syst., vol. 11, no. 3, pp. 1–23, Dec. 2018.
[3] S. Mittal, “A survey of FPGA-based accelerators for Convolutional
Neural Networks,” Neural Computing and Applications, vol. 30, pp.
1–31, 2018.
[4] D. Pani et al., “An FPGA platform for real-time simulation of spiking
neuronal networks,” Frontiers in Neuroscience, vol. 11, p. 90, 2017.
[5] C. Zhang et al., “Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks,” in Proceedings of ACM/SIGDA Inter-
national Symposium on Field-Programmable Gate Arrays, ser. FPGA
’15. New York, NY, USA: ACM, 2015, pp. 161–170.
[6] P. Meloni et al., “NEURAghe: Exploiting CPU-FPGA Synergies for
Efficient and Flexible CNN Inference Acceleration on Zynq SoCs,” ACM
Trans. Reconfigurable Technol. Syst., vol. 11, no. 3, pp. 1–24, 2018.
[7] S. I. Venieris and C.-S. Bouganis, “Latency-driven design for FPGA-
based convolutional neural networks,” 27th International Conference on
Field Programmable Logic and Applications (FPL), pp. 1–8, 2017.
[8] H. Sharma et al., “From high-level deep neural models to FPGAs,”
49th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), pp. 1–12, 2016.
[9] K. Guo et al., “Angel-eye: A complete design flow for mapping CNN
onto embedded FPGA,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 37, pp. 35–47, 2018.
[10] S. I. Venieris and C.-S. Bouganis, “fpgaConvNet: Mapping regular and
irregular convolutional neural networks on FPGAs.” IEEE transactions
on neural networks and learning systems, 2018.
[11] J. T. Springenberg et al., “Striving for simplicity: The All Convolutional
Net,” CoRR, vol. abs/1412.6806, 2015.
