A scalable and efficient convolutional neural network accelerator using
  HLS for a System on Chip design by Bjerge, Kim et al.
A scalable and efficient convolutional neural network accelerator using HLS
for a System on Chip design
Kim Bjergea,∗, Jonathan Horsted Schougaardb, Daniel Ejnar Larsenb
aSchool of Engineering, Aarhus University, Finlandsgade 22, 8200 Aarhus N, Denmark
bDepartment of Engineering, Aarhus University, Finlandsgade 22, 8200 Aarhus N, Denmark
Abstract
This paper presents a configurable Convolutional Neural Network Accelerator (CNNA) for a System on
Chip design (SoC). The goal was to accelerate inference of different deep learning networks on an embedded
SoC platform. The presented CNNA has a scalable architecture which uses High Level Synthesis (HLS) and
SystemC for the hardware accelerator. It is able to accelerate any Convolutional Neural Network (CNN)
exported from Python and supports a combination of convolutional, max-pooling, and fully connected layers.
A training method with fixed-point quantized weights is proposed and presented in the paper. The CNNA is
template-based, enabling it to scale for different targets of the Xilinx Zynq platform. This approach enables
design space exploration, which makes it possible to explore several configurations of the CNNA during C-
and RTL-simulation, fitting it to the desired platform and model. The CNN VGG16 was used to test the
solution on a Xilinx Ultra96 board using PYNQ. The result gave a high level of accuracy in training with
an auto-scaled fixed-point Q2.14 format compared to a similar floating-point model. It was able to perform
inference in 2.0 seconds, while having an average power consumption of 2.63 W, which corresponds to a
power efficiency of 6.0 GOPS/W.
Keywords: System On Chip, FPGA, High Level Synthesis, Convolutional Neural Network, PYNQ
1. Introduction
In recent years, deep learning with Convolutional
Neural Networks (CNNs) has been applied in many
different fields such as image classification [1],[2],
object detection [3],[4] and recognition [5]. In most
cases, state-of-the-art CNN models run on a server
in the cloud. However, with the increase of Internet
of Things (IoT), there is a demand for embedding
the deep neural networks into mobile edge comput-
ing. This is especially true for computer vision sys-
tems, where the amount of collected data is high
and analyses of images must be carried out in real-
time.
As CNNs continue to be applied to increasingly
complex problems, low throughput, latency and
energy efficiency present challenges on embedded
devices with Central Processing Units (CPUs) or
∗Corresponding author
Email address: kbe@ase.au.dk (Kim Bjerge)
Graphical Processing Units (GPUs). Due to sev-
eral attractive features, Field Programming Gate
Arrays (FPGAs) present promising platforms for
Hardware (HW) acceleration of CNNs as reported
in [6],[7],[8],[9]. CNNs that are optimized for
fixed-point data or use binary neural networks
achieve even better performance [10],[11],[12],[13].
In general, FPGAs provide higher performance
than CPUs and have a better energy efficiency than
both CPUs and GPUs.
Historically, the long design time and need for
HW experts have limited the use of FPGAs. Here,
the high-level synthesis tools have enabled auto-
matic compilation from imperative high-level pro-
grams to low-level specifications in a Hardware De-
scription Language (HDL) [14]. It is, however, still
a challenge to accelerate large-scale CNNs [15] on
a FPGA, since model parameters typically require
far more memory than the on-chip capacity of the
FPGAs. Another challenge is to find an optimal
configuration for a given HW accelerator design due
to the long design time.
Preprint submitted to Journal of Systems Architecture October 8, 2020
ar
X
iv
:2
00
4.
13
07
5v
2 
 [c
s.C
V]
  7
 O
ct 
20
20
The scope of our work is to develop a generic
and flexible architecture, which can accelerate the
inference of CNNs on a Multi-Processor System on
Chip design (MPSoC). It presents the design of the
HW/SW architecture, i.e. the programmable logic
that will reside in the FPGA fabric and the de-
sign of the software. The architecture is generic so
that it can accept major CNNs such as AlexNet [16]
and VGG16 [2], which can be exported from a deep
learning framework such as Keras [17]. It is devel-
oped in the PYNQ [18] framework using Python
and SystemC [19] in order to create a generic tem-
plate based HW accelerator. To find the optimal
design, this study uses a SystemC-based simula-
tion to explore the design space of the optimal con-
figuration parameters of the Convolutional Neural
Network Accelerator(CNNA). The design model is
translated to a HDL specification using High Level
Synthesis (HLS). Our paper discusses the precision,
speed and power consumption of the accelerator as
well as the fixed-point retraining of the CNN.
1.1. Related work
In this section, the current state-of-the-art HW-
based CNNAs that inspired the architecture pre-
sented in this paper will be discussed.
The Microsoft model [20] is an architecture devel-
oped by Microsoft for accelerating CNN for a cloud
server solution with several FPGA-cards. The ar-
chitecture uses a top-level controller to control the
data-flow with a PCI memory interface. It has mul-
tiple input buffers, one kernel weight buffer, a large
array of Processing Element Arrays (PEAs) and
lastly, a data redistribution block. It uses a Di-
rect Memory Access (DMA) channel to load data
in from PC memory to the buffers. On the FPGA
it uses PEA blocks to perform dot product calcula-
tions of the values in the input buffer and the weight
buffer. The result of the dot product is saved into
the next input buffer.
ZynqNet [21] is based on the architecture of the
Microsoft model. However, it focuses on making it
work for both training and inference. It is built for
a System on Chip (SoC) design instead of a server
solution. The proposed solution seems promising,
although it appears to have a few bottlenecks due
to a purely C-based HLS implementation of the so-
lution. It uses a Circular Line Buffer (CLB) for
input data handling and a memory-mapped master
interface to get data from the main memory, i.e.
weights and input data are transferred using the
memory-mapped interface.
FINN-R[11] is an end-to-end deep-learning
framework for fast exploration of Quantized Neu-
ral Networks (QNNs). It is a framework built upon
the FINN accelerator [22] which is a QNN built
for FPGA. The FINN-R consists of a cascade of
multiple layer accelerators that are optimized for
a pipelined architecture. This design reduces the
transfer of data between the main memory and the
accelerators. The difficult part is to balance the lay-
ered accelerators in order to prevent bottlenecks or
resource waste. However, the framework does not
solve the problem of different throughput for each
layer. FINN-R optimizes the generated HW using
HLS, allowing fast exploration of QNN to create the
perfect accelerator for a specific FPGA target.
To accelerate and develop CNNs on reconfig-
urable HW (FPGAs), a survey of the current State-
of-the-art toolflows was published in 2018 by Ve-
nieris et al. [23]. The survey compares the perfor-
mance of: fpgaConvNet [24, 25], DNNWEAVER [26],
Angel-Eye [27], DeepBurning [28] and Caffeine [29].
The listed toolflows covers mapping of the clas-
sic AlexNet and VGG16 to the Xilinx Zynq or
UltraScale platforms. The deep learning frame-
work Caffe by Berkeley AI Research is the most
widely supported front end for these State-of-the-
art toolflows.
The above-mentioned toolflows can be divided
into two main categories of architectures: streaming
architecture and single computation engine. FINN-
R, fpgaConvNet and DeepBurning are in the cate-
gory of streaming architectures. This chained and
pipelined architecture can be achieved high perfor-
mance, but the optimal HW design needs to be
found and synthesized for each specific CNN. The
single computation engine, on the other hand, ex-
ecutes CNN layers sequentially, which means that
the same HW engine is able to handle many differ-
ent CNNs. The engine is controlled from SW, and
data must be moved from CPU memory to the on-
chip FPGA memory during processing of the CNN
layers. The advantage of this approach is that the
same HW can be used for several CNN architectures
without reconfiguration.
Tunable parameters must be optimized for the
available resources of the FPGA devices, which is
the case for DNNWEAVER, Angel-Eye and Caffeine.
Angel-Eye uses a compiler to translate the input
CNN to custom instructions. A similar approach
is used by DNNWEAVER, which utilizes a macro
dataflow instruction set architecture and supports
FPGAs from both Xilinx and Altera. Of the above-
2
mentioned toolflows, only fpgaConvNet supports
special layers with irregular dataflow including the
inception, residual and dense blocks that are re-
quired for the newest deep neural networks such
as ResNet [30], DenseNet [31], InceptionNet and
GooglLeNet [32].
The single computation engines build their archi-
tectures around a PEA with a buffer for handling
input data, which could be a CLB or a row buffer.
A streaming design, such as FINN-R uses the out-
put to feed the next accelerator. Consequently, the
streaming architecture has a large memory to cache
layered outputs directly to the next input buffer so
that the data are ready for the next CNN layer.
However, due to limited internal memory, this ap-
proach is not feasible for all FPGAs. Therefore,
there is a need for reloading the input data from
the main memory. An example of this is the Mi-
crosoft model. Other architectures use the main
memory to cache the data between layers. FINN-R
and fpgaConvNet, does this for each block of layers.
The CNN developed in this work has some el-
ements in common with the solutions presented
above. It uses the main memory to store data be-
tween layers and uses the single computation en-
gine approach. In addition, the architecture is built
around a PEA with two buffering systems: one for
the weights and one for input image data, the latter
of which uses a CLB. The above architectures are
very similar, but the major difference lies in the de-
tails of the CLB, which enables efficient pipelining
and data allignment.
The CNN architecture in our work supports any
input size and layer depth, stride, zero padding
and windows size. It makes the accelerator more
flexible and enables it to run nearly any CNN
model that uses convolution, pooling and fully con-
nected layers. It can be used with most CNNs
during run-time inference without the need for re-
compiling. The accelerator is developed to work
with PYNQ [33],[18] and uses an Application Pro-
gramming Interface (API) similar to Keras [17]. In
summary, this paper makes the following contribu-
tions.
• We present a generic CNN architecture con-
sisting of a single computation engine with five
core elements (weight buffer, data buffer, PEA,
pooling block and output handler) to perform
FPGA acceleration of CNN models.
• Stitching is used for convolutional layers that
are too large to execute in a single processing
pass and is used to split complex convolutions
into sub convolutions.
• Dynamic auto scaling is used during training
to minimize the accuracy between the floating
point and the quantized fixed point accelera-
tor.
• A template based SystemC design with an exe-
cutable model is proposed for design space ex-
ploration. The template model is synthesized
to an Xilinx IP core with HLS and controlled
from the host CPU using the PYNQ framework
and Python.
2. Design methods
In this section, we will briefly describe the design
methods and concepts used as a basis for designing
and implementing the architecture for the CNNA.
In our work, SystemC is used with the design flow
described in [34, ch. 1],[35]. It is an efficient way in
which an IP can be written and verified using HLS.
SystemC is able to model the HW structure and
concurrency in a more optimal manner than pure
C and C++. It is an IEEE standard (IEEEStd
1666-2011) [19], based on a C++ template library
made for HW/SW co-design.
The use of templates makes the IP core config-
urable and portable to explore different solutions
and platforms, whereas custom designs are less flex-
ible. It is much faster to recompile and simulate a
template-based IP core than to write a custom IP
that may be more optimal. By use of SystemC the
desired HW architecture can be controlled and de-
signed via modules with parallel clocked threads.
With HLS directives it is possible to control the
synthesized threads and achieve a desired unroll,
pipeline and iteration interval of the synthesized
RTL code.
PYNQ (Productiviy for Zynq) [36] is an open-
source framework for creating applications on a
ZYNQ MPSoC. The system design in this work is
based on PYNQ for controlling the IP directly from
Python. This framework is divided into three lay-
ers: application, SW and HW.
The application layer, which hosts the user-code,
is described in [36, ch. 22]. This is usually Python
code in the form of a Jupyter notebook that runs on
the ARM CPU inside the Zynq MPSoC. The mid-
dle layer in the PYNQ framework is the SW layer.
3
Figure 1: Block diagram of the system architecture covering
CPU (Zynq Ultrascale+ MPSoC), memory (DDR RAM),
HW IP core accelerator (CNNA), five DMAs for inputs
(X), outputs (Y), weights (W), control (CTRL) and splits
(XBUF).
This layer contains the Python libraries and the in-
teraction with the IP inside the FPGA through the
OS drivers. Several drivers are provided through
the PYNQ libraries for interacting with the IP. The
interface is called an overlay and is used to pro-
gram the FPGA and manage the IP. The last HW
layer in the PYNQ framework is the bit-file pro-
grammed into the FPGA. The interaction between
the SW layer and the HW layer is done using DMA
or memory-mapped interfaces.
3. System architecture
The SoC design consists of three main elements:
the FPGA (i.e. the Programming Logic (PL)) the
dual core CPU and memory (i.e. Dynamic Ran-
dom Access Memory (DRAM)). The goal is to run a
CNN consisting of convolutional, max-pooling and
fully connected layers computed in the same IP core
inside the FPGA logic. The responsibility of the
CPU is to let the user control the HW accelera-
tion so that the IP core is processing CNN layers in
correct sequential order. Figure 1 shows that the
system uses DMA to transfer data and weights be-
tween the CPU and IP core accelerator. The CPU
controls the DMA data block transfer and converts
the memory interface to the streaming interface of
the IP core.
The system interacts in different manners de-
pending on which scenario it needs to execute.
There are three main scenarios: preprocessing, ini-
tialization and inference.
Preprocessing. The first scenario converts the
weights to fixed-point and realigns and scales the
weights so that they are ready for the system to
use. Preprocessing also calculates parameters such
as layer output size and layers to be split which can
be done offline on any computer. The weights are
transformed from floating-point to fixed-point rep-
resentation in the chosen format, and aligned and
rounded off correctly, as described later. Finally,
the weights are saved in an h5-file, which is the
standard format for storing weights in Keras [17],
and can be transferred to the HW target.
Initialization. The HW target needs to be config-
ured and initialized for a particular fixed-point res-
olution by using the synthesized bit-file of the op-
timized CNNA. The bit-file contains the CNNA IP
core and interconnection setup to the CPU for the
specified HW target. This is done by using a spec-
ification of the model in the form of a JSON-file
and an h5-file containing the weights, which are al-
ready realigned and quantized in the preprocessing
scenario. It starts by calculating the buffer size and
getting the properties of the loaded CNNA. When
this is done, the SW allocates the needed resources
and readies the SW for inference by allocating the
buffers for each layer in the CNN.
Inference. When using the system, predicting an
image will be the most commonly used task. This
task is shown in the sequence diagram in figure 2.
Here, the user calls the method predict, which re-
turns the predicted class of the image when the in-
ference is done. The image, which is the parameter
to the method predict, is stored internally in a con-
tiguous array, i.e. an array which can be used by
the PYNQ DMA library. Depending on the CNN,
several layers are executed in the correct order, i.e.
convolution, pooling or fully connected layer. All
parameters controlling the CNN execution are sent
at the start of the predict method.
The convolution is done by the CPU initiating
four different tasks in parallel. It sets up the data
transfer for the input control data CTRL, the input
data X, the output Y and XBUF. Each of these data
transfers are handled by the DMA, which streams
the content of the buffer from DRAM to the CNNA.
The fully connected layer is executed similarly to
both pooling and convolution. It starts four dif-
ferent DMAs: one for each of the input data X,
the weights W, the output Y and the configuration
though the CTRL.
4
Figure 2: Sequence diagram of the system of interaction
between SW control (PYNQ) and HW accelerator (FPGA)
during inference.
Two interfaces are used. The streaming inter-
faced used for the DMA is implemented with a
functional deterministic guaranty so that no race
condition can happen, which makes the entire IP
very stable. The AXI streaming interface, is used
for transmitting the data between the CPU and
IP. The other AXI-lite [37] interface, which is a
memory-map, is only used for reading status reg-
isters.
3.1. Software control and stitching
Some convolutional layers are too large to be pro-
cessed as a single CNNA iteration. This means that
they are split into several sub-convolutions. How-
ever, the result is returned from the IP core with
the depth first and thus needs to be stitched to-
gether with the later result. This is done through
the IP core which has a DMA-channel (XBUF) for
this purpose, as shown in figure 1. An example of
stitching can be seen in figure 3. The shown ex-
ample illustrates the output of two pixels, a and b,
from a convolution with a depth that needs to be
split.
The stitching is done by using two equally sized
buffers, which both have a size equal to the ex-
pected output size. The size in this example is six.
The first convolution only uses the first buffer as
the output, and pixels a and b are both updated
Figure 3: Example of buffer stitching with a split of three
shown for a single pixel.
by the DMA. However, only the first third of the
depth is calculated in the first convolution. The
second convolution calculates the next third of the
output. However, these outputs need to be stitched
in between the previous outputs. This is done by
using the output of the first convolution as a stitch
buffer. The IP core is informed to use the first part
of each pixel and appends the result to each pixel
depth-wise. The result of this stitching is sent to
the output buffer[1]. The third convolution takes
two thirds from the stitch buffer and the last third
from the output for each pixel. The output of the
stitched convolution is in the buffer, which is the
one used as output in the last stitching.
However, the fully connected layers can also be
too large to be processed at once, in which case
the splits are handled differently. A fully connected
layer generates a single value per output. The
buffer will be filled with values from the first split
when it runs the first time. The second time it runs
it will get the next outputs, which must then to be
placed after the first split in the buffer. All the
splits need to have an adjusted number of bytes to
ensure that the correct amount of data is received.
When all the splits have been processed the result
is in the same buffer. This means that the fully
connected layer only needs a single buffer, contrary
to the convolution layers, which needs two.
3.2. CNN hardware accelerator
Figure 4 shows the architecture overview of the
main elements of the CNNA. The CNNA works as
an accelerator for a single layer at a time. This
means that the accelerator needs to be reconfigured
5
Figure 4: Illustration of the CNNA architecture with five
streaming buffer inputs for control, stitching, weights and
data as well as one result output buffer. A number of Pro-
cessing Elements (PE) are used to accelerate the pooling and
multiplications for each neuron in the network. The CNNA
will be executed several times and typical once for each layer
during inference.
for each layer, which is done through the streaming
interface CTRL.
The streaming interface W is used to load the
weights, which can consist of multiple kernels, and
cache them in the weight buffer. This means that
they can be used multiple times without reloading
from DRAM. The streaming interface X is used to
stream in the input data, which can either be an
image or the output from the previous layer. X buf
is an interface that is used when a convolutional
layer is split into several splits, which need to be
stitched together correctly. The last streaming in-
terface is Y, which streams out the output values
of the operation.
The accelerator is built for three different opera-
tions: convolution, pooling and fully connected lay-
ers. During convolution acceleration, it uses the
weight buffer, the data buffer and the PEA. The
pooling operation is done by using only the data
buffer and the pooling block. When executing the
fully connected layers, the weight buffer and the
data buffer are simply set to forward the data di-
rectly to the PEA, thus generating a dot product
of the two. The following sections describe the five
core elements comprising the design of the CNNA.
3.2.1. Weight buffer
The weight buffer is used for caching the weight
kernels. This caching is necessary, because the con-
volution requires the kernels to be used once for
each output pixel.
An illustration of the weight buffer module can
be seen in figure 5, which shows the modules inside
of it. The Iteration Interval (II) and Bandwidth
Figure 5: Weight buffer with alignment illustrates how kernel
weight data are aligned so that a specific kernel is placed in
the correct spot. The illustration shows how the weight data
of kernel 0 are sent to stream buffer 1. The II and BW of the
weight input package are changed by a factor of resizefactor.
The yellow part of the image cube shows which part of the
kernel is sent.
(BW) of the weight input package are changed dur-
ing resizing as illustrated in the figure. It shows
how the resize module changes the BW from in-
put BW, BW(in) by a factor of resizefactor. It also
splits the raw package into smaller packages. The
realign module splits the raw package into smaller
packages. The splitter separates the data stream
into N different stream buffers, each of which has
the same BW as the resized BW, BW(resize). Each
stream buffer sends the kernel to a Processing Ele-
ment (PE) X times.
The realignment in the weight buffer is compli-
cated. Firstly, this is due to the bias value, which
uses a complete package. Secondly, it is compli-
cated because the kernels need to match the or-
der in which the three-dimensional window comes
from the data buffer, i.e. have the same positions
and depths as the data buffer. In figure 5, the first
package contains the bias values transferred to the
weight buffer. It shows that this single value only
uses a complete resized package. This is followed by
N other bias packages. After all bias packages are
sent the weight packaged are sent. In this example,
the weight packages contain a 3 × 3 × 4 window.
The stream buffers receive the values one after the
other.
3.2.2. Data buffer
The data buffer is used to handle the input data
stream and create the windows on which the con-
volution or pooling is carried out.
6
An image typically consists of three chan-
nels, RGB, which can be visualized as a three-
dimensional cube. Such a three-dimensional image
is illustrated in figure 6. The image is stored in
raster order, i.e. first, pixel (0,0) channel 0, then
channel 1 of the same pixel followed by the last
channel. This is followed by the same three chan-
nels for the pixel one row down, which means that
the Z-axis is processed first, then the Y-axis second
and the X-axis last. Raster order is the order in
which the image data is streamed to the CNNA.
The CLB can be considered the brain of the
CNNA, because it allows it to increase the BW
and removes the need to realign the input data for
each layer. The parameters that the actor can set
through the control interface are:
• Row size — The row size of the input image,
i.e. the Y-axis length. It is assumed that the
image is quadratic.
• Depth — The depth is N-channels of the input
image, i.e. the Z-axis of the image. This should
be dividable by the BW.
• Stride — Stride is a convolution parameter.
• Window size — The size of the window. If the
window size is 3, the real window size would
be 3 × 3 × depth. This is also a convolution
parameter.
• Zero pad — A convolution parameter setting
the size of the zero padding around the image.
• Replay — How many times the CLB should
resend a single window.
After setting up the CLB with the parameters
through the control interface, the image data can
flow into the CLB. The CLB consists of two parts:
a line buffer for storing Nlines previous lines and
a shift buffer for storing Npixels previous pixel for
each line. These parts are explained in detail below:
Line buffers. The first module in the CLB, where
the image data is ordered and stored is the line
buffer. This module streams one row of the image
with all channels at the same time. The number of
line buffers is equal to the maximum window size
minus one, Nline buffer = windowsize−1. This is be-
cause only the N previous lines are needed to con-
struct a window. The illustration of the data flow of
the line buffer in figure 6 shows that the N −1 pre-
vious lines are stored inside the line buffer and sent
out individually. This means that the BW increases
with a factor of windowsize. It is also indicated that
the buffer is stored circularly. This is handled by
the pointer, which can be seen in figure 6. This
pointer will increase each time a new input is re-
ceived. After receiving a whole line, the line buffers
will rotate, i.e. the first line will be moved to the
back the second line will be pushed forward and the
pointer will be reset. This is done by multiplexing
logic in the implemented design.
Shift buffers. After the line buffers, the data
reaches the shift buffers. These buffers are used
for getting the N previous pixels from each line,
i.e. having all the pixels needed for a convolution
window, as shown in figure 6. The shift buffers
have another important function as well. They re-
play the window for the convolution if there are not
enough PEs to run all the dot products in the con-
volution operation at once. The shift buffers are
on-chip RAM-based shift buffers and consist of two
pointers. The write pointer is essentially controlled
by counting up whenever data is written and mov-
ing it back to start of the shift buffer when the end
has been reached. The read pointer, however, is
controlled by logic, which tells the shift buffer that
it needs the N previous samples. This will be han-
dled by the shift buffer, which also calculates its
new positions.
3.2.3. Processing element array
The heart of the CNNA is the PEA. Each PE
performs HW acceleration of a dot product with a
small range of activation functions, e.g. linear or
ReLU [38]. The PE operation can be written as
shown in equation 1, which is a dot product of the
two equal length vectors ~x and ~w.
PE(~x, ~w) = f
(N−1∑
i=0
(xi · wi)
)
(1)
Each PE receives data frames in pairs from the
weight buffer and the data buffer, i.e. one from
each. The acceleration of the PE is done by run-
ning the multiplications in parallel and totaling the
results afterward, as illustrated in figure 7. This
data is dotted together and followed by the activa-
tion function.
Figure 7 shows how the PE has two inputs: W,
the weight input, and X, the data input. When
7
Figure 6: An illustration of the flow of data through the CLB. It consists of two parts: a line buffer for storing Nlines previous
lines and a shift buffer for storing Npixels previous pixel for each line. The leftmost image cube illustrates a single pixel 202,
which is written for the line buffer. The middle cube illustrates which data is saved in which line buffer and how the new line
replaces the first line. The rightmost cube illustrates what data is in the shift buffers. The missing part illustrates how much
more data is needed from the line buffers before it has a complete window in the shift buffer. The read pointer on the shift
buffers is used for getting the N previous samples and for generating the output from the shift buffers.
Figure 7: Illustration of the PE design. It consists of a par-
allel multiplier array followed by a summing tree and lastly
an accumulator, scale and the activation function logic. The
PE dot product and summer are executed in a parallel and
pipeline order.
the PE has received a frame on both W and X,
the data frames are dotted together and the bias
is added to the result. The result is forwarded to
the next part, which is the pipelined PE summer.
This part accumulates the result, which it has re-
ceived from the PE dot product. It will keep on
accumulating until it receives the last flag. When
this happens, it will multiply the accumulated value
by a factor set by the actor, i.e. the control inter-
face, and apply the activation, which is also set by
the actor through the control interface. Lastly, it
is streamed out through the port Y, and the accu-
mulated result is reset. The hole PE is synthesized
with a II of one, which means that new inputs can
be processed in each clock cycle. HLS try to solve
this problem with a summer tree or a cascade of
Multiply-Accumulates (MACs) processed in a long
pipeline.
3.2.4. Pooling
The pooling element is used to accelerate the
pooling operation. It gets its input from the data
buffer and sends the output to the output handling
part, thus bypassing the PEA, which is not used in
pooling.
8
Figure 8: Illustration of the logic of the pooling accelerator.
The data is received as a small slice (the purple cube), which
is the output of the CLB. It compares the input data with
the current pooling. If it is the first value, it saves it. Af-
ter pooling, logic is executed. If it is max-pooling it checks
if the input is larger than the stored value, and stores the
largest one. When a whole window has been run through the
accelerator, it will start streaming out the calculated pixels.
These steps are repeated for all the windows created by the
CLB.
The reason for placing the pooling operator in-
side the CNNA is reuse of the CLB hardware. The
pooling accelerator receives its input directly from
the CLB, and the output from the pooling goes di-
rectly to the output.
When looking at figure 8, it can be seen that
the pooling block consists of logic for handling the
pooling operation, e.g. max-pooling, and RAM for
buffering a single pixel. The pooling logic is con-
trolled by the actor and is used for setting the depth
of the current image and the size of the window, e.g.
2× 2× depth or 3× 3× depth. The last parameter
controls what type of pooling operator should be
run, i.e. max-, min- or average-pooling.
3.2.5. Output handler
The output handling element plays a major role
in getting the output of the CNNA into the cor-
rect shape and alignment before streaming it out
through interface Y. It merges the results from the
PEA when it is used, and if the data needs to be
stitched with X buf, i.e. if a convolution opera-
tion has been split into more than one convolution
and needs to have the old output interleaved into
the new output. Splitting the convolution happens
when too many kernels need to be stored in the
weight buffer. It also handles the output of a pool-
ing operation, which simply means forwarding the
output of the pooling element.
4. Training for fixed-point
To overcome the challenge of the CNNA us-
ing fixed-point values, an emulation of fixed-point
needs to be made in order for the CNN to be trained
and calculated correctly. This is mostly due to the
large dynamic range of the weights.
This emulation is shown in equation 2, where
Q[I.F ](x) is the fixed-point representation of x in
the fixed-point format Q[I].[F] [39]. Here, I is the
number of integer bits and F is the number of fac-
tional bits. First, the number x is scaled up by 2F
and then rounded off to resolve the limited reso-
lution of fixed-point numbers. This is followed by
what is essentially a saturation of the number to
the range of the fixed-point number, i.e. between
−2I+F−1 and 2I+F−1 − 1. Lastly, the number is
scaled down by the same factor it was scaled up
by. This results in a value that can be interpreted
correctly by the CNNA.
Q[I.F ](x) = max
(
−2I+F−1, min(2I+F−1 − 1,
round(x · 2F ))) · 2−F
(2)
4.1. Quantized weights
The weights are quantized as a constraint to the
optimizer, which executes the backpropagation [40].
This constraint is set to quantize all weights after
each update using equation 2. This results in the
Stochastic Gradient Decent (SGD) update formula
shown in equation 3, where Q[I.F ](x) is the quanti-
zation function shown in equation 2, W
(l,t=t−1)
ij is
the previous weight, W
(l,t=t)
ij is the new weight, and
α is the learning rate.
W
(l,t=t)
ij = Q[I.F ](W
(l,t=t−1)
ij − α∇(l,t=t−1)Wij ) (3)
However, this introduces a problem, that makes
the training freeze. The cause of the problem is that
the size of the update to the weights is too small
to move from one quantized value to another. The
effect of a too-small update change can be seen in
9
the example shown in equation 4. It is not possi-
ble to update a single weight in Q2.6 with a value
smaller than the smallest quantized value, in this
case 2−6 = 0.015625. The example shows a weight
with value 1.671875 being updated by a too-small
value: 0.0015624. Updating the quantized weight
value does not result in a change, which causes the
training to freeze.
W
(l)
ij = Q[2.6](1.671875− 0.0015624) = 1.671875
(4)
To solve this issue, an extra copy of the weights
W is saved so that the forward pass, i.e. inference,
is calculated using the quantized weights, and the
SGD is calculated using unquantized weights. This
means that the weights do not get stuck between
quantization steps. This is also known as lazy up-
date SGD [41]. In this way, the weights W are
saved and the quantized weights WQ are used for
the forward pass, which can be seen in equations 5
and 6.
W
(l),t=τ
ij = (W
(l),t=τ−1
ij − α(∇(l),t=τ−1W )ij) (5)
WQ
(l),t=τ
ij = Q[I.F ](W
(l),t=τ
ij ) (6)
By using these equations, the optimizer can train
the CNN even though the changes are too small to
be significant when quantized.
4.2. Dynamic range scaling
The small kernels in the first convolutional layers
of the CNN VGG16 have large weights, i.e. close
to 1 or −1, but the fully connected layers have very
small weights that only use the lowest bits, even
in Q2.14. This means that the CNN needs more
fractional bits. However, this is possible to solve
by dynamically scaling the weights and the output.
This is carried out with integers in [42].
The following will show how this can be carried
out on fixed-point values as well. It has been found
that the dynamic range of each kernel is almost the
same for each layer. This knowledge can be used
to add scaling to each layer in order to change the
dynamic range of the weights. For example, based
on the given weights
W =
[
0.11 0.024 −0.30
−0.05 0.002 0.1
]
and a fixed-point format Q[I].[F], which, for simplic-
ity, is able to store a maximum value of 1, denoted
QMAX[I.F ] , a scaling can be found. To find the scaling
needed for a better dynamic range, equation 7 can
be used. This equation takes the absolute maxi-
mum absolute value of the weights and divides it
by the maximum value of the fixed-point format.
scale(l) =
maxi
(
|W (l)i |
)
QMAX[I.F ]
= | − 0.30| = 0.30 (7)
The scaled value of the weights can now be cal-
culated as shown in equation 8, which divides the
weights by scale(l). This shows that the maximal
absolute value is now -1.
W
(l)
scale =
W (l)
scale(l)
=
[
0.367 0.08 −1
−0.167 0.00667 0.333
]
(8)
Using this scale factor, the output of a layer is
calculated as shown in equation 9, which has an
added multiplication of the quantized value of the
scale factor, where z
(l)
scale is the scaled output of
layer l, W
(l)
scale are the scaled weights,
~al−1 is the
output from the previous layer and scale(l) is the
scale factor of the layer l.
z
(l)
scale = Q[I.F ](W
(l)
scale· ~al−1)·Q[Iscale].[Fscale](scale(l))
(9)
Because of the quantization, it cannot be guaran-
teed that the outputs are the same, but they should
be very similar, i.e. z(l) ' z(l)scale. The main dif-
ference between the scaled and unscaled version is
that z
(l)
scale is better suited for the bit range of the
fixed-point format than z(l).
5. Design space exploration
The template-based IP core written in SystemC
has a number of parameters that must be selected
in order to achieve an optimal solution. The exe-
cutable model of the IP core gives an approximate
estimate of the time performance and FPGA re-
source usage. When optimizing the IP core for a
FPGA, it can be an enormous task to generate a
design and find the optimal design parameters. It
takes approximately one hour to synthesize the HLS
code to RTL code. If this was to be carried out for
10
Figure 9: Results of the C-simulation of the combined test
of all fixed-point candidates, showing the average resource
usage of BRAM and DSP versus latency. The PEBW{×} is
set to 128 for all solutions. The plotted text for a candidate is
in the format
[
PEN,DB
(output)
BW{×} , kernels
(R3×3×512+1)
N
]
. The
candidates are split up in three groups of the word lengths
(data
(W)
size ) 8 bit, 16 bit and 32 bit versions. It took 2 minutes
to create the estimate for one candidate solution.
all possible combinations of parameters, it would
take weeks, months or even years, since the archi-
tecture has such a large number of parameters, e.g.
BW between modules, FIFO-depth, the number of
PEs, etc. The high level model of the IP in Sys-
temC can be simulated faster than the RTL code
by a factor of 50-200 times, depending on the size
of the accelerator. It is possible to use a heuristic
approach to find the optimal solutions for a cer-
tain fixed-point resolution constrained by the given
target device by evaluating several simulated solu-
tions.
The design parameters are used for tuning the
CNNA design in order to find a balance between
precision, speed and resources. The CNNA tuning
parameters used are as follows:
data
(W)
size . The word-length of the fixed-point data
format in bits, i.e. I+F. Has an impact on precision.
PEBW{×}. The internal BW with an element size
of data
(W)
size used by the CNNA.
PEN. The number of PEs and the PEA are limited
by the size of FPGA fabrics.
DB
(output)
BW{×} . The output BW multiplier after the
CLB. Normally this will be set at an equal value
to CLB
(rows)
N , but can be set to a lower number in
order to allow the PE to run with lower BW and
potentially have a bigger PEA. The internal BW
in the PE will be DB
(output)
BW{×} · PEBW{×}, with an
element size of data
(W)
size . The BW used inside the
weight buffer is also equal to DB
(output)
BW{×} ·PEBW{×}.
kernels
(R3×3×512+1)
N . Used to calculate
WB
(buffer)
size =
(
(3× 3× 512)
DB
(output)
BW{×} · PEBW{×}
+ biassize
)
·kernels(R3×3×512+1)N
Here (3 × 3 × 512) is chosen from the largest layer
in the CNN, which, in this case, is the VGG16 [2].
The tuning of the CNNA can be expressed as
a vector, ~β, with 5 hyper-parameters, as shown in
equation:
~β =
[
data
(W)
size ,PEBW{×},PEN,DB
(output)
BW{×} ,
kernels
(R3×3×512+1)
N
]
To measure the performance of the different
CNNA configurations, a simulation was made. It
consisted of five different elements: two pooling op-
erations, two convolution operations and a single
fully connected operation. They were executed in-
dividually but evaluated together.
When looking at the latency for the combined
simulation test, i.e. the five simulations carried out
consecutively after each other, the dominant can-
didates all have DB
(output)
BW{×} = 1 regardless of word
length (see figure 9). The figure shows that the
faster the accelerator, the higher the number of
PEs.
Two models were created of each configuration,
one of which was done using C-simulation, i.e. a
simulation that used the SystemC HLS code di-
rectly. The other was a RTL-simulation, which used
the RTL-code generated from the SystemC model
for the most optimal solutions. The latter was clock
cycle accurate and the execution time was precise.
Several candidates were identified and shown in
greater detail in table 1. The table shows the num-
ber of Digital Signal Processing slices (DSPs) and
BRAM used, as well as the total latency for C- and
RTL-simulation. Some candidates marked with a
”-” used more resources than were available on the
tested target platform.
11
Table 1: Design space exploration of resource usage and latency of possible CNNA candidates using C- and RTL-simulation. A
”÷” means RTL-simulation performed, but insufficient space on target platform (Ultra96). A ”-” means RTL-simulation not
performed.
Parameters
~β
resource
average %
DSPs
DSPs
(RTL)
BRAMs
BRAMs
(RTL)
latency[ms]
latency[ms]
(RTL)
[8, 128, 8, 3, 42] 70 384 359 144 249 4.60 6.66
[8, 128, 8, 1, 42] 34 128 125 137 185 4.95 7.28
[8, 128, 16, 3, 42] 124 768 - 152 - 4.26 -
[8, 128, 16, 1, 42] 52 256 245 139 193 4.03 6.66
[16, 128, 8, 3, 32] 54 192 360 233 377 7.47 8.40
[16, 128, 8, 1, 32] 35 64 - 227 - 7.61 -
[16, 128, 16, 3, 32] 81 384 ÷ 239 ÷ 6.98 ÷
[16, 128, 16, 1, 32] 44 128 293 229 349 6.27 8.52
[32, 128, 8, 3, 20] 94 384 - 355 - 13.19 -
[32, 128, 8, 1, 20] 58 128 165 351 336 12.92 14.53
[32, 128, 16, 3, 20] 148 768 - 359 - 12.42 -
[32, 128, 16, 1, 20] 76 256 325 353 408 10.73 12.81
The candidates with lowest latency were synthe-
sized and tested using RTL-simulation, which simu-
lates the real HDL-code generated. This also gives
a more precise resource usage, which only differs
slightly from the ones estimated using C-simulation.
The execution time is also shown and is slightly
higher (approx. 2ms) than the estimated value.
On average, the compilation and C-simulation time
took 2 minutes for each solution. The HLS synthe-
sizatione and RTL-simulation took 1-7 hours.
The optimal parameters were found using two
different fixed-point formats (data
(W)
size ): Q2.14 and
Q2.6, i.e. a word length of 16 bits and 8 bits, re-
spectively. These were chosen because of the area
constraints of the FPGA on the Xilinx Ultra96
board [43]. However 32-bits would have been pos-
sible with a larger FPGA.
Finally, three different configurations of the
CNNA were chosen for the final test of the system,
one of which used 16-bit fixed-point format Q2.14,
while the two others used 8-bit fixed-point format
Q2.6.
6. Results and discussion
The dataset DETECT [44] was used to verify
the system. This dataset consisted of 29 classes
of micro-invertebrates suspended in alcohol. Only
the first five classes were used in the first test, while
the second test used all 29 classes. Cifar-100 [45]
and ImageNet [46] were used for comparison with
other common datasets and to validate the results.
Figure 10: Training with five classes. Gray: floating-point,
orange: fixed-point Q2.14 with auto-scaling, blue: fixed-
point Q2.14 without auto-scaling, red at the bottom: fixed-
point Q2.6 with and without auto-scale.
The training was carried out over the span of 100
epochs.
The CNN used was VGG16 [2]. The CNN con-
volutional blocks of this CNN is followed by two
dense fully connected layers with either 4096 or
1024 neurons. Its final fully connected layer has
either five or 29 neurons, depending on the num-
ber of classes. The training was performed on
two fixed-point formats: Q2.14 and Q2.6, and
tested on three configurations, which will be de-
noted CNNA16, CNNA
1
8 and CNNA
2
8. CNNA16
uses the tuning parameters ~β = [16, 128, 8, 3, 32],
CNNA18 uses
~β = [8, 128, 16, 1, 42] and CNNA28 uses
~β = [8, 128, 8, 3, 42].
12
Figure 11: Training with 29 classes. Blue: floating-point,
red: fixed-point Q2.14 with auto-scale, light blue: fixed-
point Q2.14 without auto-scaling.
The accuracy, performance and power consump-
tion of the proposed system will be presented and
discussed in this section.
6.1. Accuracy
The CNN was trained using the small dataset in
order to find suitable candidates faster, since it is
faster to train for five classes than for 29 classes.
If the accuracy of a fixed-point format is poor on
five classes, it will likely be as poor, or worse, when
training on 29 classes. Therefore, initial training
was carried out on the small dataset.
Figure 10 shows that most of the trained models
faced issues and obtained low accuracy when us-
ing fixed-point format. The only quantized version
that obtained a high level of accuracy was the one
using fixed-point format Q2.14. It is unknown why
the training with fixed-point format Q2.14 and no
auto-scaling makes a sudden dive after 10 epoch.
However, it could be caused by the learning-rate
being too high or too low, or too few neurons in
the fully connected layers. The best results were
achieved with fixed-point format Q2.14 and auto-
scaling, which converges towards an accuracy of al-
most 100%. All fixed-point Q2.6 versions did not
manage to be trained or achieve any useful results.
Table 2 shows the results of the training with five
classes. Only the training that used Q2.14 with no
auto-scaling performed well with 4096 neurons and
reached approximately 83%. The table shows the
number of neurons in the fully connected layers,
Nneurons, as well as the training and validation ac-
curacy. Validation was performed on a dataset not
used for training.
Table 2: Training results for the training of VGG16 on five
classes.
Type auto-scale Nneurons validate train
float n/a 1024 97.5 100.0
Q2.6 yes 1024 20.3 20.9
Q2.14 no 1024 24.3 23.6
Q2.14 yes 1024 94.2 98.8
float n/a 4096 97.9 99.5
Q2.14 no 4096 83.2 83.6
Q2.14 yes 4096 91.7 97.6
A final test was performed on all 29 classes of
DETECT with, the candidates that performed well
in the previous test. The best candidates were
the floating-point version for reference and the ver-
sions that used fixed-point format Q2.14, both with
and without auto-scaling. As is evident from fig-
ure 11, only the training that used fixed-point for-
mat Q2.14 and auto-scaling achieved promising re-
sults. It shows that it is much more difficult to
train the CNNA when using quantization, because
details are lost due to the limited range of the fixed
point numbers. However, it takes many more iter-
ations for the training to reach the same accuracy
level as the floating-point format.
The first 29 classes from ImageNet and Cifar-100
were also used for training. The validation results
in table 3 shows that comparing the Q2.14 format
with floating-point the accuracy drops with 3.5%
and 3.2%. For DECTECT the drop is 4% which
is higher compared to training with ImageNet and
Cifar-100.
Table 3: Results for the training of the 16-bits fixed-point
VGG16 on 29 classes from the datasets: DETECT(DET),
ImageNet(Image) and Cifar-100(Cifar).
Type data auto Nneurons val. train
float DET n/a 1024 88.0 100.0
Q2.14 DET no 1024 5.0 5.0
Q2.14 DET yes 1024 86.4 94.4
float DET n/a 4096 84.0 99.4
Q2.14 DET no 4096 5.0 5.1
Q2.14 DET yes 4096 86.5 92.9
float Image n/a 4096 83.0 99.2
Q2.14 Image yes 4096 79.5 88.1
float Cifar n/a 4096 80.5 99.5
Q2.14 Cifar yes 4096 77.3 89.3
13
6.2. Performance
The Xilinx Ultra96 board [43] was used to eval-
uate the performance of the system using a HW
clock of 100 MHz and 172.22 MHz for the CNNA IP
core. The inference time was measured for the dif-
ferent configurations of the CNNA16, CNNA
1
8 and
CNNA28 and the inference times are shown in table
4. The timing performance was measured on the
Ultra96 board during inference of the quantized and
trained VGG16 model with five classes. The mean
time and variance is an average of 30 measurements.
The fastest model CNNA18 took 1.22 sec per image,
while the slowest, CNNA16 at 100 MHz, took 2.20
sec per image.
Table 4: Average inference time and variance using VGG16
for five classes using four different IP cores.
CNNA16
100MHz
CNNA16
172MHz
CNNA18
172MHz
CNNA28
172MHz
avg [sec] 2.20 1.96 1.22 1.49
var [·10−3] 0.25 0.30 0.20 0.11
The different layers have different execution
times, as shown in table 5. As expected, the execu-
tion time in convolutional layers depended on the
number of bits in the fixed-point format. However,
pooling took approximately the same time for all
tested IP, since pooling is independent of the fixed-
point format. The table shows that the IP CNNA18,
obtained the best performance due to the lager
number of PEs (16). Note that CNNA28 was slightly
faster than CNNA18 in the convolutional layers, even
with fewer PEs, due to the higher Bandwidth of
the output multiplier. There is a large number of
splits (512) in the dense 1 and dense 2 layers, and
they consume more than half of the total execution
time for all three CNNA configurations. In average
32% of the time is used to setup the DMA’s from
PYNQ, which could be optimized with a scatter-
gather DMA. In such a solution the DMA would
initiate transfer for the next location of DRAM data
without involving the CPU. A larger FPGA with
more on-chip memory could also be a solution to
lower the number of splits and optimize the perfor-
mance further.
6.3. Power consumption
The power consumption of the design with
CNNA16, CNNA
1
8 and CNNA
2
8 was measured on
the Ultra96 board during inference of the trained
Table 5: Time of execution of each VGG16 layer in [ms]
using four different IP cores.
layer
CNNA16
100MHz
CNNA16
172MHz
CNNA18
172MHz
CNNA28
172MHz
l1 conv1 19.3 17.0 19.9 21.4
l1 conv2 111 84.1 61.3 60.1
l1 pool 18.1 13.7 12.9 17.1
l2 conv1 55.4 42.9 31.3 30.5
l2 conv2 108 81.3 60.2 56.3
l2 pool 8.98 6.87 6.36 8.40
l3 conv1 56.0 43.5 33.1 29.9
l3 conv2 112 84.2 64.2 59.1
l3 conv3 110 85.5 63.0 57.8
l3 pool 4.51 3.48 3.25 4.23
l4 conv1 64.6 51.5 37.9 35.3
l4 conv2 126 97.9 76.0 70.5
l4 conv3 123 102.0 73.5 67.8
l4 pool 2.32 1.83 1.71 2.19
l5 conv1 46.6 41.0 30.7 29.3
l5 conv2 49.1 39.7 29.7 28.1
l5 conv3 45.8 39.5 29.6 27.8
l5 pool 0.74 0.62 0.59 0.69
dense 1 767 737 364 509
dense 2 393 397 197 362
dense 3 1.55 1.62 1.18 1.50
VGG16 model with five classes. The measured volt-
age of the power supply to the board was multiplied
with the measured current to compute the power
consumption. The mean and maximum power dur-
ing inference is calculated as a mean of 10 infer-
ences. The power consumption of the IP core is de-
fined as the difference between the Ultra96 board
idling and power during inference. The idle power
consumption was measured at Pidle = 3.055 Watt
over a five-minute period:
Table 6: Average and peak power consumption in watt of
the Ultra96 board and the IP core during inference.
CNNA16
100MHz
CNNA16
172MHz
CNNA18
172MHz
CNNA28
172MHz
Pavg 5.28 5.68 4.71 4.80
Ppeak 6.60 7.14 5.76 6.35
PIPavg 2.23 2.63 1.66 1.74
PIPpeak 3.55 4.09 2.71 3.30
Table 6 shows that the mean power consump-
tion of the Ultra96 board for all tests was between
4.7−5.7 W out of which the IP core only consumes
approximately 2 W. This means that running the
14
Figure 12: Power consumption of the tested solution with
format CNNA18, CNNA
2
8 and CNNA16 during inference at
172 MHz.
IP did not affect the average power consumption.
However, because they run for a shorter amount of
time, the fixed-point IPs with a low number of bits
used less energy per inference. The CNNA16 with
a 100 MHz clock was 0.24 sec slower but consumed
less power than the version with a 172 MHz clock.
Table 6 shows that the peak power consumption
was almost the same for all tested IPs in the range
from 2.7 W to 4.1 W.
Figure 12 shows that the power consumption is
largest in the beginning of the inference, i.e. in
the convolution blocks of the CNN. The power con-
sumption dropped during execution of the fully
connected layers. This indicates that most of the
FPGA logic was in action during convolution, while
less logic was used during computing of the fully
connected layers and pooling. Pooling activity cor-
responds to the big dips in power consumption in
the first half of the inference.
7. Comparison with state-of-the-art CNNs
We have chosen to evaluate our work with the
current state-of-the-art toolflows presented in [23]
which use a fixed-point resolution of 8 or 16-bits to
perform FPGA acceleration of the VGG16 network
by targeting the Xilinx Zynq and UltraScale plat-
forms. The purpose of this is to compare our work
with other tools that have mapped the same CNN
on similar FPGA devices from the same vendor.
Accuracy. Our fixed-point training method only
performed well for 16-bit quantization. DoReFa-
Net [47] proposes a method for training CNNs with
low-bit quantization. The method demonstrate a
high accuracy bu using AlexNet with only 1-bit
weights. FINN-R [11] uses quantized neural net-
works for low-bit quantization of different CNN
topologies with high accuracy. Angel-Eye [27] also
proposes a dynamic quantization strategy, where
the network is initially trained with a floating-point
format. The radix position of the fixed-point data is
chosen differently for each layer based on statistics,
and an optimal radix point is chosen. The network
is converted back to floating-point and fine tuned.
This method achieves a high level of accuracy for
both 16 and 8-bit formats with VGG16.
Performance. To compare the different solutions,
the performance needs to be expressed in Giga Op-
erations Per Second (GOPS) . The performance re-
sult is normalized relative to the number of Look Up
Tables (LUTs) and DSPs as a measure for available
resources on the target device. This performance
density measure is used to compare the VGG16
mapped to different FPGA devices. The through-
put performance is calculated as the number of Giga
Operations (GOP) performed by the CNN relative
to the inference time in seconds. In the case of the
VGG16 network, the total number is 30.76 GOP
out of which 30.7 GOP is performed in the con-
volutional (CONV) layers. We have presented the
performance for CONV layers and all layers of the
VGG16 model since some solutions do not acceler-
ate the FC layers.
The results are shown in table 7, which indi-
cates that the CNNA performance (Total) is lower
than the comparable state-of-the-art architectures.
The best performance of our 16-bit solution is
29.1 GOPS. This is lower than the 31.4 GOPS for
DNNWEAVER which has the worst performance of
the state-of-the-art solutions.
The Ultra96 target used in our evaluation is small
and low-cost compared to the ones used in some of
the examples e.g. Zynq XC7Z2045 and UltraScale
KU060. If a larger and more expensive target such
as the Xilinx ZCU104 evaluation kit [48] was used,
it would be possible to increase the number of PEs,
thereby achieving a higher throughput and perfor-
mance.
The performance density measure is also lower
than most the other architectures and only simi-
lar to DNNWEAVER. Angel-Eye and Caffeine both
15
Table 7: Table comparing the CNNA with state-of-the-art CNN accelerators: DnnWeaver [26], fpgaConvNet [24], Angel-Eye [27]
and Caffeine [29]. All solutions targets the Xilinx Zynq devices except for Caffeine, which uses the Kintex UltraScale FPGA.
The power efficiency, performance density and throughput performance are listed for the different solutions. The performance
density is only shown for the CONV layers.
Technique-Mhz
Power
Efficiency
[GOPS/W]
Power E.
(Conv)
[GOPS/W]
Density
[GOPS/
DSP]
Density
[GOPS/
kLUT]
Perfor.
(Conv)
[GOPS]
Perfor.
(Total)
[GOPS]
Xilinx
Device
Fix.
[bits]
DnnWeaver-150 n/a n/a 0.143 0.59 31.4 n/a ZC7Z020 16
fpgaConvNet-125 n/a 7.27 0.221 0.91 48.5 12.7 ZC7Z020 16
fpgaConvNet-125 n/a n/a 0.173 0.71 156 n/a ZC7Z045 16
Angel-Eye-214 n/a 24.1 n/a n/a 85.3 n/a ZC7Z020 8
Angel-Eye-150 14.2 n/a 0.209 0.86 188 137 ZC7Z045 16
Caffeine-200 10.64 12.4 0.187 1.55 310 266 KU060 16
CNNA16-100 6.28 11.86 0.078 0.62 26.4 14.0 ZU3EG 16
CNNA16-172 5.99 11.08 0.081 0.52 29.1 15.7 ZU3EG 16
CNNA18-172 15.22 22.94 0.155 0.66 38.0 25.2 ZU3EG 8
CNNA28-172 11.83 22.53 0.110 0.61 39.5 20.7 ZU3EG 8
have a much higher density performance compared
to usages of LUT and DSP resources on the FPGA.
Power efficiency. The power efficiency is depen-
dent on both of the efficiency of data communica-
tion and computation.
The SmartShuttle [49] solution is optimizing
CNN off-chip memory access. Observing that over
80% of energy is consumed by DRAM accesses, they
make a benchmark of the data volume of DMA
requests during inference of the 13 CONV layers
in VGG16. Our CNNA16 measures a data vol-
ume of 211.7 MB transferred for the same feature
layers including pooling. However, we use more
on-chip memory for weight and data buffers than
SmartShuttle. As a benchmark SmartShuttle mea-
sures 221.3 MB. Simulated with a on-chip buffer of
512 KB, however, they can lower the DRAM ac-
cess volume to 160 MB. The design of the CLB
in our CNNA ensures that weights are only trans-
ferred once from DRAM, which is similar to what
SmartShuttle achieves with the weight reuse ori-
ented scheme (WRO) they propose. The last three
FC layers of the CNNA16 transfers a volume of
273.8 MB, which is not considered by SmartShuttle
and stands for most of the data communication.
The computation power efficiency is calculated
as the number of operations per second, relative
to the the mean power consumption of the CNNA,
which we measured earlier (GOPS/W). Compared
to many of the current state-of-the-art accelerators,
the CNN accelerator in this work performs quite
well in terms of power efficiency. When using 16-
bit fixed-point weights at 100 MHz, its total power
efficiency is 0.44x lower than Angel-Eye and 0.59x
lower than Caffeine. With nearly the same effi-
ciency of 12 GOPS/W, the power efficiency of the
CONV layers are considered comparable with Caf-
feine. The performance bottleneck in our CNN ac-
celerator is the fully connected layers, where splits
are performed 512 times with a high DRAM ac-
cess. The fpgaConvNet on the Zynq XC7Z020 has
a worse efficiency of 7.3 GOPS/W compared to the
CNNA16 with 11.9 GOPS/W. While Angel-Eye’s
fixed-point 8-bit with 24.1 GOPS/W is the best of
all the compared state-of-the-art solutions in terms
of efficiency, the 8-bit CNNA with 23.0 GOPS/W
is a close second.
8. Conclusion
In this paper, an architecture for a SoC design
was presented. The presented architecture imple-
ments the different operations necessary for a deep
neural network to perform close to real-time in-
ference. The architecture was implemented using
Python and HLS for the IP core and was able to
run on the Ultra96 board using PYNQ. The inter-
face for the system is similar to Keras and should
be familiar to most engineers working in the field
of machine learning.
The CNN is able to accelerate deep learning al-
gorithms that use any sequence of convolutional,
max-pooling and fully connected layers. The layer
operations can support many different parameters
and will be able to perform inferences using most
16
modern CNNs. The network weights can use any
8-, 16- or 32-bit fixed-point format when exported
from Keras with the weights auto-scaled correctly.
A training method was proposed which achieved
high levels of inference accuracies, both when using
fixed-point and floating-point weights. The VGG16
architecture chosen for testing in this paper was
able to perform inference in 2.0 sec per image when
using the fixed-point format Q2.14 and 1.2 sec when
using fixed-point format Q2.6. The IP core alone
consumes a peak power of 4.1 W with a mean power
between 1.5− 2.7 W and has a power efficiency be-
tween 6.0− 15.2 GOPS/W depending of the fixed-
point format.
Compared to similar state-of-the-art solutions
for mapping the VGG16 network to Xilinx plat-
forms, our solution demonstrates a comparable en-
ergy efficiency, especially for the convolutional lay-
ers. In future work, the CNNA needs be extended
to support special layers to support deep neural
networks such as ResNet, DenseNet, InceptionNet
and GooglLeNet. The special layers with irregular
dataflow will be implemented in the SW controlling
part of the proposed architecture.
Acknowledgments
We would like to thank Freia Martensen for lan-
guage and proof reading the article.
References
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet
classification with deep convolutional neural networks,
Communications of the ACMdoi:10.1145/3065386.
[2] K. Simonyan, A. Zisserman, Very deep convolutional
networks for large-scale image recognition, in: 3rd In-
ternational Conference on Learning Representations,
ICLR 2015 - Conference Track Proceedings, 2015.
arXiv:1409.1556.
[3] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You
only look once: Unified, real-time object detection, in:
Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, 2016.
arXiv:1506.02640, doi:10.1109/CVPR.2016.91.
[4] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: To-
wards Real-Time Object Detection with Region Pro-
posal Networks, IEEE Transactions on Pattern Anal-
ysis and Machine IntelligencearXiv:1506.01497, doi:
10.1109/TPAMI.2016.2577031.
[5] K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask R-
CNN, IEEE Transactions on Pattern Analysis and Ma-
chine Intelligencedoi:10.1109/TPAMI.2018.2844175.
[6] A. Shawahna, S. M. Sait, A. El-Maleh, FPGA-Based
accelerators of deep learning networks for learning and
classification: A review (2019). arXiv:1901.00121,
doi:10.1109/ACCESS.2018.2890150.
[7] S. Mittal, J. S. Vetter, A survey of methods for an-
alyzing and improving gpu energy efficiency (2014).
arXiv:1404.4629, doi:10.1145/2636342.
[8] S. Mittal, A survey of FPGA-based accelerators for
convolutional neural networks (2018). doi:10.1007/
s00521-018-3761-1.
[9] W. Ding, Z. Huang, Z. Huang, L. Tian, H. Wang,
S. Feng, Designing efficient accelerator of depthwise sep-
arable convolutional neural network on FPGA, Journal
of Systems Architecturedoi:10.1016/j.sysarc.2018.
12.008.
[10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv,
Y. Bengio, Binarized neural networks, in: Advances in
Neural Information Processing Systems, 2016. arXiv:
1602.02505.
[11] M. Blott, T. B. Preuber, N. J. Fraser, G. Gambardella,
K. O’Brien, Y. Umuroglu, M. Leeser, K. Vissers, FinN-
R: An end-to-end deep-learning framework for fast ex-
ploration of quantized neural networks, ACM Transac-
tions on Reconfigurable Technology and SystemsarXiv:
1809.04570, doi:10.1145/3242897.
[12] E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra,
G. Venkatesh, D. Marr, Accelerating binarized neu-
ral networks: Comparison of FPGA, CPU, GPU, and
ASIC, in: Proceedings of the 2016 International Con-
ference on Field-Programmable Technology, FPT 2016,
2017. doi:10.1109/FPT.2016.7929192.
[13] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H.
Lin, M. Srivastava, R. Gupta, Z. Zhang, Acceler-
ating binarized convolutional neural networks with
software-programmable fpgas, in: Proceedings of the
2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, FPGA ’17, ACM, New
York, NY, USA, 2017, pp. 15–24. doi:10.1145/
3020078.3021741.
URL http://doi.acm.org/10.1145/3020078.3021741
[14] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort,
A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi,
J. Anderson, K. Bertels, A Survey and Evaluation of
FPGA High-Level Synthesis Tools, IEEE Transactions
on Computer-Aided Design of Integrated Circuits and
Systemsdoi:10.1109/TCAD.2015.2513673.
[15] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang,
A high performance FPGA-based accelerator for large-
scale convolutional neural networks, in: FPL 2016 -
26th International Conference on Field-Programmable
Logic and Applications, 2016. doi:10.1109/FPL.2016.
7577308.
[16] A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012
AlexNet, Advances In Neural Information Pro-
cessing SystemsarXiv:1102.0183, doi:http://dx.doi.
org/10.1016/j.protcy.2014.09.007.
[17] F. Chollet, et al., Keras, https://keras.io (2015).
[18] L. Stornaiuolo, M. Santambrogio, D. Sciuto, On how
to efficiently implement deep learning algorithms on
PYNQ Platform, in: Proceedings of IEEE Computer
Society Annual Symposium on VLSI, ISVLSI, 2018.
doi:10.1109/ISVLSI.2018.00112.
[19] Accellera Systems Initiative, Ieee standard for standard
systemc language reference manual, IEEE Std 1666-
2011 (Revision of IEEE Std 1666-2005).
[20] K. Ovtcharov, O. Ruwase, J.-y. Kim, J. Fowers,
K. Strauss, E. S. Chung, Accelerating Deep Convo-
lutional Neural Networks Using Specialized Hardware,
Microsoft Research Whitepaper.
17
[21] D. Gschwend, Zynqnet: An fpga-accelerated embedded
convolutional neural network.
URL https://github.com/dgschwend/zynqnet
[22] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott,
P. Leong, M. Jahre, K. Vissers, FINN: A frame-
work for fast, scalable binarized neural network
inference, in: FPGA 2017 - Proceedings of the
2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2017. arXiv:1612.07119,
doi:10.1145/3020078.3021744.
[23] S. I. Venieris, A. Kouris, C. S. Bouganis, Toolflows for
mapping convolutional neural networks on FPGAS: A
survey and future directions (2018). arXiv:1803.05900,
doi:10.1145/3186332.
[24] S. I. Venieris, C. S. Bouganis, FpgaConvNet: A Frame-
work for Mapping Convolutional Neural Networks on
FPGAs, in: Proceedings - 24th IEEE International
Symposium on Field-Programmable Custom Comput-
ing Machines, FCCM 2016, 2016. doi:10.1109/FCCM.
2016.22.
[25] S. I. Venieris, C. S. Bouganis, FpgaConvNet: Mapping
Regular and Irregular Convolutional Neural Networks
on FPGAs, IEEE Transactions on Neural Networks and
Learning Systemsdoi:10.1109/TNNLS.2018.2844093.
[26] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim,
C. Shao, A. Mishra, H. Esmaeilzadeh, From high-level
deep neural models to FPGAS, in: Proceedings of the
Annual International Symposium on Microarchitecture,
MICRO, 2016. doi:10.1109/MICRO.2016.7783720.
[27] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han,
Y. Wang, H. Yang, Angel-Eye: A complete design flow
for mapping CNN onto embedded FPGA, IEEE Trans-
actions on Computer-Aided Design of Integrated Cir-
cuits and Systemsdoi:10.1109/TCAD.2017.2705069.
[28] Y. Wang, J. Xu, Y. Han, H. Li, X. Li, DeepBurn-
ing: Automatic generation of FPGA-based learning
accelerators for the neural network family, in: Pro-
ceedings - Design Automation Conference, 2016. doi:
10.1145/2897937.2898003.
[29] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, Caffeine:
Towards uniformed representation and acceleration for
deep convolutional neural networks, in: IEEE/ACM In-
ternational Conference on Computer-Aided Design, Di-
gest of Technical Papers, ICCAD, 2016. doi:10.1145/
2966986.2967011.
[30] Y. Y. Huang, W. Y. Wang, Deep residual learning
for weakly-supervised relation extraction, in: EMNLP
2017 - Conference on Empirical Methods in Natural
Language Processing, Proceedings, 2017. arXiv:1707.
08866, doi:10.18653/v1/d17-1191.
[31] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein-
berger, Densely connected convolutional networks, in:
Proceedings - 30th IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2017, 2017.
arXiv:1608.06993, doi:10.1109/CVPR.2017.243.
[32] G. Zeng, Y. He, Z. Yu, X. Yang, R. Yang, L. Zhang, In-
ceptionNet/GoogLeNet - Going Deeper with Convolu-
tions, CvprarXiv:1409.4842, doi:10.1002/jctb.4820.
[33] Xilinx, Pynq: Python productivity for zynq.
URL http://www.pynq.io/
[34] Xilinx, UG 902 - Vivado Design Suite User Guide -
High-Level Synthesis, 2019th Edition (07 2019).
[35] Accellera Systems Initiative, SystemC Synthesizable
Subsets, 1st Edition (January 2015).
[36] L. H. Crockett, D. Northcote, C. Ramsay, F. D. Robin-
son, R. W. Stewart, Exploring Zynq® MPSoC With
PYNQ and Machine Learning Applications, Strath-
clyde Academic Media, 2019.
[37] Xilinx, UG761 - AXI Reference Guide, v13.1 Edition
(March 2011).
[38] B. Xu, R. Huang, M. Li, Revise saturated activation
functions, CoRR abs/1602.05980. arXiv:1602.05980.
URL http://arxiv.org/abs/1602.05980
[39] E. Oberstar, Fixed-Point Representation & Fractional
Math Revison 1.2 (08 2007). doi:10.13140/RG.2.1.
3602.8242.
[40] B. J. Wythoff, Backpropagation neural networks:
A tutorial, Chemometrics and Intelligent Labora-
tory Systems 18 (2) (1993) 115 – 155. doi:https:
//doi.org/10.1016/0169-7439(93)80052-J.
URL http://www.sciencedirect.com/science/
article/pii/016974399380052J
[41] H. Park, J. H. Lee, Y. Oh, S. Ha, S. Lee, Train-
ing deep neural network in limited precision, CoRR
abs/1810.05486. arXiv:1810.05486.
URL http://arxiv.org/abs/1810.05486
[42] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,
A. Howard, H. Adam, D. Kalenichenko, Quantization
and training of neural networks for efficient integer-
arithmetic-only inference, in: The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
2018.
[43] 96 Boards, Ultra96-v2 developer board.
URL https://www.96boards.org/product/ultra96/
[44] Detect [online] (June 2017) [cited 7/5-2018].
[45] A. Krizhevsky, V. Nair, G. Hinton, CIFAR-10 and
CIFAR-100 datasets (2009).
[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-
stein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale
Visual Recognition Challenge, International Journal of
Computer Vision (IJCV) 115 (3) (2015) 211–252. doi:
10.1007/s11263-015-0816-y.
[47] Y. Z. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu
Zhou, He Wen, Dorefa-Net: Training Low Bitwidth
Convolu- Tional Neural Networks With Low Bitwidth
Gradients, arXiv:1606.06160v3 [cs.NE] 2 Feb 2018
DoReFa-NetarXiv:arXiv:1606.06160v3, doi:10.1145/
1449956.1450053.
[48] Xilinx, Zynq ultrascale+ mpsoc zcu104 evaluation kit.
URL https://www.xilinx.com/products/
boards-and-kits/zcu104.html
[49] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu,
X. Li, SmartShuttle: Optimizing off-chip memory ac-
cesses for deep learning accelerators, in: Proceedings
of the 2018 Design, Automation and Test in Europe
Conference and Exhibition, DATE 2018, 2018. doi:
10.23919/DATE.2018.8342033.
18
