Analysis of Cyclone V in computer vision applications by Frank, Justin
© 2020 Justin Frank
brought to you by COREView metadata, citation and similar papers at core.ac.uk
provided by Illinois Digital Environment for Access to Learning and Scholarship Repository




Submitted in partial fulfillment of the requirements
for the degree of Bachelor of Science in Electrical and Computer Engineering
in the Undergraduate College of the





As embedded computing becomes more common in computer vision applica-
tions FPGAs have become a common solution to accelerate inference. Within
Intel’s line of FPGAs there exists ready documentation of the high cost and
high power Arria and Stratix lines, but much less has been published on the
performance of the lower cost and lower power Cyclone series devices.
Data was collected for two popular frameworks: the semi-closed source
OpenVINO and the open source PipeCNN project. Data was collected on
inference time and power consumption for an array of popular models ac-
celerated with OpenVINO across multiple CPU frequencies, multiple FPGA
bitstreams, and multiple execution modes. For PipeCNN a design space ex-
ploration was carried out to get optimal performance and power numbers for
a set of popular supported networks.
For OpenVINO it was found that for most models heterogeneous infer-
ence outperformed CPU only inference. Further it was found that hetero-
geneous inference in general uses comparable power to CPU only inference.
For PipeCNN it was found that performance had no strong tie to maximum
utilization of any one resource on the FPGA.
Overall these results show a compelling case for the use of Cyclone series
FPGAs in embedded computing applications that require fast computer vi-
sion inference in relatively low cost and low power form factors.
Subject Keywords: Computer Vision; FPGA; Cyclone V; OpenVINO; PipeCNN
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
Thank you to the UIUC ECE department for providing funding to purchase
the developer board required for this thesis.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . 5
2.1 Intel DLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 OpenVINO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 OpenVINO Starter Kit . . . . . . . . . . . . . . . . . . . . . . 9
2.4 PipeCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER 3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . 13
CHAPTER 4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 OpenVINO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 PipeCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 23
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
APPENDIX A APPENDIX . . . . . . . . . . . . . . . . . . . . . . . 27
v
LIST OF TABLES
4.1 Table of OpenVINO and PipeCNN Results . . . . . . . . . . . 19
A.1 OpenVINO Benchmarking . . . . . . . . . . . . . . . . . . . . 29
A.2 PipeCNN Benchmarking . . . . . . . . . . . . . . . . . . . . . 40
A.3 OpenVINO Wall Power . . . . . . . . . . . . . . . . . . . . . . 46
vi
LIST OF FIGURES
2.1 DLA Architecture Overview ©2018 IEEE [14] . . . . . . . . . 6
2.2 DLA VLIW System ©2018 IEEE [14] . . . . . . . . . . . . . 7
2.3 Group Slicing Example ©2018 IEEE [14] . . . . . . . . . . . . 8
2.4 PipeCNN Architecture Overview ©2017 IEEE [23] . . . . . . 10
2.5 PipeCNN Convolution Unit ©2017 IEEE [23] . . . . . . . . . 11
2.6 PipeCNN memRD/memWR Dimensions ©2017 IEEE [23] . . 11
2.7 PipeCNN Windowing Scheme ©2017 IEEE [23] . . . . . . . . 12
4.1 Illustration of the Effect of Bitstream on OpenVINO Per-
formance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Illustration of the Effect of CPU Frequency on OpenVINO
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Illustration of the Effect of CPU Core Usage on OpenVINO
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 OpenVINO Performance Across Network Topologies . . . . . . 18
4.5 OpenVINO Power Usage . . . . . . . . . . . . . . . . . . . . . 19
4.6 Illustration of the Effect of Extra Padding Operations on
Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.7 The Relation of Resource Utilization and Runtime . . . . . . . 21




Interest in computer vision has increased rapidly in recent years [1]. Com-
puter vision based systems have become critical tools in many domains.
These domains range from assessing the quality of fruits and vegetables via
detecting visual defects on the surface of the produce [2], to allowing robots to
navigate using images of their environments [3], to automatically separating
healthy liver tissue from tumors in medical imaging data [4].
Recently deep learning methods have become a key approach to solving
computer vision problems [5]. This is evidenced by the fact that in both 2015
and 2017 deep neural networks took first place in ILSVRC classification tasks
[6], [7],[8].
As deep learning models grow more complex to chase greater accuracy,
the computational power required to perform inference is increasing [9]. A
great deal of emphasis has been placed on developing deep learning solutions
to perform inference in a data center setting. In this setting the primary
constraint is computational power per unit cost [10] which is closely tied to
computational power per 1 W. In this setting GPUs have been a popular
solution for their wide availability and ready support in popular software
tools. Recently FPGA and ASIC accelerators have been finding a place
in the data center as they offer compelling performance compared to GPUs
[11]. The Google Tensor Processing Unit (TPU) is used extensively internally
by Google to accelerate inference as well as training. In addition to using
the TPUs internally, Google allows consumers to buy compute time on the
devices. TPU V3, the most current hardware revision at the time of writing
this thesis, is a system consisting of 4 ASIC chips. Each chip consists of
2 TPU cores. Each core is built around 4 128x128 matrix units operating
with bfloat16 precision. Each core has, in addition to the matrix units,
dedicated scalar and vector arithmetic units and access to 16GB of HBM [12].
Google achieved a throughput score of 261,587.00 on the MlPerf ResNet-
1
50 v1.5 offline inference benchmark1 using 8 TPU V3s. Compare this to
the fact that NVIDA2 only achieved a throughput score of 44,977.80 with
a Supermicro 4029GP-TRT-OTO-28 8xT4 system with 8 T4 GPUs on the
same benchmark. This is an over 480% increase in throughput using the
TPU system as opposed to the NVIDA GPU system [13].
Moving from ASIC to FPGA devices, Intel’s distribution of OpenVINO
officially supports accelerating inference on Arria 10 GX based accelerator
cards. These devices utilize the Intel DLA IP that will be discussed later
in this thesis. OpenVINO further allows inference to be executed heteroge-
neously utilizing both the host CPU and the DLA IP on the FPGA. This
allows for easy implementation of custom layers that may not be supported
by the DLA IP. Additionally with the correct licensing the DLA IP allows
custom operations to be implemented with relative ease [14]. Previous works
have shown an up to 2.52 times increase in throughput performing inference
on a HEP-CNN model with heterogeneous execution using an Aria 10 PAC
and a Skylake CPU compared to inference using only the CPU [15]. Similar
to Intel’s solution Xilinx’s xfDNN software stack coupled with their xDNN
IP core allows easy deployment of deep neural networks to Xilinx Alveo ac-
celerator cards. Like OpenVINO, xfDNN supports heterogeneous execution
for networks that may include custom layers unsupported by the xDNN IP.
The xDNN IP is built around a systolic processing array for accelerating
convolution, along with a set of modules to implement common activation
functions, and hardware modules to handle element-wise addition and pool-
ing operations. There is also hardware based support for image tiling to
better handle large image sizes. Execution pathways for different commands
are parallelized in the xDNN IP to allow operations to run in parallel when
possible. xDNN bitstreams can be optimized to run in a maximum through-
put mode or in a minimum latency mode depending on the application. An
Alveo U250 accelerator card with a throughput optimized xDNN IP core was
able to achieve over 370% greater throughput than an NVIDIA V100 GPU
on a GoogleNet V1 inference task [16].
1MLPerf v0.5 Inference Closed ResNet-v1.5 offline. Retrieved from www.mlperf.org 1
November 2020, entry 0.5-18. MLPerf name and logo are trademarks. See www.mlperf.org
for more information.
2MLPerf v0.5 Inference Closed ResNet-v1.5 offline. Retrieved from www.mlperf.org 1
November 2020, entry 0.5-24. MLPerf name and logo are trademarks. See www.mlperf.org
for more information.
2
In addition to more traditional datacenter-centric applications, embedded
applications have spurred recent demand for computer vision capabilities [17].
In particular deep learning based computer vision solutions have found use
in applications where offloading inference to datacenters is infeasible, such
as autonomous underwater vehicles using visual input for obstacle avoidance
[18]. In embedded applications constraints such as fixed power budgets and
hardware cost requirements may outweigh the desirability of pure compu-
tations per W and make hardware solutions developed for the datacenter
unsuitable or undesirable. Multiple lower power and lower cost inference
devices currently exist to fill this demand. In addition to supporting Arria
10 devices Intel’s distribution of OpenVINO also supports Movidius Myriad
X ASIC devices. Myriad X combines a custom hardware based deep neu-
ral network accelerator with 16 SHAVE vector cores and an array of vision
processing modules all on one chip [19]. Intel claims performance figures
of 4 TOPS (Trillion Operations Per Second). Similarly Google has released
an edge focused ASIC dubbed the edge TPU. Google claims up to 4 TOPS
peak performance and 2 TOPS per W. The device draws 2 W of power when
operating [20]. Google’s released benchmarks show their Coral development
board featuring an edge TPU and NXP i.MX 8M SoC managed to produce
lower inference times than a Xeon Gold 6154 CPU on all the deep neural
network topologies tested. In the case of Resnet-152 v2 the coral develop-
ment board achieved over 1000 percent lower times per inference than the
Xeon CPU [21]. Moving back to FPGAs, in addition to the previously dis-
cussed datacenter-centric Alveo cards Vitis AI also allows deploying models
to Zynq devices. Xilinx provides a DPU bitstream for Zynq SoCs for accel-
erating deep learning inference in embedded devices. The DPU, much like
the DLA and xDNN, implements an array of common deep learning opera-
tions in hardware. A Resnet-50 model deployed to a Xilinx ZCU102 with the
DPU IP core managed to achieve 163.4 FPS throughput [22]. Additionally
an open source solution, PipeCNN [23], exists for the complete line of Intel
FPGAs. PipeCNN also has limited support for Xilinx devices. The archi-
tecture of PipeCNN will be discussed later in this thesis. PipeCNN achieved
an execution time of 140 ms for an Alexnet model on a Cyclone-V chip and
15 ms for the same model running on a Stratix-V chip.
This thesis proposes using OpenVINO to deploy deep learning models to
Cyclone V chips as a possible alternative to the embedded solutions presented
3
above. Currently no Cyclone V devices are officially supported by Intel’s
distribution of OpenVINO, and the Terasic OpenVINO Starter Kit is the only
commercially available Cyclone V product with a bitstream that supports
OpenVINO. Currently very little has been published about the performance
or power usage of the Cyclone V devices as an OpenVINO target. This thesis
hopes to fill that gap in knowledge by presenting performance and power
usage figures for heterogeneous inference with the Terasic OSK accelerator






The Intel DLA (or deep learning accelerator) is an IP core designed for ac-
celerating deep neural network inference. It is designed to be general enough
to support a large number of neural network topologies but also customiz-
able enough to achieve reasonable performance for any given network. Of
particular note is that a single bitstream can support running inference on
multiple networks, in contrast to systems like PipeCNN that require resyn-
thesis for every network. The DLA IP supports multiple different floating
point number representations, but the Terasic bitstreams used in this thesis
will use the FP11 representation.
The DLA is primarily based around four components: a PE array, an ac-
tivation module, an Xbar module, and a series of memory modules. The
PE array is a systolic array of multiple individual PE elements designed to
compute dot product operations allowing for the array to compute matrix
operations like convolution. Vectorization is used to support parallel com-
putations. In particular C V EC controls the depth of the input vector,
K V EC controls the depth of the output vector, and Q V EC and P V EC
control the width and height of the images in the input vector as seen in
figure 1. Furthermore, there appears to be a hardware module following the
PE Array to handle a set of predetermined activation functions, but this is
not elaborated upon in the original DLA paper short of a box in figure 2.1.
Following the activation module is the Xbar, an interface for adding cus-
tom functions to the DLA. The Xbar is generated automatically when the
bitstream is synthesized and appears to be mainly a collection of multiplex-
5
ers. The Xbar is sensitive to the order in which functions attached to it are
meant to be executed and can be configured to conserve logic if specific func-
tions are guaranteed to always be executed in a specific order relative to each
other. Additionally the Xbar supports the use of width adapters to control
the bandwidth required by each function module. This allows for logic to be
conserved by decreasing bandwidth for less commonly used functions.
The final major hardware components comprising the DLA are the filter
caches and the stream buffer. Each PE in the PE array contains a filter cache
for storing the corresponding filter values for operations. The filter caches
are double buffered so that as the filter stored in a buffer is being used for a
computation a new filter can be loaded in from external memory. The stream
buffer is used for storing intermediate results during computation.
Figure 2.1: DLA Architecture Overview ©2018 IEEE [14]
A VLIW system, as shown in figure 2.2, is used to deliver instructions to
the above components. Before discussing the VLIW system the notion of a
subgraph must be defined. This is “a list of functions that can be imple-
mented on the DLA without writing to a buffer, except at the very end of
subgraph execution”1 [14]. For each subgraph to be executed correctly by the
DLA the parameters of each module must be adjusted to match the current
operations. The VLIW system allows for the DLA to be configured in such
a manner. As illustrated in the figure the VLIW system consists of a VLIW
reader kernel and a series of transport kernels. These kernels are arranged in
a cascade with the VLIW reader at the top of the cascade followed by a trans-
port kernel for each runtime-reconfigurable module in the DLA. The VLIW
reader kernel’s job is to read instructions in from off-chip memory. These
1©2018 IEEE
6
instructions are VLIW instructions, or Very Long Instruction Word instruc-
tions. In essence each VLIW instruction consists of a series of subinstructions
(represented in different colors in the figure) for each module concatenated to-
gether. Each of these subinstructions contains a header denoting what device
it belongs too (shown as the darker colored square in each sub instruction
in the figure). As the VLIW instruction moves through the cascade each
transport kernel strips the subinstruction for its corresponding module off
of the VLIW instruction and forwards the subinstruction, minus the header,
to the transport kernel’s corresponding module. The transport kernel then
passes the remainder of the VLIW instruction to the next transport kernel
in the cascade. This system is not only simple but easily expanded. If a
new module is added via the Xbar there is no need to redesign the entire
instruction pathway. A new transport kernel for the new module can simply
be cascaded after the last previous transport kernel in the old cascade.
Figure 2.2: DLA VLIW System ©2018 IEEE [14]
2.1.2 Graph Compiler
Next the Graph Compiler that accompanies the DLA will be discussed. The
graph compiler is responsible for generating the VLIW instructions that re-
configure the DLA at runtime. The graph compiler works in 3 main passes:
a “slicing” phase, a “scheduling” phase, and an “allocation” phase.
During the “slicing” phase the graph compiler breaks the neural network
into subgraphs. In an ideal world a single layer would never span more than
one subgraph; however, as the sizes of feature tensors increase, feature ten-
sors and filter tensors may no longer fit into the stream buffer or the filter
caches. If this is the case the feature or filter tensors must be sliced and the
single layer distributed across multiple subgraphs. Features and filters can
7
be sliced across width, height, or depth but each of these slicing paradigms
usually incurs the cost of extra computations and it is the job of the slicing
pass of the subgraph compiler to minimize these excess operations. Addi-
tionally the slicing pass of the graph compiler implements a strategy referred
to as group slicing. After slicing a tensor rather than computing a single con-
volution for all slices in the tensor before processing sequential convolution
operations, when possible multiple sequential convolution operations are per-
formed on a single slice before processing the other slices. This procedure,
illustrated in figure 2.3, minimizes accesses to external memory needed to
store intermediate results and retrieve new slices.
Figure 2.3: Group Slicing Example ©2018 IEEE [14]
The “scheduling” pass determines the order in which subgraphs are to be
executed with the intention of minimizing buffer spillover to external memory.
For many traditional networks there is no choice in scheduling as every layer
depends upon the results of all the previous layers, but some newer networks
have branching structures that allow for flexibility in the order layers are
executed. In these cases the ordering of layers can have a significant effect on
the number of buffer spillovers that occur. To avoid overly long compile times
the scheduling pass does not perform an exhaustive search to determine the
optimal schedule to minimize spillover but rather resorts to using a priority
queue to determine the schedule.
The “allocation” pass handles memory operations—in particular, how data
is moved in and out of the stream buffer for a given subgraph and reducing
fragmentation inside the stream buffer. In general the stream buffer is simply
used with input tensors being filled into the buffer starting from one side and
output tensors being stored from the opposing side. Additionally when data
overflows from on-chip memory the allocation pass is responsible for assigning
addresses for this data in external memory.
8
2.2 OpenVINO
OpenVINO is a general framework for deploying deep neural networks. Open-
VINO works by automatically converting models trained in popular frame-
works such as Caffe and Pytorch, assuming the network is composed of sup-
ported layers, into an IR or intermediate representation model. This IR
model can then be used to perform inference on a variety of supported de-
vices such as x86 64 CPUs, GPUs, Myriad VPUs, and supported FPGA
devices programmed with an Intel DLA bitstream. In particular OpenVINO
allows running inference in either heterogeneous or homogeneous mode. In
homogeneous mode inference is run entirely on a single device, whereas in
heterogeneous mode certain portions of a model are executed on different
devices. Of particular note for this thesis when using an FPGA this al-
lows models with functions not supported by the FPGA bitstream to be run
heterogeneously using the CPU to compute the layers unsupported by the
bitstream.
2.3 OpenVINO Starter Kit
The Terasic OpenVINO Starter Kit (Now referred to as the Starter Plat-
form for OpenVINO Toolkit) shall be used as the reference Cyclone V de-
vice for this thesis. Specifically the device on the board is a Cyclon V GT
5CGTFD9D5F27C7N, with 301,000 LE. Additionally on the board are 64MB
of SDRAM and 1GB of DDR3 SDRAM. The board supports PCIe Gen2x4.
Terasic provides two OpenVINO compatible bitstreams for the OSK board:
a 8x16 clamp architecture and a 8x8 norm architecture.
The 8x8 bitstream has a K V EC size of 8 and a C V EC size of 8. The
bitstream supports FP11 representations. A stream buffer depth of 10,000
bits is supported by the bitstream. Additionally the bitstream includes a
pool and norm module on the Xbar. A hardware prelu activation is also
included in the bitstream.
The 8x16 bitstream has a K V EC size of 16 and C V EC size of 8. Like
the 8x8 bitstream this bitstream also supports FP11 representation and has
a stream buffer depth of 10,000 bits. The bitstream includes a pooling and




PipeCNN is an opensource OpenCL based neural network accelerator for
FPGA. In contrast to the Intel DLA, the PipeCNN bitstream must be resyn-
thesized for every unique network. PipeCNN is primarily built around a
cascade of four kernels: a memory reader module (memRD), a convolution
module, a pooling module, and a memory writer module (memWR), cas-
caded in that order. This structure can be seen in figure 2.4. The convolu-
tion kernel can accelerate both convolution layers and fully connected layers.
The overall convolution kernel consists of multiple parallel convolution units,
with each convolution unit consisting of a multiply-adder tree with a delayed
accumulator buffer as seen in figure 2.5.
Figure 2.4: PipeCNN Architecture Overview ©2017 IEEE [23]
The two data mover kernels, memRD and memWR can function in two
different modes depending on the layer that is being computed, either con-
volution or fully connected (FC). As their respective names suggest memRD
reads in input tensors from memory and memWR writes the result of the





+1)∗K,C ′ ∗M ] data points to pass to the convolution




+1,M ] where W is the
width and height of the square input image, K is the width and height of the
convolution kernel, S is the stride associated with the convolution operation,
C corresponds to the depth of the input tensor, and M corresponds to the
10
Figure 2.5: PipeCNN Convolution Unit ©2017 IEEE [23]
depth of the output tensor. This can be seen in figure 2.6. The input tensor
is further partitioned into [K,K,C ′] work groups that are computed upon
in parallel. When the data mover kernels are functioning in FC mode the
weight and input tensors are both simple 1D vectors; however, to maximize
resource usage these 1D vectors are group together in batches to form 3D
tensors.
Figure 2.6: PipeCNN memRD/memWR Dimensions ©2017 IEEE [23]
In addition to the above, some strategies are adopted to increase perfor-
mance. First multiple convolution units are instantiated in parallel, and
the number of convolution units is controlled by CU NUM. Additionally in-
puts are vectorized into [K,K, V EC SIZE] tensors, with each tensor being
processed by an independent convolution unit. Finally data is cached in on-
chip memory to reduce external memory accesses. This is done by caching
FT NUM times the area of the convolution filter worth of data in on-chip
memory to be quickly accessed for successive convolution operations as seen
in figure 2.7.
11
Figure 2.7: PipeCNN Windowing Scheme ©2017 IEEE [23]
Additionally PipeCNN exposes three parameters for fitting designs to an
FPGA. LANE NUM controls the number of convolution units that are
instantiated on the chip. V EC SIZE controls the length of the input vector
for each of the convolution units. Finally CONV GP SIZE X controls the




For the experiments described below the Terasic OSK board described in the
background section was used. The OSK board was connected to a 30 cm
PCIe gen 3 16x riser cable (LINKUP PCIEXT22SR-040) that was in turn
connected to an 8x PCIe gen 3 slot on the host system’s motherboard. The
OSK board was powered via the 12V PCIe power lanes and the external
power supply was left unconnected to the OSK board. The host PC was
equipped with a AMD Opteron X3421 APU that can run at 1.40, 1.80, and
2.10 GHz with 4 cores and 8GB of 2400 MHz ECC memory. The host PC was
running a custom Ubuntu 16.04 image provided by Terasic with OpenVINO
2019.1.094 installed.
Additionally in order to make rough measures of the power being used by
the OSK board, a power measurement system similar to that described by
NLeSC 1 was used. The system consisted of a Allegro ACS712 20A chip [24]
spliced inline with the 12V PCIe power lane cables in the riser cable. The
output of the ACS712 was then sent to an Arduino Leonardo where a power
and energy computations were reported to the host computer.
To assess the suitability of the OSK for common computer vision tasks
benchmark values were recorded for a variety of networks running heteroge-
neously across the OSK and host CPU via OpenVINO. The OpenVINO C++
benchmark tool was used to obtain benchmarking numbers. The benchmark-
ing tool 2 was run in two different modes, synchronous and asynchronous. In
synchronous mode a single request is generated and inference is performed
before creating another request. In asynchronous mode multiple requests are
generated at a time and then processed asynchronously. The exact number of
requests is automatically determined by the benchmarking tool. Synchronous
1NLeSC PowerSensor: https://github.com/NLeSC/PowerSensor
2https://docs.openvinotoolkit.org/latest/openvino inference engine samples
benchmark app README.html
13
mode is designed to benchmark performance in latency sensitive applications
and asynchronous mode is designed to benchmark performance in throughput
sensitive applications. Each of the selected networks was benchmarked with
3000 inference requests in both synchronous and asynchronous mode. Each
model was benchmarked with inference being performed heterogeneously and
CPU only. Additionally models were benchmarked across all three available
CPU frequencies. Benchmarks were also run using one, two, three, and four
CPU cores. Since none of the networks used execute in a different manner
depending upon input data, a single set of images were used to test all the
networks. Additionally the power sensor system was used to collect rough
measurements on the average power and total energy used by the OSK during
the benchmarking task. The following networks were selected specifically to
benchmark based on their popularity as well as the specific features described
below.
Resent was selected for its widespread use3 and in particular Resnet-18 and
Resnet-152 [8] were selected to typify two extremes of layer depth. Resnet-
50 was also benchmarked to provide a comparison to Resnet-50 performance
using PipeCNN.
Facenet4 and Insightface5 were selected as common face recognition net-
works and were used to evaluate the performance of the OSK in face recog-
nition applications [25].
Shufflenet-V26 and efficientNet-B07 were chosen for their computational
efficiencies and for the fact that they were specifically designed for resource
limited applications[26], [27].
Unet8 was chosen as an opposition to shufflenet and efficientnet-B0 as it
is known to be a complex model [4].
Additionally Alexnet and VGG-169 were chosen to provide a comparison
between the PipeCNN benchmarking demo and OpenVINO.
3The Resnet model used was from torchvision:
https://pytorch.org/docs/stable/torchvision/index.html
4Facenet model used is the CASIA model from:
https://github.com/davidsandberg/facenet
5Model used is: LResNet100E-IR,ArcFace@ms1m-refine-v2
6Shufflnet-v2 model used from: https://pytorch.org/hub/pytorch vision shufflenet v2
7Efficient net model used from: https://github.com/lukemelas/EfficientNet-PyTorch
8U-Net model used in from torch hub: https://pytorch.org/hub/mateuszbuda brain-
segmentation-pytorch unet
9Resnet-50, Alexnet, and VGG-16 models used were from torchvision
14
Furthermore “ at the wall ” energy usage measurements were collected. A
power sensor (P3 International P4460) was used to measure the power draw
at the outlet of the host system. Benchmarking tasks for each network were
run for five hours at a time while collecting power draw data from the wall
outlet. This gives some reasonable comparison to a baseline measurement of
how much power is used by both the CPU and FPGA during inference for
the different networks.
To find the optimal performance of PipeCNN on the Cyclone V hardware
a design space exploration was performed. In particular the exploration was
focused on optimizing 3 relevant parameters for the Alexnet, Resnet-50, and
VGG-16 networks implemented in the benchmarking demo provided by the
PipeCNN project. The three parameters of interest are LANE NUM the to-
tal number of parallel convolution units, VEC SIZE the number of integers
grouped into a vector, and CONV GP SIZE X which controls the manner
in which data is loaded from host memory. Each of these numbers is incre-
mented from its base value and a new bitstream is generated until designs
no longer fit on the target device. Each of these bitstreams is then loaded
to the FPGA device and the PipeCNN benchmarking demo is run to collect
runtime data. Additionally the average power draw of the FPGA board and





The OpenVINO testing yielded promising results. Running in heterogeneous
mode OpenVINO consistently showed superior results to both OpenVINO
executing in CPU only mode and PipeCNN.
Testing revealed that for the Terasic OSK board the supplied 8x16 DLA
bitstream yields higher performance than the 8x8 DLA bitstream across the
tested models. This is illustrated in figure 4.1 where it can be seen across all
models that the average latency is lower and average throughput is higher
for the 8x16 bitstream than for the 8x8 bitstream. As a result only the 8x16
bitstream shall be considered for the analysis below.
Figure 4.1: Illustration of the Effect of Bitstream on OpenVINO
Performance
Performance of heterogeneous execution appears to be largely independent
of CPU frequency. Figure 4.2 illustrates this. Plots A and B show that aver-
age latency is heavily influenced by CPU frequency when executing in CPU
16
only mode (A), whereas CPU frequency has a negligible effect on latency
when running in heterogeneous mode (B). Similarly it can be seen that av-
erage throughput is heavily influenced by CPU frequency when executing in
CPU only mode (C) but is negligibly influenced when running in heteroge-
neous mode (D). In both the case of latency and throughput these relations,
or lack thereof, can be observed for all three model topologies tested.
Figure 4.2: Illustration of the Effect of CPU Frequency on OpenVINO
Performance
It can additionally be seen in figure 4.3 that CPU core count has a negligi-
ble effect on latency across the tested topologies. This further supports that
heterogeneous execution tends to perform well with a lower cost/performance
host CPU.
17
Figure 4.3: Illustration of the Effect of CPU Core Usage on OpenVINO
Performance
Further it can be seen in figure 4.4 that among the full set of networks
tested, including the extended networks, the majority of networks show
greater performance, in both throughput and latency, when inference is per-
formed heterogeneously than when performed with just the CPU. The ex-
clusion to this trend is Shufflenet V2 and Efficientnet B0.
Figure 4.4: OpenVINO Performance Across Network Topologies
Additionally it was found that energy usage was comparable between per-
forming inference in heterogeneous mode and CPU only. It can be seen in
figure 4.5 that energy usage is comparable between the two execution modes
for the tested networks.
18
Figure 4.5: OpenVINO Power Usage
4.2 PipeCNN
The PipeCNN results yield lesser performance than the OpenVINO bit-
streams as illustrated in table 4.1. However, the PipeCNN bitstreams do
offer an interesting insight into the relationship between resource usage and
compute performance. Unfortunately time did not allow for an exhaustive
parameter exploration but the collected results will be analyzed in this sec-
tion.
Of note, certain combinations of bitstream parameters result in additional
padding operations being required. If the number of input channels for a cer-
tain layer is not a multiple of LANE NUM , then additional all-zero channels
must be added. In the case of VGG-16 and Resnet this padding appears to
greatly increase runtime as seen in figure 4.6, whereas Alexnet has a less
dramatic increase due to the extra required padding. It is theorized that this
difference in effect is due to the significantly greater depth of VGG-16 and
Table 4.1: Table of OpenVINO and PipeCNN Results








Resnet compared to Alexnet. This extra depth causes the number of required
padding operations to increase leading to the padding causing greater latency
than in a shallower network. As a result we will exclude these bitstreams
from further analysis as they are a poor representation of the maximum
performance that can be achieved by PipeCNN on the hardware.
Figure 4.6: Illustration of the Effect of Extra Padding Operations on
Runtime
Figure 4.7 shows the runtime for the generated PipeCNN bitstreams against
the utilization of various resources. In plots A, B, C, and E a general trend
can be seen that runtime for the VGG-16, and to a lesser extent Alexnet,
bitstreams decreases as the utilization metric of interest increases. This is
largely expected, as increased resource usage should naturally enable greater
computational power all else being equal.
However plot D, runtime versus DSP utilization, does not follow the trend
of the other plots and no trend can be observed between increased DSP
utilization and runtime.
The relationships between runtime and the parameters across which the
design space exploration was conducted were also analyzed as seen in figure
4.8. In plot A it can be seen that runtime, for VGG-16 and a lesser extent
Alexnet, tends to trend lower as LANE NUM increases. LANE NUM
20
Figure 4.7: The Relation of Resource Utilization and Runtime
corresponds to the number of parallel convolution units in the design. Since
convolution is one of the most expensive operations in a CNN in terms of
computation, it stands to reason that increasing the number of convolution
units working in parallel would decrease runtime. In plot B runtime can
be seen to vaguely decrease as V EC SIZE increases for VGG-16 but this
relationship is not very strong. In plot C no relationship can be observed
between CONV GP SIZE X and runtime.
21




The general results indicate the Cyclone V chip as an OpenVINO target
makes a compelling solution for low cost computer vision applications. Open-
VINO on the Cyclone V chip performs better than a comparable open source
accelerator framework, PipeCNN, on the same device. Additionally, and per-
haps more importantly, OpenVINO inference utilizing the CPU and Cyclone
V device performs significantly better than only utilizing a desktop class
CPU. Further it has been shown that inference utilizing the Cyclone V uses
power comparable to that of utilizing only the CPU.
There is certainly also room for further future exploration. First it would
be beneficial to evaluate the performance of OpenVINO running in hetero-
geneous mode with the Cyclone V along with a lower cost and lower power
processor like a Celeron or Atom. Additionally it was not possible to obtain
a source code license for the intel DLA for this thesis. In the future it would
be prudent to explore the ability of custom Xbar modules to effect the per-
formance of the different networks investigated or allow performing inference
entirely on the FPGA without use of the CPU. Additionally a more reliable
power measurement system would likely yield cleaner data.
23
REFERENCES
[1] “Web of science.” [Online]. Available: http://wcs.webofknowledge.
com/RA/analyze.do?product=WOS\&SID=8E22oRzqlnqIEcsNEpl\
&field=PY\ PublicationYear\ PublicationYear\ en\&yearSort=true
[2] B. Zhang, W. Huang, J. Li, C. Zhao, S. Fan, J. Wu, and C. Liu, “Prin-
ciples, developments and applications of computer vision for external
quality inspection of fruits and vegetables: A review,” Food Research
International, vol. 62, pp. 326–343, 2014.
[3] L. B. Marinho, P. P. Reboucas Filho, J. S. Almeida, J. W. M. Souza,
A. H. S. Junior, and V. H. C. de Albuquerque, “A novel mobile robot
localization approach based on classification with rejection option using
computer vision,” Computers & Electrical Engineering, vol. 68, pp. 26–
43, 2018.
[4] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-
DenseUNet: hybrid densely connected unet for liver and tumor seg-
mentation from CT volumes,” IEEE Transactions On Medical Imaging,
vol. 37, no. 12, pp. 2663–2674, 2018.
[5] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis,
“Deep learning for computer vision: A brief review,” Computational
Intelligence and Neuroscience, vol. 2018, 2018.
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” In-
ternational Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
211–252, 2015.
[7] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 7132–7141.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for im-
age recognition,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 770–778.
24
[9] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep
neural network models for practical applications,” arXiv preprint
arXiv:1605.07678, 2016.
[10] Y. LeCun, “1.1 deep learning hardware: Past, present, and future,”
in 2019 IEEE International Solid-State Circuits Conference-(ISSCC).
IEEE, 2019, pp. 12–19.
[11] E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr,
“Accelerating recurrent neural networks in analytics servers: Compari-
son of FPGA, CPU, GPU, and ASIC,” in 2016 26th International Con-
ference on Field Programmable Logic and Applications (FPL). IEEE,
2016, pp. 1–4.
[12] “System architecture.” [Online]. Available: https://cloud.google.com/
tpu/docs/system-architecture
[13] “Mlperf inference v0.5 results.” [Online]. Available: https://mlperf.
org/inference-results/
[14] M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O’Connell,
N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling et al., “DLA: Com-
piler and FPGA overlay for neural network inference acceleration,” in
2018 28th International Conference on Field Programmable Logic and
Applications (FPL). IEEE, 2018, pp. 411–4117.
[15] C. Jiang, D. Ojika, T. Kurth, S. Vallecorsa, B. Patel, H. Lam et al.,
“Acceleration of scientific deep learning models on heterogeneous com-
puting platform with intel® FPGAs,” in International Conference on
High Performance Computing. Springer, 2019, pp. 587–600.
[16] N. author, “Accelerating dnns with xilinx alveo accelerator cards,” Xil-
inx, White Paper na, Oct. 2014.
[17] D. Bhowmik and K. Appiah, “Embedded vision systems: A review of
the literature,” in International Symposium on Applied Reconfigurable
Computing. Springer, 2018, pp. 204–216.
[18] J. O. Gaya, L. T. Gonçalves, A. C. Duarte, B. Zanchetta, P. Drews,
and S. S. Botelho, “Vision-based obstacle avoidance using deep learn-
ing,” in 2016 XIII Latin American Robotics Symposium and IV Brazilian
Robotics Symposium (LARS/SBR). IEEE, 2016, pp. 7–12.
[19] “Myriad x,” 2019. [Online]. Available: https://www.movidius.com/
myriadx
[20] “Coral Accelerator Module data sheet,” Google.
25
[21] “Edge tpu performance benchmarks.” [Online]. Available: https:
//coral.ai/docs/edgetpu/benchmarks/
[22] Zynq DPU v3.2 Product Guide, Xilinx, 2020.
[23] D. Wang, K. Xu, and D. Jiang, “PipeCNN: An OpenCL-based open-
source FPGA accelerator for convolution neural networks,” in 2017 In-
ternational Conference on Field Programmable Technology (ICFPT).
IEEE, 2017, pp. 279–282.
[24] “Acs712 data sheet,” Allegro.
[25] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified em-
bedding for face recognition and clustering,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2015, pp. 815–
823.
[26] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical
guidelines for efficient cnn architecture design,” in Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 116–131.
[27] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for con-





This appendix includes 3 datasets. The first dataset is a collection of the
performance of the Terasic OSK as an OpenVINO target, the second dataset
is a collection of the performance of the Terasic OSK as a PipeCNN target,
and finally is a collection of wall outlet power measurement’s for the OSK
and host computer while OpenVINO is performing inference.
First is the table A.1 for the OpenVINO benchmarking. Power and Energy
report only the usage metrics for the OSK board and do not factor in the
power usage of the CPU, additionally the measurement system is not entirely
reliable and occasionally fails to retrieve a valid measurement, in this case
power and energy are reported as -1. Should inference fail, latency and













































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Second is table A.2 for the PipeCNN benchmarking. The totalRuntime
and timePerBatch columns occasionally have a value of -1, this indicated the
benchmarking tool failed to run with that bitstream loaded on the board.
Additionally the warning column indicated whether or not a padding warning
was given as discussed in the results section. totalResources is the percentage
of total resource utilization. The network topologies are denoted as A for
Alexnet, V for VGG-16, and R for Resnet-50. The Power and Energy columns
have the same meaning as in table A.1. Intel FPGA SDK for OpenCL 18.1
was used to collect the PipeCNN data. Unfortunately, the Terasic provided
board support package for the OSK board is made for Intel FPGA SDK for
OpenCL 17.1. No released version of PipeCNN is designed to use Intel FPGA
SDK for OpenCL 17.1. This SDK incompatibility is the likely reason that the
PipeCNN performance data is worse than previously published performance












































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Last is table A.3 for the “at the wall” power measurements for the OSK
running the OpenVINO benchmark. Only the 8x16 bitstream was used for
this data. The benchmarking method used for this data is the same as the
main OpenVINO benchmarking in table A.1 except inference was run for 5
hours instead of stopping at 3000 iterations.
Table A.3: OpenVINO Wall Power
Topology Device Energy (KWH)
Shufflenet V2 cpu 0.16
Shufflenet V2 hetero 0.17
Resnet-18 hetero 0.16
Resnet-18 cpu 0.17
Alexnet cpu 0.17
Alexnet hetero 0.18
VGG-16 hetero 0.16
VGG-16 cpu 0.16
Resnet-50 hetero 0.17
Resnet-50 cpu 0.17
Unet cpu 0.17
Unet hetero 0.16
Efficientnet hetero 0.17
Efficientnet cpu 0.15
Insightface cpu 0.15
Insightface hetero 0.15
Resnet-152 cpu 0.15
Resnet-152 hetero 0.14
46
