2L-3W: 2-Level 3-Way Hardware-Software Co-Verification for the Mapping
  of Deep Learning Architecture (DLA) onto FPGA Boards by Odetola, Tolulope A. et al.
12L-3W: 2-Level 3-Way Hardware-Software
Co-Verification for the Mapping of Deep
Learning Architecture (DLA) onto FPGA Boards
Tolulope A. Odetola, Katie M. Groves, and Syed Rafay Hasan
Abstract—FPGAs have become a popular choice for deploying deep learning architectures (DLA). There are many researchers that
have explored the deployment and mapping of DLA on FPGA. However, there has been a growing need to do design-time
hardware-software co-verification of these deployments. To the best of our knowledge this is the first work that proposes a 2-Level
3-Way (2L-3W) hardware-software co-verification methodology and provides a step-by-step guide for the successful mapping,
deployment and verification of DLA on FPGA boards. The 2-Level verification is to make sure the implementation in each stage
(software and hardware) are following the desired behavior. The 3-Way co-verification provides a cross-paradigm (software, design and
hardware) layer-by-layer parameter check to assure the correct implementation and mapping of the DLA onto FPGA boards. The
proposed 2L-3W co-verification methodology has been evaluated over several test cases. In each case, the prediction and
layer-by-layer output of the DLA deployed on PYNQ FPGA board (hardware) alongside with the intermediate design results of the
layer-by-layer output of the DLA implemented on Vivado HLS and the prediction and layer-by-layer output of the software level (Caffe
deep learning framework) are compared to obtain a layer-by-layer similarity score. The comparison is achieved using a completely
automated Python script. The comparison provides a layer-by-layer similarity score that informs us the degree of success of the DLA
mapping to the FPGA or help identify in design time the layer to be debugged in the case of unsuccessful mapping. We demonstrated
our technique on LeNet DLA and Caffe inspired Cifar-10 DLA and the co-verification results yielded layer-by-layer similarity scores of
99% accuracy.
Index Terms—Deep Learning, Convolutional Neural Network, Hardware-Software Co-Verification, FPGA, High Level Synthesis.
F
1 INTRODUCTION
Convolutional neural network (CNN), a well-known Deep
Learning Architecture (DLA) evolved from artificial neural
network, has been extensively applied to various applica-
tions, such as video surveillance, mobile robot vision, image
search engine in data centers and so on [1]. In general, deep
learning uses a multi-layer neural network model to extract
high-level features which are a combination of low-level
abstractions to classify mutually exclusive properties of an
image data [2]. This helps in finding the distributed data
features, in order to solve complex problems in machine
learning [3].
Due to the specific computation pattern of CNN, cloud
computing have been employed to perform classification of
deep learning models but this raises concerns of privacy
[4]–[16], security [17] and latency. General-purpose proces-
sors are also not efficient for CNN implementation and can
hardly meet the performance requirement [18]. Thus, vari-
ous accelerators based on FPGA (Field Programmable Gate
Array), GPU (Graphics Processing Unit), and even ASIC
(Application Specific Integrated Circuits) design have been
proposed to improve performance of CNN designs [19].
Among these approaches, FPGA based accelerators have
attracted more attention because they have advantages of
• Tolulope A. Odetola, Katie M. Groves, and Syed Rafay Hasan are with
the Department of Electrical & Computer Engineering, Tennessee Tech
University, Cookeville, TN 38505 USA.
good performance, high energy efficiency, fast prototyping,
and capability of reconfiguration [1].
To take advantage of what FPGA has to offer, several
approaches like [20], [21] and [22] have been proposed
to enable efficient optimizations for the deployment and
successful mapping of DLAs onto FPGA boards. These
optimizations help to reduce latency, conserve area and
memory on the hardware (FPGA) [23]. Majority of these
mapping and optimizations are only validated at the point
of final prediction and the measure of accuracy. Hence,
layer-by-layer design time verification mechanism for DLA
mapping to hardware from software paradigm to hardware
paradigm has not been addressed. Verification is very cru-
cial in hardware design as it accounts for about 80% of
modern hardware design time [24].
Though, many researchers understand the crucial nature
of verification in the design and mapping of DLA, but the
mapping of DLA to FPGA board has unique phases from the
software to the design and eventual mapping onto FPGA
boards (hardware). Several approaches have been adopted
to verify the workings and correctness of the DLA. Xiang
et. al [25] proposes a software simulation based approach to
verify the correctness of multilayer neural networks by mea-
suring the maximum sensitivity of the layers of the network.
Similarly, Dwarakanath et. al [26] proposes a software based
approach to verify the correctness of image classifiers by
building relationships between subsequent layer-by-layer
outputs corresponding to different inputs. These verifica-
tion approaches are limited to software level layer-by-layer
ar
X
iv
:1
91
1.
05
94
4v
1 
 [c
s.L
G]
  1
4 N
ov
 20
19
2output and the accuracy of the final prediction. They do not
provide a means of verifying the implementation correct-
ness of the mapping of DLAs on hardware.
The verification approach verifies to the layer-by-layer
output and the accuracy of the final prediction only. This
approach does not take into consideration an approach that
can be applied to the mapping of DLAs to FPGA boards.
Other approaches that focus on the mapping of DLA
onto FPGA boards involve the process of hardware-software
co-design. Guo et. al [20] proposes a design flow for map-
ping CNNs onto embedded FPGA using data quantization
to reduce the bit-width of CNN models without compro-
mising much on accuracy. Similarly, Jiandong et. al [27] pro-
poses a collaborative framework to optimize the OpenCL
based CNN design. These co-design techniques can only
validate the correctness of the implementation based on the
accuracy of the final prediction.
Very recently, Cong et. al [28] proposes a time saving co-
design methodology that simultaneously searches possible
design options to auto-generate efficient DNNs optimized
for FPGA deployment. This design approach is all automatic
from the software to the hardware deployment. However, it
only validates the design based on the accuracy of predic-
tion of the model.
One shortcoming in the existing literature in the process
of mapping DLAs to FPGAs is their inability to show a
complete hardware-software co-verification schemes of the
hardware implementation against its counterpart software-
based DLA. Some of the above mentioned approaches [20],
[27] and [28] only have means of validating or debugging
the deployment at the stage of final prediction while others
[25] and [26] show means of verifying the DLA in a software
environment. During the mapping of DLA to FPGA, if the
prediction in design is wrong or does not correspond to
the software implementation, the traditional approaches of
verification are not able to analyze layer-by-layer feature
values of DLA in design time. In this paper, we work with
the premise that for a sustainable DLA design environment
co-verification at the three stages of design (software stage,
hardware design stage and hardware deployment stage)
is crucial. Hence, there is a need for a methodology for
complete hardware-software co-verification of DLA that
readily shows the step-by-step and end-to-end process of
deployment and verification of the inference phase (the
forward propagation path) of the DLAs over FPGA boards.
To the best of the authors knowledge, no such methodology
exists so far in the literature.
In this paper, we propose a 2-Level 3-Way (2L-3W)
coherent hardware-software co-verification approach. Our
2-level verification approach is divided into software in-
ference level and hardware inference level of the DLA.
Our 3-way co-verification technique provides a means if
assuring that the software design, hardware design and
hardware mapping of the DLA are coherent and correctly
implemented.
The following are the contributions of this work:
• A step-by-step and end-to-end methodology for the
mapping of DLAs onto FPGA boards
• A 2-level verification approach to ensure the implemen-
tation correctness of a designed DLA in both software
and hardware
• A 3-way layer-by-layer co-verification technique that
ensures successful mapping of DLA to FPGA boards
The remainder of this paper is organized as follows: Sec-
tion II provides some preliminary information. Section III
discusses the proposed methodology of hardware-software
co-verification of DLA. Section IV describes the experimen-
tal validation of the co-verification methodology setup on
Xilinx PYNQ FPGA board. Section V shows the results
and lessons learned. Section VI discusses the related work.
Section VII compares the co-verification methodology with
different approaches. Section VIII concludes the paper.
2 PRELIMINARIES
In order to understand this paper, we are providing infor-
mation about some of the concepts used in this paper.
2.1 High Level Synthesis (HLS)
Hardware accelerators like FPGA provides a means to
achieve moderate level performance with low power con-
sumption, massive memory parallelism and short time to
market [29]. To ensure proper deployment of DLA on FPGA,
hardware-software co-verification is essential. Hardware-
software co-verification helps to ensure the behavior of the
embedded system software is consistent with the hardware
design.
Hardware design using Hardware Descriptive Lan-
guages (HDL) can be time consuming and difficult to debug
and verify [30]. High Level Synthesis (HLS) offers flexibility
by utilizing C/C++ code with a set of derivatives to au-
tomatically generate HDL for hardware implementation on
FPGA. HLS provides a means of converting C/C++ code
(High Level Languages) to HDL like VHDL or Verilog.
2.2 Deep Learning Framework: Caffe
In this paper, Caffe deep learning framework is adopted
because of its popularity, support and easy-to-use interface.
It is easy to experiment with popular pre-trained models
[31]. Caffe provides toolkits for training, fine-tuning and
the deploying DLA [32]. In Caffe, the DLA is designed
and configured using prototxt files prior to training. After
training, Caffe generates a caffemodel file containing the
trained parameters (weights and biases) of the DLA. The
parameters in the caffemodel file can be accessed through
using Python libraries.
2.3 Network Surgery
DLA tend to have stacked layers. Each layer contains
learnable parameters (weight and biases) [33]. For proper
replication and deployment of the DLA on hardware, access
and extraction of these learnable parameters are needed.
During the inference phase, network surgery gives access
to the output of each layer when an unseen data is passed
through the DLA. The output of a layer is called Blob.
Network surgery allows access and extraction of the DLA
parameters and Blobs.
3Fig. 1: General Deep Learning Approach.
2.4 Chosen FPGA Board: PYNQ
The hardware environment chosen is the PYNQ-Z1 FPGA
[34]. PYNQ-Z1 is built upon Xilinx ZYNQ SoC technol-
ogy and is used to develop applications for ZYNQ-7000
based devices [35]. The PYNQ platform offers designers
the privilege of exploiting the programmable logic of the
FPGA board from a Python environment [35]. Xilinx pro-
vides Python packages that facilitates the interaction with
hardware modules using overlays.
2.5 Python Overlay
Overlays, or hardware libraries, are configurable FPGA de-
signs capable of extending user application from the ZYNQ
processor of a PYNQ board into the programmable Logic.
Overlays can be loaded to the FPGA dynamically like a
software library. PYNQ overlays are created by hardware
designers, and wrapped with PYNQ’s Python Overlay API.
This allows Python interface to program and control spe-
cialized hardware overlays [36].
2.6 Data Transfer: AXI Direct Memory Access (AXI
DMA)
AXI DMA transfers data between memory and AXI4-
Stream-type target peripherals [37]. AXI DMA in Vivado
provides high-bandwidth direct memory access between an
AXI4 memory-mapped and an AXI4-Stream ports on IPs
(Intellectual Property) interfaces [38]. PYNQ supports the
AXI central DMA IP with the PYNQ DMA class [36]. DMA
can be used for high performance burst transfers between
Processing System (PS) DRAM and the Programmable Logic
(PL). It helps to offload data from the Central Processing
Unit (CPU) in processor-based systems [38]. AXI DMA data
movement between system memory and stream target is
through the AXI4 Read Master to AXI4 memory-mapped to
stream (MM2S) Master, and AXI stream to memory-mapped
(S2MM) Slave to AXI4 Write Master.
2.7 General Deep Learning Approach
Fig. 1 shows the general end-to-end approach from the
training of a DLA to its deployment on FPGA board. This
include the following steps:
• Network Training: This stage takes place after the DLA
has been designed. The training process is where the
best sets of parameters that maximizes a DLA’s accu-
racy is determined by leveraging on gradient descent
(back propagation). Training involves a number of for-
wared and backward propagation based on the number
of iterations specified in the model design. Network
training is done with CPUs or GPUs on different soft-
ware frameworks like caffe, tensoflow and so on.
• Network Testing: This stage is also referred to the infer-
ence stage. The trained model is used to classify unseen
data and predict a result with a degree of accuracy.
• C++ Layer-by-Layer Abstraction: Model design, training
and testing are usually done in Python environment.
For hardware design, the model design is converted
from prototxt syntax adopted in Caffe (which is utilized
using Python libraries in model training)to C++ syntax
used in hardware design. In this stage, every layer is
designed in C++ as stipulated in the model design in
the prototxt. All conditions in terms of layer outputs,
kernel sizes, stride sizes and so on for each respective
layer is obeyed during this conversion.
• Vivado HLS (Hardware Design): Vivado HLS provides
an environment for the simulation and synthesis of
the C++ code of the model design. After successful
synthesis, Vivado HLS allows for IP generation of the
model design.
• Deployment over FPGA (PYNQ Board): In this stage,
the IP generated from Vivado HLS is converted to
bitstreams and deployed on the FPGA board.
The verification part shown on the right hand side of Fig. 1
is something outside the realm of the general deep learning
deployment methodology.
3 HARDWARE-SOFTWARE CO-VERIFICATION OF
DLA INFERENCE PHASE
In this paper we are proposing a novel 2L-3W hardware-
software co-verification concept for DLA deployment on
FPGA boards. In order to achieve this, Caffe software frame-
work is utilized for the software implementation (training
and testing) and Vivado HLS for the hardware design
synthesis. Finally, our approach uses Xilinx PYNQ FPGA
board for hardware implementation
Prior to explaining our proposed co-verification appoach
it is worthy to note that we collect the trained model apriori
using Caffe deep learning framework. This trained model
is called Caffemodel file in Caffe framework. Furthermore,
we design the feed forward path of the trained DLA using
Vivado HLS.
Fig. 2 shows the different levels and sections of the co-
verification methodology. These sections are discussed as
follows:
3.1 Level 1: Inference Phase Software Verification
This is the first level of the proposed 2L-3W co-verification
methodology. In this phase, the trained DLA is collected.
4Fig. 2: Verification Approach.
5As shown in Fig. 2, the image dataset (correctly predicted
by the trained DLA) used in training the DLA is passed
through the trained model. Network surgery is used to get
the Blobs (layer-by-layer output features) of each layer of
the DLA. These Blobs are obtained and used to investigate
the numerical distribution of each respective layer Blob.
Statistical properties like the range, minimum, maximum
and standard deviation of each Blob is also collected over
a given number of training image set and generalized and
written to the Software Properties Verification File (SPVF)
as shown in Fig. 2. The SPVF file contains boundary values
of each element in the Blob of each layer. This forms a
benchmark for comparing the numerical distribution and
statistical properties of Blobs of subsequent images (test
images) that is passed through the model. This serves as
the Inference Phase Software Verification as shown in Fig. 2.
3.2 Level 2: Inference Phase DLA Mapping From Soft-
ware to Hardware
This is the second level of the proposed 2L-3W co-
verification methodology. This level is divided into 6 sec-
tions. After the software verification is done, the DLA is
mapped to FPGA and test images are used to verify the
implementation correctness of mapping the DLA from soft-
ware to hardware (FPGA).
3.2.1 Section A: Parameter Extraction Using Network
Surgery
This stage of the co-verification methodology is shown as
Section A in Fig. 2. This section shows that the trained
model (i.e. Caffemodel file) parameters is obtained using
a Caffe function called Network Surgery. Here, unseen data
(data not used in the training phase) shown as input data
in Fig. 2 is passed through the parameters of the trained
Caffemodel file. This stage is carried out in the software
environment. The prediction and the layer-by-layer output
(Blobs) is extracted and written to a specified file called
File_SW as shown in Fig. 2. The numerical distribution and
statistical properties of the Blobs written to File_SW is then
compared and validated with the properties written to the
SPVF generated in level 1. Line 1 to 7 of Algorithm 1 shows
what actions needs to be taken if the DLA requirement is
not met.
3.2.2 Section B: Parsing of Weights from Caffe to Vivado
HLS
The parameters of the Caffemodel are obtained in Python
and passed through a data cleaning process to convert the
layer-by-layer parameters to be compatible with C++ syntax
required for Vivado HLS. This converted parameters are
then incorporated for Vivado HLS synthesis of hardware
design. This is further explained in the simulation of HLS
design section (Setion D). Line 8 to 11 of Algorithm 1
summarizes this section
3.2.3 Section C: Streaming of Input Data to the Hardware
Design
This section is in the hardware design stage. Here, the input
data (unseen data), that is used in Section A is read using
OpenCv C++ library and converted to a stream of data using
HLS stream library. The stream of data is passed down to
the designed IP in the HLS design (Section D). Line 12 of
Algorithm 1 summarizes this section.
3.2.4 Section D: Simulation of HLS Design to Provide
Layer-by-layer Output
This section is shown in Fig. 2 as section D. This section as-
sumes that the C++ adaptation of each layer of the DLA has
been completed to form the IP in Vivado HLS. The weights
obtained using the parsing of weights of the Caffemodel
file (section B) is imported and merged appropriately with
the IP. Unseen image data is read using OpenCV (as shown
in section C) and is converted to a stream of data. This
stream of data is passed as an input to the designed IP.
After simulation of the IP, the layer-by-layer output and
the prediction is written to a specified file (denoted as
File_Design) as shown in Fig. 2. Line 13 to 16 of Algorithm
1 shows what actions needs to be taken if the verification
does not meet the requirement.
3.2.5 Section E: Hardware Deployment and On-board Ver-
ification
The generated IP is synthesized to obtain a bitstream and
.tcl files in Xilinx Vivado environment as in the case of
conventional design flow. These files are imported to the
PYNQ board as an Overlay to be called in the Python envi-
ronment. Special provisions are made to ensure the output
of each stage of the DLA is compared against the output
of simulation of HLS design (explained in Section D) and
the software deep learning framework output (Caffe layer-
by-layer output explained in Section A). Fig. 3 illustrates
the comparison. It shows that all layers of the DLA are
synthesized as a separate module. For example, the output
of conv1_dma (shown in the right hand side of Fig. 3)
corresponds to the output of conv1 layer in the software
(shown in the left hand side of Fig. 3). In order to store the
values and automate the process we stored the output of
each stage in Python environment in a separate file (denoted
as File_HW) as shown in Fig. 2. Line 18 to 24 of Algorithm
1 summarizes this section.
3.2.6 Section F: Co-verification
To automate our methodology, the output of all the three
stages need to be compared seamlessly. In order to achieve
this, a Python script is written that verifies the software
verified layer-by-layer output of each stage that are stored
in the File_SW, File_Design and File_HW for our three-
way verification approach. Line 25 to 28 of Algorithm 1
summarizes this section and suggests possible actions if the
verification does not meet the requirement.
4 EXPERIMENTAL VALIDATION OF HARDWARE-
SOFTWARE CO-VERIFICATION
To validate our methodology, we implemented 2 DLAs. The
first DLA is LeNet and the other is Caffe-inspired Cifar-
10 as shown in Figs. 3 and 4, respectively. Both DLAs
are implemented on PYNQ Xilinx FPGA Board and the
processes of implementation are for the most part the same.
For the sake of vivid elaboration, our discussion in this
section is explaining the process for LeNet DLA.
6Fig. 3: LeNet DLA and hardware configuration for the output of each layer on FPGA
Fig. 4: Cifar-10 DLA and hardware configuration for the output of each layer on FPGA
7Algorithm 1: 2L-3W Hardware-software Co-
verification Methodology
Require: Design, Configure and training of
Model
1: Testing of model on unseen data (D) in
Caffe
2: if Testing = Fails then
3: Action : Retrain and re-design or
re-configure model
4: else
5: Perform network surgery on model
6: Obtain layer-by-layer Blob and obtain the
numerical distribution and generalized
statistical properties (Range, Maximum,
Minimum, Mean and Standard deviation) for
correctly predicted images in training
sets and write to SPVF
7: end if
8: Extract Blobs (layer-by-layer output) from
testing (of unseen data), write to file
(File_SW).
9: Compare File_SW with SPVF generated in
1evel 1
10: Extract parameters (weights and biases) of
Model
11: Convert the parameters from Python syntax
to C++
12: Implement C++ representation of each layer
of the model design in Vivado HLS
13: Incorporate model design parameters with
model design in Vivado HLS
14: Simulate model design with unseen data (D)
used in model testing
15: Write layer-by-layer output of the result
of simulation of the model in Vivado HLS to
file (File_Design)
16: Compare value-to-value of respective
layer-by-layer output between Vivado HLS
and Caffe
17: if Vivado HLS Output != Caffe Output then
18: Action : Redesign C++ algorithm and check
for error using layer-by-layer output
values
19: else
20: Generate IP from model design in Vivado
HLS
21: end if
22: Configure IP in Vivado block design
23: Generate bit-stream
24: Deploy bit-stream on board
25: Import bit-stream in Python Overlay
26: Run bit-stream with unseen data (D) and
write layer-by-layer output to file
(File_HW)
27: Perform hardware-software verification with
results
28: if FPGA Output != Vivado HLS Output or
Caffe Output then
29: Action : Redesign C++ algorithm and
re-generate bitstream
30: else
31: End Deployment
32: end if
The LeNet DLA is designed and trained in Caffe. After
training, several steps are taken to test the trained model
and to validate the implementation correctness of the model.
A total of 100 test images are passed through the model and
it yields an accuracy of 97%. After verifying the accuracy,
1000 correctly predicted images are passed through the
LeNet DLA to obtain the numerical distribution of each
respective Blob (Blob is explained in Section II) using net-
work surgery. The mean, range, maximum, minimum and
standard deviations are obtained and averaged over 1000
images to get generalized statistical properties of the each
respective Blobs as shown in Fig. 2. The boundary values
(minimum and maximum of each element across all the
chosen training imageset)for each element in the Blob is
also obtained. These properties and boundary values are
written to a specified file called SPVF. To illustrate this, Fig.
5 shows the numerical distribution of outputs from the first
fully connected layer of LeNet DLA. The Blobs of the first
fully connected layer for unseen data is compared with this
to verify it. The SPVF for the first fully connected layer
shows the Blobs follow a Gaussian distribution. The same
procedure is carried out for all the layers in the LeNet DLA.
This concludes Level 1 of the 2L-3W hardware-software co-
verification which is shown as “Inference Phase Software
Verification" in Fig. 2.
Fig. 5: Software Verification for Fully Connected Layer 1 for
LeNet DLA
For the second level of 2L-3W hardware-software co-
verification, labelled as “Level 2: Inference Phase DLA
Mapping in Fig. 2", an unseen image is passed through the
trained DLA and the Blobs of each layer are obtained using
network surgery (explained in Section II) and written to a
specified file denoted by File_SW in Fig. 2. The Blobs written
to File_SW is then verified with the SPVF. The element in the
Blob of each layer is compared with the boundary values in
the SPVF to verify them. The code snippet that allows the
access to layer-by-layer output using network surgery for
one of the DLA layers is shown in Image 3 of Fig. 6. Follow-
ing the proposed 2L-3W co-verification methodology in Fig.
2, using Section B, the parameters (weights and biases) of
the DLA are obtained and parsed into the HLS design. Each
of the layers defined in the Caffe framework is also defined
in Vivado HLS to maintain the same accuracy of prediction
from Caffe to the PYNQ hardware. As shown in Section C
in Fig. 2, a stream of input data (same used in testing the
Caffe model) is used in simulating the HLS design layers
and the parsed parameters. The layer-by-layer output result
of the simulation is written to a specified file denoted by
File_Design in Fig. 2. The layer-by-layer output of the DLA
written to File_Design is then verified with the SPVF file.
After successful verification, the DLA is optimized to fit
8Fig. 6: Illustration of Conv1 for Caffe to HLS Design to Block Design to Python
the PYNQ FPGA board is then synthesized and packaged
as an IP. Vivado HLS contains built in directives known as
pragmas (shown in the first two lines of image 4 Fig. 6)
that specifies how the data is written to the IP (shown in
Section C in Fig. 2) and also how the data is read from the
IP. The pragma used to allow data flow is called “interface
axis port". This axis port is important because this allows for
an actual physical port of an AXI4-Stream to be used later
in the block design. The AXI4-Streams ports allow for this
implementation to Blobs from each layer to be viewed in the
Python environment.
Fig. 7: The IP integration with the Zynq Processor
To generate an Overlay that will be exported on the
PYNQ board for the LeNet DLA, the generated IP is im-
ported to Vivado where each axis port defined in Vivado
HLS is now declared as AXI4-Stream port on the IP. An
example of this can be seen in Fig. 3 where the LeNet_DLA
IP has ports representing Blobs for each layer. The AXI4-
Streams are written to and read from via AXI DMA as
shown in Fig. 3. Each of these AXI DMAs needs to interact
with Python Overlay APIs to write data and read data from
the AXI DMA. These connections in Fig. 3 are collapsed
into a hierarchy called “LeNet_DLA" shown in Fig. 7 which
is called in the Python environment. As shown in Fig.
7 of the block design, the LeNet_DLA transfers data to
and from the ZYNQ processor via the axi_interconnect_0
and axi_interconnect_1 modules, respectively. When all the
connections are routed, the connections in the block diagram
are validated, synthesized, and implemented. After the im-
plementation, a bitstream file and a .tcl are generated which
are exported to the PYNQ FPGA board to create an Overlay
to be called at the Python environment.
As shown in the Hardware Deployment and On-board
Verification phase (Section D) in the Fig. 2, the Python Over-
lay API is imported into the Jupyter Notebook that allows
reading from and writing to the IP on the hardware of the
PYNQ board via AXI DMA. In the Python environment,
the Python Overlay library uses AXI DMA APIs to call the
AXI DMAs created in the block diagram directly and allows
the writing of an image vector as a stream to the IP for
processing. After the execution, the output of each layer is
written to their respective AXI DMA, which is written to
the Python environment. These outputs are verified with
the SPVF and written to a specified file (File_HW). The
prediction is read from the output register specified in
Vivado HLS.
To illustrate this co-verification process, Fig. 6 shows
how the output of conv1 layer defined in Caffe is written
in Vivado HLS with its number of respective outputs. In
Vivado HLS, the IN_DATA and OUT_CONV1 are defined
as AXI4-Stream that allows the actual ports for the input
image to be streamed in by the IN_DATA and the Blob
to be streamed out by OUT_CONV1 as shown in Fig. 6.
Importing the IP into Vivado block design shown in Fig.
96 shows that IN_DATA and OUT_CONV1 have their own
ports to be connected to an AXI DMA. OUT_CONV1 is
written to the conv1_dma (which is shown in Image 6 of the
code snippet shown in Fig. 6) at the Python environment.
Buffers are created and assigned to their AXI DMA for the
data to be passed to and from the AXI DMA. Once the IP
is signaled through the Python environment to start, the
AXI DMA returns its values back to the buffer in which this
buffer can be viewed in Python environment.
Caffe software framework generated File_SW at the end
of Section A of Fig. 2. The Vivado design simulation gener-
ated the layer-by-layer output feature of the DLA which is
stored in File_Design shown in Section D of Fig. 2. Finally,
the layer-by-layer output of the AXI DMA of each respective
layer is written to File_HW as depicted in Section E of Fig.
2. Finally, as shown in Fig. 2, the Section E of our 2L-3W co-
verification compares the output of each layer at each stage
of hardware-software co-design.
5 RESULTS AND LESSONS LEARNED
The LeNet DLA for MNIST dataset and Caffe Cifar-10
inspired DLA for Cifar-10 datasets are shown in Figs. 3 and
4. They are implemented on the PYNQ hardware using the
methodology shown in Fig. 2.
The LeNet DLA consists of 8 layers excluding the data
and prob layers as shown in Fig. 3. The data layer passes
a 28x28 hand-written image of a digit through the layers
designed in Caffe and also through the layers designed in
Vivado HLS and the PYNQ FPGA. The results are shown in
Table I.
The Tables 1 and 3 show the values of subsections of the
array outputted by the Conv1, Pool1 and Conv2 layers of
these respective written files of LeNet and DLA for Cifar-
10 DLA respectively. The 3-way verification performed by
the Python script which compares of the output values of
each layer and the prediction written to the files returns
a similarity score per layer. The similarity score is defined
as the metric for measuring element-by-element similarity
in terms of magnitude of the values stored in the arrays
produced by each layer and written to the three files (File
SW, File Design, File HW).
The similarity score per layer for the design stage
(SCDes) is given as:
SCDes =
∑n
i=0(1−
X1 − Y1
X1
)
n
(1)
where :
SCDes =Similarity score for a layer in design stage
i =ith element written to a particular file
n =Number of parameters to be compared in the layer
X1 = Max(|Ei|SW , |Ei|Des)
Y1 = Min(|Ei|SW , |Ei|Des)
|Ei|SW=Absolute value of the ith element value written to
the File_SW file
|Ei|Des=Absolute value of the ith corresponding element
value written to the File_Design file
Similarly, the similarity score per layer for the deploy-
ment stage (SCHW ) is given as:
SCHW =
∑n
i=0(1−
X2 − Y2
X2
)
n
(2)
where :
SCHW =Similarity score for a layer in hardware
deployment stage
i =ith element written to a particular file
n =Number of parameters to be compared in the layer
X2 = Max(|Ei|SW , |Ei|HW )
Y2 = Min(|Ei|SW , |Ei|HW )
|Ei|SW=Absolute value of the ith element value written to
the File_SW file
|Ei|HW=Absolute value of the ith element value written to
the File_HW file
|Ei|Des=Absolute value of the ith corresponding element
value written to the File_Design file
Table 1 shows snippets of partial results of the layer-
by-layer output values written to the File_SW file in the
software stage, the File_Design file in the design stage of the
LeNet DLA and the File_HW in the hardware deployment
stage to obtain similarity scores in the design stage and
deployment stage respectively.
Prior to the design stage, the training of the DLA is done
in Caffe software environment using float (32-bits precision)
data type. Hence the parameters and the Blobs of the DLA
are in float data type. The layer-by-layer output (Blob) are
obtained and written to File_SW. In the design stage, the
DLA is simulated with parameters and Blobs of float data
type numbers to obtain and write the layer-by-layer output
in the design stage to File_Design. The values written to
the File_Design are verified and compared with the layer-
by-layer outputs written to File_SW to obtain the similarity
scores at the design stage as shown in Table 6a. Once the
result shows desirable similarity scores, an IP is generated
from the hardware design and exported and configured in
Vivado to generate a bit-stream file that is deployed on the
PYNQ FPGA board. The layer-by-layer values outputted
by the PYNQ FPGA after deployment are obtained are
written to File_HW to obtain the layer-by-layer similarity
score for the deployment stage. The similarity score for the
deployment stage for LeNet DLA is shown in Table 6a.
From Table 6a, a 99% similarity score for each layer
is obtained for the LeNet DLA in the design stage and
deployment stage using float data type.
FPGAs have a common characteristic of having lim-
ited area and hardware resources (DSPs, LUTs, Flip-flops,
BRAM). For scalability, one of the strategies to ensure the
large DLA fit the FPGA boards, the bit-width of parameters
and Blob precisions of the large DLA are truncated using
Arbitrary Precision (AP) libraries provided in the hardware
design stage in Vivado HLS. This truncation reduces the
memory and computation requirement of the large DLA.
For the LeNet DLA in this work, the parameters and Blobs
are truncated from 32-bit precision to 8-bits and 24-bits
precisions respectively. The truncation reduces the area of
the DLA on the board without compromising on accuracy as
truncated parameters and Blobs are tested with 100 images
10
TABLE 1: Snippet of results from layer-by-layer output of LeNet DLA implementation using default float (32-bit) data type.
First column shows Caffe output (Software), second column shows Vivado HLS (Design) and third column shows PYNQ
FPGA (Hardware)output results.
Layer Caffe Output Vivado HLS Output PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction
TABLE 2: Snippet of results from layer-by-layer output of LeNet DLA implementation using Arbitrary Precision for bit-
width reduction in hardware design and deployment. First column shows Caffe output (Software), shows Vivado HLS
(Design) and third column shows PYNQ FPGA (Hardware) output results.
Layer Caffe Output Vivado HLS Output PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction
11
TABLE 3: Snippet of results from layer-by-layer output of Cifar-10 DLA implementation using default float (32-bit) data
type. First column shows Caffe output (Software), second column shows Vivado HLS (Design) and third column shows
PYNQ FPGA (Hardware)output results.
Layer Caffe Output Vivado HLS Output PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction
TABLE 4: Snippet of results from layer-by-layer output of Cifar-10 DLA implementation using Arbitrary Precision for bit-
width reduction in hardware design and deployment. First column shows Caffe output (Software), second column shows
Vivado HLS (Design) and third column shows PYNQ FPGA (Hardware)output results.
Layer Caffe Output Vivado HLS PYNQ FPGA Output
Data
conv1
pool1
conv2
prediction
and they show consistent predictions with the hardware de-
sign and deployment using float data type. This truncation
leads to changes in the values of the parameters and hence
changes in Blobs as shown in Table 2. The similarity score
12
TABLE 5: Table of Results of Similarity Scores for LeNet DLA
(a) Table of Similarity Scores and Parameters Compared for Layer-by-
Layer Output of LeNet DLA When Hardware is Designed With Float
Data Type
Layers Similarity Score Parameters Compared
conv1
File_SW 0.99999
3456File_DesignFile_SW 0.99999File_HW
pool1
File_SW 0.99999
864File_DesignFile_SW 0.99999File_HW
conv2
File_SW 0.98153
1024File_DesignFile_SW 0.98153File_HW
pool2
File_SW 0.96887
256File_DesignFile_SW 0.96887File_HW
conv3
File_SW 0.99057
120File_DesignFile_SW 0.99057File_HW
fc1
File_SW 0.99333
84File_DesignFile_SW 0.99333File_HW
fc2
File_SW 0.99088
10File_DesignFile_SW 0.99088File_HW
(b) Table of Similarity Scores and Parameters Compared for Layer-
by-Layer Output of LeNet DLA When Hardware is Designed With
Arbitrary Precision Data Type
Layers Similarity Score Parameters Compared
conv1
File_SW 0.82564
3456File_DesignFile_SW 0.82564File_HW
pool1
File_SW 0.84104
864File_DesignFile_SW 0.84104File_HW
conv2
File_SW 0.74304
1024File_DesignFile_SW 0.74304File_HW
pool2
File_SW 0.68196
256File_DesignFile_SW 0.68196File_HW
conv3
File_SW 0.64940
120File_DesignFile_SW 0.64940File_HW
fc1
File_SW 0.66700
84File_DesignFile_SW 0.66700File_HW
fc2
File_SW 0.77909
10File_DesignFile_SW 0.77909File_HW
of the design and deployment stage are obtained as shown
in Table 5b.
From Table 5b, similarity scores ranging from 65% - 84%
is obtained in the design and deployment stage when layer-
by-layer output values written in File_Design and File_HW
are compared with layer-by-layer output values written to
File_SW. This drop in similarity scores is due to bit-width
truncation of the parameters and Blobs of the LeNet DLA in
the design stage and hence the deployment stage.
The 3-way prediction values written to the files are
equivalent and consistent. This depicts the successful im-
plementation of a DLA on the PYNQ FPGA. Based on
the similarity score provided by the Python script, recom-
mendations can be made on where to debug or redesign
if the similarity score is below a certain threshold. This
helps to avoid blind debugging of results during hardware
implementation of DLA. A total of 100 images are used to
validate our 3-way verification methodology and it turns
out to be consistent in all cases.
Tables 3 and 4 shows a similar result of the implementa-
tion of a DLA shown in Fig. 4 for Cifar-10 dataset. The DLA
also consist of 8 layers and accepts a 32x32 input image. Just
like Tables 1 and 2, Tables 3 and 4 also shows subsections of
the 2D-matrices that are written to the File_SW, File_Design
and File_HW for float data type and implementation us-
ing arbitrary precision data type for bit-width reduction
respectively. The Python script returns a 3-way similarity
score of 99% for the values written to the files in the design
stage(File_Design) and deployment stage (File_HW) using
float data type when compared with File_SW as seen in
Table 6a and a similarity score ranging from 65% - 84% as
seen in 6b for the values written to the files in the design
stage (File_Design) and deployment stage (File_HW) using
arbitrary precision data type when compared with File_SW.
The arbitrary precision prediction show consistent results
with the prediction obtained using float data type for 100
images.
6 RELATED WORK
Several approaches in existing literature have been adopted
to achieve efficient mapping of DLAs to FPGA boards. Guo
et. al [20] proposes a design flow for mapping CNNs onto
embedded FPGA. In [20], data quantization is introduced
to reduce the bit-width of CNN models to achieve smaller
memory and computation requirements with negligible ac-
curacy loss. A compiler that maps the CNN to the FPGA is
also proposed.
Florian et. al [39] proposes a tool flow for the hard-
ware/software codesign implementation of CNNs on
PYNQ FPGAs. FPGA possess Dynamic Partial Reconfigu-
ration (DPR) capabilities that enable the exchange of logic
partitions within the FPGA fabric. This property offers a
major advantage for designing hardware architectures able
to adapt and reconfigure the hardware due to characteristics
of DLA using high-level synthesis.
Jiandong et. al [27] proposes a collaborative framework
to optimize the OpenCL based CNN design for CNN
applications. The introduction of LoopTree to capture the
main features of OpenCL based hardware design. Hardware
13
TABLE 6: Table of Results of Similarity Scores for Cifar-10 DLA
(a) Table of Similarity Scores and Parameters Compared for Layer-by-Layer
Output of Cifar-10 DLA When Hardware is Designed With Float Data Type
Layers Similarity Score Parameters Compared
conv1
File_SW 0.99889
5120File_DesignFile_SW 0.99889File_HW
pool1_relu1
File_SW 0.99885
1280File_DesignFile_SW 0.99885File_HW
conv2_relu2
File_SW 1.00510
2560File_DesignFile_SW 1.00510File_HW
pool2
File_SW 0.99896
640File_DesignFile_SW 0.99896File_HW
conv3_relu3
File_SW 0.99732
960File_DesignFile_SW 0.99732File_HW
pool3
File_SW 1.00856
240File_DesignFile_SW 1.00856File_HW
fc1
File_SW 0.98541
50File_DesignFile_SW 0.98541File_HW
fc2
File_SW 0.99964
10File_DesignFile_SW 0.99964File_HW
(b) Table of Similarity Scores and Parameters Compared for Layer-by-Layer
Output of Cifar-10 DLA When Hardware is Designed With Arbitrary Preci-
sion Data Type
Layers Similarity Score Parameters Compared
conv1
File_SW 0.99889
5120File_DesignFile_SW 0.99889File_HW
pool1_relu1
File_SW 0.99885
1280File_DesignFile_SW 0.99885File_HW
conv2_relu2
File_SW 1.00510
2560File_DesignFile_SW 1.00510File_HW
pool2
File_SW 0.99896
640File_DesignFile_SW 0.99896File_HW
conv3_relu3
File_SW 0.99732
960File_DesignFile_SW 0.99732File_HW
pool3
File_SW 1.00856
240File_DesignFile_SW 1.00856File_HW
fc1
File_SW 0.98541
50File_DesignFile_SW 0.98541File_HW
fc2
File_SW 0.99964
10File_DesignFile_SW 0.99964File_HW
design specifications like loop orders, loop tiling, Block
RAMs (BRAM) and Double Data Rate (DDR) configura-
tions, and OpenCL attributes are utilized. Then a coarse-
grained model is employed in evaluating the performance
of LoopTree and to find candidate designs. Finally, a fine-
grained model is employed to tune the candidate designs to
obtain the best design deployed on the hardware. Also, [40],
proposes weight compression and weight sharing neural
networks in order to allow for the proper hardware resource
utilization that enables the large neural network models to
fit in ASICs and FPGAs.
Xiang et. al [25] proposes a software simulation-based
approach for the verification of Multilayer Neural Networks
by coming up with an algorithm to measure the maximum
sensitivity for the output of a finite number of different sim-
ulations corresponding to different finite bounded inputs.
The sensitivity of the network is given as the mathematical
expectation of output deviations due to input and weight
deviations with respect to overall input and weight values
in a given continuous interval. The maximum sensitivity
used to measure the maximum deviation of outputs, which
is brought by bounded disturbances around the input. The
maximum sensitivity represents the output reachable sets
of the network and is measured and computed layer-by-
layer. These measurements are used for the verification of
the layer-by-layer output of the network.
Dwarakanath et. al [26] proposes a software-based ap-
proach of verification of machine learning-based image
classifiers using metamorphic testing. This approach builds
multiple relationships between the subsequent output of a
classifier to different inputs to derive the degree of correct-
ness of the implementation of the classifier. This approach
is designed to detect implementation bugs in the implemen-
tation of the classifier. The metamorphic testing comes up
with different permutations of cases for the training and
testing input features, training instances and layers and also
scaling of the test data samples of the image classifier to
generate different outputs.
Choi et. al [41] proposes a stochastic functional verifi-
cation method in designing DNN-based systems. In this
approach, synthetic data sets are generated in a virtual
environment and added to the training set for a DNN.
The DNN is trained with both dataset and validated with
validation subsets of both datasets. A comparison metric
such as class-wise average precision is used to compare
the performance of the model on both validation datasets
against a predefined threshold. For a DNN under verifica-
tion, the DNN is trained with synthetic datasets and the
comparison, metric is obtained. The similarity between the
comparison metric and the predefined threshold is used to
validate the verification.
Cong et. al [28] proposes a time saving co-design
methodology that simultaneously searches possible design
options to auto-generate efficient DNNs optimized for
14
FPGA deployment. [28] introduces a template for the gener-
ation of DNN with efficient performance and hardware re-
source utilization. An automatic HLS generator is proposed
to help translate the auto-generated DNN to synthesizable
C code for hardware deployment.
In reference [42] is a Github repository that the C++ code
(Design code) for mapping LeNet DLA on hardware. The
repository shows the weights and algorithms of each layer.
This code is an already finished DLA on an FPGA board.
This repository does not give information about which
framework has been used to train the DLA and does not
provide a means of debugging and validating the output of
each layer in order to accomplish design time verification at
every stage.
The references [43] and [44] shows an introduction to the
deployment of Machine Learning on Hardware. This only
shows stacks and block diagrams of how neural networks
is utilized on hardware and also the number of parame-
ters and MACC (Multiply-Accumulate) units required by a
DLA. This does not give a full picture from the training to
the testing and successful deployment of DLAs on FPGA
boards and other hardware.
These approaches are either limited to the software envi-
ronment or they do not take into consideration the verifica-
tion of the implementation correctness of the DLA mapping
onto hardware across all the design stages involved.
7 COMPARISON WITH STATE-OF-THE-ART
Some state-of-the-art approaches have been adopted to as-
certain the implementation correctness of DLA. Guo et. al
[20] proposes an approach that allows for the hardware-
software co-design of DLA on FPGAs. This approach only
has a means of validating the DLA at the final layers of the
software and hardware. The limitation of this approach is
that it does not account for layer-by-layer verification of the
output of the layers.
Florian et. al [39] proposes a toolflow approach for the
hardware-software co-design of DLA on FPGAs. The means
of validating the co-design is at the final layers of the
software and hardware. This toolflow approach does not
take into consideration the verification of the layer-by-layer
outputs to ensure the implementation correctness on the
hardware. The approach also does not provide a means of
debugging in case of errors.
Jiandong et. al [27] proposes a collaborative framework
to optimize the deployment of DLA on FPGA. This ap-
proach validates the correctness of the deployment of the
DLA only at the final layers of the hardware deployment.
The limitation of this approach is that it does not account
for software implementation, and at the hardware level, it
does not provide layer-by-layer verification of the DLA.
Xiang et. al [25] proposes a software simulation based
approach to verify the correctness of a DLA. The verification
approach is limited to the layer-by-layer output and the
accuracy of the final prediction. This approach does not pro-
vide a means of verifying the implementation correctness of
the mapping of DLAs on hardware.
Dwarakanath et.al [26] proposes a software based ap-
proach to verify the correctness of a image classifiers. The
verification approach verifies to the layer-by-layer output
and the accuracy of the final prediction only. This approach
does not take into consideration an approach that can be
applied to the mapping of DLAs to FPGA boards.
Choi et. al [41] introduces a stochastic functional verifica-
tion method using synthetic datasets. This method verifies
the layer-by-layer output and the accuracy of the deep
learning model. This approach is not scalable when trying
to achieve successful mapping of DLAs on FPGA boards.
Cong et. al [28] proposes a co-design methodology that
simultaneously generates a software design model and an
synthesizable C code for the hardware design. This ap-
proach only validates the design based on the accuracy of
prediction of the model.
Table 7 shows that our proposed method can verify all
the six types of cross layer verification.
8 CONCLUSIONS
This work proposes a 2-Level 3-Way methodology for
hardware-software co-verification of DLA from deep learn-
ing software framework to HLS design of DLA and finally
onto DLA deployment on the FPGA board. This methodol-
ogy is used to test the hardware implementation correctness
of 2 DLAs (LeNet and Caffe inspired Cifar-10 network) on
PYNQ FPGA board. To the best of author’s knowledge
this is the first time a methodology is developed, which
performs layer-by-layer co-verification for mapping of DLA
architectures across the 3 paradigms (software, design and
hardware level). The methodology can help to achieve suc-
cessful implementation and mapping of DLA onto FPGA
during the design phase and can help in the cross paradigm
debugging process. We proposed a new metric for cross
paradigm co-verification, called similarity score, which as
a metric to measure the degree of correctness of the im-
plementation of each layer. The similarity score also helps
to show layers that need debugging. Our implementation
results from Caffe software to Vivado HLS design and
finally to Xilinx’s PYNQ FPGA show similarity scores of
99% for LeNet and Caffe inspired Cifar-10 network in the
design stage. A range of similarity scores from 65% - 84%
are obtained in the deployment stage due to truncation of
the bit-width of the LeNet DLA so it can fit on the PYNQ
FPGA board. This stipulates the successful mapping of the
DLA onto the PYNQ FPGA board
REFERENCES
[1] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural
networks,” in Proceedings of the 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. ACM, 2015, pp.
161–170.
[2] T. A. Odetola, O. Ogheneuriri, and S. R. Hasan, “A scalable
multilabel classification to deploy deep learning architectures for
edge devices,” arXiv preprint arXiv:1911.02098, 2019.
[3] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “DLAU: A
scalable deep learning accelerator unit on FPGA,” IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems,
vol. 36, no. 3, pp. 513–517, 2017.
[4] M. Baza, N. Lasla, M. Mahmoud, and M. Abdallah, “B-ride: Ride
sharing with privacy-preservation, trust and fair payment atop
public blockchain,” arXiv preprint arXiv:1906.09968, 2019.
[5] M. Baza, M. Nabil, N. Lasla, K. Fidan, M. Mahmoud, and M. Ab-
dallah, “Blockchain-based firmware update scheme tailored for
autonomous vehicles,” Proc. of the IEEE Wireless Communications
and Networking Conference (WCNC), Marrakech, Morocco, April 2019.
15
TABLE 7: Verification Approach Comparison With Other Works.
Approach [20] [39] [27] [25] [26] [41] [28] 2L-3W
Software Verification Layer-by-Layer Verification x x x
√ √ √
x
√
Final Layer Verification
√ √
x
√ √ √ √ √
Hardware Verification Layer-by-Layer Verification x x x x x x x
√
Final Layer Verification
√ √ √
x x x
√ √
Hardware-Software Co-Verification Layer-by-Layer Verification x x x x x x x
√
Final Layer Verification x x x x x x x
√
[6] W. Al Amiri, M. Baza, M. Mahmoud, K. Banawan, W. Alasmary,
and K. Akkaya, “Privacy-preserving smart parking system using
blockchain and private information retrieval,” Proc. of the IEEE
International Conference on Smart Applications, Communications and
Networking (SmartNets 2019), 2020.
[7] M. Baza, M. Nabil, M. Ismail, M. Mahmoud, E. Serpedin, and
M. Rahman, “Blockchain-based charging coordination mechanism
for smart grid energy storage units,” Proc. Of IEEE International
Conference on Blockchain, Atlanta, USA, July, 2019.
[8] W. Al Amiri, M. Baza, K. Banawan, M. Mahmoud, W. Alasmary,
and K. Akkaya, “Towards secure smart parking system using
blockchain technology,” Proc. of 17th IEEE Annual Consumer Com-
munications & Networking Conference (CCNC), Las vegas, USA, 2020.
[9] M. Baza, M. Pazos-Revilla, M. Nabil, A. Sherif, M. Mahmoud,
and W. Alasmary, “Privacy-preserving and collusion-resistant
charging coordination schemes for smart grid,” arXiv preprint
arXiv:1905.04666, 2019.
[10] M. Baza, M. Nabil, N. Bewermeier, K. Fidan, M. Mahmoud, and
M. Abdallah, “Detecting sybil attacks using proofs of work and
location in vanets,” arXiv preprint arXiv:1904.05845, 2019.
[11] M. Baza, M. Mahmoud, G. Srivastava, W. Alasmary, and M. You-
nis, “A light blockchain-powered privacy-preserving organization
scheme for ride sharing services,” Proc. of the IEEE 91th Vehicular
Technology Conference (VTC-Spring), Antwerp, Belgium, May 2020.
[12] M. Baza, A. Salazar, M. Mahmoud, M. Abdallah, and K. Akkaya,
“On sharing models instead of data using mimic learning for
smart health applications,” Proc. of the IEEE International Conference
on Informatics, IoT, and Enabling Technologies (ICIoT’20) , Doha, Qatar,
Feb. 2020.
[13] A. Shafee, M. Baza, D. A. Talbert, M. M. Fouda, M. Nabil, and
M. Mahmoud, “Mimic learning to generate a shareable network
intrusion detection model,” Proc. of the IEEE Consumer Communi-
cations & Networking Conference,Las Vegas, USA, 2020.
[14] M. Baza, M. M. Fouda, A. S. T. Eldien, and H. A. Mansour, “An
efficient distributed approach for key management in microgrids,”
Proc. of the Computer Engineering Conference (ICENCO), Egypt, pp.
19–24, 2015.
[15] M. Baza, M. Fouda, M. Nabil, A. S. Tag, H. Mansour, and M. Mah-
moud, “Blockchain-based distributed key management approach
tailored for smart grid,” in Combating Security Challenges in the Age
of Big Data. Springer, 2019.
[16] M. Baza, J. Baxter, N. Lasla, M. Mahmoud, M. Abdallah, and
M. Younis, “Incentivized and secure blockchain-based firmware
update and dissemination for autonomous vehicles,” in Connected
and Autonomous Vehicles in Smart Cities. CRC press, 2020.
[17] T. A. Odetola, H. R. Mohammed, and S. R. Hasan, “A stealthy
hardware trojan exploiting the architectural vulnerability of
deep learning architectures: Input interception attack (iia),” arXiv
preprint arXiv:1911.00783, 2019.
[18] M. Bacis, G. Natale, E. Del Sozzo, and M. D. Santambrogio, “A
pipelined and scalable dataflow implementation of convolutional
neural networks on fpga,” in 2017 IEEE International Parallel and
Distributed Processing Symposium Workshops (IPDPSW). IEEE,
2017, pp. 90–97.
[19] M. T. Hailesellasie and S. R. Hasan, “MulNet: A Flexible CNN Pro-
cessor with Higher Resource Utilization Efficiency for Constrained
Devices,” IEEE Access, 2019.
[20] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,
and H. Yang, “Angel-Eye: A complete design flow for mapping
CNN onto embedded FPGA,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47,
2017.
[21] J. Park and W. Sung, “FPGA based implementation of deep neural
networks using on-chip memory only,” in Acoustics, Speech and
Signal Processing (ICASSP), 2016 IEEE International Conference on.
IEEE, 2016, pp. 1011–1015.
[22] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-
net: Imagenet classification using binary convolutional neural
networks,” in European Conference on Computer Vision. Springer,
2016, pp. 525–542.
[23] X. Zhang, A. Ramachandran, C. Zhuge, D. He, W. Zuo, Z. Cheng,
K. Rupnow, and D. Chen, “Machine learning on fpgas to face the
iot revolution,” in Proceedings of the 36th International Conference on
Computer-Aided Design. IEEE Press, 2017, pp. 819–826.
[24] L.-T. Wang, Y.-W. Chang, and K.-T. T. Cheng, Electronic design
automation: synthesis, verification, and test. Morgan Kaufmann,
2009.
[25] W. Xiang, H.-D. Tran, and T. T. Johnson, “Output reachable set
estimation and verification for multilayer neural networks,” IEEE
transactions on neural networks and learning systems, vol. 29, no. 11,
pp. 5777–5783, 2018.
[26] A. Dwarakanath, M. Ahuja, S. Sikand, R. M. Rao, R. Bose,
N. Dubash, and S. Podder, “Identifying implementation bugs
in machine learning based image classifiers using metamorphic
testing,” in Proceedings of the 27th ACM SIGSOFT International
Symposium on Software Testing and Analysis. ACM, 2018, pp. 118–
128.
[27] J. Mu, W. Zhang, H. Liang, and S. Sinha, “A Collaborative Frame-
work for FPGA-based CNN Design Modeling and Optimization,”
in 2018 28th International Conference on Field Programmable Logic and
Applications (FPL). IEEE, 2018, pp. 139–1397.
[28] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m.
Hwu, and D. Chen, “FPGA/DNN Co-Design: An Efficient Design
Methodology for IoT Intelligence on the Edge,” arXiv preprint
arXiv:1904.04421, 2019.
[29] H. Park, C. Lee, H. Lee, Y. Yoo, Y. Park, I. Kim, and K. Yi, “Op-
timizing DCNN FPGA accelerator design for handwritten hangul
character recognition: work-in-progress,” in Proceedings of the 2017
International Conference on Compilers, Architectures and Synthesis for
Embedded Systems Companion. ACM, 2017, p. 11.
[30] D. O’Loughlin, A. Coffey, F. Callaly, D. Lyons, and F. Morgan,
“Xilinx vivado high level synthesis: Case studies,” 2014.
[31] G. Lacey, G. W. Taylor, and S. Areibi, “Deep learning on FPGAs:
Past, present, and future,” arXiv preprint arXiv:1602.04283, 2016.
[32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture
for fast feature embedding,” in Proceedings of the 22nd ACM inter-
national conference on Multimedia. ACM, 2014, pp. 675–678.
[33] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for effi-
cient dnns,” in Advances In Neural Information Processing Systems,
2016, pp. 1379–1387.
[34] B. Janï¿œen, T. Wingender, and M. Hï¿œbner, “Hardware Accel-
erator Framework Approach for Dynamic Partial Reconfigurable
Overlays on Xilinx PYNQ,” in INFORMATIK 2017, M. Eibl and
M. Gaedke, Eds. Gesellschaft fï¿œr Informatik, Bonn, 2017, pp.
481–492.
[35] B. Janßen, P. Zimprich, and M. Hübner, “A dynamic partial recon-
figurable overlay concept for pynq,” in 2017 27th International Con-
ference on Field Programmable Logic and Applications (FPL). IEEE,
2017, pp. 1–4.
[36] Xilinx, “Python productivity for Zynq (Pynq) Documentation
Release 2.2,” https://buildmedia.readthedocs.org/media/pdf/
pynq/latest/pynq.pdf, 2019.
[37] J. Johnson, “Using the AXI DMA in Vivado,” http://www.
fpgadeveloper.com/2014/08/using-the-axi-dma-in-vivado.html,
2014.
[38] Xilinx, “AXI DMA Controller,” https://www.xilinx.com/
products/intellectual-property/axi_dma.html, 2019.
[39] F. Kästner, B. Janßen, F. Kautz, M. Hübner, and G. Corradi,
“Hardware/software codesign for convolutional neural networks
exploiting dynamic partial reconfiguration on pynq,” in 2018 IEEE
16
International Parallel and Distributed Processing Symposium Work-
shops (IPDPSW). IEEE, 2018, pp. 154–161.
[40] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
W. J. Dally, “EIE: efficient inference engine on compressed deep
neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE
43rd Annual International Symposium on. IEEE, 2016, pp. 243–254.
[41] J. Choi, K. M. Irick, J. Hardin, W. Qiu, A. Yuille, J. Sampson, and
V. Narayanan, “Stochastic functional verification of dnn design
through progressive virtual dataset generation,” in 2018 IEEE
International Symposium on Circuits and Systems (ISCAS). IEEE,
2018, pp. 1–5.
[42] C. woo Lee, “FPGA Accelerator for CNN using Vivado HLS,”
https://github.com/changwoolee/lenet5_hls, 2018.
[43] S. Evanczuk, “Get Started with Machine Learning
Using Readily Available Hardware and Software,”
https://www.digikey.com/en/articles/techzone/2018/aug/
get-started-machine-learning-hardware-and-software, 2018.
[44] Xilinx, “Accelerating DNNs with Xilinx Alveo Accelera-
tor Cards,” https://www.xilinx.com/support/documentation/
white_papers/wp504-accel-dnns.pdf, 2018.
