Neural Network in Hardware by Si, Jiong
UNLV Theses, Dissertations, Professional Papers, and Capstones 
12-15-2019 
Neural Network in Hardware 
Jiong Si 
Follow this and additional works at: https://digitalscholarship.unlv.edu/thesesdissertations 
 Part of the Electrical and Computer Engineering Commons 
Repository Citation 
Si, Jiong, "Neural Network in Hardware" (2019). UNLV Theses, Dissertations, Professional Papers, and 
Capstones. 3845. 
http://dx.doi.org/10.34917/18608784 
This Dissertation is protected by copyright and/or related rights. It has been brought to you by Digital 
Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Dissertation in any way that 
is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to 
obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons 
license in the record and/or on the work itself. 
 
This Dissertation has been accepted for inclusion in UNLV Theses, Dissertations, Professional Papers, and 
Capstones by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact 
digitalscholarship@unlv.edu. 
NEURAL NETWORKS IN HARDWARE
By
Jiong Si
Bachelor of Engineering – Automation
Chongqing University of Science and Technology
2008
Master of Engineering – Precision Instrument and Machinery
Hefei University of Technology
2011
A dissertation submitted in partial fulfillment
of the requirements for the
Doctor of Philosophy – Electrical Engineering
Department of Electrical and Computer Engineering
Howard R. Hughes College of Engineering
The Graduate College













Copyright 2019 by Jiong Si 






The Graduate College 
The University of Nevada, Las Vegas 
        
November 6, 2019
This dissertation prepared by  
Jiong Si 
entitled  
Neural Networks in Hardware 
is approved in partial fulfillment of the requirements for the degree of 
Doctor of Philosophy – Electrical Engineering 
Department of Electrical and Computer Engineering 
 
                
Sarah Harris, Ph.D.       Kathryn Hausbeck Korgan, Ph.D. 
Examination Committee Chair      Graduate College Dean 
 
Shahram Latifi, Ph.D. 
Examination Committee Member 
        
R. Jacob Baker, Ph.D. 
Examination Committee Member 
 
Evangelos Yfantis, Ph.D. 







This dissertation describes the implementation of several neural networks built on a field 
programmable gate array (FPGA) and used to recognize a handwritten digit dataset – the 
Modified National Institute of Standards and Technology (MNIST) database. A novel hardware-
friendly activation function called the dynamic ReLU (D-ReLU) function is proposed. This 
activation function can decrease chip area and power of neural networks when compared to 
traditional activation functions at no cost to prediction accuracy.  
The implementations of three neural networks on FPGA are presented: 2-layer online training 
fully-connected neural network, 3-layer offline training fully-connected neural network, and two 
solutions of Super-Skinny Convolutional Neural Network (SS-CNN). The 2-layer online training 
fully-connected neural network was built on an FPGA with varying data width. Reducing the data 
width from 8 to 4 bits only reduces prediction accuracy by 11%, but the FPGA area decreases by 
41%. The 3-layer offline training fully-connected neural network was built on an FPGA with both 
the sigmoid and the proposed D-ReLU activation functions. Compared to networks that use the 
sigmoid function, the proposed D-ReLU function uses 24-41% less area with no loss to prediction 
accuracy. Further reducing the data width of the 3-layer networks from 8 to 4 bits, the prediction 
accuracy only decreased by 3-5%, with area being reduced by 9-28%. The proposed sequential 
and parallel SS-CNN networks perform state-of-the-art (99%) recognition accuracy but with 
fewer layers and less neurons than prior works, for example, the LeNet-5 network. Using 
parameters with 8 bits of precision, the FPGA solutions of this SS-CNN show no recognition 
accuracy loss when compared to the 32-bit floating point software solution. In addition to high 




cost Cyclone IVE FPGA. Moreover, these FPGA solutions have maximally 145× faster execution 
time than software solutions, even despite running at 97× to 120× lower clock rate.  
Thus, FPGA implementations of neural networks offer a high-performance, low-power 
alternative to traditional software methods, and the proposed novel D-ReLU activation function 
offers additional improvements to performance and power savings. Furthermore, the hardware 
solutions of the proposed SS-CNN provide a high-performance, hardware-friendly, and power 






I’d like to thank my advisor Dr. Sarah L. Harris for her continuous guidance, support and 
encouragement during my five years’ study and work at UNLV. She shared a lot of her 
experience with me; we talk about research, being a Ph.D student, career, family, playing… She is 
my mentor, model, elder sister and good friend.  
I would also like to thank one of my idols, Dr. Evangelos Yfantis from the Computer Science 
department at UNLV. I enjoyed every minute in his class “Neural Networks and Genetic 
Algorithms”. It is always inspiring and exciting talking with him. I hope one day I can fly a plane 
like him and enjoy working as much as he does. 
I also thank my other committee members, Dr. Shahram Latifi, Dr. R. Jacob Baker, and Dr. 
Justin Zhan for being my committee members and giving me a lot of advice on my research and 
dissertation. 
Dr. Yingtao Jiang, Dr. Grzegorz Chmaj, Ms. Jennifer Reff and other professors and staff 
members in ECE department have also been very supportive and helpful in my journey here at 
UNLV.  
Outside of UNLV, I would like to thank my hiking friends – they are my family in Las Vegas. 





Table of Contents 
 
Abstract ......................................................................................................................................... iii 
Acknowledgements ....................................................................................................................... v 
Table of Contents ......................................................................................................................... vi 
List of Tables .............................................................................................................................. viii 
List of Figures ............................................................................................................................... ix 
Chapter 1  Introduction ............................................................................................................ 1 
Chapter 2  Background ............................................................................................................ 4 
2.1 Fully-connected Neural Network .................................................................................. 5 
2.2 Convolutional Neural Network ..................................................................................... 6 
2.3 Activation Functions ...................................................................................................... 9 
2.3.1 Sigmoid Function .................................................................................................... 9 
2.3.2 ReLU Function ...................................................................................................... 10 
2.4 Background Summary ................................................................................................. 12 
Chapter 3  Methodology and Procedures ............................................................................. 13 
3.1 MNIST Dataset ............................................................................................................. 13 
3.2 2-layer Fully-connected Neural network.................................................................... 14 
3.2.1 Network Architecture ........................................................................................... 14 
3.2.2 FPGA System ........................................................................................................ 16 
3.3 3-layer Fully-connected Neural network.................................................................... 22 
3.3.1 Network Architecture ........................................................................................... 22 
3.3.2 FPGA System ........................................................................................................ 24 
3.4 Super-Skinny Convolutional Neural Network (SS-CNN) ........................................ 29 
3.4.1 Network Design and Architecture ....................................................................... 29 
3.4.2 FPGA System ........................................................................................................ 34 
3.5 Sigmoid Function Approximation .............................................................................. 42 
3.6 Dynamic ReLU Function ............................................................................................. 44 
Chapter 4  Results and Discussion ......................................................................................... 48 




4.1.1 Recognition Accuracy ........................................................................................... 48 
4.1.2 Performance .......................................................................................................... 51 
4.1.3 FPGA Area ............................................................................................................ 52 
4.2 3-layer Fully-connected Neural network on an FPGA ............................................. 55 
4.2.1 Recognition Accuracy ........................................................................................... 56 
4.2.2 Performance .......................................................................................................... 57 
4.2.3 FPGA Area ............................................................................................................ 60 
4.3 Super-Skinny Convolutional Neural Network on an FPGA .................................... 64 
4.3.1 Recognition Accuracy ........................................................................................... 64 
4.3.2 Performance .......................................................................................................... 65 
4.3.3 FPGA Area ............................................................................................................ 66 
Chapter 5  Conclusion ............................................................................................................ 68 
Appendix A Software Code of Neural Networks ................................................................. 71 
A.1 2-layer Fully-connected Neural Network in Python ................................................. 71 
A.2 3-layer Fully-connected Neural Network in C ........................................................... 72 
A.3 SS-CNN in C ................................................................................................................. 78 
Appendix B Hardware Code of Neural Networks ............................................................... 86 
B.1 2-layer Fully-connected Neural Network with Online Training in SystemVerilog 86 
File: control_unit.sv ............................................................................................................. 86 
File: tile.sv............................................................................................................................. 88 
File: sigmoid_plan.sv ........................................................................................................... 93 
B.2 3-layer Fully-connected Neural Network with Offline Training in SystemVerilog 95 
File: tile.sv............................................................................................................................. 95 
B.3 SS-CNN in SystemVerilog ........................................................................................... 99 
File: tile.sv............................................................................................................................. 99 
Bibliography .............................................................................................................................. 106 





List of Tables 
 
Table 1. Implementation of PLAN [28]..................................................................................... 44 
Table 2. Recognition accuracy of 2-layer fully-connected neuron network with online 
training ......................................................................................................................................... 50 
Table 3. Execution time and potential power savings of 2-layer fully-connected neural 
network with online training...................................................................................................... 51 
Table 4. FPGA area of 2-layer fully-connected neural network with online training .......... 52 
Table 5. FPGA area of forward propagation of 2-layer fully-connected neural network ... 55 
Table 6. Recognition accuracies of 2- & 3-layer fully-connected neural networks .............. 56 
Table 7. Execution time and potential power savings of 2-layer fully-connected neural 
network ........................................................................................................................................ 58 
Table 8. Execution time and potential power savings of 3-layer fully-connected neural 
network ........................................................................................................................................ 59 
Table 9. FPGA areas of 2- & 3-layer fully-connected neural networks ................................. 61 
Table 10. FPGA areas of 2- & 3-layer fully-connected neural networks with different 
activation functions ..................................................................................................................... 62 
Table 11. Recognition accuracies of three neural networks ................................................... 65 
Table 12. Execution time and potential power savings of two SS-CNN solutions on FPGA 66 





List of Figures 
 
Figure 1. Multilayer perceptron network ................................................................................... 5 
Figure 2. Convolutional neural network (Figure adapted from [1]) ........................................ 7 
Figure 3. Sigmoid function ......................................................................................................... 10 
Figure 4. ReLU function ............................................................................................................. 11 
Figure 5. Leaky ReLU/Parametric ReLU function (Figure adapted from [25]) ................... 12 
Figure 6. Examples of digits in the MNIST dataset ................................................................. 14 
Figure 7. 2-layer fully-connected neural network .................................................................... 15 
Figure 8. FPGA system architecture ......................................................................................... 17 
Figure 9. FPGA platform ........................................................................................................... 18 
Figure 10. Controller finite state machine (FSM) .................................................................... 19 
Figure 11. 3-layer fully-connected neural network .................................................................. 24 
Figure 12. FPGA system architecture ....................................................................................... 25 
Figure 13. Controller finite state machine (FSM) .................................................................... 26 
Figure 14. Executable action of test program .......................................................................... 32 
Figure 15. Super-skinny convolutional neural network .......................................................... 33 
Figure 16. FPGA system architecture of Super-skinny convolutional neural network ....... 35 
Figure 17. Controller finite state machine (FSM) .................................................................... 36 
Figure 18. Convolution for one kernel frame ........................................................................... 37 
Figure 19. Multiplication of the image data with one convolutional kernel frame............... 37 
Figure 20. Adder tree for image and kernel frame multiplication results ............................ 38 
Figure 21. Convolutional layer finite state machine (FSM) with sequential feature channel
....................................................................................................................................................... 39 
Figure 22. Convolutional layer finite state machine (FSM) with parallel feature channel .. 40 
Figure 23. Max-Pooling layer..................................................................................................... 41 
Figure 24. Fully-connected layers’ finite state machine (FSM) .............................................. 42 
Figure 25. Sigmoid function and its approximation using PLAN .......................................... 43 
Figure 26. ReLU function with a positive threshold ................................................................ 46 
Figure 27. ReLU function with a negative threshold ............................................................... 46 
Figure 28. Software solution accuracy vs. FPGA solutions accuracy .................................... 49 
Figure 29. Accuracy (percentage of correct predictions) vs. area (relative to 8-bit version) 
for the 4-, 5-, 6-, and 8-bit fully-connected neural network FPGA designs ........................... 53 
Figure 30. Data width vs. FPGA area (number of FPGA logic elements) ............................. 54 
Figure 31. Data width vs. FPGA area of 3-layer fully-connected neural network with 
sigmoid ......................................................................................................................................... 63 






Chapter 1  Introduction 
 
Machine learning and deep learning algorithms and their applications are becoming 
increasingly prevalent. One popular type of machine learning algorithms, neural networks, which 
are inspired by biological neural networks, can “learn” to perform tasks such as speech or image 
recognitions by “studying” examples, without being pre-programmed with any task-specific rules. 
Because tasks are becoming increasingly complex, neural networks are getting larger and deeper, 
and the problems of long execution times and of being compute-intensive and power-hungry are 
worsening. For example, AlexNet [1] has 8 layers and 62 million parameters, VGG Net [2] has 16 
layers and 138 million parameters, and GoogLeNet [3] has 22 layers and 7 million parameters.  
Currently, machine learning algorithms are typically running on CPUs/GPUs or the cloud, but 
these platforms have shortcomings, the most notable of which are high power consumption, lack 
of privacy, and network delay. Solutions on GPUs consume large amounts of power, and the GPU 
itself is expensive as well. When a user’s machine learning program is running on cloud, data that 
they are using in the process, such as personal information, can be at risk. Network delay is also 
an issue; a user must upload a task, wait for the calculation, and then download the calculation 
result from cloud, increasing the delay of receiving the result and of iterating to improve the 
process.  
Hardware solutions are fast and power efficient, and they protect privacy because they can 




hardware specially designed for neural networks. For example, AlphaZero [4] is a computer 
program implemented on the Tensor Processing Unit (TPU) [5] designed by Google. A TPU is a 
domain-specific custom application specific integrated circuit (ASIC) designed to implement 
neural networks. The TPU is 15-30 times faster and 30-80 times more power efficient than 
modern GPUs and CPUs when running neural networks.  
On the other hand, field programmable gate arrays (FPGAs) offer similar advantages to 
ASICs, such as Google’s TPU, by providing the ability to configure hardware specific to neural 
networks, including parallel execution. This hardware-specific design on an FPGA offers 
increased performance, lower power consumption, and decreased cost compared to a CPU or 
GPU implementation. It also offers the advantage over ASIC designs of increased flexibility and 
decreased implementation time. Neural networks built on an FPGA platform offer a potential 
solution to the compute-intensive and high power demands of neural networks by providing 
algorithm-specific hardware that is easy to parallelize and pipeline architecturally and 
algorithmically, and that facilitates the quantization and compression of weights. Moreover, 
because machine learning and deep learning algorithms, architectures, and applications are 
changing and updating all the time, the flexibility of FPGA design offers an ideal alternative to 
ASIC implementations.  
My research is about the implementation and acceleration of neural networks on an FPGA to 
achieve higher performance, lower power solutions that can be used in portable or offline devices. 




1. Accelerating machine learning algorithms on an FPGA using parallel computations to 
achieve high speed and low power consumption. 
2. Implementing new hardware-friendly machine learning algorithms and architectures to 
achieve small area, high speed, and low power consumption solutions. 
After describing the background work of implementing neural networks in hardware in 
Chapter 2, this dissertation presents the three neural networks developed on an FPGA platform in 
Chapter 3 while also presenting various hardware-friendly activation functions. Chapter 4 
compares each implementation’s performance with software solutions, analyzing their prediction 
accuracy, execution time, FPGA area requirements, and power cost. Chapter 5 summarizes the 
findings and conclusions which, in summary, are that FPGA solutions of neural networks provide 
high performance, low-cost, and light-weight hardware alternatives to the traditional CPU and 
GPU solutions. 
The work in this dissertation narrows the gap between software and hardware solutions of 
neural networks and also provides a bridge for software and hardware cooperation and integration 
in the implementation of neural networks, which results in higher performance, more flexible, and 




Chapter 2  Background 
 
Neural networks play an important role in the artificial intelligence (AI) world. In recent years, 
with the support of increasingly powerful computation resources, neural networks have become 
larger and more powerful. These networks continue to reach higher recognition accuracies, 
sometimes even surpassing human ability. For example, the deep residual learning network in [6] 
achieves top-5 error rate of 3.57% on ImageNet dataset [7], which the estimated human 
classification error is 5.1% [8]. Neural networks typically include at least these three main 
elements: fully-connected layer, convolutional layer and activation function. The combination of 
just these three elements produces many high performance neural networks such as AlexNet [1], 
VGG Net [2] and GoogLeNet [3]. In this chapter, we present recent related research on neural 
networks implemented in hardware by first introducing the fundamental neural networks: the 
fully-connected neural network and the convolutional neural network (CNN). In this context, we 
describe recent work that has implemented these neural networks in hardware and we also 
describe the most common activation functions: the sigmoid and ReLU activation functions. 
Beyond traditional neural networks, deep neural networks have many layers and millions of 
parameters, which require massive computation resources. This chapter also describes popular 
neural network compression techniques which help deep neural networks decrease their 





2.1 Fully-connected Neural Network 
Fully-connected neural networks are the classical type of neural network which contains many 
neurons organized into layers. The neurons compute weighted sums of their inputs and pass that 
sum through an activation function to produce that layer’s outputs, which are used as the inputs 
for the next layer. The fully-connected neural network shown in Figure 1 has four layers: one 
input layer, two hidden layers, and one output layer. The neurons in one layer are fully-connected 
with each neuron in the next layer. The information from the input is passed from layer to layer, 
where each layer performs some computation on the information. In this way, the information 
from the input layer is processed and sent to the output layer. 
 
 
Hidden Layer 2 Output LayerHidden Layer 1Input Layer  






Image recognition has been a popular application and benchmark of neural network 
techniques. For more than a decade, several groups have built the forward propagation portion of 
neural networks on an FPGA [9][10][11][12]. Each of these FPGA designs use selected image 
features, instead of the whole image, to test the neural network. For example, F. Moreno, et al. [9] 
provides a case study on a traffic sign recognition system, that neural network uses offline trained 
weights and input image characters extracted from certain regions of interest (RoI) to predict the 
type of traffic sign. Other prior work extracts features from images of Arabic words or MNIST 
handwritten digits [10][11][12]. System in [10] exhibits low (85%) recognition accuracy. The 
FPGA implementation in [11] cost 2.8% accuracy loss than the software solution. T. Huynh [12] 
used 16-bit precision but only achieved 91% recognition accuracy on handwritten digit 
recognition. In another paper, he used floating point arithmetic for calculations instead of fixed 
point, which cost more in hardware resources and execution time [13]. Other solutions use the on-
chip soft-core processor and, thus, fail to take advantage of hardware acceleration using 
application specific hardware [11] [14]. 
2.2 Convolutional Neural Network 
Convolutional neural networks (CNNs) are commonly used for analyzing images. Besides 
fully-connected layers, convolutional layers are included. Deep CNNs are typically organized into 
alternating convolutional and Max-pooling layers followed by a number of dense, fully-connected 
layers, which are made up of neurons that have learnable weights and biases. As illustrated in the 




layers: convolutional, pooling, and fully-connected layers, as shown in Figure 2 [1]. Each 3D 
volume represents an input to a layer, and is transformed into a new 3D volume feeding the 
subsequent layer. The example below has five convolutional layers (represented by the pyramids), 
three max-pooling layers, and three fully-connected layers (represented by the last three 
rectangles). Convolutional layers apply convolution operation to the layer inputs, and then pass 
the result to the next layer with learnable filters. Pooling layers are inserted between successive 
convolutional layers to reduce the number of parameters and overfitting. The last three fully-
connected layers, they play the same role as in the fully-connected neural networks to forward the 




Figure 2. Convolutional neural network (Figure adapted from [1]) 
 
 
Deep CNNs have many layers and millions of parameters. For example, AlexNet [1] has 8 




GoogLeNet [3] has 22 layers and 7 million parameters. Recent research is attempting to accelerate 
and compress deep CNNs to speed up their calculations and to decrease their computation 
resource requirements and power cost.  
Several recent papers attempt to accelerate CNNs by building systems on FPGA platform. 
They offer algorithm-specific hardware and low power consumption [15][16] and they are easy to 
pipeline architecturally [15][17] and algorithmically [18][19][20]. C. Zhang, et al. [15] uses loop 
tiling and transformation optimization techniques to speed up the CNN’s computation and then 
analyzes its effect on performance and on memory bandwidth using the roofline model. J. Qiu, et 
al. [16] increases the external memory bandwidth by optimizing the data storage pattern in 
memory, which can increase the burst length of memory transaction. H. Li, et al. [17] uses two 
ping-pang buffers in each layer, which enables data reuse and pipelined working of different 
layers. L. Lu, et al. [18] introduces a fast winograd convolution algorithm to reduce the arithmetic 
complexity, and they also reuse the feature map data on multiple element tiles in the output 
feature map to improve the computation speed of a CNN. C. Fong, et al. [19] represent weights 
with power of 2 format to minimize computation complexity, by transferring multiplications to 
shift operations. V. Akhlaghi, et al. [20] take advantage of the characteristics of the ReLU 
activation function by reordering the weights based on their signs. By doing so, this method can 
ignore some computations in earlier CNN layers. These systems all achieved system speedup and 
energy reduction with little or no loss in classification accuracy. 
Several techniques have been introduced to compress bulky CNNs by decreasing the number 




connections between layers, whose weights are rounded to zero during quantization. This method 
reduced the storage of parameters by 35 times on AlexNet [1] and 49 times on VGG Net [2]. 
Sparse neural networks in [22] delete the “unimportant” neurons in the networks, this method can 
help decrease the neural networks sizes by 10-20%. The feature channel pruning method proposed 
in [23] deletes the “unimportant” feature channels in the network; this results in two and five 
times of savings in the computation of VGG-16 [2] and ResNet-18 [6] networks. These methods 
have produced parameter size compression of the neural networks without decreasing recognition 
accuracy.  
2.3 Activation Functions 
Activation functions in neural networks provide the decision boundary of the neuron output 
and decide whether the neuron should be activated or not. So an activation function inhibits low 
inputs and accentuates high inputs. In this way, activation functions reflect neuron behavior, 
where neurons require an input above some threshold to activate. Commonly used activation 
functions include the sigmoid, ReLU, softmax, and tangent functions.  
2.3.1 Sigmoid Function 
A popular activation function is the sigmoid function shown in equation (1) and Figure 3. As 
with other commonly used activation functions, such as softmax and tangent functions, it is highly 
compute intensive. Along with the high recognition accuracy when using the sigmoid function as 
the activation in fully-connected neural networks, another consideration is that it can easily be 










Figure 3. Sigmoid function 
 
 
2.3.2 ReLU Function 
The Rectified Linear Unit (ReLU) has become a popular activation function in the last few 
years. As shown in equation (2) and Figure 4, it is a piecewise function. It keeps the positive 
values unchanged and inhibits negative inputs by producing a zero for negative inputs. Although 
this function is relatively simple, depending on the range of inputs, this behavior can be 
problematic. If most of the numbers in a batch are positive, too much redundant information is 
retained and this results in wider bit-width calculations or overflow for proceeding calculations. 




when those values are zeroed out by the ReLU function. This would prohibit the network from 
learning and potentially even cause the network’s death. 




Figure 4. ReLU function 
 
 
To address the problem of losing too much information, a few prior papers [24] [25]  suggested 
modified ReLU functions, particularly the Leaky ReLU, Parametric ReLU, and Randomized 
Leaky ReLU functions, as shown in Figure 5. All of these modified ReLU functions tried to fix 
the negative input case. When the input is negative, instead of outputting zero, they would output 
the multiplication of the input with a smaller parameter. The difference among the proposed 




random (Randomized Leaky ReLU). Despite these differences, that parameter would be fixed in 




Figure 5. Leaky ReLU/Parametric ReLU function (Figure adapted from [25])  
 
 
2.4 Background Summary 
This chapter introduced the architectures of two different types of neural networks and 
commonly used activation functions. To address the bulky, power consuming, and compute 
intense problems of neural networks, chapter 3 will introduce hardware-based neural network 
architectures and their implementations on an FPGA, and how to implement hardware-friendly 





Chapter 3  Methodology and Procedures 
 
This chapter describes the 2- and 3-layer fully-connected neural networks designed and built 
on an FPGA and the forward and backward propagation processes in these two networks. The 
chapter then introduces a novel neural network architecture called the Super-Skinny 
Convolutional Neural Network (SS-CNN) and its implementation on an FPGA. This SS-CNN 
performs state-of-the-art (99%) recognition accuracy but with fewer layers and neurons, which 
requires fewer computation resources than neural networks with similar performance. The chapter 
then dives into the details of the activation function, first describing the sigmoid function 
approximation method used to simplify the computation and decrease the FPGA area, and then 
introducing the proposed dynamic ReLU function, which offers high recognition accuracy at low 
computation cost when compared with the conventional ReLU function and the approximated 
sigmoid function.  
3.1 MNIST Dataset 
The MNIST dataset (the Modified National Institute of Standards and Technology database of 
handwritten digits) is used to train and test each of the neural networks proposed in this chapter. 
The MNIST dataset includes a training set of 60,000 images and their labels, and a testing set of 
10,000 images and their labels. Figure 6 shows five image examples from the MNIST dataset 
with their labels shown above each image. Each image is 28 × 28 pixels (784 total pixels), as 







Figure 6. Examples of digits in the MNIST dataset 
 
 
3.2 2-layer Fully-connected Neural network 
A 2-layer fully-connected neural network is the simplest fully-connected neural network. It is 
used to train and recognize images from the MNIST database of handwritten digits ranging from 
0 to 9. The goal of the neural network is to first train on a subset of the handwritten data and then 
to predict values (from 0 to 9) for other test images of handwritten digits.  
3.2.1 Network Architecture 
In a 2-layer fully-connected neural network, the first layer consists of the inputs, the second 
layer is the output layer that sums the weighted inputs, and then goes through the activation 
function to produce the outputs, the prediction that the input image was one of the ten possible 
digits.  
The architecture of a 2-layer fully-connected neural network is shown in Figure 7. Because 
each MNIST image consists of 784 pixels, the input layer consists of 784 inputs, , , … . 




weighted inputs and send their outputs through the activation function to produce the outputs of 
the neural network. With 10 possible outputs (i.e., digits 0-9), the network has 10 neurons in the 
















1. Input Layer 2. Output Layer
xii
 
Figure 7. 2-layer fully-connected neural network 
 
 
The weight matrix between the input layer and the output layer consists of ten weights (one for 
each possible digit) for each of the 784 input pixels. So the system has 784 × 10 = 7,840 weights 
( , , , , , , … , , … , , … , ). Each weight represents a synaptic weight (activating 
or inhibiting). For example, with one bit of precision, activating would be 1 and inhibiting 0. But 




precision, the weights have 256 fractional values between full activation (the maximum value) 
and complete inhibition (the minimum value). 
Each of the 10 neurons in the output layer corresponds to a given digit (0-9) and that neuron 
sums the input pixels using the weights corresponding to that digit. The outputs of these 
summations, called , , …  (see Figure 7 and equation (3)), are then transformed through an 
activation function to produce the final outputs. The final ten outputs of the neural network are 
called , , … ,  (see Figure 7 and equation (4)). The outputs give the probability that an image 
is a given digit; for example,  gives the probability of the digit being 0,  gives the probability 
of the digit being 1, and so on. For example, an output of { , , … , } = {0.1, 0.2, 0.9, 0.3, 0.1, 
0.2, 0.3, 0.1, 0.2, 0.1} would predict that the handwritten image was of the digit 2. 
3.2.2 FPGA System 
The 2-layer fully-connected neural network – including forward and backward propagation – is 
built using SystemVerilog on a Cyclone IVE FPGA. The system diagram of the neural network is 
shown in Figure 8. The network is implemented using 4-, 5-, 6- and 8-bit precisions for all inputs, 



















Figure 8. FPGA system architecture 
 
 
The FPGA system consists of an UART communication module, Image/Label RAM, and a 
Controller that directs the Computation Unit. The system also outputs results to a 7-segment 
display. The UART module transmits all training and test images and their labels from the PC to 
the FPGA. The system then stores these data in the Image/Label RAM. After a single image and 
its label are transferred to the FPGA, the Controller module is triggered to start either training or 
testing, depending on whether the system is in backward or forward propagation mode. The 
Computation Unit reads the weights from the Weights RAM in testing mode and both reads and 
updates the weights in training mode. At startup, the Weights RAM is initialized to hold random 
values. During both training and testing, the Computation Unit reads the weights from the 
Weights RAM to perform the weighted sum of the inputs. During training, the Computation Unit 
then also calculates the errors between the calculated outputs and the actual values and updates 
the Weights RAM with calculated delta weights. During this backward propagation process, the 
weights are updated after each training image is processed. Figure 9 shows the FPGA platform at 




label and the prediction result of the neural network, the LEDs count the number of correct 




Figure 9. FPGA platform 
 
 
3.2.2.1 System Controller 
The Controller module manages the two main processes: forward and backward propagation. 
It also displays the target and calculated outputs. Figure 10 shows the finite state machine (FSM) 




module receives and stores one image and its label into the Image/Data RAM, the UART module 
asserts the start signal, thus moving the FSM to the forward (Fwd) state and triggering the forward 
propagation process. After forward propagation computations are complete, if the system is in 
testing mode, the FSM displays the image label (target result) and predicted results ( ,… ) on 
the 7-segment displays and then returns to the Idle state to continue processing next test images. 
However, if the system is in training mode, the FSM moves from the forward propagation state 
(Fwd) to the backward propagation state (Back) to update the weights in the Weights RAM (see 
Figure 8). After backward propagation is complete, the FSM displays the image label and training 
results on the 7-segment displays and then moves to the Idle state to continue processing the next 








































3.2.2.2 Computation Unit 
The Computation unit performs the calculations of the output layer. This unit also updates the 
weights when the system is in training mode. After being triggered by the Controller moving to 
the forward propagation (Fwd) state, the Computation Unit calculates the weighted sums of each 
input ( - , see equation (3)) and then passes these results through the activation function to 
produce the outputs (  through  , see equation (4)). = , + , + ⋯ + ,  
               ⋮  
                = , + , + ⋯ + ,        (3) 
 = ( )  ⋮ 
     = ( ) (4) 
   
If the system is in backward propagation mode, the system still computes the outputs 
(  through ) in the forward state (Fwd) but then also proceeds to the backward propagation 
state (Back) to both compute the errors between the outputs and target results and update the 
weights. The error calculation for each digit , where  is 0 to 9, is shown in equation (5). The 
values of the actual results, target , are obtained from the Label RAM. 
 = −  (5) 
We set the cost function for the training process as:  




This cost function of the errors is then used to update the weights in the Weights RAM. The 
derivative of the cost function with respect to output errors is: 
  =  (7) 
However, the errors are first passed through the activation function and then multiplied by the 
input pixels before being used to update the weights. So, we need to calculate the derivative of the 
activation function ( ):  
  =  ( ) ∗ 1 − ( )  (8) 
The change in weights (that will be multiplied by the inputs before being added to the weights) 
is then calculated as given in equation (9). 
 =  ∗  =  ∗ ( ) (9) 
This calculation can then be rewritten as: 
 = ∗ ( ) ∗ 1 − ( )  (10) 
The hardware of the 2-layer fully-connected neural network calculates the change for each 
weight by multiplying the input by the calculated weight change, as shown in equation (11), 
where ,  is the amount to change the weight,  is 0 to 9 for each of the digits,  is 0 to 783 
for each of the pixels in a single image, and  are the inputs to the neural network, that is, the 
value of each pixel in the training image. 




The weights used to process the next image are the updated weights, . , shown in 
equation (12). 
 . = , + ,  (12) 













 , = , + ∗ ( ) ∗ ( ) ∗ [1 − ( )] (14) 
3.3 3-layer Fully-connected Neural network 
The 3-layer fully-connected neural network we designed has higher recognition accuracy than 
the 2-layer network described in Section 3.2. The additional layer in the 3-layer fully-connected 
neural network includes more trainable parameters. This added complexity enables increased 
recognition accuracy, but it also increases the circuit complexity and size. For this reason, we 
train the network in software and only implement the forward propagation path in hardware on an 
FPGA. 
3.3.1 Network Architecture 
The architecture of the 3-layer fully-connected neural network is shown in Figure 11. It is 




addition to input and output layers, the system includes one hidden layer, which contains 128 
neurons.  
Because each MNIST image consists of 784 pixels, the input layer has 784 inputs, , , … . The second layer, also called the hidden layer, contains 128 calculation nodes, or 
neurons, that sum the weighted inputs and send their outputs through the activation function to the 
output layer. Again, the 10 neurons in the output layer sum the weighted inputs from the hidden 
layer and then pass them through the activation function to produce the outputs. With 10 possible 
outputs (i.e., digits 0-9), the network has 10 neurons in the output layer, as shown in Figure 11. In 
addition to the weights between the two fully-connected layers, biases ( )  and ( )  are also 
added for each non-output layer. As shown in equations (15) and (16), in the calculation of the 
following layer, weights control that how much any given layer input affects the output, and 






























Figure 11. 3-layer fully-connected neural network 
 
 
3.3.2 FPGA System 
The FPGA system architecture of the 3-layer fully-connected neural network is similar to the 
architecture of the 2-layer network introduced in section 3.2. As shown in Figure 12, weights – 
obtained through training in software – are saved in the Weights RAM upon initialization of the 
system. After the system begins, it receives testing images and their labels from a PC through the 
UART. At the same time, the system reads the weights from the Weights RAM. After all the input 
data is loaded into memory on the FPGA, the data is passed through the layers and the predicted 
value is calculated. After each image is processed, the predicted digit and the image’s label are 


















Figure 12. FPGA system architecture 
 
 
3.3.2.1 System Controller 
The Controller module manages the dataflow during forward propagation and its main module 
is the finite state machine (FSM) defined by the state transition diagram in Figure 13. In the Idle 
state, the FSM waits for the start signal to assert. After the UART module receives and stores one 
image and its label into the Image/Data RAM, the UART module asserts the start signal, thus 
moving the FSM to the forward (Fwd) state and triggering the forward propagation process. After 
forward propagation computations are complete, the FSM displays the image label and predicted 
results ( , , … ) on the 7-segment displays, and then returns to the Idle state to continue 




















Figure 13. Controller finite state machine (FSM) 
 
 
3.3.2.2 Computation Unit 
Because the multiplication of layer inputs and weights, summation, and activation happen in 
each pair of fully-connected layers, the 3-layer fully-connected neural network’s computation unit 
requires more calculations than the 2-layer network. The 3-layer fully-connected neural network 
has 128 neurons in the hidden layer, and one bias ( ( ), ( )) in each non-output layer. The 
computation unit calculates the weighted ,( ) through ,( )  sums of each input 
( ( ) through ( ) , see equation (15)) and then passes these results through the activation 
function to produce the hidden layer ( ( ) through ( ) , see equation (16)). 
 ( ) = ,( ) + ,( ) + ⋯ + ,( ) + ( )  ⋮ 










( ) = ( )  ⋮ 
 ( ) = ( )  (16) 
These results, ( ) through ( )  , are then passed to the output layer and multiplied with 
another batch of weights ( ,( ) through ,( ) ). Their sums ( ( ) through ( ), see equation (17)) 
are then passed through the activation function to produce the output layer ( ( ) through ( ), see 
equation (18)). 
 ( ) = ,( ) ( ) + ,( ) ( ) + ⋯ + ,( ) ( ) + ( )  ⋮ 
 ( ) = ,( ) ( ) + ,( ) ( ) + ⋯ + ,( ) ( ) + ( ) (17) 
 
 
( ) = ( )   
⋮ 
 ( ) = ( )  (18) 
For backward propagation, the error calculation for each digit  (where  is 0 to 9) is shown in 
equation (19). The values of the actual results, , are obtained from the Label RAM (as 
shown in Figure 12). 
 ( ) = − ( ) (19) 
With the same theory introduced in section 3.2, the change of the hidden layer weights 
( )  is then calculated as given in equations (20) and (21). 




 ( ) = ( ) ∗ ( ) ∗ 1 − ( )  (21) 
Having calculated the layer weight changes and using the learning rate , the new bias of the 
hidden layer, ( ) , is calculated using equation (22). 
 ( ) = ( ) + ∗ ( )  (22) 
According to the chain rule, to back-propagate ( )  to hidden layer weights, we need to 
multiply the derivative of the weights in equation (17). In equation (17) the hidden layer weights 
multiplied with hidden layer inputs, therefore, we need to multiply the weight changes by the 
inputs of the hidden layer ( ), in order to get ,( )  (see equation (23)). 
 ,( ) = ( ) ∗ ( ) (23) 
Then we can produce new hidden layer weights by adding the multiplication of weight deltas 
and learning rate (see equation (24)). 
 ,( ) = ,( ) + ∗ ,( )  (24) 
Similar to the error calculation in the output layer, we must also calculate the errors for the 
hidden layer neurons. But unlike the output layer, we cannot calculate these errors directly 
because we do not have a target, so we back-propagate them from the output layer. This is done 
by taking the errors from the output neurons and running them back through the weights to 
produce the hidden layer errors (see equation (25), where m is 0 to 127 for each of the hidden 
neurons). 




Having obtained the weight changes for ( ), we calculate the new bias ( )  and weights 
,( )  for the input layer using the same method described for the hidden layer – see equations 
(26) to (28), where  is 0 to 783 for each of the input pixels. 
 ( ) = ( ) + ∗ ( )  (26) 
 ,( ) = ( ) ∗  (27) 
 ,( ) = ,( ) + ∗ ,( )  (28) 
These biases and weights are updated each time the system processes a new training image. 
3.4 Super-Skinny Convolutional Neural Network (SS-CNN) 
This section describes our design and FPGA implementation of a novel CNN we call a Super-
Skinny Convolutional Neural Network (SS-CNN). This SS-CNN performs state-of-the-art (99%) 
recognition accuracy but with fewer layers and less neurons, when compared to recent networks 
recognizing the MNIST dataset [26], for example the LeNet-5 network [27]. Furthermore, it has 
an additional compelling feature when compared with recent networks: it does not use FPGA-
unfriendly techniques such as normalization. Thus, it can be readily mapped onto hardware such 
as an FPGA. 
3.4.1 Network Design and Architecture 
The SS-CNN, designed for recognizing the MNIST dataset, results in state-of-the-art 
performance in terms of recognition accuracy, circuit area, and power. Beyond input and output 
layers, it has only three additional layers: one convolutional layer, one Max-Pooling layer, and 




MB of memory after being quantized to 8-bit precision in the FPGA implementation. The 
inference portion of the SS-CNN fits easily on the low-cost Cyclone IV FPGA platform. Despite 
its minimal design and FPGA area, it has high recognition accuracy on handwritten digits in the 
MNIST dataset, as described in detail later in this section. Because of the FPGA implementation’s 
low power, low computation demands, and low memory requirements, this system can be readily 
integrated into various portable devices.  
Designing such a CNN is not a random task, but requires exploratory analysis using 
mathematical, statistical and image processing knowledge. The autocorrelation function of 
neighboring pixels is a function of the distance between the two pixels and attenuates as the 
distance increases. Exploring convolutional kernels ranging from 5×5 to 28×28 data points, the 
performance (recognition accuracy) of the neural network remained the same for convolutions 
with 5×5 to 8×8 pixel kernels. However, for convolutions using more than 8×8 or lower than 5×5 
pixel kernels, the performance decreased. Thus, we chose to use 5×5 pixel kernel in the 
convolutions. After computing the principal components of 5×5 segments of a large group of 
images, using stride one, it is concluded that the six principal components account for almost 
100% of the total variance. For this reason, six feature maps are used at the convolutional layer. 
While we explored using seven and eight feature maps, there was no difference in recognition 
accuracy when compared to using six feature maps. For the Max-Pooling layer, a weight of one 
and an offset of zero are used, because there is no increase in recognition accuracy when using the 
weights and offsets computed during the learning stage. Following the Max-Pooling layer, a fully-




determined in a stepwise approach by starting with 35 neurons and gradually increasing the 
neurons to explore the effect on performance. The recognition accuracy reached its maximum at 
45 neurons; using more than 45 neurons did not improve the performance. Thus, the hidden layer 
is designed with 45 neurons. During training, after 12 epochs, the probability of correct 
classification for the training data was 99.92% and that of the test data was 98.68%. At the end of 
epoch 15 the probability of correct classification for the training data was 99.99% and that of the 
test data was 98.76%. As the number of epochs increased, the probability of correct classification 
for the training data approached one and that of the test data was 99.90%. A feedforward final test 
program with a GUI interface was written that implemented the architecture described above 









Figure 14. Executable action of test program 
  
 
 Figure 15  shows the overall architecture of the SS-CNN. The network processes 28×28 pixel 
grayscale images of handwritten digits (see Figure 6) from which six convolutional feature maps 
are derived. To derive a feature map, consider a 5×5 pixel segment of the input image, with stride 
one and offset zero. Thus, for one convolutional feature map we need 26 (i.e., 5×5+1) parameters 
and for all six feature maps we need 156 (i.e., 26×6). Each of the six feature maps has 576 (i.e., 
24×24) components. Each feature map also derives a 12×12 Max-Pooling layer by taking the 
maximum of a 2×2 non-overlapping segment in the feature map. Going from a feature map to the 
pooling layer the weight is fixed to one, and the offset is zero. All of the 864 neurons of the Max-
Pooling layer are connected to the 45 neurons of the fully-connected layer plus one offset, for a 




45 neurons plus one offset for a total of (45+1)×10=460 coefficients. The total number of 











Figure 15. Super-skinny convolutional neural network 
 
 
All weights are initialized with uniformly distributed independent random variables in the 
interval [-0.5, 0.5]. The neural network is trained with 60,000 training images, and enough epochs 
needed to maximize the probability of correct classification without overfitting. The output is 
sensitive to the choice of learning rate and the activation function. The training probability of 
correct classification was 99.75% while the probability of correct classification of the test data 
was 98.95%, which is comparable to neural network architectures with many more layers that 
require more time for training and testing than this architecture. For example, with similar 
recognition accuracy on MNIST dataset, LeNet-5 network [27] requires 6 hidden layers and re 
than 60,000 parameters. The weights obtained after training the CNN were used by a feedforward 




of the software version of the feedforward neural network used to classify 10,000 MNIST test 
images. 
3.4.2 FPGA System 
This section describes the SS-CNN network designed, built, and tested using SystemVerilog 
on a Cyclone IVE FPGA. The FPGA system diagram of the SS-CNN network is shown in Figure 
16. The network was implemented using parameters with 8-bit precision for all inputs, outputs, 
and trained weights. 8-bit parameters are enough to keep all the useful information. With further 
decreases in parameter precision, the prediction accuracy of the neural network decreases. 
The CNN hardware system (see Figure 16) consists of external memories, on-chip registers, 
and calculation units on an FPGA. The external memories, the Image/Label RAM and Weights 
RAM, store the image/label data and trained weights. The on-chip registers, the Conv Result Reg, 
store the calculation results of the convolutional layer. The system controller, the Controller, 
manages the dataflow of the whole system. The computation units on the FPGA system, 
Convolutional Layer, Pooling Layer, and fully-connected layers, perform the calculations of the 























Figure 16. FPGA system architecture of Super-skinny convolutional neural network 
 
 
3.4.2.1 System Controller 
The Controller module manages the dataflow during forward propagation. Its main module is 
the finite state machine (FSM) defined by the state transition diagram in Figure 17. In the Idle 
state, the FSM waits for the Start_conv signal to assert in order to start the convolution operation. 
After the image data is read from external memory, the Start_conv signal is asserted, which 
moves the FSM to the forward (Fwd) state and triggers the forward propagation process. After 
forward propagation computations are complete, the FSM displays the image label and predicted 
















Figure 17. Controller finite state machine (FSM) 
 
 
3.4.2.2 Convolutional Layer 
The Convolutional Layer unit performs feature extraction from the input data images by 
performing convolution operations using six 5×5 kernel frames. As shown in Figure 18, one 5×5 
pixel kernel frame convolves with 5×5 pixels of image data to generate one pixel in the 
convolutional layer. The convolution kernel has six frames, and the stride of each frame is one. 
No padding is added to the original images, so we get six 24×24 frames in the convolutional layer 














Figure 18. Convolution for one kernel frame 
 
 
The basic calculation cell in the convolutional layer is the 5×5 multiplication net, as shown in 
Figure 19. Each clock cycle, the 5×5 multiplication net calculates the multiplication of 5×5 pixels 














To complete the convolution operation, the system accumulates the 25 (i.e., 5×5) 
multiplication results and then uses an adder tree to add them together as shown in Figure 20. The 
variables M00 … M04, M10 … M14, M20 … M24, M30 … M34, M40 … M44 represent the 25 
multiplication results from the 5×5 multiplication net as shown in Figure 19. The accumulation 














Figure 20. Adder tree for image and kernel frame multiplication results 
 
 
The convolution kernel consists of six distinct 5×5 convolution frames as discussed in section 
3.4.1. Two separate FPGA solutions are provided; one solution processes the six feature channels 
sequentially, and the other processes them in parallel. 
Figure 21 shows the calculation steps of the convolutional layer when the feature channels run 




above, the system adds a Bias to the accumulated data and then passes the data through the 
sigmoid activation function. At this point the system goes to the Byte Ready state which indicates 
that one convolution data value in the convolutional layer has finished calculating. After 
completing this computation, the FSM goes back to Idle state to continue calculating the next data 
value in the convolutional layer. We repeat this process 3,456 (i.e., 6×24×24) times to process all 
















Figure 21. Convolutional layer finite state machine (FSM) with sequential feature channel 
 
 
When the six feature channels are processed in parallel, the convolutional layer hardware on 
the FPGA is replicated six times. As expected, as shown in Figure 22, the FSM of the 
convolutional layer is the same but each of the six parallel FSMs only repeats 576 (i.e., 24×24) 



















Figure 22. Convolutional layer finite state machine (FSM) with parallel feature channel 
 
 
After the calculations of the convolutional layer are complete, the Max-Pooling layer is 
processed in the following clock cycle, as shown in Figure 23. The Max-Pooling layer picks the 
maximum value in non-overlapping 2×2 boxes in the six 24×24 frames in the convolutional layer. 
Thus, the Max-Pooling layer results in six 12×12 frames, as shown in Figure 15 and Figure 23. At 
the same time, the Start_fc signal in Figure 21 and Figure 22 is asserted to start the calculations in 













Figure 23. Max-Pooling layer 
 
 
3.4.2.3 Fully-connected Layers 
The SS-CNN has one fully-connected hidden layer and one fully-connected output layer. The 
inputs of the fully-connected hidden layer are the six 12×12 frames from the Max-Pooling layer. 
The two operations of multiplication and accumulation run sequentially for each neuron in the 
hidden layer. Thus, calculating the result in one neuron of the fully-connected hidden layer 
requires 864 (i.e., 6×12×12) clock cycles, which is the size of its input data – or, in other words, 
the size of the Max-Pooling layer output. The results of the 45 neurons in the fully-connected 
hidden layer are calculated in parallel, so the calculation of the fully-connected hidden layer costs 
864 cycles plus the time to add a bias and to pass the data through the sigmoid function. The input 
of the fully-connected output layer is the 45 neuron outputs from the hidden layer. The output 
layer performs the same calculations on its inputs as the hidden layer. Thus, the calculation of the 
fully-connected output layer requires 45 clock cycles plus the time to add a bias and to pass the 
data through the sigmoid function. Figure 24 shows the calculation process of the two fully-




Figure 21 or Figure 22, the calculations of the fully-connected hidden layer are triggered. After 
864 (i.e., 6×12×12) operations of multiplication and addition, the FSM moves to the calculation of 
the fully-connected output layer. Again, after repeating 45 times, the forward propagation 
calculations are complete. The FP_done signal is then asserted, which informs the system 
controller in Figure 17 that the system is ready to process the next image, and it also tells the 



















Figure 24. Fully-connected layers’ finite state machine (FSM)  
 
 
3.5 Sigmoid Function Approximation  
The sigmoid function (see Equation (1)) requires exponential and division operations, which 




simplify the computation and, thus, decrease the area, execution time, and power consumption 
required by the sigmoid function, the FPGA systems we designed approximate the sigmoid 
function using a piecewise linear approximation of a nonlinear function (PLAN) [28]. As shown 
in Figure 25, the piecewise linear function (represented by the blue line) closely approximates the 
original sigmoid function (represented by the red dotted line). The piecewise linear approximation 
offers hardware-friendly calculations that can be readily implemented on an FPGA. Table 1 
shows the output  as a function of the input  given several input ranges in the implementation 
of PLAN. Thus, the PLAN approximation replaces the exponent and division in the original 
sigmoid function with shifting and addition. 
 
 
   







Table 1. Implementation of PLAN [28] 
Output: y = F(x) Input Condition 
y = 32 x ≥ 160 
y = (x << 5) a + 27 76 ≤ x < 160 
y = (x << 3) b + 20 32 ≤ x < 76 
y = (x << 2) c + 16 0 ≤ x < 32 
y = 1 – y x < 0 
a. right shift 5 bits b. right shift 3 bits c. right shift 2 bits 
 
 
While the sigmoid and ReLU functions are two of the most common activation functions, they 
have limitations in terms of demand on hardware resources and performance. The sigmoid 
activation function – even when approximated – demands a large amount of FPGA area and the 
ReLU function can potentially limit performance in terms of prediction accuracy. The next 
section introduces a novel activation function, the D-ReLU function that overcomes both of these 
limitations. 
3.6 Dynamic ReLU Function 
This section introduces a novel activation function, the dynamic ReLU (D-ReLU) activation 
function, and contrasts it with the most common activation functions used in neural networks: 
ReLU, leaky ReLU, parametric ReLU (P-ReLU), and Randomized Leaky ReLU functions. While 
the sigmoid activation function offers high accuracy, it is also computationally expensive. In 




accuracy, the modified leaky ReLU, P-ReLU, and Randomized Leaky ReLU only fix the negative 
input case, but the proposed D-ReLU function offers high accuracy at low computational cost. 
To alleviate the disadvantages of conventional ReLU and modified ReLU functions mentioned 
in section 2.3.2, a modified ReLU function - D-ReLU is proposed. The two types of D-ReLU 
functions proposed are the (1) middle D-ReLU – where the threshold is the middle value of 
numbers in each layer input; and (2) mean D-ReLU – where the threshold is the average of the 
numbers in each layer input. For example, suppose the values of the activation function’s inputs is 
-5, 0, 2, 5 and 8, the threshold would be 2 for the middle D-ReLU; The mean D-ReLU, on the 
other hand, uses the mean value of all of the inputs = 2 instead of the middle value of 
the inputs.  
By using either of these proposed D-ReLU functions, the system can discard redundant 
information and preserve useful information dynamically, based on the pixel values of each 
image. D-ReLU functions with positive and negative thresholds, as calculated above, are shown 
in Figure 26 and Figure 27. During forward propagation, the threshold of the D-ReLU function 
changes dynamically. Before activating the outputs of the hidden layer and of the output layer, the 
D-ReLU function calculates the range of activation function inputs and sets the middle or mean 
value of the inputs as the current threshold. Each layer shares the same threshold at any given 
time, and this threshold changes when the network processes a new image – during training, 











Figure 27. ReLU function with a negative threshold 
 
 
While the calculations of backward propagation in the D-ReLU function are similar to those 
described for the sigmoid function in Section 3.3.2, in many cases the calculations are simpler. 
Because the D-ReLU function is a piecewise function, we need to find the gradient of its two 
segments. As shown in Figure 26 and Figure 27, when the input is smaller than the threshold, the 




equal to or larger than the threshold, the slope of the function is 1, i.e., the gradient is 1. In this 
way, the backward propagation calculations of the network change from those in equations (20) 
and (21) for the sigmoid function to equations (29) and (30) for the D-ReLU function. 
 ( ) =  0,              < ℎ ℎ  ( ) =  1,               ≥ ℎ ℎ  (29) 
 ( ) = 0,               < ℎ ℎ  
( ) = ( ),               ≥ ℎ ℎ  (30) 
And the ( )  from equation (25) changes to equation (31). 
 ( ) = 0,               < ℎ ℎ  
( ) = ( ) ∗ ,( ) ,     ≥ ℎ ℎ  (31) 
As shown above, in the backward propagation calculations using the D-ReLU function, instead 
of multiplying with the more complex sigmoid derivative, we multiply with the derivative of the 
D-ReLU, which is either 0 or 1. Thus, using the D-ReLU simplifies the calculations and reduces 





Chapter 4  Results and Discussion 
 
In chapter 3, we designed, built, and tested three neural networks in hardware on an FPGA and 
developed a novel activation function. We also explored the design space of varying the precision 
of parameters in the neural network and the resulting performance, area, and power implications. 
The FPGA implementations of the three neural networks result in lower execution time and lower 
power requirements when compared with software solutions without a loss in recognition 
accuracy. Moreover, by comparing different bit-width of 2-layer and 3-layer fully-connected 
neural networks on FPGA, we explore the accuracy/area tradeoff of changing precision. 
Furthermore, by applying D-ReLU function, we can further improve these networks’ performance 
on FPGA. At last, both FPGA solutions of SS-CNN can each fit on a low-cost Cyclone IVE 
FPGA with state-of-the-art (99%) recognition accuracy. 
4.1 2-layer Fully-connected Neural network on an FPGA 
To explore the 2-layer fully-connected neural network’s performance, accuracy loss and 
potential hardware resource saving on FPGA, this network was implemented using 8-, 6-, 5- and 
4-bit precision for all inputs, outputs, and weights. This section also compares the accuracy and 
execution time of the software solution with the proposed hardware solutions.  
4.1.1 Recognition Accuracy 
The recognition accuracy of the 8-, 6-, 5- and 4-bit hardware designs vary from 73-89% as 




depending on the number of training images. Figure 28 summarizes these results. The horizontal 
axis shows the number of training images (up to 55,000), and the vertical axis represents the 
recognition accuracy, the percentage of correct predictions using 10,000 test images. The 8-bit 
integer FPGA solution’s accuracy reaches (and slightly exceeds) that of the 32-bit floating point 
software solution when using the maximum number of training images (55,000). Moreover, it has 
double convergence speed – the 8-bit integer FPGA solution requires only 20,000 training images 
to achieve the highest recognition accuracy, while the 32-bit floating point software solution 










Further reducing the bits of precision results in only 6-11% drops in accuracy (when using 
55,000 training images), as shown in Figure 28 and summarized in Table 2. Compared with the 8-
bit hardware solution, the 6-bit solution’s accuracy drops by only 6%, and the 5- and 4-bit 
solutions have accuracies that are only 9-11% lower than the 8-bit solution. Thus, reducing the 
precision by 50% from 8 bits to 4 bits results in an accuracy decrease of only 9%. Table 2 
indicates accuracy is not affected when decreasing precision from 32-bit floating point to 8-bit 




Table 2. Recognition accuracy of 2-layer fully-connected neuron network with online training 
Implementation Recognition Accuracy 
32-bit software 89% 
8-bit FPGA 89% 
6-bit FPGA 83% 
5-bit FPGA 78% 









With the same recognition accuracy, the 8-bit hardware solution has potential power savings 
than the software solution, as measured by execution time and summarized in Table 3. The clock 
frequency of the software solution is the CPU frequency (3.6 GHz), and the clock frequency of 
the FPGA designs (25 MHz) is the highest frequency at which the designs can run on the Cyclone 
IVE FPGA. The execution time measured includes both training and testing time when using 
55,000 training images and 10,000 test images. Execution time of both the software and hardware 
designs is almost identical:  3.7 seconds in software and 3.8 seconds for each FPGA solution. 
 
 
Table 3. Execution time and potential power savings of 2-layer fully-connected neural network with 
online training 
Implementation Clock Frequency Execution Time Potential Power Savings 
Software 3.6 GHz 3.7 seconds 1× 
Hardware 25 MHz 3.8 seconds d 140× 
d. includes calculation time for forward and backward propagation (but not UART transfer) 
 
 
While the FPGA hardware designs complete the computations in essentially the same time as 
the software implementation, the clock frequency of software model is 144 times faster than that 
of the FPGA designs. Thus, to compare solutions running at the same frequency, the hardware 




software solution. On the other hand, keeping the lower frequency of the hardware design 
produces an approximate 140× decrease in power consumption for training and testing 
calculations, using the relationship that power is proportional to operating frequency. Thus, the 
FPGA hardware designs offer higher performance or lower power alternatives to the software 
implementation. 
4.1.3 FPGA Area 
The FPGA area for the hardware design of the 2-layer fully-connected neural network using 4, 
5, 6, and 8 bits of precision requires 20-34k logic elements (LEs). Table 4 summarizes the area 
requirements of each hardware design as well as the percent use of the FPGA’s total LEs. As 
expected, lower bit width solutions require fewer logic elements, with the 4-bit solution using 
about 40% less area than the 8-bit solution. 
 
 
Table 4. FPGA area of 2-layer fully-connected neural network with online training 
FPGA Solution Logic Elements (% use e) 
8-bit width 34 k (29%) 
6-bit width 26 k (23%) 
5-bit width 22 k (19%) 
4-bit width 20 k (17%) 






Figure 29 shows the area usage (relative to the 8-bit version) versus recognition accuracy for 
varying bit width solutions when using 55,000 training images and 10,000 test images. The 6-bit 
solution uses 24% fewer logic elements than the 8-bit solution with an only 6% accuracy drop 
(from 89% to 83%), as shown in Figure 29. The 5-bit solution saves another 11% of the logic 
elements compared with the 6-bit solution but only has a 5% accuracy drop (from 83% to 78%). 
The 4-bit design requires 6% fewer logic elements than the 5-bit design while maintaining 
accuracy similar to that design. Figure 29 shows the actual accuracy decrease with decreasing 
precision as well as a trend line. As the trend line shows, from 8-bit precision to 4-bit precision, 
the accuracy drops 11% (from 89% to 78%), and the area decreases by 41%. So area decreases at 




Figure 29. Accuracy (percentage of correct predictions) vs. area (relative to 8-bit version) for the 4-, 






Figure 30 shows data width versus area and a trend line for each FPGA design of the 2-layer 
fully-connected neural network. As depicted in the figure, the trend line shows that area grows at 




Figure 30. Data width vs. FPGA area (number of FPGA logic elements) 
 
 
However, in many cases, online training is not always necessary. For example, for image 
recognition on mobile phone, we just need to train the network on a server or cloud and the 
mobile phone needs only perform the testing function, thus reducing the hardware and power 




connected neural network when working on different data widths after removing the training 
hardware. On average, including only the testing hardware requires 20% less hardware than 
including both testing and training hardware. Additional advantages of offline training are 
described in section 4.2. 
 
 
Table 5. FPGA area of forward propagation of 2-layer fully-connected neural network 
FPGA Solution Logic Elements (% use f) 
8-bit  27 k (24%) 
6-bit  21 k (18%) 
5-bit  17 k (15%) 
4-bit  14 k (12%) 
f. Intel’s Cyclone IVE FPGA 
 
 
4.2 3-layer Fully-connected Neural network on an FPGA 
The 3-layer fully-connected neural network with sigmoid activation function has higher 
recognition accuracy than the 2-layer network introduced in section 4.1. The recognition accuracy 
increases by 6.5% by adding one more layer, adjusting the biases associated with each non-output 
layer, and adjusting the learning rate . This section also shows the advantages of the D-ReLU 
function over the sigmoid function are that it has lower execution time in software, requires less 




4.2.1 Recognition Accuracy 
This section compares the recognition accuracy of 2- and 3-layer fully-connected neural 
networks using sigmoid or D-ReLU activation functions in both software and hardware. Note that 
the designs are all trained in software and then tested in either software or hardware. The software 
designs use 32-bit floating point calculations, and the FPGA implementations use 8-bit fixed-
point calculations. The input image data and parameters are preprocessed as 8-bit integers before 
being sent to the FPGA.  
Table 6 lists the prediction accuracies of 2- and 3-layer fully-connected neural networks with 
different activation functions – sigmoid function and D-ReLU function. As shown in Table 6, the 
accuracy of 3-layer fully-connected neural network with sigmoid function achieves 95.5%, which 
is 6.5% higher than the 2-layer fully-connected neural network reported in section 4.1. 
 
 
Table 6. Recognition accuracies of 2- & 3-layer fully-connected neural networks  
Implementation 2-layer Fully-connected Network 3-layer Fully-connected Network 
w/ sigmoid in software 90.2% 95.5% 
w/ middle D-ReLU in software 90.0% 92.9% 
w/ mean D-ReLU in software 90.0% 95.8% 
w/ sigmoid in hardware  90.3% 95.6% 
w/ middle D-ReLU in hardware 89.6% 92.7% 






The software versions of the three 2-layer fully-connected neural networks in Table 6 achieve 
similar accuracies (~90%) across platforms and type of activation functions. The accuracy of the 
3-layer fully-connected neural network using the middle D-ReLU function achieves 92.9% 
prediction accuracy, only 2.6% less than the 95.5% of the network using the sigmoid activation 
function. The accuracy of the 3-layer fully-connected neural network with mean D-ReLU 
activation function are even slightly higher than the networks with sigmoid activation function. 
When implemented on an FPGA, the prediction accuracies of the 2- and 3-layer fully-
connected neural networks are not affected by the data and parameter preprocessing. The 2-layer 
fully-connected neural network has the same prediction accuracy (~90%) whether implemented in 
software or built using an FPGA. The 3-layer fully-connected neural networks achieve similar 
prediction accuracies for both hardware and software implementations: ~95% when using the 
sigmoid activation function, and ~93% and ~96% when using the D-ReLU activation function. 
4.2.2 Performance 
Here we compare the execution time and power consumption of the 2- and 3-layer fully-
connected neural networks across activation functions – sigmoid vs. D-ReLU, and across 
platforms – software vs. hardware.  
First, we consider only the software implementations of each network. As expected, networks 




activation function. As discussed in section 3.6, the calculations of the D-ReLU function are 
simpler than the sigmoid function, so it requires less computation, and thus less execution time. 
As shown in Table 7  Table 8, compared with the systems using the sigmoid activation 
function, the 2-layer fully-connected neural network with the D-ReLU function executes 14% 
faster, and the 3-layer fully-connected neural network with the D-ReLU function is 57% faster. 
Note that the execution time counts the forward propagation time only because training is 
completed in software for all systems. 
 
 
Table 7. Execution time and potential power savings of 2-layer fully-connected neural network  
Implementation Clock Frequency Execution Time Potential Power Savings 
w/ sigmoid in software 3.6 GHz 2.9 seconds 1.0× 
w/ D-ReLU in software 3.6 GHz 2.5 seconds 1.2× 
w/ sigmoid in hardware 250MHz 0.02 seconds 2,088× 











Table 8. Execution time and potential power savings of 3-layer fully-connected neural network  
Implementation Clock Frequency Execution Time Potential Power Savings 
w/ sigmoid in software 3.6 GHz 9.9 seconds 1.0× 
w/ D-ReLU in software 3.6 GHz 4.3 seconds 2.3× 
w/ sigmoid in hardware 60MHz 0.15 seconds 3,960× 
w/ D-ReLU in hardware 60MHz 0.15 seconds 1,740× 
 
 
When built in hardware on an FPGA, the activation function used does not affect execution 
time, but all four fully-connected neural networks show potential for lower power consumption in 
the FPGA designs over the software implementations due to faster execution time and lower 
system frequency. As shown in Table 7 and Table 8, the execution time of the 2-layer fully-
connected neural network is 145 and 125 times less respectively, when using sigmoid and D-
ReLU functions, than their software solutions. Similarly, the execution time of the 3-layer fully-
connected neural network is 66 and 29 times less respectively, when using sigmoid and D-ReLU 
functions, than their software solutions. Moreover, the 2-layer fully-connected neural networks’ 
FPGA clock frequency is 14.4 times slower than the software frequency, and the 3-layer fully-
connected neural networks’ FPGA clock frequency is 60 times slower than the software 
frequency. Using the relationship that power is proportional to operating frequency, hardware 




effects of both decreased execution time and decreased clock frequency of the FPGA designs over 
the software implementations, these four fully-connected neural networks offer the potential of 
over a 1,700× decrease in power consumption when working on an FPGA. Thus, the FPGA 
hardware designs offer higher performance or lower power alternatives when compared to 
software implementations. 
 
4.2.3 FPGA Area 
The D-ReLU activation function requires less area and fewer computation cycles than the 
sigmoid activation function when built on an FPGA. In order to highlight the contribution of two 
kinds of activation functions, only the FPGA area of each network’s computation unit is 
compared. As shown in Table 9, 2- and 3-layer fully-connected neural networks using the D-
ReLU activation function use 41% and 24% less area respectively, compared with networks using 
the sigmoid activation function. Moreover, the D-ReLU function uses two fewer clock cycles to 











Table 9. FPGA areas of 2- & 3-layer fully-connected neural networks  
FPGA Solution 
Logic Element g 
2-layer fully-connected network 3-layer fully-connected network 
w/ sigmoid 308 w/ sigmoid 
w/ D-ReLU 183 w/ D-ReLU 
g. Intel’s Cyclone IVE FPGA 
 
 
By reducing the data width from 8 bits to 6, 5, and 4 bits as in section 4.1, we can also get a 
similar trend of trading off recognition accuracy for FPGA area. As shown in Table 10, if we 
reduce the precision from 8 to 4 bits for the two types of 3-layer neural networks, the recognition 
accuracy of the networks drops by only 2.7% and 4.8% respectively. At the same time, their 












Table 10. FPGA areas of 2- & 3-layer fully-connected neural networks with different activation 
functions 
FPGA Solution Bit Precision Logic Elements h Recognition Accuracy 
3-layer network w/ sigmoid 
8-bit 564 95.6% 
6-bit 495 95.1% 
5-bit 459 94.7% 
4-bit 404 92.9% 
3-layer network w/ D-ReLU 
8-bit 428 92.7% 
6-bit 416 90.3% 
5-bit 407 88.6% 
4-bit 386 87.9% 
h. Intel’s Cyclone IVE FPGA 
 
 
Figure 31 and Figure 32 show the data width versus area and a trend line for 3-layer fully-
connected neural network implemented on an FPGA. As depicted by the trend lines, the FPGA 
area of the 3-layer network with sigmoid function grows by ~9% per bit of precision, and the 


















4.3 Super-Skinny Convolutional Neural Network on an FPGA 
The experimental results described in this section show that the proposed SS-CNN produces 
high recognition accuracy on handwritten digits. Its implementation on an FPGA results in lower 
execution time and lower clock frequency without loss in recognition accuracy, which also 
indicates potential power saving. Moreover, the two FPGA solutions can each fit on a low-cost 
Cyclone IVE FPGA. 
4.3.1 Recognition Accuracy 
The recognition accuracy of the 8-bit integer hardware design is 98.8% as compared to the 32-
bit floating point software implementation that achieves an accuracy of 99.0%. The FPGA 
hardware solutions exhibit negligible recognition accuracy loss. 
As shown in Table 11, when compared with the 2- and 3-layer fully-connected neural 
networks we designed in [29][30], the recognition accuracies of handwritten digits using the SS-












Table 11. Recognition accuracies of three neural networks 
Implementation Recognition Accuracy 
SS-CNN in software 99.0% 
SS-CNN in hardware 98.8% 
3-layer fully-connected neural network in software 95.5% 
3-layer fully-connected neural network in hardware 95.6% 
2-layer fully-connected neural network in software 90.3% 




The performance differences between software and hardware solutions, as measured by clock 
frequency and execution time, are summarized in Table 12. The clock frequency of the software 
solution is determined by the CPU frequency (3.6 GHz), and the clock frequencies of the FPGA 
designs are the highest frequencies at which the designs can run on the Cyclone IVE FPGA. The 









Table 12. Execution time and potential power savings of two SS-CNN solutions on FPGA 
Implementation Clock Frequency Execution Time Potential Power Savings 
Software solution 3.6 GHz 6.5 seconds 1× 
Sequential solution in hardware 37 MHz 9.4 seconds 67× 
Parallel solutions in hardware 30 MHz 2.2 seconds 355× 
 
 
As shown in Table 12, the FPGA solution with sequential feature channel processing runs with 
a 37 MHz clock frequency, which is 97 times lower than the software solution. Even though its 
execution time is longer, it still has 67 (i.e., (3.6GHz/37MHz)×(6.5s/9.4s)) times potential power 
savings when compared to the software solution. The potential power saving of the parallel 
feature channel processing solution is much higher. This solution has both a lower clock 
frequency and a lower execution time, which results in 355 (i.e., (3.6GHz/30MHz)×(6.5s/2.2s)) 
times potential power savings when compared to the software solution. 
4.3.3 FPGA Area 
Both the sequential and parallel FPGA solutions fit on Intel’s low-cost Cyclone IVE FPGA. 
The SS-CNN hardware design with sequential feature channel processing requires 20k logic 
elements (LEs) on the Cyclone IVE FPGA, which is only 17% of the total LEs available on that 
chip. The parallel feature channel processing solution requires 98k logic elements, which is 86% 




parallelized – that is, the Max-Pooling and fully-connected layers are the same in both hardware 
solutions – so the FPGA area requirement of the parallel feature channel processing solution is 
less than six times that of the sequential feature channel processing solution.  
 
 
Table 13. FPGA areas two SS-CNN solutions 
FPGA Solution Logic Elements (% use i) 
Sequential solution 20k (17%) 
Parallel solution 98k (86%) 





Chapter 5  Conclusion 
 
In this research, we designed, built, and tested three neural networks on an FPGA. We also 
explored the FPGA area/precision tradeoff and introduced a novel activation function, the D-
ReLU function. We analyzed each design in terms of execution time, prediction accuracy, area, 
and power consumption. The proposed FPGA solutions result in lower execution times, fewer 
computation resources, and lower power consumption than comparable software solutions at little 
or no cost to prediction accuracy. 
The FPGA hardware design of the 2-layer fully-connected neural network with online training 
presented in section 3.2 offers a high-performance, low power alternative to traditional software 
methods. The 8-bit hardware design of the 2-layer neural network performs with similar execution 
time (3.8 seconds) and recognition accuracy (89%) as the 32-bit software solution running at a 
clock speed 144 times greater than the hardware design (3.6 GHz vs. 25 MHz). This difference in 
clock frequency indicates that the hardware solution offers either lower power consumption or 
potentially increased performance of 144 times, at no cost to recognition accuracy, as compared to 
the software solution. Furthermore, a reduction in precision from 32 to 8 bits results in no 
decrease in recognition accuracy. Additional reductions in precision below 8 bits result in only 
small reductions in recognition accuracy (4% recognition accuracy reduction per bit of reduced 




decrease that falls off more quickly than the decrease in recognition accuracy (4% decrease in 
area per percent decrease in recognition accuracy).  
The D-ReLU activation function proposed in section 3.6 offers a more flexible and accurate 
algorithm than the traditional ReLU function. It also results in a faster, more power-efficient 
design when compared to the software implementation without incurring loss in recognition 
accuracy. As depicted in section 4.2, compared with networks using sigmoid activation function, 
the 2- and 3-layer fully-connected neural networks using the D-ReLU activation function are 14% 
and 57% faster during the testing phase and use 41% and 24% less FPGA area, compared with 
networks using the sigmoid activation function. Moreover, because they operate at a lower clock 
frequency and require less execution time, the FPGA solutions of the 2- and 3-layer fully-
connected neural networks offer a low power alternative to traditional software methods. These 
fully-connected neural networks implemented on an FPGA offered the potential of being 1,700× 
more power efficient than comparable software solutions. 
The CNN architecture presented in section 3.4 offers a high-performance, compact neural 
network with state-of-the-art (99%) recognition accuracy, compared with other recent neural 
networks applied to handwritten digit recognition. The two FPGA solutions of the proposed SS-
CNN also offer high-performance and low power alternatives to the software solution. The 8-bit 
hardware designs achieve recognition accuracy (~99%) that is similar to the 32-bit floating point 
software solution. Moreover, the two proposed hardware designs on the FPGA indicate 67 to 355 




Furthermore, all the hardware solutions of these neural networks can be transplanted to any 
other FPGA platforms readily. Moreover, the FPGA chip used in this system is extremely low-
cost compared with the modern FPGA chips. For example, Intel Arria 10 FPGAs contain over 1 
million logic elements (LEs), but the FPGA chip used in this dissertation only contains 0.1 
million LEs and still can fit all the neural networks introduced in Chapter 3. 
In all, FPGA solutions of neural networks provide fast, power efficient, low-cost and portable 
alternatives to the software solutions. Moreover, FPGA solutions offer flexible, customizable, and 




Appendix A Software Code of Neural Networks 
A.1 2-layer Fully-connected Neural Network in Python 
import sys 
import numpy as np 
from tensorflow.examples.tutorials.mnist import input_data 
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) 
 
# sigmoid function 
def sigmoid(x): 
    return 1/(1+np.exp(-x)) 
 
y_test_result = np.zeros((10000,10)) 
error_cnt = 0 
 
# input images 
X_train = mnist.train.images 
X_test  = mnist.test.images 
 
# input labels 
y_train = mnist.train.labels 
y_test  = mnist.test.labels 
 
np.random.seed(1) 
W = (2*np.random.random((784,10)) - 1)*0.25 
for i in range(1): 
    l0 = np.array([X_train[i]])    
    l1 = sigmoid(np.dot(l0,W))  
    l1_error = np.array([y_train[i]]) - l1 
    slope = l1*(1-l1)                            
    l1_change = l1_error * slope         
    l1_delta = np.dot(l0.T, l1_change) 
    W += l1_delta*0.05                    
for z in range(1): 
    l0_test = np.array([X_test[z]]) 
    y_test_result[z] = sigmoid(np.dot(l0_test,W)) 
    if np.argmax(y_test_result[z])==np.argmax(y_test[z]): 












void rewind(FILE *f); 
 
int ReverseInt(int i) 
{ 
 unsigned char ch1, ch2, ch3, ch4; 
 ch1 = i & 255; 
 ch2 = (i >> 8) & 255; 
 ch3 = (i >> 16) & 255; 
 ch4 = (i >> 24) & 255; 
 return((int)ch1 << 24) + ((int)ch2 << 16) + ((int)ch3 << 8) + ch4; 
} 
 
float sigmoid(float x) 
{ 
 return 1 / (1 + exp(-x)); 
} 
 
float sigmoid_deri(float y) 
{ 
 return y * (1 - y); 
} 
 
int max_index(float a[10]) 
{ 
 float max = 0.0; 
 int index = 0; 
 for (int c = 0; c < 10; c++) 
 { 
  if (a[c] > max) 
  { 
   index = c; 
   max = a[c]; 









 int a1, a2, num1, num2, high, width, count_train, count_test; 
 unsigned char image_train[784], image_test[784], label_train, label_test; 
 unsigned char t[10]; 
 float e, w0[784][40], u0[40][10], b0[40], c0[10], image[784]; 
 float s[40], y[40], r[10], z[10], error[10], E; 
 float outError[10]; 
 float tempErrorSum; 
 float hiddenErrorSum[40]; 
 FILE *fp_image_train = fopen("../../train-images.idx3-ubyte", "rb"); 
 FILE *fp_label_train = fopen("../../train-labels.idx1-ubyte", "rb"); 
 FILE *fp_image_test = fopen("../../t10k-images.idx3-ubyte", "rb"); 
 FILE *fp_label_test = fopen("../../t10k-labels.idx1-ubyte", "rb"); 
 FILE *fp = fopen("data_mnist_sigmoid.dat", "wb"); 
 srand(1); 
 for (int i = 0; i < 784; i++) 
  for (int j = 0; j < 40; j++) 
   w0[i][j] = rand() % 10000 / 10000.0 - 0.5; 
 
 for (int i = 0; i < 40; i++) 
  for (int j = 0; j < 10; j++) 
   u0[i][j] = rand() % 10000 / 10000.0 - 0.5; 
 
 for (int i = 0; i < 40; i++) 
  b0[i] = rand() % 10000 / 10000.0 - 0.5; 
 
 for (int i = 0; i < 10; i++) 
  c0[i] = rand() % 10000 / 10000.0 - 0.5; 
 
 e = 0.08; 
  
 for (int epoch = 0; epoch < 100; epoch++) 
 { 
  rewind(fp_image_train); 
  rewind(fp_label_train); 
  fread(&a1, sizeof(int), 1, fp_image_train); 




  fread(&num1, sizeof(int), 1, fp_image_train); 
  num1 = ReverseInt(num1); 
  fread(&high, sizeof(int), 1, fp_image_train); 
  high = ReverseInt(high); 
  fread(&width, sizeof(int), 1, fp_image_train); 
  width = ReverseInt(width); 
  fread(&a2, sizeof(int), 1, fp_label_train); 
  a2 = ReverseInt(a2); 
  fread(&num2, sizeof(int), 1, fp_label_train); 
  num2 = ReverseInt(num2); 
 
  printf("Beginning of epoch %d\n", epoch + 1); 
  count_train = 0; 
  for (int k = 0; k < 60000; k++) 
  { 
   fread(image_train, sizeof(char), 28 * 28, fp_image_train); 
   fread(&label_train, sizeof(char), 1, fp_label_train); 
 
   for (int i = 0; i < 784; i++) 
    image[i] = image_train[i] / 255.0; 
 
   for (int j = 0; j < 10; j++) 
    if (j == label_train) t[j] = 1; 
    else t[j] = 0; 
 
   for (int i = 0; i < 40; i++) 
   { 
    s[i] = 0.0; 
    for (int j = 0; j < 784; j++) 
     s[i] = s[i] + image[j] * w0[j][i]; 
    y[i] = sigmoid(s[i] + b0[i]); 
   } 
 
   for (int i = 0; i < 10; i++) 
   { 
    r[i] = 0.0; 
    E = 0.0; 
    for (int j = 0; j < 40; j++) 
     r[i] = r[i] + y[j] * u0[j][i]; 
    z[i] = sigmoid(r[i] + c0[i]); 




   } 
 
   for (int i = 0; i < 10; i++) 
   { 
    outError[i] = e * error[i] * sigmoid_deri(z[i]); 
    c0[i] = c0[i] - outError[i]; 
   } 
 
   for (int i = 0; i < 40; i++) 
    for (int j = 0; j < 10; j++) 
    { 
     u0[i][j] = u0[i][j] - outError[j] * y[i]; 
    } 
 
   for (int i = 0; i < 40; i++) 
   { 
    tempErrorSum = 0.0; 
    for (int j = 0; j < 10; j++) 
     tempErrorSum = tempErrorSum + outError[j] * u0[i][j]; 
    hiddenErrorSum[i] = tempErrorSum * sigmoid_deri(y[i]); 
    b0[i] = b0[i] - hiddenErrorSum[i]; 
   } 
 
   for (int i = 0; i < 784; i++) 
    for (int j = 0; j < 40; j++) 
    { 
     w0[i][j] = w0[i][j] - hiddenErrorSum[j] * image[i]; 
    } 
 
   if (max_index(z) == label_train) 
    count_train++; 
  } 
  printf("End of Epoch %d\n", epoch + 1); 
  printf("training accuracy = %3.2f%%\n", count_train / 600.0); 
 } 
 
 fread(&a1, sizeof(int), 1, fp_image_test); 
 a1 = ReverseInt(a1); 
 fread(&num1, sizeof(int), 1, fp_image_test); 
 num1 = ReverseInt(num1); 




 high = ReverseInt(high); 
 fread(&width, sizeof(int), 1, fp_image_test); 
 width = ReverseInt(width); 
 fread(&a2, sizeof(int), 1, fp_label_test); 
 a2 = ReverseInt(a2); 
 fread(&num2, sizeof(int), 1, fp_label_test); 
 num2 = ReverseInt(num2); 
 
 printf("Beginning of Testing\n"); 
 count_test = 0; 
 for (int k = 0; k < 10000; k++) 
 { 
  fread(image_test, sizeof(char), 28 * 28, fp_image_test); 
  fread(&label_test, sizeof(char), 1, fp_label_test); 
 
  for (int i = 0; i < 784; i++) 
   image[i] = image_test[i] / 255.0; 
 
  for (int j = 0; j < 10; j++) 
   if (j == label_test) t[j] = 1; 
   else t[j] = 0; 
 
  for (int i = 0; i < 40; i++) 
  { 
   s[i] = 0.0; 
   for (int j = 0; j < 784; j++) 
    s[i] = s[i] + image[j] * w0[j][i]; 
   y[i] = sigmoid(s[i] + b0[i]); 
  } 
 
  for (int i = 0; i < 10; i++) 
  { 
   r[i] = 0.0; 
   E = 0.0; 
   for (int j = 0; j < 40; j++) 
    r[i] = r[i] + y[j] * u0[j][i]; 
   z[i] = sigmoid(r[i] + c0[i]); 
   error[i] = z[i] - t[i]; 
   E = E + (error[i] * error[i]) / 2; 
  } 




   count_test++; 
 } 
 printf("End of Testing\n"); 





















void rewind(FILE *f); 
 
float sigmoid(float x) 
{ 
 return 1 / (1 + exp(-x)); 
} 
 
float sigmoid_deri(float y) 
{ 
 return y * (1 - y); 
} 
 
void maxpool(float a, float b, float c, float d, float *max_num, int *max_index) 
{ 
 *max_num = a; 
 *max_index = 0; 
 
 if (b > *max_num) 
 { 
  *max_num = b; 
  *max_index = 1; 
 } 
 if (c > *max_num) 
 { 
  *max_num = c; 
  *max_index = 2; 
 } 
 if (d > *max_num) 
 { 
  *max_num = d; 







int max_out(float a[10]) 
{ 
 float max = 0.0; 
 int index = 0; 
 for (int c = 0; c < 10; c++) 
 { 
  if (a[c] > max) 
  { 
   index = c; 
   max = a[c]; 
  } 
 } 





 int a1, a2, num1, num2, high, width, count_train, count_test; 
 unsigned char image_train[28][28], image_test[28][28], label_train, label_test; 
 unsigned char t[10]; 
 int pool_index[6][12][12]; 
 float e, image[28][28], error[10]; 
 float w00[6][5][5], w20[6][12][12][45], w30[45][10]; 
 float w0[6][5][5], w2[6][12][12][45], w3[45][10]; 
 float b0[6], b2[45], b3[10]; 
 float s0[6][24][24], y0[6][24][24], s1[6][12][12], y1[6][12][12], s2[45], y2[45], s3[10], z[10]; 
 float outError[10]; 
 float temp3ErrorSum, temp2ErrorSum, temp1ErrorSum, tan_temp, temp0ErrorSum; 
 float hidden3ErrorSum[45], hidden2ErrorSum[6][12][12], hidden1ErrorSum[6][24][24]; 
 FILE *fp_image_train = fopen("../../train-images.idx3-ubyte", "rb"); 
 FILE *fp_label_train = fopen("../../train-labels.idx1-ubyte", "rb"); 
 FILE *fp_image_test = fopen("../../t10k-images.idx3-ubyte", "rb"); 
 FILE *fp_label_test = fopen("../../t10k-labels.idx1-ubyte", "rb"); 
 FILE *fp = fopen("CNN_MNIST_sigmoid.dat", "wb"); 
 
 for (int i = 0; i < 6; i++) 
  for (int j = 0; j < 5; j++) 
   for (int k = 0; k < 5; k++) 





 for (int i = 0; i < 6; i++) 
  for (int m = 0; m < 12; m++) 
   for (int n = 0; n < 12; n++) 
    for (int j = 0; j < 45; j++) 
     w20[i][m][n][j] = rand() % 10000 / 10000.0 - 0.5; 
 
 for (int i = 0; i < 45; i++) 
  for (int j = 0; j < 10; j++) 
   w30[i][j] = rand() % 10000 / 10000.0 - 0.5; 
 
 for (int i = 0; i < 6; i++) 
  b0[i] = rand() % 10000 / 10000.0 - 0.5; 
 
 for (int i = 0; i < 45; i++) 
  b2[i] = rand() % 10000 / 10000.0 - 0.5; 
 
 for (int i = 0; i < 10; i++) 
  b3[i] = rand() % 10000 / 10000.0 - 0.5; 
 
 e = 0.1; 
 
 for (int epoch = 0; epoch < 100; epoch++) 
 { 
  rewind(fp_image_train); 
  rewind(fp_label_train); 
  fread(&a1, sizeof(int), 1, fp_image_train); 
  fread(&num1, sizeof(int), 1, fp_image_train); 
  fread(&high, sizeof(int), 1, fp_image_train); 
  fread(&width, sizeof(int), 1, fp_image_train); 
  fread(&a2, sizeof(int), 1, fp_label_train); 
  fread(&num2, sizeof(int), 1, fp_label_train); 
 
  printf("Beginning of epoch %d\n", epoch + 1); 
  count_train = 0; 
 
  for (int k = 0; k < 60000; k++) 
  { 
   fread(image_train, sizeof(char), 28 * 28, fp_image_train); 
   fread(&label_train, sizeof(char), 1, fp_label_train); 
   for (int i = 0; i < 28; i++) 




     image[i][j] = image_train[i][j] / 255.0; 
 
   for (int j = 0; j < 10; j++) 
    if (j == label_train) t[j] = 1; 
    else t[j] = 0; 
 
   //hidden layer 1 -- conv 
   for (int i = 0; i < 6; i++) 
    for (int m = 0; m < 24; m++) 
     for (int n = 0; n < 24; n++) 
     { 
      s0[i][m][n] = 0.0; 
      for (int j = 0; j < 5; j++) 
       for (int k = 0; k < 5; k++) 
        s0[i][m][n] = s0[i][m][n] + image[j + m][k + n] * 
w00[i][j][k]; 
      y0[i][m][n] = sigmoid(s0[i][m][n] + b0[i]); 
     } 
 
   //hidden layer 2 -- pool 
   for (int i = 0; i < 6; i++) 
    for (int m = 0; m < 12; m++) 
     for (int n = 0; n < 12; n++) 
     { 
      maxpool(y0[i][2 * m][2 * n], y0[i][2 * m][2 * n + 1], y0[i][2 * m + 1][2 
* n], y0[i][2 * m + 1][2 * n + 1], &s1[i][m][n], &pool_index[i][m][n]); 
      y1[i][m][n] = s1[i][m][n]; 
     } 
 
   //hidden layer 3 -- FC 
   for (int k = 0; k < 45; k++) 
   { 
    s2[k] = 0.0; 
    for (int i = 0; i < 6; i++) 
     for (int m = 0; m < 12; m++) 
      for (int n = 0; n < 12; n++) 
       s2[k] = s2[k] + y1[i][m][n] * w20[i][m][n][k]; 
    y2[k] = sigmoid(s2[k] + b2[k]); 
   } 
 




   for (int i = 0; i < 10; i++) 
   { 
    s3[i] = 0.0; 
    for (int j = 0; j < 45; j++) 
     s3[i] = s3[i] + y2[j] * w30[j][i]; 
    z[i] = sigmoid(s3[i] + b3[i]); 
    error[i] = z[i] - t[i]; 
   } 
 
   //back propagation 
   //output layer 
   for (int i = 0; i < 10; i++) 
   { 
    outError[i] = e * error[i] * sigmoid_deri(z[i]); 
    b3[i] = b3[i] - outError[i]; 
   } 
 
   for (int i = 0; i < 45; i++) 
    for (int j = 0; j < 10; j++) 
     w3[i][j] = w30[i][j] - outError[j] * y2[i]; 
 
   //hidden layer 3 -- FC 
   for (int i = 0; i < 45; i++) 
   { 
    temp3ErrorSum = 0.0; 
    for (int j = 0; j < 10; j++) 
     temp3ErrorSum = temp3ErrorSum + outError[j] * w30[i][j]; 
    hidden3ErrorSum[i] = temp3ErrorSum * sigmoid_deri(y2[i]); 
    b2[i] = b2[i] - hidden3ErrorSum[i]; 
   } 
 
   for (int i = 0; i < 6; i++) 
    for (int m = 0; m < 12; m++) 
     for (int n = 0; n < 12; n++) 
      for (int j = 0; j < 45; j++) 
       w2[i][m][n][j] = w20[i][m][n][j] - hidden3ErrorSum[j] * 
y1[i][m][n]; 
 
   //hidden layer 2 -- pool 
   for (int i = 0; i < 6; i++) 




    temp1ErrorSum = 0.0; 
    for (int m = 0; m < 12; m++) 
     for (int n = 0; n < 12; n++) 
     { 
      temp2ErrorSum = 0.0; 
      for (int j = 0; j < 45; j++) 
       temp2ErrorSum = temp2ErrorSum + hidden3ErrorSum[j] * 
w20[i][m][n][j]; 
      hidden2ErrorSum[i][m][n] = temp2ErrorSum * 
sigmoid_deri(y1[i][m][n]); 
      temp1ErrorSum = temp1ErrorSum + hidden2ErrorSum[i][m][n]; 
     } 
    b0[i] = b0[i] - temp1ErrorSum; 
   } 
 
   //hidden layer 1 -- conv 
   for (int i = 0; i < 6; i++) 
    for (int j = 0; j < 5; j++) 
     for (int k = 0; k < 5; k++) 
     { 
      temp0ErrorSum = 0.0; 
      for (int m = 0; m < 12; m++) 
       for (int n = 0; n < 12; n++) 
       { 
        int i3 = pool_index[i][m][n] / 2; 
        int j3 = pool_index[i][m][n] % 2; 
        temp0ErrorSum = temp0ErrorSum + 
hidden2ErrorSum[i][m][n] * image[2 * m + j + i3][2 * n + k + j3]; 
       } 
      w00[i][j][k] = w00[i][j][k] - temp0ErrorSum; 
     } 
 
   for (int i = 0; i < 6; i++) 
    for (int m = 0; m < 12; m++) 
     for (int n = 0; n < 12; n++) 
      for (int j = 0; j < 45; j++) 
       w20[i][m][n][j] = w2[i][m][n][j]; 
 
   for (int i = 0; i < 45; i++) 
    for (int j = 0; j < 10; j++) 





   if (max_out(z) == label_train) 
    count_train++; 
  } 
  printf("End of Epoch %d\n", epoch + 1); 
  printf("training accuracy = %3.2f%%\n", count_train / 600.0); 
 } 
 
 fread(&a1, sizeof(int), 1, fp_image_test); 
 fread(&num1, sizeof(int), 1, fp_image_test); 
 fread(&high, sizeof(int), 1, fp_image_test); 
 fread(&width, sizeof(int), 1, fp_image_test); 
 fread(&a2, sizeof(int), 1, fp_label_test); 
 fread(&num2, sizeof(int), 1, fp_label_test); 
 
 printf("Beginning of Testing\n"); 
 count_test = 0; 
 for (int k = 0; k < 10000; k++) 
 { 
  fread(image_test, sizeof(char), 28 * 28, fp_image_test); 
  fread(&label_test, sizeof(char), 1, fp_label_test); 
 
  for (int i = 0; i < 28; i++) 
   for (int j = 0; j < 28; j++) 
    image[i][j] = image_test[i][j] / 255.0; 
 
  for (int j = 0; j < 10; j++) 
   if (j == label_test) t[j] = 1; 
   else t[j] = 0; 
 
  //hidden layer 1 -- conv 
  for (int i = 0; i < 6; i++) 
   for (int m = 0; m < 24; m++) 
    for (int n = 0; n < 24; n++) 
    { 
     s0[i][m][n] = 0.0; 
     for (int j = 0; j < 5; j++) 
      for (int k = 0; k < 5; k++) 
       s0[i][m][n] = s0[i][m][n] + image[j + m][k + n] * w00[i][j][k]; 
     y0[i][m][n] = sigmoid(s0[i][m][n] + b0[i]); 





  //hidden layer 2 -- pool 
  for (int i = 0; i < 6; i++) 
   for (int m = 0; m < 12; m++) 
    for (int n = 0; n < 12; n++) 
    { 
     maxpool(y0[i][2 * m][2 * n], y0[i][2 * m][2 * n + 1], y0[i][2 * m + 1][2 * n], 
y0[i][2 * m + 1][2 * n + 1], &s1[i][m][n], &pool_index[i][m][n]); 
     y1[i][m][n] = s1[i][m][n]; 
    } 
 
  //hidden layer 3 -- FC 
  for (int k = 0; k < 45; k++) 
  { 
   s2[k] = 0.0; 
   for (int i = 0; i < 6; i++) 
    for (int m = 0; m < 12; m++) 
     for (int n = 0; n < 12; n++) 
      s2[k] = s2[k] + y1[i][m][n] * w20[i][m][n][k]; 
   y2[k] = sigmoid(s2[k] + b2[k]); 
  } 
  //output layer 
  for (int i = 0; i < 10; i++) 
  { 
   s3[i] = 0.0; 
   for (int j = 0; j < 45; j++) 
    s3[i] = s3[i] + y2[j] * w30[j][i]; 
   z[i] = sigmoid(s3[i] + b3[i]); 
   error[i] = z[i] - t[i]; 
  } 
  if (max_out(z) == label_test) 
   count_test++; 
 } 
 printf("End of Testing\n"); 










Appendix B Hardware Code of Neural Networks 
B.1 2-layer Fully-connected Neural Network with Online Training 
in SystemVerilog 
File: control_unit.sv 
Control unit is used to control the working flow of the whole system. 
module control_unit (clk, rst, start, train, test, bp_done, fp_done, train_reg, test_reg, do_fp, do_bp, show); 
   
   parameter IMG_SZ = 784; 
   parameter OUTPUT_SZ = 10;   
  
    input  logic              clk, rst, start, train, test; 
    input  logic              bp_done, fp_done; 
    output logic              train_reg, test_reg; 
//  output logic [ADDR_WIDTH-1:0] data_addr; 
    output logic             do_fp, do_bp, show; 
 
    enum logic [1:0] {idle, fwd_prop, back_prop, display} cs, ns; 
 
    logic started, clear_start; 
//  logic [ADDR_WIDTH-1:0] addr_reg; 
   
//  assign data_addr = addr_reg; 
//  assign start = test | train; 
 
    always_ff @(posedge clk, posedge rst) begin 
        if (rst) begin 
            cs <= idle; 
            started <= 1'b0; 
            train_reg <= 1'b0; 
            test_reg <= 1'b0; 
        end 
        else begin 
            cs <= ns; 
            // Buffer the start and train pulse 




            test_reg <= (test) ? 1'b1 : ((train) ? 1'b0 : test_reg); 
            started <= (start) ? 1'b1 : ((clear_start) ? 1'b0 : started); 
 end 
    end  
 
    // Next state and output logic 
   always_comb begin 
    do_fp = 0; 
    show = 0; 
    do_bp = 0; 
    clear_start = 0; 
        case (cs)  
            idle: begin 
   ns = (started) ? fwd_prop : idle; 
            end 
            fwd_prop: begin 
                do_fp = (fp_done) ? 0 : 1; 
    ns = (fp_done) ? ((train | train_reg) ? back_prop : display) : fwd_prop; 
                clear_start = (test_reg & fp_done) ? 1 : 0; 
            end 
            back_prop: begin 
                do_bp = (bp_done) ? 0 : 1; 
                ns = (bp_done) ? display : back_prop; 
                clear_start = (bp_done) ? 1 : 0; 
            end 
       display: begin 
   show = 1; 
   ns = idle; 
        end 
    endcase 







tile performs forward or backward propagation calculations, updates the weights in the 
Weights RAM. 
module tile (clk, rst, do_bp, train_reg, do_fp, test_reg, fp_done, bp_done, get_weights0, image,  
   weights0, result, label, weight_delta, save_weight, enable_update, save_update); 
    
   parameter IMG_SZ = 784; 
   parameter OUTPUT_SZ = 10;   
    
   // Inputs 
   input logic clk, rst; 
   input logic do_bp, train_reg, do_fp, test_reg; 
   input logic [IMG_SZ-1:0] [7:0] image; 
   input logic [OUTPUT_SZ-1:0] [7:0]  weights0; 
   input logic [OUTPUT_SZ-1:0] [7:0] label; 
    
   // Outputs 
   output logic    fp_done; 
   output logic    bp_done; 
   output logic    get_weights0; 
   output logic [OUTPUT_SZ-1:0] [7:0] result; 
   output logic [OUTPUT_SZ-1:0] [7:0]  weight_delta; 
   output logic     save_weight, enable_update, save_update; 
    
   logic      clear, enable_delta; 
   logic [OUTPUT_SZ-1:0]   [7:0]   acc_lay0; 
    
   logic [$clog2(IMG_SZ):0]      lay0f_idx; 
   logic [$clog2(IMG_SZ):0]      lay0b_idx; 
   logic      enable_fp; 
  
   logic [OUTPUT_SZ-1:0] [23:0]   weight_change; 
   logic [OUTPUT_SZ-1:0] [15:0]   slope; 
   
   logic [OUTPUT_SZ-1:0] [7:0]   error; 
   logic [OUTPUT_SZ-1:0] [7:0]   ideal; 
 




   enum  logic [2:0] {S_BPIDLE, S_EN_UPDATE, S_GET_DATA, S_BPROP_CALCDELTA, S_BPDONE} bpcs, 
bpns; 
 
   always_ff @(posedge clk, posedge rst) 
     if (rst) fpcs <= S_FPIDLE; 
     else fpcs <= fpns; 
 
   always_ff @(posedge clk, posedge rst) 
     if (rst) bpcs <= S_BPIDLE; 
     else bpcs <= bpns; 
 
   integer      j; 
   always_comb begin 
      ideal = {10{8'h0}}; 
 
      for (j = 0; j < OUTPUT_SZ; j++) begin 
  if (label[j] == 1)  
    ideal[j] = 1 << 5; //1 
      end 
       
      for (j = 0; j < OUTPUT_SZ; j++) 
  error[j] = ideal[j] - result[j]; 
   end 
    
   genvar i, m; 
   // Generate forward propagation neurons 
   generate 
      for (i = 0; i < OUTPUT_SZ; i++) begin: forward 
         neuron_f neuron_f (.clk(clk), .rst(rst), .clear(clear), .en(enable_fp), .weight(weights0[i]), 
   .data(image[783-lay0f_idx]), .accum(acc_lay0[i])); 
         sigmoid_plan sigmoid_f(.clk(clk), .rst(rst), .enable(enable_fp), .in(acc_lay0[i]), .out(result[i])); 
      end  
   endgenerate 
 
   // Generate backwards propagation neurons / multipliers 
   generate 
  for (i = 0; i < OUTPUT_SZ; i++) begin: backward  
   multiplier8 mult1(({8{enable_delta}} & result[i]), (8'h20-result[i]), slope[i]); 
   multiplier16 mult2(({8{enable_delta}} & error[i]), slope[i], weight_change[i]); 





  end 
   endgenerate 
 
  always_comb begin 
      fp_done = 1'b0; 
      clear = 1'b0; 
      enable_fp = 1'b0; 
     save_weight = 1'b0; 
      get_weights0 = 1'b0; 
 
      case (fpcs) 
 S_FPIDLE: begin 
    clear = do_fp || do_bp; 
    get_weights0 = do_fp || do_bp; 
    fpns = (do_fp || do_bp) ? S_GET_WEIGHTS : S_FPIDLE; 
 end 
  
 S_GET_WEIGHTS: begin 
    save_weight = 1; 
    fpns = S_FPROP_LAYER1;   
 end 
 
 S_FPROP_LAYER1: begin 
    enable_fp = 1'b1; 
    fpns = lay0f_idx < IMG_SZ-1 ? S_FPROP_LAYER1 : S_FPDONE;   
 end 
   
 S_FPDONE: begin 
    fp_done = 1'b1; 
    fpns = (~train_reg) ? S_FPIDLE : bp_done ? S_FPIDLE : S_WAIT;  
 end 
  
 S_WAIT: begin 
  fpns = bp_done ? S_FPIDLE : S_WAIT; 
 end 
  
 default: fpns = S_FPIDLE;   
      endcase 
   end  
  




      bp_done = 1'b0; 
      enable_delta = 1'b0; 
      enable_update = 1'b0; 
 save_update = 1'b0; 
  
      case (bpcs) 
 S_BPIDLE: begin 
  bpns = do_bp ? S_EN_UPDATE : S_BPIDLE; 
 end 
 
 S_EN_UPDATE: begin 
       enable_update = 1'b1; 
  bpns = S_GET_DATA;   
 end 
  
 S_GET_DATA: begin 
  save_update = 1'b1; 
  bpns = S_BPROP_CALCDELTA;   
 end 
  
 S_BPROP_CALCDELTA: begin 
    enable_delta = 1; 
    bpns = lay0b_idx < IMG_SZ-1 ? S_BPROP_CALCDELTA : S_BPDONE; 
 end 
  
 S_BPDONE: begin 
    bp_done = 1'b1; 
    bpns = S_BPIDLE; 
 end 
 
 default: bpns = S_BPIDLE;   
      endcase 
   end 
 
   always_ff @(posedge clk, posedge rst) begin 
      if (rst) lay0f_idx <= '0; 
      else begin 
  case (fpcs) 
    S_FPIDLE: begin 
       lay0f_idx <= '0; 




     
    S_FPROP_LAYER1: begin 
       lay0f_idx <= (lay0f_idx==IMG_SZ-1)? 0 : (lay0f_idx + 1); 
    end 
    default: ;  
  endcase 
      end 
   end 
    
   always_ff @(posedge clk, posedge rst) begin 
      if (rst) begin  
   lay0b_idx <= '0; 
      end 
      else begin 
  case (bpcs) 
    S_BPIDLE: begin 
       lay0b_idx <= '0; 
    end 
 
  S_BPROP_CALCDELTA: begin 
   lay0b_idx <= (lay0b_idx==IMG_SZ-1)? 0 : (lay0b_idx + 1); 
  end 
 
    default: ;  
  endcase 
      end 







sigmoid_plan builds an approximation of the sigmoid function using PLAN [28] method. 
`define FIXED_5       (5<<5) 
`define FIXED_1       (1<<5) 
`define FIXED_2_375   8'h4C//32'h26000 
`define FIXED_0_84375 8'h1B//32'h0D800 
`define FIXED_0_03125 8'h01//32'h00800 
`define FIXED_0_125   8'h04//32'h02000 
`define FIXED_0_625   8'h14//32'h0A000 
`define FIXED_0_25    8'h08//32'h04000 
`define FIXED_0_5     8'h10//32'h08000 
 
module sigmoid_plan(input logic clk,  
  input logic rst, 
               input logic enable, 
  input logic [7:0]  in, 
               output logic [7:0] out); 
 
   function logic [7:0] piecewise_sig_stage1(logic [7:0] in); 
      if(in>=`FIXED_5)  
 piecewise_sig_stage1 = `FIXED_1; 
      else if ((in >= `FIXED_2_375) && (in < `FIXED_5))  
 piecewise_sig_stage1 = (in>>5); 
      else if ((in >= `FIXED_1) && (in < `FIXED_2_375)) 
 piecewise_sig_stage1 = (in>>3); 
      else  
 piecewise_sig_stage1 = (in>>2); 
   endfunction 
 
   function logic [7:0] piecewise_sig_stage2(logic [7:0] in, logic [7:0] temp); 
      if(in>=`FIXED_5) 
 piecewise_sig_stage2 = `FIXED_1; 
      else if ((in >= `FIXED_2_375) && (in < `FIXED_5))  
 piecewise_sig_stage2 = temp + `FIXED_0_84375; 
      else if ((in >= `FIXED_1) && (in < `FIXED_2_375)) 
 piecewise_sig_stage2 = temp + `FIXED_0_625; 
      else  
 piecewise_sig_stage2 = temp + `FIXED_0_5; 





    
   logic [7:0]         temp1, temp2, result; 
   logic          sign;   
   always_ff @(posedge clk, posedge rst) begin 
      if (rst) begin 
         temp1 <= 8'b0; 
         temp2 <= 8'b0; 
         result <= 8'b0; 
         sign <= 1'b0; 
      end else if (enable) begin 
         temp1 <= in[7] ? piecewise_sig_stage1(~in+1) : piecewise_sig_stage1(in); 
         temp2 <= in[7] ? ~in+1 : in; 
         sign <= in[7]; 
         result <= sign ? `FIXED_1 - piecewise_sig_stage2(temp2, temp1) : piecewise_sig_stage2(temp2, temp1); 
      end 
   end 







B.2 3-layer Fully-connected Neural Network with Offline Training 
in SystemVerilog 
File: tile.sv 
module tile (clk, rst, do_fp, test_reg, fp_done, image, label, result); 
    
   parameter IMG_SZ = 784; 
   parameter HIDDEN_SZ = 128;  
   parameter OUTPUT_SZ = 10;   
    
   // Inputs 
   input logic clk, rst; 
   input logic do_fp, test_reg; 
   input logic [IMG_SZ-1:0] [7:0] image; 
   input logic [OUTPUT_SZ-1:0] [7:0] label; 
    
   // Outputs 
   output logic    fp_done; 
   output logic [OUTPUT_SZ-1:0] [7:0] result; 
   logic [OUTPUT_SZ-1:0] [10:0] result_temp; 
   
   logic      clear; 
   logic [HIDDEN_SZ-1:0]   [10:0]   acc_lay0; 
   logic [OUTPUT_SZ-1:0]   [10:0]   acc_lay1; 
    
   logic [$clog2(IMG_SZ):0]    lay0f_idx; 
   logic [$clog2(HIDDEN_SZ):0]    lay1f_idx; 
   logic      enable_fp1, enable_fp2; 
 
   logic [HIDDEN_SZ-1:0] [7:0]  weights0; 
   logic [OUTPUT_SZ-1:0] [7:0]  weights1; 
   logic [HIDDEN_SZ-1:0] [10:0]  hidden_temp; 
   logic [HIDDEN_SZ-1:0] [7:0]  hidden; 
   logic [HIDDEN_SZ-1:0] [7:0] b0; 
   logic [OUTPUT_SZ-1:0] [7:0]  b1; 
 
   enum  logic [2:0] {S_FPIDLE, S_FPROP_LAYER1, S_FPROP_LAYER2, S_FPROP_EN2, S_FPROP_EN2_2, 





   always_ff @(posedge clk, posedge rst) 
     if (rst) fpcs <= S_FPIDLE; 
     else fpcs <= fpns; 
    
   genvar i, j; 
   // Generate forward propagation neurons 
   generate 
      for (i = 0; i < HIDDEN_SZ; i++) begin: L2 
         neuron_f0 neuron_f0 (.clk(clk), .rst(rst), .clear(clear), .en(enable_fp1), .weight(weights0[i]), 
   .data(image[IMG_SZ-1-lay0f_idx]), .accum(acc_lay0[i])); 
   assign hidden_temp[i] = acc_lay0[i] + {b0[i][7],b0[i][7],b0[i][7],b0[i]}; 
         sigmoid_plan sigmoid_f0(.clk(clk), .rst(rst), .enable(enable_fp1), .in(hidden_temp[i]), .out(hidden[i])); 
     end  
   for (j = 0; j < OUTPUT_SZ; j++) begin: L3 
         neuron_f1 neuron_f1 (.clk(clk), .rst(rst), .clear(clear), .en(enable_fp2), .weight(weights1[j]), 
   .data(hidden[HIDDEN_SZ-1-lay1f_idx]), .accum(acc_lay1[j])); 
   assign result_temp[j] = acc_lay1[j] + {b1[j][7],b1[j][7],b1[j][7],b1[j]};  
         sigmoid_plan sigmoid_f1(.clk(clk), .rst(rst), .enable(enable_fp2), .in(result_temp[j]), .out(result[j])); 
   end 
  endgenerate 
 
  always_comb begin 
      fp_done = 1'b0; 
      clear = 1'b0; 
      enable_fp1 = 1'b0; 
      enable_fp2 = 1'b0; 
 
      case (fpcs) 
 S_FPIDLE: begin 
    clear = do_fp; 
    fpns = (do_fp) ? S_FPROP_LAYER1 : S_FPIDLE; 
 end 
 
 S_FPROP_LAYER1: begin 
    enable_fp1 = 1'b1; 
    fpns = lay0f_idx < IMG_SZ-1 ? S_FPROP_LAYER1 : S_FPROP_LAYER2;   
 end 
 
 S_FPROP_LAYER2: begin 
    enable_fp2 = 1'b1; 






 S_FPROP_EN2: begin 
    enable_fp2 = 1'b1; 
    fpns = S_FPROP_EN2_2;   
 end  
 
 S_FPROP_EN2_2: begin 
    enable_fp2 = 1'b1; 
    fpns = S_FPDONE;   
 end  
 
 S_FPDONE: begin 
    fp_done = 1'b1; 
    fpns = S_FPIDLE;  
 end 
 
 default: fpns = S_FPIDLE;   
      endcase 
   end  
 
   always_ff @(posedge clk, posedge rst) begin 
      if (rst) begin 
  lay0f_idx <= '0; 
  lay1f_idx <= '0; end 
      else begin 
  case (fpcs) 
    S_FPIDLE: begin 
       lay0f_idx <= '0; 
       lay1f_idx <= '0; 
    end 
     
    S_FPROP_LAYER1: begin 
       lay0f_idx <= (lay0f_idx==IMG_SZ-1)? 0 : (lay0f_idx + 1); 
    end 
     
    S_FPROP_LAYER2: begin 
       lay1f_idx <= (lay1f_idx==HIDDEN_SZ-1)? 0 : (lay1f_idx + 1); 
    end 
 




  endcase 
      end 
   end 
 weights_b0 weights_b0(.read_addr(0), .clk(clk), .q(b0)); 
 weights_b1 weights_b1(.read_addr(0), .clk(clk), .q(b1)); 
 weights_W0 weights_W0(.read_addr(lay0f_idx), .clk(clk), .q(weights0)); 
 weights_W1 weights_W1(.read_addr(lay1f_idx), .clk(clk), .q(weights1)); 




B.3 SS-CNN in SystemVerilog 
File: tile.sv 
module tile (clk, rst, do_fp, test_reg, fp_done, result); 
    
   parameter IMG_SZ = 784; 
   parameter IMG_WID = 28; 
   parameter KERNEL_SZ = 6;  
   parameter KERNEL_WID = 5;  
   parameter OUTPUT_SZ = 10;   
   parameter FC_SZ = 45;  
    
   // Inputs 
   input logic clk, rst; 
   input logic do_fp, test_reg; 
    logic [IMG_WID-1:0][IMG_WID-1:0][7:0] image; 
    logic [OUTPUT_SZ-1:0][7:0] label; 
    logic [KERNEL_SZ-1:0][7:0] weights0; 
    logic [FC_SZ-1:0][7:0] weights2; 
    logic [OUTPUT_SZ-1:0] [7:0] weights3; 
    logic [KERNEL_SZ-1:0][7:0] b0; 
    logic [FC_SZ-1:0][7:0] b2; 
    logic [OUTPUT_SZ-1:0][7:0] b3; 
   
   // Outputs 
   output logic fp_done; 
   output logic [OUTPUT_SZ-1:0][7:0] result; 
   
   logic [23:0][23:0][31:0] s00, s00_temp, s01, s01_temp, s02, s02_temp, s03, s03_temp, s04, s04_temp, s05, 
s05_temp; 
   logic [KERNEL_SZ-1:0][23:0][23:0][7:0] y0; 
   logic [KERNEL_SZ-1:0][11:0][11:0][7:0] y1; 
   logic [FC_SZ-1:0][31:0] s2, s2_temp; 
   logic [FC_SZ-1:0][7:0] y2; 
   logic [OUTPUT_SZ-1:0][31:0] s3, s3_temp; 
 
   logic clear, enable_conv0, enable_conv0_b, enable_conv0_s, enable_fc, enable_fc_b, enable_fc_s, enable_out, 
enable_out_b, enable_out_s; 
 integer j, k, a, b, r; 





   enum  logic [3:0] {S_FPIDLE, S_FPROP_CONV0, S_FPROP_CONV0_b, S_FPROP_CONV0_s, 
S_FPROP_CONV0_s2, S_FPROP_FC, S_FPROP_FC_b, S_FPROP_FC_s, S_FPROP_FC_s2, S_FPROP_OUT, 
S_FPROP_OUT_b, S_FPROP_OUT_s, S_FPROP_OUT_s2, S_FPDONE} fpcs, fpns; 
 
 always_ff @(posedge clk, posedge rst) 
     if (rst) fpcs <= S_FPIDLE; 
     else fpcs <= fpns; 
 
   always_comb begin 
      fp_done = 1'b0; 
      clear = 1'b0; 
      enable_conv0 = 1'b0; 
      enable_conv0_b = 1'b0; 
      enable_conv0_s = 1'b0; 
      enable_fc = 1'b0; 
      enable_fc_b = 1'b0; 
      enable_fc_s = 1'b0; 
 enable_out = 1'b0; 
 enable_out_b = 1'b0; 
 enable_out_s = 1'b0; 
   
      case (fpcs) 
 S_FPIDLE: begin 
    clear = do_fp; 
    fpns = (do_fp) ? S_FPROP_CONV0 : S_FPIDLE; 
 end 
 
 S_FPROP_CONV0: begin 
    enable_conv0 = 1'b1; 




 S_FPROP_CONV0_b: begin 
    enable_conv0_b = 1'b1; 
    fpns = S_FPROP_CONV0_s;   
 end  
  
 S_FPROP_CONV0_s: begin 




    fpns = S_FPROP_CONV0_s2;   
 end   
 
 S_FPROP_CONV0_s2: begin 
    enable_conv0_s = 1'b1; 
    fpns = S_FPROP_FC;   
 end    
  
 S_FPROP_FC: begin 
    enable_fc = 1'b1; 
    fpns = (r == 5) && (a == 11) && (b == 11) ? S_FPROP_FC_b : S_FPROP_FC;   
 end 
  
 S_FPROP_FC_b: begin 
    enable_fc_b = 1'b1; 
    fpns = S_FPROP_FC_s;   
 end   
 
 S_FPROP_FC_s: begin 
    enable_fc_s = 1'b1; 
    fpns = S_FPROP_FC_s2;   
 end  
 
 S_FPROP_FC_s2: begin 
    enable_fc_s = 1'b1; 
    fpns = S_FPROP_OUT;   
 end    
  
 S_FPROP_OUT: begin 
    enable_out = 1'b1; 
    fpns = out_idx < FC_SZ - 1 ? S_FPROP_OUT : S_FPROP_OUT_b;   
 end  
  
 S_FPROP_OUT_b: begin 
    enable_out_b = 1'b1; 
    fpns = S_FPROP_OUT_s;   
 end 
 
 S_FPROP_OUT_s: begin 
    enable_out_s = 1'b1; 




 end  
  
 S_FPROP_OUT_s2: begin 
    enable_out_s = 1'b1; 
    fpns = S_FPDONE;   
 end   
 
 S_FPDONE: begin 
    fp_done = 1'b1; 
    fpns = S_FPIDLE;  
 end 
 
 default: fpns = S_FPIDLE;   
      endcase 
   end  
 
   always_ff @(posedge clk, posedge rst) begin 
      if (rst) begin 
   j <= '0; 
   k <= '0;  
   a <= '0; 
   b <= '0;  
   r <= '0;  
   out_idx <= '0; end 
      else begin 
  case (fpcs) 
    S_FPIDLE: begin 
   j <= '0; 
   k <= '0;  
   a <= '0; 
   b <= '0;  
   r <= '0;  
   out_idx <= '0;  
    end 
     
    S_FPROP_CONV0: begin 
       k <= (k < KERNEL_WID-1) ? (k + 1) : 0; 
       j <= ((j < KERNEL_WID-1) && (k==KERNEL_WID-1)) ? (j + 1) : j; 
    end 
 




       b <= (b < 11) ? (b + 1) : 0; 
       a <= ((a < 11) && (b==11)) ? (a + 1) : (((a==11) && (b==11)) ? 0 : a); 
       r <= ((r < 5) && (a==11) && (b==11)) ? (r + 1) : r; 
    end 
     
    S_FPROP_OUT: begin 
       out_idx <= (out_idx==FC_SZ-1)? 0 : (out_idx + 1); 
    end 
 
    default: ; 
  endcase 
      end 
   end 
   
 genvar m, n; 
 generate  //conv layer 
  for (m = 0; m < 24; m++) begin: CONV_R 
   for (n = 0; n < 24; n++) begin: CONV_C   
    neuron_conv0 neuron_conv0 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0), 
.weight(weights0[0]), .data(image[27-(m+j)][27-(n+k)]), .accum(s05_temp[m][n])); 
    adder conv_adder0 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0_b), 
.acc({s05_temp[m][n][31], s05_temp[m][n][31:1]}), .bias({b0[0], 2'b0}), .sum(s05[m][n])); 
    sigmoid sigmoid_conv0(.clk(clk), .rst(rst), .enable(enable_conv0_s), .in(s05[m][n]), 
.out(y0[5][m][n])); 
 
    neuron_conv0 neuron_conv1 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0), 
.weight(weights0[1]), .data(image[27-(m+j)][27-(n+k)]), .accum(s04_temp[m][n])); 
    adder conv_adder1 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0_b), 
.acc({s04_temp[m][n][31], s04_temp[m][n][31:1]}), .bias({b0[1], 2'b0}), .sum(s04[m][n])); 
    sigmoid sigmoid_conv1(.clk(clk), .rst(rst), .enable(enable_conv0_s), .in(s04[m][n]), 
.out(y0[4][m][n])); 
 
    neuron_conv0 neuron_conv2 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0), 
.weight(weights0[2]), .data(image[27-(m+j)][27-(n+k)]), .accum(s03_temp[m][n]));     
    adder conv_adder2 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0_b), 
.acc({s03_temp[m][n][31], s03_temp[m][n][31:1]}), .bias({b0[2], 2'b0}), .sum(s03[m][n])); 
    sigmoid sigmoid_conv2(.clk(clk), .rst(rst), .enable(enable_conv0_s), .in(s03[m][n]), 
.out(y0[3][m][n])); 
 
    neuron_conv0 neuron_conv3 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0), 




    adder conv_adder3 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0_b), 
.acc({s02_temp[m][n][31], s02_temp[m][n][31:1]}), .bias({b0[3], 2'b0}), .sum(s02[m][n])); 
    sigmoid sigmoid_conv3(.clk(clk), .rst(rst), .enable(enable_conv0_s), .in(s02[m][n]), 
.out(y0[2][m][n])); 
 
    neuron_conv0 neuron_conv4 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0), 
.weight(weights0[4]), .data(image[27-(m+j)][27-(n+k)]), .accum(s01_temp[m][n])); 
    adder conv_adder4 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0_b), 
.acc({s01_temp[m][n][31], s01_temp[m][n][31:1]}), .bias({b0[4], 2'b0}), .sum(s01[m][n])); 
    sigmoid sigmoid_conv4(.clk(clk), .rst(rst), .enable(enable_conv0_s), .in(s01[m][n]), 
.out(y0[1][m][n])); 
 
    neuron_conv0 neuron_conv5 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0), 
.weight(weights0[5]), .data(image[27-(m+j)][27-(n+k)]), .accum(s00_temp[m][n])); 
    adder conv_adder5 (.clk(clk), .rst(rst), .clear(clear), .en(enable_conv0_b), 
.acc({s00_temp[m][n][31], s00_temp[m][n][31:1]}), .bias({b0[5], 2'b0}), .sum(s00[m][n])); 
    sigmoid sigmoid_conv5(.clk(clk), .rst(rst), .enable(enable_conv0_s), .in(s00[m][n]), 
.out(y0[0][m][n])); 
   end 
  end 
 endgenerate 
   
 genvar i, p, q; 
 generate  //pooling layer 
  for (i = 0; i < 6; i++) begin: POOL_I 
   for (p = 0; p < 12; p++) begin: POOL_R 
    for (q = 0; q < 12; q++) begin: POOL_C      
     pooling pooling 
(.y0({y0[i][p+p][q+q],y0[i][p+p][q+q+1],y0[i][p+p+1][q+q],y0[i][p+p+1][q+q+1]}), .y1(y1[i][p][q])); 
    end 
   end 
  end 
 endgenerate 
  
   genvar f, t; 
   generate    
      for (f = 0; f < FC_SZ; f++) begin: FC  //FC hidden layer 
  neuron_conv0 neuron_fc0 (.clk(clk), .rst(rst), .clear(clear), .en(enable_fc), .weight(weights2[f]), 
.data(y1[r][a][b]), .accum(s2_temp[f])); 
        adder fc_adder0 (.clk(clk), .rst(rst), .clear(clear), .en(enable_fc_b), 




       sigmoid sigmoid_fc(.clk(clk), .rst(rst), .enable(enable_fc_s), .in(s2[f]), .out(y2[f])); 
  end 
 
   for (t = 0; t < OUTPUT_SZ; t++) begin: OUT  //FC output layer 
          neuron_conv0 neuron_out (.clk(clk), .rst(rst), .clear(clear), .en(enable_out), .weight(weights3[t]), 
.data(y2[FC_SZ-1-out_idx]), .accum(s3_temp[t])); 
  adder fc_adder1 (.clk(clk), .rst(rst), .clear(clear), .en(enable_out_b), 
.acc({s3_temp[t][31],s3_temp[t][31],s3_temp[t][31],s3_temp[t][31:3]}), .bias({b3[t], 2'b0}), .sum(s3[t])); 
         sigmoid sigmoid_out(.clk(clk), .rst(rst), .enable(enable_out_s), .in(s3[t]), .out(result[t])); 
   end 
  endgenerate 
   
  logic [4:0] addr_W0; 
  logic [9:0] addr_W2; 
  assign addr_W0 = enable_conv0 ? (5*j+k) : 0; 
  assign addr_W2 = enable_fc ? (r*144+12*a+b) : 0;   
   
 weights_W0 weights_W0 (.a(addr_W0), .rd(weights0)); 
 weights_W2 weights_W2 (.a(addr_W2), .rd(weights2)); 
 weights_W3 weights_W3 (.a(out_idx), .rd(weights3)); 
 weights_b0 weights_b0 (.a(0), .rd(b0)); 
 weights_b2 weights_b2 (.a(0), .rd(b2)); 
 weights_b3 weights_b3 (.a(0), .rd(b3)); 
 image_mem_test image_mem_test0 (.a(0), .rd(image)); 







[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional 
Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 1–9, 2012. 
[2] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image 
Recognition,” Int. Conf. Learn. Represent., pp. 1–14, 2015. 
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. 
Rabinovich, “Going Deeper with Convolutions,” 2014. 
[4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, 
M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. 
Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 
354–359, 2017. 
[5] N. P. Jouppi, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. 
Dean, B. Gelb, C. Young, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, 
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, N. Patil, A. Jaffey, A. Jaworski, A. Kaplan, H. 
Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Patterson, D. Le, C. 
Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. 
Nagarajan, G. Agrawal, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, 
A. Phelps, J. Ross, M. Ross, A. Salek, R. Bajwa, E. Samadiani, C. Severn, G. Sizikov, M. 




Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, D. H. Yoon, S. Bhatia, and N. Boden, “In-
Datacenter Performance Analysis of a Tensor Processing Unit,” Proc. 44th Annu. Int. Symp. 
Comput. Archit.  - ISCA ’17, pp. 1–12, 2017. 
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Arxiv.Org, 
vol. 7, no. 3, pp. 171–180, 2015. 
[7] http://www.image-net.org/,  Last accessed November 10, 2019. 
[8] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, 
M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” 
Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. 
[9] F. Moreno, J. Alarcon, R. Salvador, and T. Riesgo, “FPGA implementation of an image 
recognition system based on Tiny Neural networks and on-line reconfiguration,” Ind. Electron. 
2008. IECON 2008. 34th Annu. Conf. IEEE, pp. 2445–2452, 2008. 
[10] E. Bouvett, O. Casha, I. Grech, M. Cutajar, E. Gatt, and J. Micallef, “An FPGA embedded system 
architecture for handwritten symbol recognition,” Proc. Mediterr. Electrotech. Conf. - MELECON, 
pp. 653–656, 2012. 
[11] A. Suyyagh and G. Abandah, “FPGA Parallel Recognition Engine for Handwritten Arabic Words,” 
J. Signal Process. Syst., vol. 78, no. 2, pp. 163–170, 2013. 
[12] V. Tay and T. Fpga, “Design of Artificial Neural Network Architecture for Handwritten Digit 
Recognition on FPGA,” no. November, 2016. 
[13] T. V. Huynh, “Design space exploration for a single-FPGA handwritten digit recognition system,” 




[14] L. B. Saldanha and C. Bobda, “An embedded system for handwritten digit recognition,” J. Syst. 
Archit., vol. 61, no. 10, pp. 693–699, 2015. 
[15] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “(A)Optimizing FPGA-based Accelerator 
Design for Deep Convolutional Neural Networks,” Proc. 2015 ACM/SIGDA Int. Symp. Field-
Programmable Gate Arrays - FPGA ’15, pp. 161–170, 2015. 
[16] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. 
Yang, “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” 2016 
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, pp. 26–35, 2016. 
[17] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A High Performance FPGA-based 
Accelerator for Large-Scale Convolutional Neural Networks,” 2016 26th Int. Conf. F. Program. 
Log. Appl., 2016. 
[18] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neural 
networks on FPGAs,” Proc. - IEEE 25th Annu. Int. Symp. Field-Programmable Cust. Comput. 
Mach. FCCM 2017, pp. 101–108, 2017. 
[19] C. Fung, B. Fong, J. Mu, W. Zhang, and H. Kong, “A Cost-Effective CNN Accelerator Design 
with Configurable PU on FPGA,” ISVLSI 2019. 
[20] V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, “SnaPEA : 
Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks,” 
ISCA 2018, no. 1, 2018. 
[21] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with 




[22] J. Frankle and M. Carbin, “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural 
Networks,” Mar. 2018. 
[23] X. Gao, Y. Zhao, Ł. Dudziak, R. Mullins, and C. Xu, “Dynamic Channel Pruning: Feature 
Boosting and Suppression,” Oct. 2018. 
[24] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier Nonlinearities Improve Neural Network 
Acoustic Models,” Proc. 30 th Int. Conf. Mach. Learn., vol. 28, p. 6, 2013. 
[25] K. He, “Delving Deep into Rectifiers : Surpassing Human-Level Performance on ImageNet 
Classification,” arXiv:1502.01852 [cs.CV], 2015. 
[26] http://yann.lecun.com/exdb/mnist/, Last accessed November 10, 2019. 
[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document 
Recognition,” Proc. IEEE, 1998. 
[28] H. Amin, K. M. Curtis, and B. R. Hayes-Gill, “Piecewise linear approximation applied to 
nonlinear function of a neural network,” Circuits, Devices Syst. IEE Proc. -, vol. 144, no. 6, pp. 
313–317, 1997. 
[29] J. Si and S. L. Harris, “Handwritten Digit Recognition System on an FPGA,” CCWC 2018, pp. 
402–407, 2018. 
[30] J. Si, S. L. Harris, and E. Yfantis, “A Dynamic ReLU on Neural Network,” 2018 IEEE 13th 
Dallas Circuits Syst. Conf., pp. 1–6, 2018. 
[31] J. Si, E. Yfantis, and S. L. Harris, “A SS-CNN on an FPGA for Handwritten Digit Recognition,” 






Jiong Si, Ph.D., Email: sijiong2014@gmail.com 
Education 
 University of Nevada, Las Vegas 
 Ph.D Degree   12/2019 Electrical Engineering 
 Hefei University of Technology 
 Master’s degree  05/2011 Precision Instrument & Machinery  
 Chongqing University of Science and Technology 
 Bachelor’s degree  07/2008  Automation 
Work Experience 
 Imatrex, Inc. FPGA Design Engineer  06/2019 – Present 
 UNLV  Teaching Assistant  05/2017 – 06/2019 
 UNLV  Research Assistant  01/2015 – 05/2017 
 MediaTek, Inc. ASIC Design Engineer  03/2011 – 01/2015  
Publications 
[1] Jiong Si, Evangelos Yfantis, Sarah Harris, 2019, A SS-CNN on an FPGA for Handwritten Digit 
Recognition, IEEE UEMCON 2019 
[2] Jiong Si, Sarah Harris, Evangelos Yfantis, 2018, A Dynamic ReLU on Neural Network, IEEE DCAS 
2018 
[3] Jiong Si, Sarah Harris, 2018, Handwritten Digit Recognition System on an FPGA, IEEE CCWC 2018 
