FPGA実装のため効率的なハードウェア指向ドロップアウトアルゴリズムの研究開発 by Yeoh  Yoeng Jye
The Study and Development of Hardware Oriented
Dropout Algorithm for Efficient FPGA
Implementation







KYUSHU INSTITUTE OF TECHNOLOGY
The Study and Development of
Hardware Oriented Dropout Algorithm
For Efficient FPGA Implementation
YEOH YOENG JYE
17899032
A thesis submitted in partial fulfillment for the
degree of PhD of Engineering
TAMUKOH - LABORATORY
DEPARTMENT OF LIFE SCIENCE AND SYSTEMS ENGINEERING
GRADUATE SCHOOL OF LIFE SCIENCE AND SYSTEMS ENGINEERING




I, Yeoh Yoeng Jye, declare that this thesis titled, ‘The Study and Development of Hard-
ware Oriented Dropout Algorithm For Efficient FPGA Implementation’ and the work
presented in it are my own. I confirm that:
 This work was done wholly or mainly while in candidature for a PhD degree at
Kyushu Institute of Technology.
 Where I have consulted the published work of others, this is always clearly at-
tributed.
 Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
 I have acknowledged all main sources of help.
 Where the thesis is based on work done by myself jointly with others, I have made




KYUSHU INSTITUTE OF TECHNOLOGY
Abstract
DEPARTMENT OF LIFE SCIENCE AND SYSTEMS ENGINEERING
GRADUATE SCHOOL OF LIFE SCIENCE AND SYSTEMS ENGINEERING
KYUSHU INSTITUTE OF TECHNOLOGY
PhD of Engineering
YEOH YOENG JYE 17899032
iv
The research and developments of Deep Neural Networks (DNNs) have been caught
attention as DNNs have demonstrated promising performance in numerous fields such as
robotics, medical, automotive, manufacturing and others. The recent DNNs are going
deeper, larger and more complex to adapt to different tasks with higher accuracy. Train-
ing neural networks is time-, resource- and power-intensive as the number of parameters
increase. Various study and research have been published in applying DNNs into em-
bedded systems and portable devices such as service robots, mobile phones, autonomous
vehicle and so on. These widen the application of DNNs. However, the computation
speed and power consumption are the concern due to the complex computation. Train-
ing DNNs in embedded systems is difficult to achieve without compromising among the
accuracy, speed and power. Field programmable gate arrays (FPGAs) are suitable de-
vice for embedded systems due to their parallel processing and low power consumption
characteristics. However, general algorithms for software implementation are not suit-
able for FPGA owing to differences in their architectures, causing reduce in speed and
increase the resources required. Therefore, modified algorithms are required for efficient
implementation into FPGA.
In this thesis, the hardware oriented dropout algorithm has been proposed for ef-
ficient FPGA implementation. Dropout algorithm is a regularization technique that
commonly used in DNNs to overcome the overfitting problem, the problem that the
DNNs are over-trained and well adapted to training data, resulting performance drop
when it comes to unseen data. By randomly dropping the neurons during training phase,
dropout technique omits the feature detectors and prevents complex co-adaptions be-
tween neurons. In general dropout method, random numbers where generated from
random number generators (RNGs) are used to compare with dropout ratio to deter-
mine the activation or deactivation of neurons. However, RNGs and comparators are
resources comsuming in FPGA and implementation of RNGs in FPGA are deep and
huge topics. Instead, the proposed algorithm attempts to eliminate the required of
RNGs and comparators, reduce the complexity and achieve the equal effect of dropout
with least resources and high speed.
The proposed method was verified through two approaches: software verification
and hardware verification. In software verification, the performance of proposed dropout
was analyzed with multiple pairs of neural networks and datasets to ensure the robust-
ness. Whereas in hardware verification, the resource consumption and speed of proposed
dropout were compared to the general dropout in showing the effectiveness of FPGA
implementation.
Keywords : Deep Neural Networks (DNNs), Dropout Algorithm, FPGA Implemen-
tation.
v
Figure 1: Graphical abstract
Acknowledgements
I would like to thank those who lent their support and expertise to me in completing
my PhD thesis ”The Study and Development of Hardware Oriented Dropout Algorithm
For Efficient FPGA Implementation” for the past months. First, I begin by thanking
my supervisor, Prof. Hakaru Tamukoh for allowing me to pursue my PhD degree in
his laboratory and giving me supports all the time whenever I faced difficulties. His
kindness guidance allowed me able to focus on my research work and complete my PhD
thesis.
I would also like to thank the expert who were involved in the validation survey for
this research project: Morie-Lab which led by Prof. Morie. Without his passionate par-
ticipation and input, the validation survey could not have been successfully conducted.
Finally, I must express my very profound gratitude to my parents who providing
me with unfailing support and continuous encouragement throughout my years of study
and through the process of researching and writing this thesis. This accomplishment






Declaration of Authorship ii
Abstract iii
Acknowledgements vi
List of Figures x
List of Tables xii
1 Introduction 1
1.1 Background of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . 2
1.1.3 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . . 2
1.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Field Programmable Gate Array (FPGA) . . . . . . . . . . . . . . . . . . 3
1.4 Problem Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Designing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Software Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.3 Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.4 Analysis and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.5 Correction and Improvements . . . . . . . . . . . . . . . . . . . . . 6
1.7 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.1 Chapter 1 - Introduction . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.2 Chapter 2 - Literature Review . . . . . . . . . . . . . . . . . . . . 7
1.7.3 Chapter 3 - Methodology . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.4 Chapter 4 - Simulation and Implementation . . . . . . . . . . . . . 7
1.7.5 Chapter 5 - Experiment Results and Discussion . . . . . . . . . . . 7
1.7.6 Chapter 6 - Conclusion and Future Works . . . . . . . . . . . . . . 7
2 Literature Review 8
vii
Contents viii
2.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Fully-connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Extended research of dropout . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Hardware-oriented Dropout . . . . . . . . . . . . . . . . . . . . . . 14
3 Methodology 16
3.1 Overview of Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Software Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Microsoft Visual Studio . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Chainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 ISE Design Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Vivado HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Simulation and Implementation 27
4.1 Architecture of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 CNN - LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 Deep CNN - GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.4 RNN Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 @home Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.4 Penn Treebank (PTB) . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Experiment Results and Discussion 37
5.1 Software Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 MLP - MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.2 LeNet - CIFAR10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.3 GoogLeNet - @home Dataset . . . . . . . . . . . . . . . . . . . . . 40
5.1.4 RNNLM - PTB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.5 Randomness Effect Analysis . . . . . . . . . . . . . . . . . . . . . . 42
5.1.5.1 Vary in Initialization . . . . . . . . . . . . . . . . . . . . 42
5.1.5.2 Vary in Rotate Bit . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Generation of 8-bit Dropout Mask . . . . . . . . . . . . . . . . . . 46
5.2.2 Generation of 64-bit dropout mask . . . . . . . . . . . . . . . . . . 48
5.2.3 Application to neuron layer . . . . . . . . . . . . . . . . . . . . . . 49
5.2.4 Application in MLP . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Contents ix
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 Pseudo RNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.3 Other Hardware-oriented Implementation . . . . . . . . . . . . . . 54
6 Conclusion and Future Works 56
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Addition of Random Effects . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 Initialization without RNG . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Application of Proposed Method . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 58




1 Graphical abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1.1 Simple Illustration of MLP with 1 Hidden Layer . . . . . . . . . . . . . . 2
1.2 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The overview of research objectives . . . . . . . . . . . . . . . . . . . . . . 5
2.1 General CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Example of Convolution Computation . . . . . . . . . . . . . . . . . . . . 9
2.3 Example of Max Pooling and Average Pooling . . . . . . . . . . . . . . . . 10
2.4 Examples of Activation Function [24] . . . . . . . . . . . . . . . . . . . . . 11
2.5 Illustration of RNN Language Model (RNNLM) . . . . . . . . . . . . . . . 11
2.6 Illustration of Underfitting and Overfitting Problem . . . . . . . . . . . . 12
2.7 Illustration of Dropout in Neural Network . . . . . . . . . . . . . . . . . . 13
3.1 Design and Task Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Dropout Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Block Diagram of General Dropout in Hardware Implementation . . . . . 19
3.4 Block Diagram of Comparator Block in Serial and in Parallel . . . . . . . 19
3.5 Block Diagram of Proposed Dropout . . . . . . . . . . . . . . . . . . . . . 20
3.6 Illustration of Proposed Method . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 Comparison between general dropout algorithm and proposed algorithm . 21
3.8 Illustration of Proposed Method with Rotation Only and with Rotation
and Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 MLP implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Architecture of LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Inception Module with Dimension Reduction . . . . . . . . . . . . . . . . 29
4.4 Architecture of GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Architecture of RNNLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Example Image of MNIST [9] . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Example Image of CIFAR-10 [42] . . . . . . . . . . . . . . . . . . . . . . . 33
4.8 POS tagset in PTB [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 Example of bracketed text PTB [41] . . . . . . . . . . . . . . . . . . . . . 36
5.1 Average Accuracy of MLP Trained with MNIST Dataset . . . . . . . . . . 38
5.2 Recognition Accuracy of LeNet Trained with CIFAR-10 Dataset (SGD) . 39
5.3 Recognition Accuracy of LeNet Trained with CIFAR-10 Dataset (ADAM) 39
5.4 Recognition Accuracy of GoogLeNet Trained with @home Dataset . . . . 41
5.5 Comparison of recurrent neural network language model . . . . . . . . . . 41
x
List of Figures xi
5.6 Comparison of Initialization by Varying from Random Number with Pe-
riod 2 to 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7 Comparison of Initialization by Varying from Random Number with Pe-
riod 12 to 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.8 Comparison of Various Rotate Bit . . . . . . . . . . . . . . . . . . . . . . 45
5.9 Timing Diagram - General Method in Serial (8-bit) . . . . . . . . . . . . . 46
5.10 Timing Diagram - General Method in Parallel (8-bit) . . . . . . . . . . . . 47
5.11 Timing Diagram - Proposed Method (8-bit) . . . . . . . . . . . . . . . . . 47
5.12 Timing Diagram - General Method in Serial (64-bit) . . . . . . . . . . . . 48
5.13 Timing Diagram - Proposed Method (64-bit) . . . . . . . . . . . . . . . . 48
5.14 Latency Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.15 Flip-flop (FF) Resources Comparison . . . . . . . . . . . . . . . . . . . . . 50
5.16 Look-up Table (LUT) Resources Comparison . . . . . . . . . . . . . . . . 51
5.17 Digital Signal Processor (DSP) Resources Comparison . . . . . . . . . . . 51
6.1 The MLP that applied in motion planning network . . . . . . . . . . . . . 59
List of Tables
1.1 Comparison of CPU, GPU and FPGA . . . . . . . . . . . . . . . . . . . . 4
3.1 Comparison between General Method and Proposed Method . . . . . . . 23
4.1 Example Image of 15 Classes Object of @home Dataset [43, 44] . . . . . . 34
5.1 Comparison of test perplexity between different methods . . . . . . . . . . 42
5.2 Train and Test Accuracy at Epoch 30 . . . . . . . . . . . . . . . . . . . . 44
5.3 Comparison of Resources Consumed for Dropout Implementation . . . . . 47
5.4 Comparison of field programmable gate array resources consumed for 64-
bits dropout masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 The power and energy for dropout mask with 100 neurons at 100MHz . . 52
5.6 Latency and resources comparison in MLP: 784-100-10. . . . . . . . . . . 53




Deep neural networks (DNNs) have demonstrated promising performance in numerous
applications, such as data mining, automation, and natural language processing [1–4].
Examples include GoogLeNet-, which succeeded in image recognition in the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) [5, 6]; and, Alpha-Go which de-
feated the top human player in the game Go [7]. Advances in computer technology and
networks have led to breakthrough in DNNs, which have become a popular topic among
researchers. The number of studies on deep learning increases yearly.
1.1 Background of Neural Networks
Neural networks are the computation systems that are inspired by biological brains and
are realized through mathematical modelling. Various types and architectures of neural
networks such as Convolutional Neural Networks, are introduced and able to achieve
outstanding performance in many applications.
1.1.1 Multi-Layer Perceptron (MLP)
MLP is a feed-forward artificial neural network, where each input neurons is fully con-
nected to the neurons of hidden layer, extracting features and mapping to output layer
(as shown in Figure 1.1) [8]. MLP is a standard supervised-learning model where teach-
ing signals are required at output layer to calculate the cost function, and back propagate
to minimize the error [8]. Each neuron performs summation of the input neurons, x with
corresponding weight connected, W, with certain bias, b, and the results are activated by
the activation functions, f (see Eq.(1.1)). The activation functions such as sigmoid, tanh
1
YEOH YOENG JYE - 17899032 2
Figure 1.1: Simple Illustration of MLP with 1 Hidden Layer
(hyperbolic tangent) and ReLU (Rectified Linear Unit) are commonly used to introduce
non-linearity to the network.
y = f(
∑
W · x+ b) (1.1)
1.1.2 Convolutional Neural Network (CNN)
CNN is another type of feed-forward neural network that introduced in year 1998 [9].
The convolutional layers and pooling layers allows the network to learn more meaningful
and invariant features, resulting CNN has outstanding performance in image recognition
and in year 2012, it significantly reducing the error rates compare to other models in
ImageNet competition [5]. CNN became a research undergoing intense study and in
following few years, CNN won in ImageNet competition with better performance as it
goes deeper [6].
1.1.3 Recurrent Neural Network (RNN)
RNN is a type of neural networks that make use of sequential information in predicting
output [10]. In the inference mode, RNN does not solely depend on the current input,
but also based on the information from the previous state in making decision. RNN is
more likely to human decision-making, where a case is judged not only based on present,
but also past experience. It is powerful when dealing with time-series data such as video,
audio and language [10].
YEOH YOENG JYE - 17899032 3
Figure 1.2: FPGA Architecture
1.2 Dropout
Dropout algorithm is a common regularization technique that is widely used in neural
network, not only restricted to feed-forward neural networks, for instance MLP, CNN,
but also can be applied to graphical models such as RBM [5, 11, 12]. When the network
model is over fitted to the train data, learning too specific feature of input, the model
unable to predict the new incoming test data and resulting low performance in inference
phase [11]. By randomly dropping neurons while training the networks, it prevents the
co-adaptions between neurons, thus generalize the network and overcomes the overfitting
problem [11].
1.3 Field Programmable Gate Array (FPGA)
FPGA is a circuit device that consists of logic blocks and interconnections (Figure 1.2)
[13, 14]. Each logic block is composed from registers and logic gates to form LookUp
Tables (LUTs), and it is reprogrammable. The interconnections are basically a bunch of
wires with controlled by switches (transistors) and can be reconfigurable as well. There
are also FPGAs with built in small Random Access Memory (RAM) and Digital Signal
Processing (DSP) blocks for complex computation.
FPGA has high performance and energy efficiency in many applications such as image
processing, due to its parallel architecture [15–17]. One of the highlights of FPGA
is it can achieves fast performance as Graphic Processing Unit (GPU) at lower clock
frequency and lower energy consumption, which is more suitable for embedded system
[15–17]. The power consumption of GPU is huge and it lacks of mobility, thus GPU
YEOH YOENG JYE - 17899032 4
Table 1.1: Comparison of CPU, GPU and FPGA
software software hardware
CPU GPU FPGA
Power High Very high Low
(>50W) (>100W) (<20W)
Clock Frequency High High Low
(GHz) (GHz) (MHz)
Processing Speed Medium Very fast Very fast
Parallelism Sequential Parallel Parallel
Development time Short Medium Long
is not suitable for embedded system. FPGA meets the criteria for embedded system
which has high mobility, low power consumption and small in size [18]. However, the
algorithms implemented in general must be modified to suit FPGA to fully utilize the
parallelism of FPGA. The limitation of resources is another the drawback of FPGA
and is usually difficult to implement an general algorithm without modification [18, 19].
Thus, algorithms with less resources consumed are desired. Table 1.1 summarize the
comparison of performance among CPU, GPU and FPGA.
1.4 Problem Statements
The motivation of this research works is due to the problem statements as state in below:
1. Random Number Generators (RNGs) required large computational power and
resources, which constraint to the memory of FPGA.
2. Floating point comparator is required which consume large resources in FPGA.
3. Parallelism of FPGA allows high speed processing but increase resources required
YEOH YOENG JYE - 17899032 5
Figure 1.3: The overview of research objectives
1.5 Objectives
The objectives of this thesis is to propose the hardware oriented dropout algorithm, to
allow the efficient implementation of trainable DNNs in FPGA for enhancing the AI
in embedded system, as illustrated in Figure 1.3. Based on the problems statements
in Chapter 1.4, this research works is aimed to implement the dropout algorithm in
FPGA, with an effective and simplify method which do not required the used of RNGs
and comparators, allowing more resources can be saved.
YEOH YOENG JYE - 17899032 6
1.6 Research Method
1.6.1 Designing
The research begins with design stage after studying various related research in de-
tails. The problem statements and objective are identified, hypothesis are made and the
proposed method is modelled. The effectiveness and difficulties are estimated in this
phase.
1.6.2 Software Simulation
In this stage, the software simulations are carried out for observation and verification to
the proposed method. The programming languages are written for experiments.
1.6.3 Hardware Synthesis
After verification on software simulation, hardware synthesize is carried out to prove the
hypothesis and show the effectiveness of proposed method.
1.6.4 Analysis and Evaluation
The results observed are analyzed and evaluated in this stage.
1.6.5 Correction and Improvements
Correction and minor debugging are done for further improvements if the results are
below expectation.
YEOH YOENG JYE - 17899032 7
1.7 Outline of thesis
1.7.1 Chapter 1 - Introduction
Introduce the basic concept and background of neural network, dropout technique and
FPGA. The problem statements and objectives of this research are also included in this
chapter, as well as the outline of thesis.
1.7.2 Chapter 2 - Literature Review
Explanation and review to the important knowledge related to this research.
1.7.3 Chapter 3 - Methodology
Contain the information of the concept of proposed method, the design flow and platform
for the experiment to be carried out.
1.7.4 Chapter 4 - Simulation and Implementation
Explanation in details of the experiments carried out, the environment setting of neural
networks architecture and the dataset used.
1.7.5 Chapter 5 - Experiment Results and Discussion
Analysis and discussion on the experimental results obtained. Software verification is
performed by comparing the performance of neural networks with general method and
proposed method. Hardware verification is performed by observing and comparing the
resource utilization and processing clock cycle in hardware synthesize.
1.7.6 Chapter 6 - Conclusion and Future Works
Conclude this thesis works, suggestions for future works and some possible applications




Recently, there are lots of CNN models such as GoogLeNet [6], resNet [20], and VGGNet
[21], are introduced for various applications. In general, the basic architecture of CNN
is consists by multiple convolutional layers, pooling layers and fully-connected layer as
illustrated in Figure 2.1. The convolutional layer is the layer to extract and learn the
meaningful features of inputs. The pooling layer is to reduce the resolution of input
data into smaller scale. The fully-connected layer is to flatten the input data into single
dimension neural network.
Figure 2.1: General CNN Architecture
2.1.1 Convolutional Layer
In this layer, the input data is computed by convolving it with multiple filters (also known
as kernels) as shown in Figure 2.2. The output is then applied the activation function
to introduce non-linearity. Each filters is applied to the input by sliding from the top
8
YEOH YOENG JYE - 17899032 9
Figure 2.2: Example of Convolution Computation
left corner to right bottom corner. The number of filters determine the channel number.
While convolving, the number of pixels that the filter slid is known as stride. The
higher the stride, the output resolution is smaller. To maintain the output resolution,
zero padding can be applied. Zero padding is a technique that applied extra columns
and rows of zero to the outer of input. By applying zero padding, the input is eventually
become larger, thus after convolving, the output remain same resolution to the input
before zero padding. The filters is usually overlapped (stride is smaller than kernel size)
and the weights are shared spatially [22]. The Figure 2.2 illustrates the example of
convolution with kernel size is 3, without zero padding.
2.1.2 Pooling Layer
For pooling layer, this is mainly to reduce the resolutions of data into half (generally).
The pooling layer is also known as sub-sampling layer. There are various types of pooling
methods but in general, average pooling and max pooling are used. For max pooling,
it simply passes the maximum value of data within the pooling filter. Where for the
average pooling, the average value of the sum of data in the pooling filter is computed
as output. Pooling layer is applied to the neural network to learn the invariances of
features [22]. In general, the pooling filters are not overlapped (stride is equal to filter
size). Figure 2.3 shows the example of max pooling and average pooling with pooling
filter size is 2 and stride is 2.
YEOH YOENG JYE - 17899032 10
Figure 2.3: Example of Max Pooling and Average Pooling
2.1.3 Fully-connected Layer
Fully-connected layer is the final layer ehich flattening the high dimension data into
single dimension, and connecting to the output layer. Generally, MLP is used to perform
classification from extracted features in previous convolutional layer. This layer is the
final layer of CNN that mapping all the learnt features to the output.
2.1.4 Activation Function
Activation function is important in neural networks as it introduces non-linearities to
the networks leading complex analytical and computational properties [23]. It is differ-
entiable and it limits the range of the input into certain range. Activation function is
applied in the convolutional layer and fully-connected layer.
There are many types of activation function are available for neural networks such as
sigmoid, tanh (hyperbolic tangent), ReLU (Rectified Linear Unit) and softplus (Figure
2.4) [24]. For sigmoid and tanh function (Figure 2.4 (left)), both having a problem of
zero gradients as input is large positive or large negative numbers. Where for ReLU and
softplus (Figure 2.4 (right)), the gradient only become zero when the input is negative,
but for positive phase, it is linear (almost linear for softplus). The ReLU has advantage
of allowing the networks easily obtain sparse representation [24] and it can be easily
differentiate. Only a subset of neurons are active and the computation is linear [24].
Thus in this thesis, by default, ReLU is applied to all CNN and MLP throughout the
experiments.
YEOH YOENG JYE - 17899032 11
Figure 2.4: Examples of Activation Function [24]
Figure 2.5: Illustration of RNN Language Model (RNNLM)
2.2 RNN
Unlike feed forward neural network, the interconnection between each layers allowing
it to receive the input from current state as well as the previous state [10]. Figure 2.5
illustrated application of RNN in language model [25]. RNN language model (RNNLM)
is a network model that used for predicting the next word in a sentence [10]. For
output y1, there is only input from current states which is x1. Whereas for output
y2, the prediction is based on the input x2 and previous information h1. However, as
the temporal dependencies become larger, the gradients tend to vanish/explode [26].
Various architectures of RNNs have been developed to overcome such problem such as
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) [26]. Such that
changing in architecture, adding special memory cells, introducing forget gates, these
allow RNNs to adapt the ability to identify long dependencies, learn to forget/memorize






















YEOH YOENG JYE - 17899032 13
Figure 2.7: Illustration of Dropout in Neural Network
2.4 Dropout
Dropout is a regularization technique that randomly drops the neuron units in a neural
network during training phase, temporary removing them for forward propagation with
a certain probability p (usually set to 0.5), as shown in Figure 2.7 and Eqs.(2.2)(2.3)
[11, 28]. In Eq.(2.2), the mask is generated independently with Bernoulli distribution
and formed as a binary vector. The generated mask is multiplied with the input vector
x as shown in Eq.(2.3). Here, l indicates the current layer. The network is thinned into
half by dropout, and half of the parameters are being omitted [11]. By omitting half
of the feature detectors, thus complex co-adaptations between neurons are prevented,
forcing them to learn more robust features and solving the overfitting problem [5, 11, 28].
mask
(l)






(l) ∗ x (l)) + b(l+1)j } (2.3)
2.4.1 Extended research of dropout
Multiple researches had been inspired from dropout for regularization technique to im-
prove the efficiency and performance.
Wang and Manning [29] proposed a fast dropout training by sampling from Gaussian
approximation instead of Monte Carlo optimization for higher in speed and more stable.
YEOH YOENG JYE - 17899032 14
Li et al. [30] proposed dropconnect, an extension method to dropout. Instead of drop-
ping out the neurons, it drops the connections (weights) between neurons, resulting the
network is sparsely connected.
Ba and Frey [31] proposed another extension to dropout method, which known as stand-
out. Standout is an adaptive dropout model, instead of setting a fix dropout ratio prob-
abilities, the optimal ratio for a given neuron is determined based on the results from
previous layer. The ratio is lower down when several hidden neurons are found to be
highly correlated, and vice versa.
Zeiler and Fergus [32] introduced the stochastic pooling, which inspired by dropout. The
activations of pooling are randomly picks according to a multinomial distribution. In-
stead of always capturing the maximum value neurons (max pooling), stochastic pooling
select the activation of neurons based on probabilities, where higher value neurons have
higher probability and vice versa.
In this thesis work, a simpler and fast dropout for hardware implementation is proposed.
2.4.2 Hardware-oriented Dropout
Several studies have also implemented dropout in an FPGA [33, 34]. In a study of
Su et. al., a restricted boltzmann machine (RBM) with dropout was implemented on
an FPGA. Instead of comparing random numbers to dropout ratio, only HSd random
numbers were generated and the dropout ratio was determined by HSd/HS, where HS
was the size of neuron layer and HSd was the number of neurons to be drop [33]. The
random numbers were used as the indices of column in weight matrix, to address the
selected weight which stored in external memory [33]. The serial process of RNG and the
transfer cost of weight between RNG, external memory and on-chip RAM is concerned.
In a study of Sawaguchi et. al., slightly-slacked dropout was proposed, to alleviate the
transfer cost of the method of Su et. al. and accelerate the training [34]. A neural central
controller was introduced to control the Neuron Group (NG) and Neuron Combination
(NC) information of slightly-slacked dropout [34]. Four subsequent neurons formed a
group (NG) with a certain dropout rate, and the approximation of dropout rate was
computed across all groups [34]. The dropout rate of each NG can only be set to 1, 0.5
or 0, where when the dropout rate was set to 0.5, two neurons will be chosen out of four
patterns (fixed) which was controlled using 2-bits (NG Info) [34]. The 2-bits NC info
were used to assign one out of four combination of the dropout rate of either one, two
or three NGs [34].
YEOH YOENG JYE - 17899032 15
However, these approaches have limitations, such as accuracy degradation, transfer costs,
and problems as operating the dropout technique externally between software and hard-
ware [33, 34]. In this paper, we propose an alternative approach that fully enables
the application of dropout in hardware with parallel processing in order to address the
problem of transfer costs. In addition, resources are reduced by eliminating RNG. The
proposed method is compared to the conventional dropout method as a baseline.
Chapter 3
Methodology
This chapter will introduce overall of the proposed method. The design and task flow
is shown in Figure 3.1.
As in the Figure 3.1, I began this thesis works firstly by defining the algorithm and the
mathematical model to be implemented. The definition was defined according to the
problem statements and objectives.
After defining, it came to the design phase. The proposed method was model and de-
signed by writing program code. It was initially implemented with software simulation
to verified the design satisfy the definition. A smaller neural network (MLP) was imple-
mented to observe the effectiveness and workability of the proposed method. The result
was observed and analyzed. The definition, model, code were rechecked when the result
was not satisfied.
Next, the program code was rewrite and modified for deeper neural networks and larger
datasets, for further verification. In this stage, the similar procedures were carried out
for the small neural network. The network model and the program code were rechecked
when the results did not met the objectives, according the definition defined in the
previous stage.
After the software simulation, the thesis works proceeded to the hardware simulation.
The software simulation was mainly to observe the effect of proposed method compare
to original works, where hardware simulation was to observe the hardware resources
consumption in real application. The hardware description program code was written
and simulated to check the functionality. The code was then synthesized the output
netlists and the design summary was observed.
All the results obtained are analyzed and discussed in Chapter 5.
16
YEOH YOENG JYE - 17899032 17
Figure 3.1: Design and Task Flow
3.1 Overview of Proposed Method
3.1.1 Algorithm
To apply dropout technique, dropout mask (a vector of binary values with certain ratio)
is required. The dropout mask is multiplied to the neurons before fetching to next layer.
The 0’s in the mask drop the neurons where the 1’s in the mask allow the neurons to
pass the value to next layer. The dropout mask is then changes for every forward pass
with respect to different input data. The Figure 3.2 shows how the dropout technique
is applied in general.
In general, the dropout mask is generated by Eq.(3.1) in software implementation, where
i is varied from 0 to N, number of neurons.
mask[i] =
1, random number ≥ dropout ratio,0, otherwise. (3.1)
YEOH YOENG JYE - 17899032 18
Figure 3.2: Dropout Implementation
A random number is generated by an RNG with uniform distribution from 0 to 1. If
the random number is greater than dropout ratio (usually is set to 0.5 for hidden layer,
0.2 for input layer [11]), mask[i] is set to 1 and vice versa. Generally, since true random
number is very costly and slow, it is replaced by the pseudo random number as sufficient
for most application in hardware implementation [35].
To calculate Eq.(3.1) on an FPGA, a hardware RNG and a comparator should be im-
plemented. However, for hardware implementation, the biggest problem is that RNGs
in FPGA consume large resources [35]. Another issue is the parallelism. Eq.(3.1) is re-
peated to generate the whole dropout mask for all i. This looping process is not a favor
to FPGA implementation as it is serial processing and eventually slow down the calcu-
lation process. Multiple RNGs are required for parallel processing, which will massively
increase FPGA resources consumption. Figure 3.3 presents a block diagram of general
dropout in hardware implementation with serial processing. The input neurons and
weight parameters are summed and multiplied with multiply-accumulate units, while at
the same time, the dropout mask is generated by comparing the random number and
dropout ratio. A looping block is required, as only one bit is generated in the compar-
ison. The dropout mask is connected to the enable of the D-latch, which controls the
activation of the neurons. The looping block (green) can be eliminated with introducing
multiple pairs of RNG and comparator block to process in parallel; however, resource
consumption significantly increases, as illustrated in Fig. 3.4.
Thus, to efficiently apply the dropout method in hardware, we propose an alternative
approach. The proposed method eliminates the use of an RNG and comparator; instead,
YEOH YOENG JYE - 17899032 19
Figure 3.3: Block Diagram of General Dropout in Hardware Implementation
Figure 3.4: Block Diagram of Comparator Block in Serial and in Parallel
a predefined mask is used with the addition of a control block and reconfiguration
block. The proposed method not only saves resources by eliminating the use of an
RNG and comparator, but also enable to process in parallel, allowing the regeneration
of the dropout mask in a single clock cycle. In general, dropout drops a neuron with
true randomness, whereas the proposed method drops neurons with pseudorandomness,
which is considered sufficient for most applications even though the randomness has
a pattern and is predictable [35]. It has been hypothesized that true randomness in
dropout is not essential, and that dropout can perform well even with pseudorandomness.
A predefined mask is generated and saved in memory for initialization purposes. For each
generation of the dropout mask, a predefined mask is loaded to the reconfiguration block
to reconstruct the mask, changing the sequence of the mask as a new dropout mask. In
the reconfiguration block, a simple, parallel operation is executed. For simplicity, the
experiments in this study applied rotation in the reconfiguration block, whereas in the
control block, the parameter of the rotate bit, r, was used to control the bit rotation
YEOH YOENG JYE - 17899032 20
Figure 3.5: Block Diagram of Proposed Dropout
in the configuration block. The equation of dropout mask generation is described as in
Eq.3.2.
mask[0 : n] = {mask[r : n],mask[0 : r − 1]} (3.2)
By only rotating bits of the mask, the distribution is maintained and the operation
remains simple. Instead of an RNG and comparator, the reconfiguration block and
control block were used to consume fewer resources with high processing speed. Figure
3.5 presents a block diagram of the proposed method, while Fig. 3.6 presents a simple
illustration of the rotation in the configuration block. A new dropout mask can be
generated by rotating the bit of the dropout mask in parallel. The figure illustrates that
no looping process is required in the proposed method. In Figure 3.7, algorithms for
general dropout and the proposed method are compared. As demonstrated in the figure,
the serial looping for general dropout is ineffective in hardware, as the clock speed is
slow, and the clock cycle required is proportional to the number of neurons. In contrast,
the proposed method is simple and operates in parallel in a single clock cycle with fewer
resources.
To show that the proposed method has a certain significance level of randomness, the
randomness runs tests are performed within each mask generation, and across a certain
number of mask generation. The runs test is a statistical test in determined whether
the two-valued data sequence is random [36]. A null hypothesis is initially made, along
with the adjacent hypothesis (opposite of null hypothesis). By runs test, it determines
whether the null hypothesis is rejected (thus adjacent hypothesis may be valid) or failed
to reject (insufficient evidence to reject) [36]. In this case, the null hypothesis is defined
as the observations are generated randomly, where the adjacent hypothesis is defined as
the observations are not generated randomly [36]. To perform runs test, three parameters
YEOH YOENG JYE - 17899032 21
Figure 3.6: Illustration of Proposed Method
Figure 3.7: Comparison between general dropout algorithm and proposed algorithm
are required which are the number of occurrence of data correspondingly, n0 and n1,
and the number of runs (the continuous of a data). For small sample runs test (less
than 20), the upper and lower boundaries can be obtained from n0 and n1 based on the
runs test table, and if the value of runs lies between the boundaries, the null hypothesis
is failed to reject [36]. For large sample runs test (larger than 20), the mean, standard
deviation and Z-score is calculated, and observe the probability of Z, P (Z) from Z-score
YEOH YOENG JYE - 17899032 22
table [36]. If the P (Z) is less than significance level (self defined), it means that the null
hypothesis can be rejected [36].
For each of mask generation by proposed method, the distribution of ’0’ and ’1’ remained
unchanged, thus the n0 and n1 is constant. The runs is either remained unchanged, or
vary at range ±1. Thus as if the predefined mask has the random distribution, the null
hypothesis that the observation is randomly generated is failed to reject for each mask
generation using proposed method.
Also, an example is made for 10-bits mask is generated for 10 times using proposed
method, and the observation is check with runs test. Initially, the mask is set to
1001101100 as started, and perform rotation to generate new mask for 10 times, and
each mask is computed as a binary random number. Thus, by converting the binary
random numbers to decimal values, such that 1001101100 binary is equivalent to 620 in
decimal value. A 10 sample sequence can be observed. For the sample that is greater
value than median, it is treated as ’A’, and for less value than median, the sample is
known as ’B’. And thus a small sample runs test can be performed. The null hypothesis
is failed to reject as concluded.
Even though it is insufficient evidence to show randomness of proposed method by only
runs test, it is also can not show that the proposed method is not random. Since the
proposed method is not to create random number, and instead, to generate it for dropout
mask. Thus, an assumption and hypothesis is made that the proposed method can have
sufficient randomness performance for dropout mask generation and the hypothesis will
be verified and evaluated through various experiments.
To further increase the randomness, split operation or XOR operation can be intro-
duced, by splitting the mask into several portion before performing rotation or XOR
the previous bits. A split parameter, s can be introduced to determined the portions of
the mask to be split. The Figure 3.8 shows the differences of the proposed method with
rotation only and with rotation and split operation. The r also can be set to a complex
sequence or function to increase the randomness, but it would consume more resources
which is contrary to the objectives. Thus in this thesis, on the parameter r is introduced
and is defined as a simple ascending/descending sequence and will be reset after certain
iteration. The evaluation on the effect of parameter r to the proposed method were done
and as discussed in Chapter 5.1.5
A comparison is made between the general method, either serial processing or parallel
processing, and the proposed method in various aspects including algorithm and perfor-
mance. The comparison is summarized as in Table 3.1. The performance of proposed
YEOH YOENG JYE - 17899032 23
Figure 3.8: Illustration of Proposed Method with Rotation Only and with Rotation
and Split
Table 3.1: Comparison between General Method and Proposed Method
Comparison General dropout General dropout Proposed method
method in serial method in parallel (expectation)
Processing Serial Parallel Parallel
Looping Yes No No
Resources High Very high Low
required
Speed Slow Fast Fast
Randomness High High Low
(Predictable)
method is estimated based on the hypothesis and expectation which will be evaluated
in following chapter.
3.2 Software Simulation
As mentioned in the earlier of this chapter, the software simulation is to observe the
effect of proposed method and compare the performance between it and general method.
YEOH YOENG JYE - 17899032 24
Two different platforms were used which are Microsoft Visual Studio and Chainer. The
simulation of experiments are carried out with computer with Intel chip i7-6700k and
GPU (Graphics Processing Unit) Tesla K8.
3.2.1 Microsoft Visual Studio
Microsoft Visual Studio is an integrated development environment (IDE) from Microsoft
company. With this platform, C programming language for MLP was written, built and
simulated. The program was run with CPU (Central Processing Unit) only. By using
this platform, high degree of freedom of programming can be done. The neural networks
can be modified freely according to own desire. Thus, it was used initially in design
phase. However, it is relatively difficult in writing a deep neural network from scratch.
YEOH YOENG JYE - 17899032 25
3.2.2 Chainer
Chainer is a powerful and flexible framework for neural networks introduced in year 2015
[37]. It supports CUDA and multi-GPU capability, flexible and easy in implementing
neural networks with Python programming language [37]. The computational graph is
constructed with concept define-and-run thus no memory management is required [37].
However, the program is fixed within training loop and can not be modified [37]. It is
relatively lack of degree of freedom in designing own algorithm due to the limitation.
Thus, the proposed method is modified to suit the Chainer platform.
I used Chainer platform to implement the CNNs with the proposed method by writing
a python script for new function. The initial mask of each batch is generated with RNG
and the following mask within the batch is generated with proposed method. The CNNs
were implemented and ran with GPU to shorten the training time required.
YEOH YOENG JYE - 17899032 26
3.3 Hardware Synthesis
3.3.1 ISE Design Suite
The hardware simulation and synthesis are done in order to obtain the information
of resources consumption in real application. ISE (Integrated Synthesis Environment)
Design Suite is a software tool from Xilinx company [38]. It allows developer to synthesis
their designs, perform timing analysis, and examine RTL (Register-Transistor-Level)
diagram [38]. The target FPGA device can be configured as well when building the
project. Verilog HDL (Hardware Description Language) was written and synthesis for
the proposed method for hardware implementation. The timing diagram were used to
ensure the output is desired. The resources consumed and other detail information were
generated and stated in the design summary.
3.3.2 Vivado HLS
Vivado High Level Synthesis (HLS) is a tool that enable users to program in software lan-
guage such as C, C++ and SystemC, and synthesized into hardware language (VHDL,
Verilog HDL) in accelerating the design implementation and verification [39]. The func-
tional simulation can be performed using Vivado HLS which provides a faster platform
in designing systems [39]. The experiments were carried out using Vivado HLS when
implementing larger networks in this thesis in enhancing developing speed.
Chapter 4
Simulation and Implementation
This chapter will introduce overall of the simulation and implementation proposed
method.
4.1 Architecture of Neural Networks
In this section, the architecture of neural networks that were used for the experiments
are introduced, which included the MLP, LeNet, GoogLeNet and RNNLM.
4.1.1 MLP
For the architecture of MLP, since there is no exact method to determine the hyper-
parameter of number of hidden layers and number of hidden units, in this thesis works,
it is set to two hidden layers and the hidden units of first hidden layer is set to 500 units
and 200 units for the second hidden layer by default. Since the objective of this thesis
work is not to determine the relationship between the hyper-parameter, thus it is fix
and does not change throughout the experiments. For the input layer, it is set to 784
units and 10 units for output layer as MLP is trained with MNIST dataset, which is 28
by 28 pixels input image and has 10 classes output. In short, the overall architecture
implemented is set to 784-500-200-10 as in Figure 4.1.
In the training phase of the MLP, mini-batch Stochastic Gradient Descent (SGD) method
is implemented for the update and optimization method [23, 40]. The network updates
and learns the new weights after the batch data is trained, and the batch size is set to
100. The cost function used is the softmax regression [23]. The experiments were carried
out for five times to obtain the average results.
27
YEOH YOENG JYE - 17899032 28
Figure 4.1: MLP implemented
For the proposed method, the rotate bit r was initialized to one. During training phase,
r was increased by one when the input batch was changed, and was reset to one after
rmax. The experiment was run for five times to observe the robustness of the network,
and the rmax was changed to 14, 18, 22, 26 and 30 for each experiment.
4.1.2 CNN - LeNet
The CNN used in this research work is the classic LeNet architecture which as shown
in Figure 4.2. For the convolution layer, the filter size is fix to 5 by 5, and the first
convolution layer consist of 24 feature maps, and the second convolution layer consist
of 72 feature maps. The stride size is set to 1 and non-zero-padding. For the pooling
layer, 2 by 2 max pooling is performed and the stride size is 2, thus the input size of the
layer is reduce to half. After the repetition of two convolution layers and pooling layers,
the network is linearly fully connected to MLP layer with architecture of 1800-1000-10
neurons.
In the training phase of the LeNet, mini-batch SGD and mini-batch Adam algorithm
were used as the optimization method [41]. Similarly, the batch size is set to 100 and
the cost function is softmax regression.
In the proposed method, the rotate bit r was set to 8 initially and decreased by one for
each input batch, and reset to 10 when it reached to one.
4.1.3 Deep CNN - GoogLeNet
For the deep CNN, I implement GoogLeNet for the experiment. The GoogLeNet is
similar to classic CNN, consists of convolutional layers, pooling layers, softmax layers
YEOH YOENG JYE - 17899032 29
Figure 4.2: Architecture of LeNet
Figure 4.3: Inception Module with Dimension Reduction
and normalize layers. The difference of GoogLeNet is that it introduced the inception
modules which is consisted by multiple convolutional layers with different kernels in
the same layer (Figure 4.3)[6]. The inception modules allows the networks finding the
optimal local sparse structure. 1 by 1 convolution modules is implemented in inception
modules as well to allow the dimension reduction[6]. The output of each convolution
module is then concatenated.
YEOH YOENG JYE - 17899032 30
Figure 4.4: Architecture of GoogLeNet
The architecture of the network implemented in this thesis work is same as the original
paper, where the output layer is reduced from 1000 to 15 to train the @home dataset [6].
As the GoogLeNet is very deep neural network and training it is time consuming, thus
the experiment is only run once with Adam optimization, after observing the result of
LeNet, and it shows better performance in allowing the learning process faster to steady
stage. The setting of rotate bit r is same to the setting of training LeNet. The Figure
4.4 shows the architecture of GoogLeNet, and the inception modules are marked in the
red circle.
YEOH YOENG JYE - 17899032 31
Figure 4.5: Architecture of RNNLM
4.1.4 RNN Language Model
RNNLM was implemented and trained with the Penn Treebank (PTB) dataset, which
contains 10,000 vocabulary words in sequence and is commonly used in natural language
processing [42]. Words corresponding to their ID were input as a one-hot vector that
was embedded into an embedded matrix before inputting to RNN. The RNNLM used in
this experiment consisted of two hidden layers with long-short-term memory (LSTM),
and the dropout layer was applied before the LSTM layer. The structure of the RNNLM
was composed by five layers (10,000-650-650-650-10,000), as illustrated in Fig. 4.5.
YEOH YOENG JYE - 17899032 32
Figure 4.6: Example Image of MNIST [9]
4.2 Dataset
Four different types of image datasets were used for the implementation, which are
MNIST, CIFAR-10, @home dataset and Penn Treebank(PTB) datasets as follow.
4.2.1 MNIST
MNIST dataset is a dataset contained human handwriting number images [9]. Each
images size is 28 by 28 pixels, and the images are in 8-bit grayscale. Thus, to normalize,
each pixel value is divided by 255. MNIST dataset contains total of 60000 train images,
and 10000 test images for the 10 classes data. The examples of MNIST dataset are
shown in Figure 4.6.
YEOH YOENG JYE - 17899032 33
Figure 4.7: Example Image of CIFAR-10 [42]
4.2.2 CIFAR-10
CIFAR-10 dataset contains a total of 60000 RGB color images, including 50000 train
images and 10000 test images [43]. Each images size is 32 by 32 pixels, where each
pixels is 8-bit in size. In CIFAR-10 dataset, there is total 10 classes of objects which
are airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Figure 4.7
shows the examples image of CIFAR-10 dataset. Compare to MNIST dataset, CIFAR-10
dataset is relatively difficult to learn as more information is contained.
YEOH YOENG JYE - 17899032 34
Table 4.1: Example Image of 15 Classes Object of @home Dataset [43, 44]
Item Categories Examples
PET bottle
Iced-tea Cafe-Au Lait Green tea
Snack
Potato stick Potato chips Chocolate cookies
Fruit juice








@home dataset is an own-designed dataset used for home service robot in our university.
This dataset is mainly designed for the home service robot to recognize daily product
so that it able to help human to carry those items [44, 45]. There is a total of 15 classes
object, each classes consists of 2000 train images and 700 test images. All images are
color images. The image size are not constant, but will be resize into 224 by 224 pixels
while training the GoogLeNet. The examples and classes of image data are tabulated
in Table 4.1.
YEOH YOENG JYE - 17899032 35
Figure 4.8: POS tagset in PTB [41]
4.2.4 Penn Treebank (PTB)
The Penn Treebank (PTB) dataset is corpus that consisting over millions of words in
generating texts and sentences [42]. PTB dataset is a sequential data which commonly
used in language model. Figure 4.8 shows the part-of-speech (POS) tagsets in PTB and
Figure 4.9 shows the example of bracketed text in PTB [42]. In the experiments, 10000
vocabulary words in sequence of PTB dataset are used. The words ID are inputted
as one-hot vector and multiplied with embedding matrix before fetching to RNN as
illustrated in Figure 4.5.
YEOH YOENG JYE - 17899032 36




This chapter will discuss the experiment results’ analysis and discussion. The experi-
ments were carried out in two approaches: software verification and hardware synthesis.
From software verification, the proposed method was examined in comparing the effect
of dropout to different architecture of neural networks with corresponding datasets. In
hardware synthesis, the resource utilisation and processing speed was compared to exam-
ined the effectiveness of proposed method in hardware implementation. The experiments
are stated as follows:
• Software Verification
– MLP - MNIST
– LeNet - CIFAR10
– GoogLeNet - @home dataset
– RNNLM - PTB
– Randomness effect analysis
• Hardware Synthesis
– generation of 8-bit dropout mask
– generation of 64-bit dropout mask
– application to neuron layer
– application to MLP
37
YEOH YOENG JYE - 17899032 38
Figure 5.1: Average Accuracy of MLP Trained with MNIST Dataset
5.1 Software Verification
In software verification, the proposed method is implemented using CPU and GPU
to compare with the general method, with different neural network architectures and
datasets. The following experiment carried out is a simple illustration of hardware im-
plementation, instead of implementing whole neural network and train, only the dropout
mask is created and make comparison based on the resources consumed. The effect of
parameter r is evaluated in the last part of this section.
5.1.1 MLP - MNIST
In this experiment, MLP was trained with MNIST dataset. Figure 5.1 shows average
results of five trials.
In Figure 5.1, the black color lines represent the MLP trained without dropout; Blue
color lines represent MLP trained with general dropout method; Red color lines repre-
sent MLP trained with proposed method; and dotted-lines represent the test accuracy
corresponding.
An overfitting problem can be observed when training without dropout from Figure
5.1. A sudden drop of accuracy (∼65%) during inference was observed, while the test
accuracy remained high (∼95%). This gap indicated that the MLP is over fitted to the
training data, adapted very well to the train data, but failed to predict the new test
data, causing a drop in test accuracy. The overfitting problem was solved when dropout
technique was applied.
The general dropout method and the proposed method showed that both methods were
able to solve this problem, closing the gap between train accuracy and test accuracy,
YEOH YOENG JYE - 17899032 39
Figure 5.2: Recognition Accuracy of LeNet Trained with CIFAR-10 Dataset (SGD)
Figure 5.3: Recognition Accuracy of LeNet Trained with CIFAR-10 Dataset (ADAM)
and achieved over 90% of recognition accuracy. The results showed that the proposed
method worked well in this experiment as achieving the similar effect to the general
dropout method. Even though in the proposed method, the neural network was not
drop randomly but in a certain sequence, it successfully generalized the network so that
it did not over fit to the train data.
5.1.2 LeNet - CIFAR10
In this experiment, LeNet, a shallow CNN is trained with CIFAR10 dataset, imple-
mented using Chainer platform. Similar to section 5.1.1, the train accuracy and test
accuracy of the CNN trained without applying dropout, with general dropout method,
and with proposed method are observed. The experiment was carried out twice with
SGD optimization (Figure 5.2) and ADAM optimization (Figure 5.3).
YEOH YOENG JYE - 17899032 40
In both Figures 5.2 and 5.3, the black color lines represent the LeNet trained without
dropout; Blue color lines represent LeNet trained with general dropout method; Red
color lines represent LeNet trained with proposed method; and dotted-lines represent
the test accuracy corresponding.
From Figure 5.2, the learning process is slow and still in progress even after 50 epochs.
However, it can be observed that the test accuracy is saturated at around 70% regardless
any method. The result shows that the general method of dropout learn slowly where
the proposed method is similar to the network performance that without dropout.
From Figure 5.3, a slightly drop at train accuracy for both network that applied dropout
(99.2% for without dropout, 96.7% for general dropout, and 98.4% for proposed method),
yet at the inference stage, the accuracy for all approaches remain the same which thresh-
old at around 70%.
Overfitting problem is shown even after applying dropout technique. The reason of
this situation occurred maybe due to the dropout technique only applied to the last
MLP layer. The LeNet learned too specific features and overfitted in the convolutional
layer while training. However, from the figure, it can be concluded that the proposed
method is applicable and has similar effect to the general method. To further verify this
statement, another experiment was carried out in the next section.
5.1.3 GoogLeNet - @home Dataset
In this experiment, deep CNN, GoogLeNet is implemented in Chainer platform, trained
with @home dataset, a dataset that used for home service robot. All the image data are
resize into size 224 x 224 pixels before feeding into the GoogLeNet. Dropout is applied
at the last layer. The comparison of general method and proposed method are plotted
in Figure 5.4.
In Figure 5.4, the blue color lines represent GoogLeNet trained with general dropout
method; Red color lines represent GoogLeNet trained with proposed method; and
dotted-lines represent the test accuracy corresponding.
From the accuracy graph shown in figure, both GoogLeNet with general method and
proposed method were well trained and achieved almost 100% in training and inference
phase. This verify that the proposed method is applicable and able to achieve similar
effect as general method. Thus, the effectiveness of resources saving is evaluated in the
following section.
YEOH YOENG JYE - 17899032 41
Figure 5.4: Recognition Accuracy of GoogLeNet Trained with @home Dataset
Figure 5.5: Comparison of recurrent neural network language model
5.1.4 RNNLM - PTB
In this section, the experiments are carried out with RNNLM, trained with PTB dataset.
Figure 5.5 shows the comparison of perplexity. When dropout was not applied (black
color lines), there is an obvious gap in training and testing perplexity. The test perplex-
ity increased while the training perplexity remained low. However, when dropout was
applied to the RNNLM, the gap between training perplexity and test perplexity was re-
duced. Table 5.1 also illustrates that the test perplexity reached over 750, which was in
contrast to approaches using dropout, whose perplexity values were only approximately
89. The results of the proposed method were thus identical to those of the general
dropout method.
YEOH YOENG JYE - 17899032 42





5.1.5 Randomness Effect Analysis
As mentioned in Chapter 2.4, from the original works of dropout technique[11], the
neurons are randomly dropped while training the neural network. Yet, the randomness
of proposed method is based on the initialization and the random bit. It can be predicted
as it is a sequence controlled by the random bit. Therefore, to further observe the effect,
additional experiments for varying the initialization dropout mask and varying the rotate
bit r had been carried out.
5.1.5.1 Vary in Initialization
In this section, the experiment was carried out with 4-layers MLP (784-200-100-10),
trained with MNIST dataset. The experiment setup was same as in section 5.1.1 with
a smaller size in hidden layer to shorten the experiment time. The initialization of
predefined mask for proposed method is now changed, varying the random number
period from period 2 to period 20 with increment of 2. The random number period
is referred to the number of generation for obtaining the repeated random number.
For example, a period 2 random number with ”01010101....01” binary sequence, as in
proposed method, the mask generated will repeat for every 2 generation, taking that
the rotate bit r is set to 1. The experiment results are shown in Figure 5.6 and 5.7
separately and the accuracy at epoch 30 of all approaches are tabulated in Table 5.2.
From the results, overfitting problem is still occur when applying the proposed method
with low random period. When the period is larger than 10, the performance become
better and the overfitting problem is solved. As in the results, it can be observed that
as the period of random number increased, the more random pattern of number can be
generated, thus the performance is closer to the conventional method. This experiment
is carried out with vary in random period up to 20, and even though the performance
were not as good as the conventional method, yet it shows improvement as the random
period increased and is close to random initialization. This experiment also shows that
the important of initialization of dropout mask for proposed method. However, in actual
application, the size of dropout mask is usually larger than hundreds, and the random
period is assumed to be large enough for proposed method. Also, in this experiment,
YEOH YOENG JYE - 17899032 43
Figure 5.6: Comparison of Initialization by Varying from Random Number with
Period 2 to 10
Figure 5.7: Comparison of Initialization by Varying from Random Number with
Period 12 to 20
the rotate bit r is set to constant 1, where varying the rotate bit r may vary the random
period as well. For illustration, when a random number with period 8 is operate with r
is set to 2, the same mask will repeated every 4th generation, which shorten the random
period by half. Where setting the r to ascending sequence, the mask is repeated at 11st
generation for first time, and at 24th generation for second time, which means that the
random period is now extend further and become irregular.
YEOH YOENG JYE - 17899032 44
Table 5.2: Train and Test Accuracy at Epoch 30
Accuracy at Epoch 30 Train (%) Test (%)
Without Dropout 97.4 74
Conventional Method 93.6 95.2
random initialization 96.1 92.4
period 2 98.3 75.2
period 4 98.1 81.1
period 6 97.6 88.6
Proposed period 8 97.7 81.2
Method period 10 98 79.6
period 12 97.7 88.6
period 14 97.8 91.4
period 16 97.5 89.4
period 18 97.7 92.3
period 20 97.8 90.1
5.1.5.2 Vary in Rotate Bit
From the result in Chapter 5.1.3, the proposed method and general method are com-
pared, and the effect achieved is similar. Additional experiments are done to observe
the effect of rotate bit to the proposed method. In the previous experiments, the rotate
bit r is set as a sequence. To observe the effect, the rotate bit r is set to a constant and
the comparison can be made. The experiment is carried out with the same environment
setting in Chapter 5.1.3, where the rotate bit r is fixed to constant number 1. The
experiments are repeated with rotate bit r changes to 2, 4, 8, 16, and 32. The results
are plotted in Figure 5.8
From the Figure 5.8, it can be concluded that effect of rotate bit, r is insignificant as the
it almost achieve similar effects and results except for case r = 1. Thus, a conclusion
can be drawn that the true randomness may not that necessary in the dropout method
and it can be replaced by the pseudo randomness. Even though the results show it is
not an important factor in proposed method, it may different for other neural network
or dataset that not included in this works. Therefore, control in the random bit and
initialization method, and more random features (discuss in Chapter 6.2) can be added
for more robustness to all cases.
YEOH YOENG JYE - 17899032 45
Figure 5.8: Comparison of Various Rotate Bit
YEOH YOENG JYE - 17899032 46
Figure 5.9: Timing Diagram - General Method in Serial (8-bit)
5.2 Hardware Synthesis
5.2.1 Generation of 8-bit Dropout Mask
To verify the effectiveness of the proposed method, I implemented two conventional
methods and the proposed method on a Xilinx Virtex 6 FPGA. I described all methods
by Verilog Hardware Description Language, and Xilinx ISE Design Suite was used for
logic synthesis. To simplify experiments, I generated a dropout mask with 8-bit size, and
for RNG that used in conventional methods, I used an 8-bit LFSR (Linear Feedback Shift
Register) as reference. The timing diagram are observed to verified the functionality
(Figure 5.9, 5.10, 5.11), and the design summary information are tabulated in Table 5.3
for showing resources consumed.
The rnum represents the random number generated from RNG, where the omask is the
output of dropout mask. From the Figure 5.9, it can be observed that only one rnum is
implemented, indicating only one RNG is used. However, the dropout mask, omask is
generated after every eight clock cycles, resulting time delay and slow in implementation.
The serial processing does not fully utilize the advantages of FPGA. Moreover, the result
shows the 8-bit dropout mask, where in actual implementation, the neurons can be up to
hundreds and even thousands or more, and with this method, the clock cycles required
is proportionally increased to the number of neurons.
In Figure 5.10, the dropout mask, omask is generated every single clock cycle, which
is fast in speed. However, to allow the parallel processing, eight RNGs are required as
shown by the rnum in the figure. This significantly increases the resources required.
And, through this method, the number of RNGs are directly proportional to the number
of neurons. Even though the dropout mask can be generated in very fast speed, yet the
resources are not affordable.
By comparing the previous two figure and Figure 5.11, it is obvious that the proposed
method can achieve fast speed as the parallel method, and without using the RNGs. This
shows that only a small resources required and within a single clock cycle, the dropout
mask can be generated. Furthermore, the time required and resources consumed by the
YEOH YOENG JYE - 17899032 47
Figure 5.10: Timing Diagram - General Method in Parallel (8-bit)
Figure 5.11: Timing Diagram - Proposed Method (8-bit)
Table 5.3: Comparison of Resources Consumed for Dropout Implementation
Logic Utilization General dropout General dropout Proposed method
method in serial method in parallel
Number of slice 32 76 8
registers
Number of slice 44 80 7
LUTs
Number of fully 27 72 0
used LUT-FF pairs
Clock cycle required 8 1 1
to generate a mask
RNG required 1 8 N/A
proposed method does not proportional to the number of neurons. This means that
regardless the number of neurons, the time required and resources utilized is about the
same.
In Table 5.3, it shows the design summary and the resources utilization. As mentioned
above, it can be observed that the number of registers and LUTs (Look-Up Tables) were
much less required with the proposed method than the ordinary RNGs based methods.
For the serial implementation, it took less resource yet increased in number of clock
YEOH YOENG JYE - 17899032 48
Figure 5.12: Timing Diagram - General Method in Serial (64-bit)
Figure 5.13: Timing Diagram - Proposed Method (64-bit)
cycle, losing the advantage of FPGA that enabled parallelism processing. Whereas for
the parallel implementation of conventional dropout method, only a single clock cycle
was required to generate the mask, but the resources consumption was significantly
increased. The proposed method was able to achieve parallel processing by generating
a dropout mask in single clock cycle with small resources required through the simple
operation.
5.2.2 Generation of 64-bit dropout mask
In this section, we extend the experiments in section 5.2.1 to 64-bit to further observe
the change in result. The environment setting is same as section 5.2.1. The timing
diagram of general method in serial and proposed method are showed in Figure 5.12
and 5.13 respectively.
From Figure 5.12, since the generation of dropout mask is executed in serial approach,
128ns is consumed for 1 mask (267ns - 139ns). The clock cycle is set to 2ns. Thus,
64 clock cycles are required for the completion of mask generation. This is due to the
repeat of 64 times of rnum generation is required. Again this verifies that the general
method that process in serial is inefficient and slow in FPGA implementation.
From Figure 5.13, similar to result in section 5.2.1, the proposed method only took 1
clock cycles to complete the generation of mask without the required of RNGs, even
though the size is increased from 8-bit to 64-bit. This indicates that the proposed
method can be efficiently implemented and the processing speed is independent to the
size of dropout mask.
YEOH YOENG JYE - 17899032 49
Table 5.4: Comparison of field programmable gate array resources consumed for 64-
bits dropout masks
Logic Utilization General dropout General dropout Proposed method
method in serial method in parallel
Number of slice 149 588 70
registers
Number of slice 190 640 64
LUTs
Number of fully 141 576 64
used LUT-FF pairs
Clock cycle required 64 1 1
to generate a mask
RNG required 1 64 N/A
Table 5.4 tabulates the resource utilization of 64-bit mask generation of general dropout
in serial, general dropout in parallel, and proposed method. In serial implementation, the
resource required is less than parallel implementation, where the clock cycle is increased.
Similar to Table 5.3, the proposed dropout is still the most efficient approach among
these three approaches, as the resource utilized and the clock cycle required is the least.
Also, comparing Table 5.3 and 5.4, the proposed method still able to process in single
clock cycle as general dropout in parallel, with significantly reduce in resource.
5.2.3 Application to neuron layer
In this section, we observe the resources when applying different approaches of dropout
in neural network and compare them with increasing number of neurons in a layer. Thus,
the experiments were executed with initial 100 neurons, and increasing 100 neurons each
time until 1000 neurons. The experiments were synthesized using High Level Synthesis
(HLS) coding in Vivado HLS 2018, with targeted to ZYNQ UltraScale+ ZCU102 FPGA
board as a reference. The comparison results in aspect of latency, flipflop (FF), look-up
table (LUT), and digital signal processor (DSP) are shown in Figure 5.14 - 5.17.
As in Figure 5.14 - 5.17, three approaches of dropout were run and analyzed, which
are conventional dropout in serial connection (black), conventional dropout in parallel
connection (blue), and the proposed dropout (red). For latency comparison (Fig 5.14),
as the expectation result based on the conclusion drawn from previous section, the serial
dropout is showing highest latency, where the parallel dropout shows a very high speed
processing in contrast. The proposed dropout is able to achieve high speed processing
as well with a slightly increasing in latency compare to parallel dropout. In the aspect
YEOH YOENG JYE - 17899032 50
Figure 5.14: Latency Comparison
Figure 5.15: Flip-flop (FF) Resources Comparison
of consumption of FF (Fig. 5.15), the serial dropout achieves to consume a very low
amount of resources, which average is below 1000 in number. For the proposed method,
the FF consumption were increased proportionally to the number of neurons, as the
resources for configuration block. However, the parallel dropout is showing a huge
leap of FF consumption (right y-axis), which is painful and not affordable for FPGA
implementation. The similar result can be observed in Fig 5.16, as the serial dropout
consumes the least LUT resources, where the parallel dropout is at extremely high cost.
Lastly, the number of DSP consumption are observed in Fig. 5.17. For serial dropout
and proposed dropout, only three DSP were used, where for the parallel dropout, over
hundreds of DSP are used and shown an increasing corresponding to the number of
neurons. Based on the result, a short conclusion can be drawn in this section, that the
serial dropout consumes relatively low resources, but at higher latency. On the other
hand, the parallel dropout achieves a very low latency, which can be insignificant in the
overview of neural network, yet the cost is extremely painful, and exceeding the total
available resources in FPGA, which is unrealistic. The proposed dropout can be consider
YEOH YOENG JYE - 17899032 51
Figure 5.16: Look-up Table (LUT) Resources Comparison
Figure 5.17: Digital Signal Processor (DSP) Resources Comparison
as the compromise and balance between serial dropout and parallel dropout, allowing a
low latency process at an acceptable range of resources consumption.
Note that the results showing in this section may have a slightly contradiction to the
results in section 5.2.1 and 5.2.2. These may due to the situation when attaching the
dropout mask to neuron layers, the resources for input and output neurons are computed
as well, instead of only the mask generation. For the serial dropout, the input and output
can also be connected in serial during processing, where for the proposed dropout, the
input and output layers required to be in parallel, causing the increased in number. Also,
the results for this section are generated from HLS coding, the auto-generation of HDL
code may not fully optimized, which causing undesired additional resources. Further
optimization may apply for improvement, such as the pipeline optimization.
By importing the HLS project to Vivado, implementation can be executed and the total
on-chip power can be obtained. The total on-chip power is the power consumed internally
within FPGA, which is the sum of the static power (device static power: power from
YEOH YOENG JYE - 17899032 52
Table 5.5: The power and energy for dropout mask with 100 neurons at 100MHz
Serial Parallel Proposed
dropout dropout dropout
Total On-chip Power (W) 0.73 2.029 0.678
Latency (clock cycles) 1402 39 107
Energy (µJ) 10.2346 0.79131 0.72546
transistor leakage; design static power: power for design configuration) and the dynamic
power (switching power, the average power due to the design internal activity) [46].
We tabulate the total on-chip power and calculate the energy required for generating
the dropout mask with 100 neurons as in Table 5.5. The latency is tabulated from
the results in Fig. 5.14. In the targeted ZCU102 board in this experiment, multiple
clocks are available. The calculation of energy in Table 5.5 is computed with clock
frequency set as 100MHz where the energy may varies with different frequencies. From
the table, it can be observed that although the power required for the serial dropout
is small enough, the energy required is larger as the clock cycles required is larger,
resulting energy inefficient. For parallel dropout, the energy required is similar to the
proposed method, but as the number of neurons increased, the higher resources required
which will increased the power, thus higher energy required is expected. The proposed
dropout achieved the most energy efficient among the three approaches. Moreover, based
on Figure 5.14 - 5.17, the resources required for proposed method is significantly less
than the parallel dropout with low latency, indicating the proposed method would be
more energy efficient as the number of neurons increased, compare to both the other
approaches.
5.2.4 Application in MLP
A further experiment was carried out to observe the effect of dropout to the number
of resources consumed in application of neural network. We had run the experiment
with the structure of MLP 784-100-10 as for different approaches of dropout. Only feed
forward propagation was run for this experiment due to the limitation and constraint.
Floating point operation is used throughout the synthesis. Further optimization tech-
niques, such as fixed point operation, pipeline processing, memory allocation and others,
may be applied during the implementation to increase effectiveness. However, this paper
would like to focus on the dropout and the proposed method’s effectiveness, thus, we do
not apply optimization for the feed forward MLP in this case.
As the data can be observed in Table 5.6, the proposed method achieve to reduce the la-
tency by around 100k clock cycles compare to serial dropout, which similar performance
as in parallel dropout. In the aspect of resources, the amount of FF used by proposed
YEOH YOENG JYE - 17899032 53
Table 5.6: Latency and resources comparison in MLP: 784-100-10.
Red numbers indicate that the amount is exceeding the available amount in FPGA.
Available Serial Parallel Proposed
amount dropout dropout dropout
Latency - 2502607 2389957 2397637
FF 548160 80291 24951168 115628
LUT 274080 190362 4393285 161449
BRAM 1824 787 788 789
DSP 2520 5 2483 8
method is slightly higher than serial dropout, but the amount of LUT is slightly less,
which is balanced in overall. Whereas for the parallel dropout, even though the least
latency is achieved, yet the resources consumed for both FF and LUT, are extremely
high and exceeding the total available amount in FPGA board, which make it impossible
for implementation. Also, the parallel dropout used up almost all of the amount of DSP,
where the other approaches only consume less than 10 DSP.
5.3 Discussion
5.3.1 The Proposed Method
We made the comparison between the general dropout and proposed dropout in aspect
of processing, time, randomness, generation of dropout mask, resources required and its
suitability device, as tabulated in Table 5.7. The general dropout is a serial processing
algorithm where the proposed dropout designed for parallel processing. Thus, the pro-
posed dropout is faster than general dropout and independent to the number of neurons
(the size of dropout mask). However, the proposed method has lower randomness as it
generates dropout masks based on a predefined mask initially. In aspect of resource, as
the general dropout requires RNGs and comparators, thus the resource consumed is high;
Where the proposed dropout requires only very small amount of resource. For software
application such as CPU, the general dropout is more suitable, whereas for hardware
devices such as FPGA, the proposed dropout has advantages to it and is efficient for
the implementation.
5.3.2 Pseudo RNG
The proposed method is demonstrating the dropout technique for neural network in a
simpler way to achieve pseudo random dropout mask with minimum resource. There are
YEOH YOENG JYE - 17899032 54
Table 5.7: Summary of general dropout and proposed dropout
General dropout Proposed dropout
Processing Serial looping Parallel
Time Slow Fast
(dependent on # of neurons) (independent of # of neurons)
Randomness High Normal
Dropout mask Regenerated for each layer Predefined
Resources Required Very high (for parallel) Low
high (for serial)
Suitable application Software Hardware
lots of research working on the RNG in hardware implementation, both pseudo random
number generators (PRNGs) and true random number generators (TRNGs) as review in
[47]. Each approaches has its advantages and limitations. Incontrovertibly, implement-
ing dropout with these approaches will increase the efficiency. However, the problems for
conventional dropout are still remained unsolved which are the looping process in serial
and the duplication of RNG and comparator blocks in parallel (mentioned in Chapter
3). Applying the random number from RNG, directly to the neuron layers seems to be
a possible alternative option, without generating the dropout mask. Nevertheless, the
number of bit, the distribution will be the concern as the dropout ratio is difficult to
fix and control in such situation. Instead, the proposed method is providing a simpler
solution, without the implementation of RNG in regeneration of dropout mask. The
dropout ratio is fix and set by the predefined mask, and is controllable by resetting the
predefined mask.
5.3.3 Other Hardware-oriented Implementation
As mentioned in section 2.4.2, there are other researches working on hardware imple-
mentation as well, such in [33] and [34]. In the implementation in [33], the authors utilize
the external memory to address the selected weight after dropout. Instead of optimizing
dropout algorithm, the paper more focus on their proposed model with dropout [33].
Also, the paper also mentioned that the implementation of dropout in RBM has poor
scalability of DSPs efficiency due to the transfer time and linearity. We are expecting our
proposed method that can operate in parallel, and without using the external memory,
can overcome this problem to increase the efficiency. Further in [34], the authors are
targeting to minimize the transfer cost of dropout between hardware and software. As
the result of the paper, 35026 LUTs, 49784 Registers and 112 DSPs were used for the
implementation, with around 5% accuracy degradation [34]. On the other hand, we are
YEOH YOENG JYE - 17899032 55
proposing an alternative approach that fully implement dropout algorithm using hard-
ware, which no transfer cost is required. Although the LUTs and registers consumed is
higher compare to [34], but the DSP is significantly minimized. Therefore, it would be
an alternative option for different compromisation in aspect of transfer cost, DSPs and
etc.
Chapter 6
Conclusion and Future Works
6.1 Conclusion
In conclusion, this thesis presented an alternative approach of dropout technique with
a view of making it effective for hardware implementation. Implementation of neu-
ral network into FPGA is difficult due to the different in architecture and resources
constraint, thus general method usually not suitable or not applicable. For dropout
technique, RNG is usually required to randomly drop the neurons while training the
neural network. However, RNG eventually consumes large resources in FPGA. The pro-
posed method eliminates the used of RNGs which enable parallel processing and can be
effectively implement into FPGA.
The performance of the proposed method was verified with multiple pairs of neural
networks and datasets. The resource utilization of our proposed method and the gen-
eral method were observed by hardware synthesis. The number of resources consumed
while implementing the dropout technique in FPGAs is significantly less in the proposed
method. This thesis work can be concluded into several as follows:
• An alternative and effective approach of dropout technique for hardware imple-
mentation had been proposed.
• The effect of proposed method had been observed and verified with multiple pairs
of neural networks and dataset.
• Resource utilization comparison had been done by hardware synthesis through ISE
Design Suite and Vivado HLS.
56
YEOH YOENG JYE - 17899032 57
6.2 Future Works
Based on the experiments and verification done, several approaches can be extended to
this work as follows:
• Addition of Random Effects
• Initialization without RNG
• Implementation of Proposed Method into FPGA
6.2.1 Addition of Random Effects
The proposed method uses an extra parameter, rotate bit to control the number of bit
to rotate for each different mask created during forward propagation. Even though the
results are verified with three different kind of neural network structure corresponding to
three different datasets in this thesis, there is a possibility that the randomness effect of
dropout technique play important role and affected the neural network to stop learning.
Even as the results are seem to be good enough, there are various type of datasets and
neural networks, thus the randomness effect may cause different in results. Thus, more
randomness effect, such as adding split parameter to split the dropout mask into several
divisions, may improve the robustness of the proposed method.
6.2.2 Initialization without RNG
The proposed method is implemented with a pre-mask that generated by RNG initially,
and based on the mask to perform rotation to create more masks in the following training
of neural network. Even though the pre-mask can be generated with software and load
into FPGA for later use, to totally eliminate the use of RNG, other approaches can be
use to initiate the mask for every training.
YEOH YOENG JYE - 17899032 58
6.3 Application of Proposed Method
To fully implement DNNs into FPGA is impractical for current, especially training
process, due to the resource constraint and high computation cost. However, applying
inference mode only in FPGA, or accelerating the computation of neural network by par-
tially implementing using FPGA is efficient, with hardware-oriented approach such that
Binary Connect, XNOR net and others [48, 49]. The proposed method is also targeting
to efficiently implementing neural network into FPGA for embedded systems. Here,
I suggests that two possible applications that the proposed method can be efficiently
applied.
6.3.1 Transfer Learning
Transfer Learning is an approach of neural network that allows the knowledge to trans-
fer from a pre-trained model to an untrained model [50]. This improvement of learning
enable the neural model learn faster to new task with the based of transfer knowledge
[50]. The trained weights are imported to a new neural model, and only the last layer
or the last few layers to be retrain with new data, instead of training the whole neural
network. Implementing neural network into FPGA using transfer learning enable the
elimination of most complex back propagation computation, where only forward propa-
gation and part of back propagation in retrain layers. This makes the training of neural
network in FPGA easier. The proposed method can be applied to the retrain layer to
further increase the efficiency in FPGA implementation.
6.3.2 Motion Planning Network
Motion planning network is an application of neural network for Human Support Robot
(HSR) [51]. In the motion planning network, the model is trained to learn the objects
and obstacles, and predict the motion path for the robot to obtain the object without
collide to the obstacles [51]. Combination of two neural networks is applied. The features
of obstacle were extracted by autoencoder, and were used for generation of training data
for robot [51]. A 10-layers MLP were implemented and dropout method was applied
throughout the layers [51]. The structure of MLP that applied in motion planning
network is shown in Figure 6.1. Not only in the training mode, the dropout was also
applied during the inference mode and shown an improve in accuracy [51]. The proposed
method can be applied in this application and enable the implementation into FPGA
to further improve the robot performance.
YEOH YOENG JYE - 17899032 59
Figure 6.1: The MLP that applied in motion planning network
Bibliography
[1] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural
networks, 61:85–117, 2015.
[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
[3] Bilal Jan, Haleem Farman, Murad Khan, Muhammad Imran, Ihtesham Ul Islam,
Awais Ahmad, Shaukat Ali, and Gwanggil Jeon. Deep learning in big data analytics:
A comparative study. Computers & Electrical Engineering, 2017.
[4] Ehsan Fathi and Babak Maleki Shoja. Deep neural networks for natural language
processing. Computational Analysis and Understanding of Natural Languages:
Principles, Methods and Applications, 38:229, 2018.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information pro-
cessing systems, pages 1097–1105, 2012.
[6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 1–9, 2015.
[7] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and
tree search. nature, 529(7587):484–489, 2016.
[8] Matt W Gardner and SR Dorling. Artificial neural networks (the multilayer per-
ceptron)—a review of applications in the atmospheric sciences. Atmospheric envi-
ronment, 32(14):2627–2636, 1998.
[9] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
60
YEOH YOENG JYE - 17899032 61
[10] LR Medsker and LC Jain. Recurrent neural networks. Design and Applications, 5,
2001.
[11] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[12] Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural
networks. Neural Networks, 71:1–10, 2015.
[13] Eduardo Sanchez. Field programmable gate array (fpga) circuits. Towards Evolvable
Hardware, pages 1–18, 1996.
[14] Stephen D Brown, Robert J Francis, Jonathan Rose, and Zvonko G Vranesic. Field-
programmable gate arrays, volume 180. Springer Science & Business Media, 2012.
[15] Shuichi Asano, Tsutomu Maruyama, and Yoshiki Yamaguchi. Performance com-
parison of fpga, gpu and cpu in image processing. In Field programmable logic and
applications, 2009. fpl 2009. international conference on, pages 126–131. IEEE,
2009.
[16] Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. A performance and
energy comparison of fpgas, gpus, and multicores for sliding-window applications. In
Proceedings of the ACM/SIGDA international symposium on Field Programmable
Gate Arrays, pages 47–56. ACM, 2012.
[17] Ben Cope, Peter YK Cheung, Wayne Luk, and Lee Howes. Performance comparison
of graphics processors to reconfigurable logic: A case study. IEEE Transactions on
computers, 59(4):433–448, 2010.
[18] Amos R Omondi and Jagath Chandana Rajapakse. FPGA implementations of
neural networks, volume 365. Springer, 2006.
[19] Pingfan Meng, Matthew Jacobsen, and Ryan Kastner. Fpga-gpu-cpu heteroge-
nous architecture for real-time cardiac physiological optical mapping. In Field-
Programmable Technology (FPT), 2012 International Conference on, pages 37–42.
IEEE, 2012.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 770–778, 2016.
[21] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
YEOH YOENG JYE - 17899032 62
[22] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,
2016.
[23] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
[24] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural
networks. In Proceedings of the Fourteenth International Conference on Artificial
Intelligence and Statistics, pages 315–323, 2011.
[25] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khu-
danpur. Recurrent neural network based language model. In Eleventh Annual
Conference of the International Speech Communication Association, 2010.
[26] Apeksha Shewalkar, Deepika Nyavanandi, and Simone A Ludwig. Performance
evaluation of deep neural networks applied to speech recognition: Rnn, lstm and
gru. Journal of Artificial Intelligence and Soft Computing Research, 9(4):235–245,
2019.
[27] Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. Perplexity?a
measure of the difficulty of speech recognition tasks. The Journal of the Acoustical
Society of America, 62(S1):S63–S63, 1977.
[28] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature
detectors. arXiv preprint arXiv:1207.0580, 2012.
[29] Sida Wang and Christopher Manning. Fast dropout training. In Proceedings of the
30th International Conference on Machine Learning (ICML-13), pages 118–126,
2013.
[30] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization
of neural networks using dropconnect. In Proceedings of the 30th international
conference on machine learning (ICML-13), pages 1058–1066, 2013.
[31] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks.
In Advances in Neural Information Processing Systems, pages 3084–3092, 2013.
[32] Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep
convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.
[33] Jiang Su, David B Thomas, and Peter YK Cheung. Increasing network size and
training throughput of fpga restricted boltzmann machines using dropout. In 2016
IEEE 24th Annual International Symposium on Field-Programmable Custom Com-
puting Machines (FCCM), pages 48–51. IEEE, 2016.
YEOH YOENG JYE - 17899032 63
[34] Sota Sawaguchi and Hiroaki Nishi. Slightly-slacked dropout for improving neural
network learning on fpga. ICT Express, 4(2):75–80, 2018.
[35] Vishakha V Bonde and AD Kale. Design and implementation of a random number
generator on fpga. International Journal of Science and Research, 4(5):203–208,
2015.
[36] JV Bradley. Chapter 12, the runs test. Distribution-free statistical concepts. Prentice
Hall Inc., Englewood Cliffs, 1968.
[37] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-
generation open source framework for deep learning. In Proceedings of workshop on
machine learning systems (LearningSys) in the twenty-ninth annual conference on
neural information processing systems (NIPS), volume 5, 2015.
[38] ISE Xilinx. Design suite, 2008.
[39] Tom Feist. Vivado design suite. White Paper, 5:30, 2012.
[40] Léon Bottou. Stochastic learning. In Advanced lectures on machine learning, pages
146–168. Springer, 2004.
[41] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[42] Ann Taylor, Mitchell Marcus, and Beatrice Santorini. The penn treebank: an
overview. In Treebanks, pages 5–22. Springer, 2003.
[43] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from
tiny images. 2009.
[44] Robocup@home. http://www.robocupathome.org/.
[45] @home dataset. https://github.com/hibikino-musashi-athome/rcj2016_
object_image_dataset/.
[46] I Xilinx. Vivado design suite user guide, 2014.
[47] Mohammed Bakiri, Christophe Guyeux, Jean-François Couchot, and Ab-
delkrim Kamel Oudjida. Survey on hardware implementation of random number
generators on fpga: Theory and experimental analyses. Computer Science Review,
27:135–153, 2018.
[48] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect:
Training deep neural networks with binary weights during propagations. In Ad-
vances in neural information processing systems, pages 3123–3131, 2015.
YEOH YOENG JYE - 17899032 64
[49] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-
net: Imagenet classification using binary convolutional neural networks. In Euro-
pean conference on computer vision, pages 525–542. Springer, 2016.
[50] Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on
machine learning applications and trends: algorithms, methods, and techniques,
pages 242–264. IGI global, 2010.
[51] 山本貴史 竹下佳佑. 深層学習を用いた高速な全身軌道計画の提案. In 第38回日本
ロボット学会学術講演会, 2020.
Publication List
• Y.J. Yeoh, T. Morie and H. TAMUKOH, ”An Efficient Hardware-Oriented Dropout
Algorithm”, Neurocomputing, Volume 427, pages 191-200, 28 Feb 2021.
• Y.J. Yeoh, H. Tamukoh “Alternative Dropout for Hardware Implementation in
Recurrent Neural Networks,” 2018 International Workshop on Smart Info-Media
Systems in Asia (SISA2018), RS-13, 2018. SISA Best Student Paper Award
• Y.J. Yeoh, T. Morie and H. TAMUKOH, ”A Hardware-Oriented Dropout Algo-
rithm for Efficient FPGA Implementation,” In International Conference on Neural
Information Processing pages 821-829. Springer, Cham, 2017.
• Y. Aratani, Y.J. Yeoh, A. Suzuki, D. Shuto, T. Morie and H. Tamukoh “Multi-
Valued Quantization of Convolutional Neural Networks for Efficient FPGA Im-
plementation,” 5th International Symposium on Applied Engineering and Sciences
(SAES2017), 2017
• ヨー ヨン ジェ，森江 隆，田向 権， “乱数生成器不要の簡略型Dropoutアルゴ
リズム,” 第27回日本神経回路学会全国大会（JNNS2017），P-80，p. 100，2017
• Y. Aratani, Y.J. Yeoh, A. Suzuki, D. Shuto, T. Morie, and H. Tamukoh, “Multi-
Valued Quantization Neural Networks toward Hardware Implementation,” Proc.
of the 2017 International Conference On Artificial Life And Robotics (ICAROB2017),
pp. 58, 2017
• 新谷嘉也，Y.J. Yeoh，鈴木章央，首藤大輔,森江隆,田向権, “畳み込みニュー
ラルネットワークにおける結合荷重の多値化,” SOFT九州支部大会, 2016年
• K. Suzuki, A. Koya, Y.J. Yeoh, T. Morie, and H. Tamukoh, “Image Recogni-
tion System For Home Service Robots Using Binarized Convolutional Neural Net-
works,” Proc. of the 15th Kyutech-Postech Joint Workshop on Neuroinformatics,
pp. 40-41, 2016.
65
