Optimization of FPGA Based Neural Network Processor by Sun, Ivan Teh Fu
Optimization of FPGA Based Neural Network Processor
by
Ivan Teh Fu Sun
Dissertation submitted in partial fulfillment of
the requirements for the
Bachelor of Engineering (Hons)










1 MtMvO\ IM^Uc'V '<*
2 £££ ••- IVfr-iS
CERTIFICATION OF APPROVAL
Optimization of FPGA Based Neural Network Processor
by
Ivan Teh Fu Sun
A project dissertation submitted to the
Electrical & Electronics Engineering Programme
Universiti Teknologi PETRONAS
in partial fulfillment of the requirement for the
BACHELOR OF ENGINEERING (Hons)
(ELECTRICAL & ELECTRONICS ENGINEERING)
Approved by,





This is to certify that I am responsible for the work submitted in this project, that the
original work is my own except as specified in the references and
acknowledgements, and that the original work contained herein have not been
undertaken or done by unspecified sources or persons.
ABSTRACT
Neural information processing is an emerging new field, providing an alternative
form of computation for demanding tasks such as pattern recognition problems
which are usually reserved for human attention. Neural network computation i s
sought after where classification of input data is difficult to be worked out using
equations or sets of rules.
Technological advances in integrated circuits such as Field Programmable Gate
Array (FPGA) systems have made it easier to develop and implement hardware
devices based on these neural network architectures. The motivation in hardware
implementation of neural networks is its fast processing speed and suitability in
parallel and pipelined processing.
The project revolves around the design of an optimized neural network processor.
The processor design is based on the feedforward network architecture type with
BackPropagation trained weights for the Exclusive-OR non-linear problem.
Among the highlights of the project is the improvement in neural network
architecture through reconfigurable and recursive computation of a single hidden
layer for multiple layer applications. Improvements in processor organization were
also made which enables the design to parallel process with similar processors.
Other improvements include design considerations to reduce the amount of logic
required for implementation without much sacrifice of processing speed.
in
ACKNOWLEDGEMENT
I would like to thank Mr. Noohul Basheer bin Zain AH for his guidance and support
which without this project would not have completed. I would also like to thank Dr.
Varun Jeoti for his advice and all the staff in the Electrical and Electronics
Engineering faculty for their time and technical expertise.
Special thanks to Professor Mark J. Embrechts from Rensellaer Polytechnic Institute,
Troy, New York for the use of his MetaNeural™ software which have been an
integral part of the project.
Last but not least, to my family and friends who had been a tremendous source of








1.1 Background of Study....
1.2 Problem Statement ....
1.3 Objectives and Scope of Study
CHAPTER 2: THEORY
2.1 Artificial Neural Network
2.2 Classification of Artificial Neural Network Models
2.3 Field Programmable Gate Array (FPGA)
2.4 Optimizing the Design












CHAPTER 3: METHODOLOGY 11
3.1 Project Methodology . . . . 11
3.2 Equipments and Software Applications Used 13
3.3 Hardware Design Flow ... 16
3.4 Functional Simulation Using Testbenches . 17
CHAPTER 4: RESULTS AND DISCUSSION
4.1 Architecture Optimization
4.2 Number Convention .
4.3 RFNNA Simulator .





CONCLUSION AND FUTURE IMPROVEMENT
5.1 Conclusion .....













Figure 2.1 A single neuron model
Figure 2.2 A 2-3-3-1 Neural Network Architecture
Figure 2.3 Computations within a Neuron
Figure 2.4 Module Boundary Selections andRegister Assignment
Figure 3.1 ANNSystem Process Model for FPGA Implementation
Figure 3.2 Screenshot ofMetaNeural™ Main Interface
Figure 3.3 Screenshot ofMetaNeural™ Network Setup Interface
Figure 3.4 Design Flowfor Verilog basedRegister Transfer Logic
Figure 3.5 Screenshot of Simulated Input Signals and Output Registers
Figure4.1 Reconfigurable Feedforward NeuralNetworkArchitecture and
Execution Phase Diagram for XOR Problem
Figure 4.2 32-Bit Floating Point Format
Figure 4.3 Graphical Plot of a Sigmoid Function
Figure 4.4 Description of NeuralNetwork Computation
Figure 4.5 Screenshot of RFNNA Simulator vl.3
Figure 4.6 Layout of 3 Neuron RFNNA Processor HDLModules
Figure 4.7 Simulation Result for Input Module
Figure 4.8 Mealy Machine for Input Module Counter
Figure 4.9 Simulation Result forBias andWeight ROM Module
Figure 4.10 Mealy Machine for Bias and Weight Counter
Figure4.11 Flowchartfor Unsigned BinaryMultiplication
Figure4.12 Simulation Result for NeuronModule
Figure 4.13 Simulation Result for Neuron Output Multiplexer Module
Figure 4.14 Simulation Result for Neuron Representation Converter Module
Figure 4.15 Mathematical Analysis on Sigmoid LUT Values
Figure 4.16 Simulation Result for Activation Function LUT Module
Figure4.17 Simulation Result for OutputThreshold Module
Figure 4.18 Simulation Result for RFNNA Processor





1.1 Background of Study
Neural networks is still considered a relatively new area of research, with
development picking up speed only in the 1990's since the first mathematical model
of the biological neuron waspresented by McCulloch and Pitts in the 1940's.Neural
networks are not confined to solely an attempt of replicating the human brain, it has
as well, wide reaching applicability and its concepts are incorporated into
applications such as optical character recognition, machine health monitoring as well
as stock market forecasting (Callan, 1999, p.2).
Neural networks can be implemented using software simulations and as fast
hardware devices. The former platform has been more often used for the
development of neural networks because it is cheaper and is more flexible for
research purposes. However, with the rapid advances in Field Programmable Gate
Array (FPGA) technology, the option to implement hardware based neural networks
now seems more appealing due to its customizability, relatively faster processing
speed and more importantly the infrastructure to tap into the neural network's
intrinsic property of parallel processing.
1.2 Problem Statement
The focus of most engineering and scientific groups on neural networks is to produce
working models of neural networks into hardware designs. These implementations
are usually made up of neural processing units known as neurons which take up
considerable amounts of logic gates to implement. Thus the complexity of a basic
neuron design affects very much the capacity of neurons which can be fitted into a
1
fixed amount of silicon real estate. On the other hand, not much work has been put
into increasing the performance and processing speed of these neural network
processors. The successful addressing of these areas would pave the way for cheaper
and faster neural network processors.
1.3 Objectives and Scope of Study
The project undertaken will be focused on the optimization of FPGA based neural
network chips in terms of logic gate numbers, processing speed, and performance.
The project will be basically a study of neural networks and implementation of logic
design onto FPGA.
The targeted result of this project would be an FPGA implementation of a neural
network architecture which is suited for the supervised learning paradigm. The
connectivity of the feedforward neural network architecture would be of a multiple
layered type.
The end product would be an Artificial Neural Network FPGA implementation with
at least 2 inputs and 1 output. The architecture should be able to implement 2 or
more neuron hidden layers. Components inherent to a neuron such as multipliers,
adders and activation functions will be part of the overall design. The ultimate goal
of the project is to produce an optimized and functioning neural network processor
which is able to solve linear and nonlinear problems. The benchmark of the project
would be utilising the end product to solve the nonlinear Exclusive-OR (XOR)
problem.
Due to time constraint and design considerations, the supervised training phase of
the neural network would beperformed separately using the MetaNeural™ software.
The resultant weights and bias values for each connection between neurons would
then be directly programmed into the FPGA design.
CHAPTER 2
LITERATURE REVIEW AND THEORY
2 LITERATURE REVIEW AND THEORY
2.1 Artificial Neural Network
Common to neural network architectures are the simple processing units known as
"neurons" as is described in Figure 2.1 below. Neural networks are made of these
simple neurons with different architectures dictating how a collection of neurons are
interconnected as well as how calculations are made to adjust the weighted inputs










Typical neural network architecture is made up of an input layer, one or more hidden
layers, and an output layer. The input layer merely acts as a buffer to external inputs,
whereas the output layer functions as a last stage hidden layer with the output
buffered and passed on to external outputs. The more interesting section of the neural
network would be the hidden layer(s) in which the neurons reside. The number of
input and output nodes for their respective layers is dependent on the application in
which the neural network was designed for. In example, an application which
recognizes alphabets using a 10 x 10 pixelizer (total 100 pixels) would require a
neural network which has 100 inputs and 26 outputs. However, there is no
convention in which the number of hidden layers to be used for a particular
application can be directly generated. To add to the confusion, the number of
neurons per hidden layer is also arbitrary. Thus the most optimized number of hidden
layers and neurons for each layer are largely decided using trial and error methods.
Figure 2.2 below is a typical neural network architecture used for applications which
require only 2 inputs and one output; the number of hidden layers and neurons are
chosen arbitrarily. Simple problems such as 2 input AND, OR or XOR can be solved
using this network configuration.
Input Layer I Hidden Layers I Output Layer
#1 #2
Figure 2.2: A 2-3-3-1 Neural Network Architecture
The connections between two neurons or nodes from adjacent layers have weights
which are multiplied to the output of the neuron/node which feeds to the neuron of
the subsequent layer. The coloured nodes represent storage for bias values which are
predetermined like the weights of each connection. Each neuron for each hidden
layer has its own unique bias value.
The combination of these weights and bias values allows a trained neural network
which is subjected to a set of inputs to provide a correct categorization of them as an
output. The processing of these weights in a neuron is divided into 2 stages:
Summation of weighted inputs and the mapping of the summation output to an
activation function. Firstly, all the weighted inputs from the preceding layer are
summed up together with a predetermined bias value. The result from this phase is
then normalised using an activation function such as the identity function, binary
threshold function or the sigmoid function. The normalised result is then fed into the
subsequent hidden layer where the cycle continues until the output layer is reached.
Figure 2.3 below illustrates the summation and activation function computations of a
neuron.
Figure 2.3: Computations within a Neuron
An important attribute of a neural network system based on weights is its capability
to learn and generalize variations in a set of inputs (Picton p., 1994 p.4). For
example, in character recognition applications, the same character 'A' can be written
in a multitude of ways, however these variations would produce the same output
when presented to a successfully trained and tested neural network.
2.2 Classification of Artificial Neural Network Models
Neural networks are organized and classified according to 4 attributes. These are:
learning paradigm, network architecture, network connectivity and learning
algorithm.
2.2.4 Learning Paradigm
The above concerns on the type of learning or training that a neural network is
subjected to, be it either supervised, unsupervised or a hybrid of both methods. The
learning paradigm determines which type of learning algorithm that can be used
when training the neural network.
2.2.5 Network Architecture
Generally there are two types of network architecture, feedforward and recurrent.
These architecture types are descriptions of how the interconnections between
neurons in a neural network are made. The connection type for the neural network in
Figure 2.2 is of the feedforward type.
2.2.6 Network Connectivity
Not to be confused with network architecture, network connectivity describes how
neurons are positioned within a neural network. For example, single layer
connectivity describes a neural network with only one hidden layer. Other
connectivity types are; Self Organizing Map, Multilayer and Hopfield. Connections
between hidden layers and neurons are decided by the network architecture.
2.2.7 Learning Algorithm
The learning algorithm deals with training the particular neural network chosen.
There are many types of learning algorithm such as BackPropagation, Associative
Memory, Madaline and more. The application of a particular learning algorithm
depends on its suitability with the three parameters mentioned earlier of a particular
neural network. For example, BackPropagation learning algorithm can only be
applied to a neural network which is based on supervised learning paradigm and has
feedforward architecture.
The taxonomy of Artificial Neural Network is shown in APPENDIX A1. A table
showing the similarities and differences between the Von Neumann computer model
and neural networks is available in APPENDIX B.
2.3 Field Programmable Gate Array (FPGA)
FPGA technology is given preference over other options such as VLSI (Very Large
Scale Integrated) circuit, ASIC (Application Specific Integrated Circuit) and
MPGA's (Mask Programmed Gate Arrays) for the implementation of neural
networks due to its flexibility to be reconfigured while providing all the advantages
inherent to hardware devices such as low sensitivity to electric noise and
Reproduced from [4] pg 6
temperature, memory of weight storage, processing speed and parallel processing.
Besides technical advantages, implementations on experimental projects using
FPGA's have lower non-recurring engineering costs as well as faster development
and implementation processes. With its reprogrammability features, any design
defects on FPGA can be easily corrected and tested, thus shortening the time-to-
market. Studies on computer architectures provide well documented optimization
techniques which can be incorporated into the neural network hardware design.
FPGA technology uses Hardware Description Language (HDL) to design circuits.
Among the more popular are VHDL and Verilog. Both HDL's are similarly
powerful and usage of either or both is up to preference. There are 3 different levels
at which a circuit design can be specified: behavioural, dataflow and structural.
Behavioural style of coding is at the highest level in terms of similarity with natural
language whereas the most specific is the structural style. Coding structurally would
make the design easier to synthesize and also provide more control over the physical
assignation of the circuit design. A design can be a specified using one or more
coding styles.
Hardware design using FPGAs are usually performed using both combinational and
sequential logic. These designs are specified using any of the three hardware
description methods; which are the structural, dataflow and behavioral. For design of
complex circuits, dataflow and behavioral styles are preferred because it frees the
designer from having to fully specify the connections and logic gates to be used.
Design of hardware using the dataflow method is used more often to design
combinational circuits. The use of the keyword "assign" in Verilog for the dataflow
method refers to a combinational logic assignment. On the other hand, behavioral
design method uses the "always" block for the design of sequential logic.
Specification of conditions such as "posedge" and "negedge" for certain signals
dictate when the procedural block is executed or triggered.
The Verilog hardware description language (HDL) is a very powerful simulation
language. However, only about 10% of its instructions are synthesizable [3].
Synthesis refers to the ability to successfully convert from codes to actual hardware
logic implementation. Thus the hardware designer has to be aware of which
instructions are synthesizable friendly as well as some designing rules that may
cause the design to not being able to be implemented in hardware. Instructions that
are not synthesizable are those which deal with timing parameters such as "time,
wait, initial, delay" and others such as "fork, join, defparam, UDP".
There are also other restrictions such as those concerning signals which are either
wires or registers. For example, it is illegal to connect two registers together inside a
module, which would be a common mistake for new designers who are usually
comfortable with software programming style of liberally using variables.
2.4 Optimizing the Design
To produce an optimized design, there are two things to consider, first would be the
algorithm of the design architecture, and second, the optimization of the
implementation itself. This section would be dedicated to the second kind of
optimization. Optimization of logic implementation would require the knowledge of
how FPGAs work. Xilinx FPGAs are made of arrays of Configurable Logic Blocks
which normally houses two 4-input LUTs feeding a pair of flip flops [2]. To produce
an optimized implementation, the designer has to constantly think of how the codes
will be implemented into hardware given the amount and type of available resources.
CLOCK
Figure 2.4: Suggested Module Boundary Selection and Register Assignment
Hardware designs are usually segregated into several modules which are then linked
to each other and controlled using a control unit which is made a state machine. Each
module i s b est d esigned t o h ave its m odule b oundaries a s specified i n Figure 2.4
which basically means that all outputs are to be buffered using registers. Using this
standardization would mean that synchronizing flip flops on inputs are not necessary.
The synthesizer optimizes combinational logic within modules. Therefore it would
be advisable that related combinational input and output of related signals to be
grouped together in the same module so that there would not be any redundancy
which is harder to detect if specified in separate modules. In hardware design,
sequential logic or state machines are preferably synchronous. This is to reduce
propagation delay as well as chances of timing problems occurring.
2.5 Completing the design
The next steps of the design process after the completion of the RTL (Register
Transfer Logic) codes would be the synthesis and the "place and route" process.
What the synthesis process does is to generate a netlist which is also a Verilog file
that represents the higher level Verilog codes presented to it into lower level
structural/gate level format. Information that is provided to the synthesis tool are for
the FPGA target chip model which will be used to invoke the appropriate library file
containing all physical and timing parameter information and implementation of
Verilog constructs into gates.
From synthesis, the designer would be able to know how fast the design could run,
which depends on many factors such as the longest length of combinatorial
propagation delay. For FPGA, problems usually encountered in ASIC design such as
clock skew and signal strengths do not have to be worried as this are already
accounted for by the FPGA library file by appropriately assigning clock and signal
buffers which are available in the FPGA chip itself. Thus the synthesis process is
rather straightforward for FPGA based designs, with design constraint usually
dependant on the achievable processing speed.
After successfully synthesizing, the netlist file is passed to the place and route
portion of the application to simulate the physical implementation of the design. This
process ensures that the design does not violate the boundary parameters of the target
FPGA chip.Itis alsopossible that manual routing maybe required to produce an
optimized design. The output of this process is a gate.v file which is also a Verilog
file which can be fed back into the RTL simulator for simulation. The process of
RTL coding synthesis and place and route is continuous and is repeated until the
specifications of the design are met.
10
CHAPTER 3
METHODOLOGY AND PROJECT WORK
3 METHODOLOGY AND PROJECT WORK
3.1 Project Methodology
The project involves the completion of the following four stages; literature research,
specification of neural network architecture design, Field Programmable Gate Array
(FPGA) training and experimentation, FPGA implementation and training of neural
network, and optimization of FPGA design. The project design approach would be















Figure 3.1: ANN System Process Model for FPGA Implementation
11
3.1.1 Literature Research
Early stages of the literature review were performed to enhance and widen the
author's knowledge on the different types of neural networks architecture. Later on,
research on FPGA technology and also a study on Hardware Description Language
(HDL) follow.
3.1.2 Specification of Neural Network Design
The next stage of the project was to specify the characteristic of the neural network
architecture to be designed and implemented onto FPGA. The specification of the
neural network architecture is important as it would serve as a guide and reference
for later design stages. Samples of specification to be included would be the number
of inputs and outputs, algorithms and architecture for logic blocks, functional
specification of the control logic as well as the expected results of the design.
3.1.3 FPGA Training and Experimentation
FPGA training and experimentation would be conducted in parallel with the
preceding stage to get familiarized with the devices and related HDL programming
software. Part of the experiments or exercises on FPGA would be to design modules
such as multipliers and fast adders and also to learn to write testbenches that will be
used to validate the functionality of the design. Aldec's Active HDL software would
be used to develop and simulate the functionality of the HDL codes.
3.1.4 FPGA Implementation and Testing of Neural Network
RTL implementation of neural network followed by optimization was performed
after sufficient skill is gained. Specified modules would be developed separately and
tested. These modules would then be integrated into a complete design. Testing at
both the Register Transfer Level (RTL) would be performed to ensure that the design
performs reliably. The supervised learning portion using the back propagation
algorithm of the design will be performed offline using a shareware program known
as MetaNeural™.
12
3.2 Equipments and Software Applications Used
3.2.1 Xilinx Virtex-II XC2V1000 Reference Board
Based on research on previous implementations of neural networks onto FPGA, the
designs documented usuallytake up a sizeable portion of the target FPGA chip used.
Some of the designs are even implemented using multiple FPGA chips. From this it
seems appropriate that the largest FPGA in terms of gate numbers available in our
labs be selectedfor the purpose of this project. Due to the experimental natureof the
project, a larger gated FPGA wouldeasily allow for variations of neural networks to
be designed and implemented. Restrictions on the design would then be minimal.
More features of the Virtex FPGA chip is discussed below.
"The highest performance designs are tailored for the target FPGA" as mentioned by
Coffman (pg 221 [3]). FPGA vendors such as Altera and Xilinx build their devices
differently from each other; their line of products which may very well vary between
themselves, differing in pin counts, routing density, logic block construct,
availability of RAM blocks and more. Therefore it is of much importance for an
HDL FPGA designer to find out about the limitations and advantages of the devices
they are dealing with.
The Virtex-II FPGA series are the latest devices offered by Xilinx. It is built using
0.15 micron lithography and 8-layer metal technology. Different from conventional
FPGA devices, the Virtex series is of a hardwired version. A hardwired FPGA
version improves the performance of conventional FPGA as well as contributing
towards a smaller silicon die footprint.
The FPGA chip that is used for the project is the Xilinx Virtex II XC2V1000-
4FG256C. This chip has 256 pins of which 172 are usable as input/output pins. As
like all Xilinx devices, the Virtex II chips are made up of an array of Configurable
Logic Blocks (CLB). Each of these CLB's contain function generators, carry logic,
arithmetic logic gates, wide function multiplexers and storage elements which will
be appropriately interconnected according to design. Unique to the Virtex board, is
built in resources for Look-Up Table (LUT) of up to 8 inputs, which is useful for
implementing mathematical functions such as a sigmoid function. The chip used also
13
has 40 blocks of 18kbits Select RAM resources which can be used to store
connection weights for the neural network. Another feature built into the device is 18
by 18 bit multipliers. Usage of these built in multipliers would considerably speed up
the design as compared to manually customized ones.
3.2.2 MetaNeural™
MetaNeural is a shareware program originally developed by Professor Mark J.
Embrechts of Rensselaer Polytechnic Institute, New York (USA) in 1988 as a
demonstration package for a lecture course. The application has since evolved and
given a user interface as seen in Figure 3.2. The application is used to train
feedforward neural network architecture using the BackPropagation learning
algorithm. The program is able to work with network architectures which have up to
3 hidden layers while allowing an arbitrary number ofneurons per hidden layer.
The list below summarizes the core features and user customization allowed by the
MetaNeural™ application. Items listed are illustrated inFigure 3.3.
• Specification of neural network architecture up to 3 hidden layers.
• Specification of number of training epochs.
• Error threshold to stop training
• Training rate which affects the amount of weight adjustment for each training
cycle.
• Selection of activation function type
• Easy text format for training pattern and test pattern input files.
The MetaNeural™ software application will be used in this project to supply the
values for connection weights and neuron biases to be designed into the FPGA
implementation of the neural network. The neural network for an intended
application can be trained given a set of training pattern which for this case would be




Current Paiiimoteis Hunenl Status













Figure 3.2: Screenshot ofMetaNeural™ Main Interface
ft input nodes \'j • tt uulpul nrjui:*' 1
L
It hidden l.iyirii [j
It ntiitrt in 1 It mules in ? It nodus in 3
1 purh 1nnqlh |1 tt (Iridiums lUUuUU}
;tsKra3rciSrt<i"r»s£i
Lcciinintf ildli* !.2!i Upiljlo isvisiji H ion r
Mtiiiii.Titum r~,;i 1 inn 1iili:iiini i: 01
Ai.livjlion
\ uriLlion jrn.11 SkjiiiokI
IlinillRUltUIL' 1
I'.itltiin 1 He nuniim'S pal
• IK [*>ini:t
•:t Numbbi ut mpul NuU-
TMFigure 3.3: Screenshot of MetaNeural Network Setup Interface
15
3.3 Hardware Design Flow
The hardware implementation of the RFNNA processor is divided into several tasks.
The approach taken as seen in Figure 3.4 is to first define the modules that will be










Figure 3.4: Design Flow for Verilog based Register Transfer Logic
The second step would be to design a control unit which is basically a state machine.
The state machine would provide control signals to all modules so that processing is
performed whenever it is intended. The block diagram editor provided with the
Aldec Active HDL compiler allows the designer to have an overview of the entire
design. This is also the place where the inputs and outputs of modules and state
machines are connected to each other.
The process of functional simulation and verification takes place in intermediate
module design stages and also for structural testing when the design is complete. The
verification process is performed by applying simulators either by using readily
available tools or by using a testbench. A testbench provides the designer with more
control regarding the stimulus which is subjected to the module unit under test.
Stimuli are essentially generated input signals inclusive of the clock which triggers
the module. The output of the module and intermediate net and register status are
checked for any discrepancies.
16
3.4 Functional Simulation Using Testbenches
Testbenches for this project are written using Verilog; however they are not
subjected to instruction and optimization constraints as imposed for synthesizable
modules. Timing parameters such as delays are applicable.
Modules are usually tested by asserting inputs and clocks going into the module. The
assertions of inputs are chosen to reflect actual operating as well as timing
conditions.
e- elk 1





DDBE (0000 XD0S* XD0SE XDn7B X0D9Z XDMC X0D36 XDDAD Xodaa X0DB+
i+] •" oij!_summalion >0tXxX<i8 { Xmx^hkdo Xxxxxxn JO00™"6 X^^ae jfxNHXHE* X
38R= ieg_multiplicationBuffei 0QD7DE4 (••DonoD yifaezm X°0[I2F4* X0004204 X°DD4C9D XD0DE400 X.raD7DE4
ffiR3 reg_summatiQnBuffei •0146EB { x°00Q0D0 Xamzm XD0n5EE4 XaDnas58 XDDQE4ea XD0,,SES
Figure 3.5: Screenshot of Simulated Input Signals and Output Registers
The simulation signals are as shown in Figure 3.5, all status of inputs, outputs and




4 RESULTS AND DISCUSSION
Development of the optimized neural network processor is divided into two equally
important parts. The initial part was to comprehensively describe an optimized
architecture as well as specifications of the neural network that is to be implemented
into hardware, whereas the second part was to design the neural network processor
modules and optimize the organization of the neural network processor which is
based on the architecture specified in the first part.
Below is a description of the subtopics that will be discussed in this chapter:
Research and analysis performed on existing neural network implementation into
Field Programmable Gate Array (FPGA) and arithmetic computations in binary
circuits led to the discovery of an innovative hardware implementation architecture.
This implementation is termed and referred to from here onwards as the
Reconfigurable Feedforward Neural Network Architecture (RFNNA). More
discussion on this architecture is available under the heading Architecture
Optimization (4.1). A unique numbering convention to optimize the information
processing of a neural network has also been worked out with more discussion
available under Number Convention (4.2). A Visual Basic application was
produced to help model and analyze the implementation of both findings. A
description of the application is shown in the subtopic RFNNA Simulator (4.3).
With a comprehensive specification of the desired neural network, the project
proceeded with the actual hardware design of the neural network processor. Details
of the modules and how they are optimized in terms of organization are available
under Neural Network Processor Modules (4.4) and Multiplier Bus (4.5).
4.1 Architecture Optimization
4.1.1 Neural Network Architecture and Learning Algorithms
There are two types of neural network architectures: Feedforward and Recurrent
[13]. The former has the advantage of being less complex in its connections between
layers of neurons and there are no connections between neurons within the same
layer. The recurrent architecture on the other hand necessitates connections between
adjacent layers as well as connections intra layer.
Implementation wise, the feedforward architecture would be more appealing due to
the considerable amount of logic gates that would be saved without implementing
these extra connections. It may be argued that since a neural network application
may perform better in a particular type of architecture than another, one cannot just
take simplicity of implementation to choose between which architecture to be used.
However, the back propagation supervised learning algorithm which is widely used
in most applications works in feedforward neural network architecture. Having a
choice on which architecture to be used for a general purpose neural network FPGA
implementation, the feedforward architecture would inarguably be selected for its
simpler implementation and wide applicability.
Thus it is decided that a neural network architecture which is based on the
feedforward architecture be used for the FPGA design implementation in this
project. Since the training portion of the neural network function will be performed
separately, the architecture would allow all types of learning algorithms which fall
under the feedforward neural network architecture branch (refer APPENDIX A).
Besides the mentioned back propagation algorithm, other learning algorithms such as
the Adaptive Linear Network (Adaline), Multiple Adaptive Linear Networks
(Madaline) and Perceptron can be used.
4.1.2 RFNNA
Following the decision to use the feedforward neural network architecture as the
basis for implementing the design, more thought has been given to optimizing the
logic gate area in which this can be implemented. Review of other works [8][9] on
neural network implementation into FPGA mentions about time multiplexing the
19
resources for connections between neurons. This however would slow down the
processing speed of the design.
Instead of time multiplexing the resources within a layer by reusing the same neuron
module for each neuron in a hidden layer, the resources for a hidden layer can be
time multiplexed so that a full parallel implementation of a neuron layer is reused by
subsequent hidden layers. This will overcome the disadvantage of reducing the
processing speed. The ability to pipeline the initialization of weights and other
resources while other stages are processing would make the design more efficient.
Taking the idea further by adding controls to determine the number of hidden layer
iterations and connectivity patterns, we would then have a multiple connection
architecture neural network at hand. This architecture is given the name
Reconfigurable Feedforward Neural Network Architecture (RFNNA).
4.1.3 RFNNA for XOR Problem
The RFNNA example shown in Figure 4.1 is developed to be able to solve the
nonlinear XOR problem. The circuit is designed to have only two inputs and one
output. The hidden layers have a maximum number of 2 neurons. The number of
hidden layers and neurons for each layer to be used is arbitrary and selectable. The
resources are designed to be fully utilized at each layer where the same resources are
used recursively for each hidden and final output layer computation.
Pipelining is used to update the value of weights for the next hidden layer's multiplier
during initialization as wellas when summation andmapping of activation function for
the current layer is being performed. Not included in the diagram is the control logic,
which controls the uploading of weights and activation of logic switches to ensure
proper network connections between registers. The logic switches connects and
disconnects accordingly to ensure proper network connectivity. The control logic also
determines the number of computation iterations which in turn depends on the number
of hidden layers selected. This logic switch shows the connection path for the output
layer.
20
Figure 4.1: Reconfigurable FeedforwardNeural Network Architecture and Execution
Phase Diagram for XOR Problem
Of particular interest is the way the multiplier constants are updated. As seen in Figure
4.1, the weights for s ubsequent hidden layer which are storedin memorywould be
fetched and stored as the multiplier constants while the current layer is still executing
its summation and activation function transformation. This pipelined method reduces
the processing speed difference between this architecture from a full cascadedparallel
implementation.
4.2 Number Convention
There are 3 arithmetic operations that a neural network computes a s is shown in
Figure 2.3. These are multiplication, summation and conversion of inputs using an
activation function.
Generated weights of a trained neural network are not integers rather they are
represented using decimal numbers. As computation of arithmetic functions using
21
logic circuits are in binary format, these decimal values would have to be substituted
using a suitable numbering convention in which the circuit is able to manipulate
with. The usual representation of decimal numbers in b inary format are the sign-
magnitude representation, twos complement representation and the floating point
representation.
4.2.1 Twos Complement and Floating Point Representation
Twos complement is a variation of number representation for integers. It is similar to
the sign magnitude representation for positive values but is different when
representing negative values. Unlike the sign magnitude representation, twos
complement does not use an extra MSB as a sign bit but is automatically represented
when translated to binary format. The main advantage of twos complement over sign
magnitude is that it simplifies mathematical operations such as addition and
subtraction, no extra logic is required to test for the polarity of the sign bit. Two's
complement and sign-magnitudenumbering convention are described as follows.
Sign Magnitude
n-2
N = ^T ai ifan.i —0 (equation 4.1)
i=a
N-2
N = -^2'a, ifan_i =1 (equation 4.2)
i=0
Twos Complement
N = -2"~lan_x+ ^2'a, (equation 4.3)n-2
i=0
Floating point numbers are used when the numbers to be represented are spread
across a wide range. It is the binary equivalent of the decimals scientific notation. A
floating point number is divided into 3 sections(Figure 4.2); the sign bit, significand
or mantissa and the exponent.
22
<— 8 bits —X 23 bits
y-
Sign Biased Exponent Significand
E.g.: 01001001110100010000000000000000= 1.638125 x 220
1 10010011 10100010000000000000000 = -1.638125x 220
Figure 4.2: 32-Bit Floating Point Format [7]
Mathematical operations performed using floating point representation is generally
more complex and require substantially more logic to implement when compared to
twos complement. Comparing multiplication operations between twos complement
and floating point representation, the whole multiplication operation of a twos
complement representation is performed only in the significand section of the
floating point representation. Extra logic is required to manipulate the exponent bits
to reflect the changes made by multiplying the significand of two floating point
numbers.
4.2.2 Fixed Point with Fractional Component
From findings, it is best that a floating point representation convention be avoided.
Besides being complex in its implementation, its significand component cannot be
directly used to address the LUT for which the activation function is to be
implemented. The storage of a number would also use much more memory.
Although the computation within a neural network contains decimal numbers which
may seem more suited for floating point implementation, usage of twos complement
or sign magnitude is possible due to the numbering range which is always limited




Figure 4.3: Graphical Plot of a Sigmoid Function
From equation 4.4, the sigmoid function involves division and the natural
exponential function. Direct mathematical implementation of this function in
hardware would be difficult due to the number of mathematical steps involved when
the function is breakdown into fundamental operations such as add, subtract, divide
and multiply. Thus a Look-Up-Table (LUT) is used to plot out the function with the





This small and confined range of operation eases the use of twos complement
representation. Besides representing integers, twos complement representation can
also be used to represent fractions.
The size of each fraction equivalent to 1 binary value is determined by the accuracy
of the computation as well as hardware implementation constraints. As was
mentioned before in Chapter 3.2, the maximum implementable LUT size for the
Virtex 2FPGA chips is at8bits. This gives 28 = 256 memory words which are 8bits
wide. From observation of Figure 4.3, the sigmoid function provides distinct values
24
between the ranges on the x-axis of-6.4 to 6.4. Corresponding values outside of this
range is clipped off to either 1 or 0. With the implementation of a sigmoid LUT and
8 bit addressing, the resolution for the entire range would be 6.4/256 = 0.025. An
extra bit is used to store the sign information.
This addressing step size of the LUT table is not to be confused with the twos
complement fraction size. This twos complement fraction size is dependent on the
multiplicand and multiplicand size for multiplication. Figure 4.4 below describes the











sJmsb 8 hits 8Mts LSB
~Y~
Figure 4.4: Description ofNeural Network Computation
The multiplier represents the weights of the connection where as the multiplicand is
the value of the presented input. The value of the multiplicand neverexceeds 1io and
is represented by an 8bit wide word whereas the multiplier is represented by ann-bit
wide word depending on the value of the weight. Notice that the multiplicand has no
sign bit as it is always positive. Multiplication of the multiplier and multiplicand
would yield a (8 + n) bit + 1 sign bit output. This output is then summed up with
other similar operations including the bias value in theneuron before being fed to the
25
sigmoid function. Bits 9 to 16 of the summation result are used to address the
sigmoid LUT. The value associated with the address is the sigmoid function
equivalent value of the pointer. To fully address 256 items of the sigmoid LUT, bits
9 to 16 of the summation result must fully vary between for decimal values 0 to
6.528 (value chosen instead of 6.4 for more straightforward correlation). For
example, to match the value of 6.528 the summation result would be
1111111IOOOOOOO2 or 65280io- From this, the twos complement fraction size can be
derived.
Max value = 65280i0
Range width = 6.528io
Fraction Size (N) = (6.528/65280)0 5= 0.01
E.g.:
Using twos complement fractional representation
Multiplicand =lio= 1/0.01 = 100steps=011001002
Multiplier = 6.528i0 - 6.528/0.01 = 653steps = 10100011012




With the bits 9 to16 all set tol, the highest value of the sigmoid LUT is accessed and
a corresponding result of 1 is provided. For binary values exceeding
1111111IOOOOOOOO2, the output would be automatically set to 1 whereas binary
values below 00000000111111112, the output will be set to 0. The value of weights
and neuron biases generated by the MetaNeural™ program and the value of sigmoid
outputs would be converted using this representation.
26
4.3 RFNNA Simulator
This application was initially written by the author to help simulate the neural
network for the XOR problem using the RFNNA architecture. The simulation of
neural networks for up to 3 hidden layers with a maximum of 3 neurons per layer is
possible using this application. The weights used for the program utilizes the WGT
weight file as generated by the MetaNeural™ application. The simulator simulates
neural network operation by showing all internal mathematical computation and
result for a chosen network. Additional features such as accuracy adjustor for
adjusting the number of decimal places for mathematical operation can be used to
analyze how well a network performs and the margin of error caused by rounding
and truncating the numbers.
Figure 4.5 below shows the software simulating the first hidden layer of a 2-3-2-1
architecture for the XOR problem. The operation manual for the application is
included in APPENDIX C. More screenshots is available in APPENDIX D.
\Rewnffet*rakKFeeC*rw^
Ffe About





0 @- •o o
NtttfiberoiOutputls]'. r
HiODENUVSRBI
x 02289 - D.
x .04433 = 0.
* 18794 Ik OS
1.G3B5 &^
x 6(
x SUSS » 00000 O ^y® ♦ 94386 fit O.S
Resel Neural Net
x -4 5436 - -4.3436 0—3®* -2115 f* 0-1076
x 4.42SS - 0.0000 9^,
Figure 4.5: Screenshot of RFNNA Simulator vl.3
27
jPMTPJT
4.4 Neural Network Processor Modules
The RFNNA processor was designed using the modular method as was mentioned in
Chapter 3 Methodology. Figure 4.6 shows a basic block diagram for the hardware
implementation of the RFNNA processor. A more detailed and accurate block
diagram is available in APPENDIX E. All modules are individually designed based
on the specifications mentioned earlier and validated using testbenches. However
some adjustments were made to better suit the architecture for hardware
implementation. The adjustments made to a module will be discussed the module's
subtopic respectively
The final validation was performed after the control unit of the processor was
completed. The RTL simulation of the RFNNA processor for a 2-3-3-3-1 was
successful. The discussion in this section will be divided into the respective modules
available in the hardware implementation as shown in APPENDIX E.
OUTPUT MODULE
















Figure 4.6: Layout of 3 Neuron RFNNA Processor HDL Modules
28
There are total 11 unique modules withinthe neural networkprocessor designed.
The modules are as follows (labels in brackets refer to the actual module name in the
design schematic):
1) Input Module (muxl)
2) Input Module Counter (muxl_cnt)
3) Bias and Weight ROM Module (values)
4) Bias and Weight Counter (values_cnt)
5) Neuron Module (neuron)
6) Neuron Output Multiplexer Module (mux2)
7) Neuron Output Multiplexer Counter (mux2_cnt)
8) Number Representation Converter Module (interface)
9) Activation Function LUT Module (activjunction)
10)Output Threshold Module (threshold)
11)Control Unit (Fub2)
The simulated 2-3-3-3-1 RFNNA processor is able to process a given set of XOR
inputs in 265 to 270 clock cycles depending on the combination of inputs provided.
This means that if the processor is running at a conservative speed of 10 MHz, the
number of XOR computations possible per second would be around 37000 to 37700.
4.4.1 Input Module (muxl)
The Input Module serves as buffer and multiplexer to external inputs as well as
outputs from the activation function. External inputs are first converted into its
equivalent value for arithmetic computation. For example, if the input is logic 1 the
equivalent value stored in the buffer (mult_reg) for multiplier values would be
01IOOIOO2. An all 0 word is provided if the input is logic 0.
The input module accepts and assigns values to external inputs to be stored into the
multiplier buffers concurrently. However the same buffers are written to sequentially
when data is passed from the activation function. No conversion is required for data
from the activation function. The control signals provided to the inputs sel_reg[l:0]
addresses the appropriate buffer in which the incoming data is supposed to be written
to. Besides receiving and storing data, the Input Module is also required to correctly
broadcast the values in its buffers onto the multiplier bus (refer to section 4.4.12
29
Multiplier Bus). Again, the sel_reg[l:0] is used to properly select which buffer is
being broadcast. To switch between broadcast and write mode, control signal from
the control unit to the input sel_inout is used. However the sel_inout control is nulled
when the sel_reg is in 2'b00 mode, which is when the external inputs are being read,
converted and stored into buffers. Figure 4.7 shows the waveform of module being
simulated.
Name Value | S,| i . 20 . i . 40 . i GO . i •00 i 100 i 120 i 140 , 1ED i T60 i 200 . 220 i Z-
<"•' in1 1 S !
^in2 1 ! :





I Input numberdeterminesthemult^reg that is beingselected
! b> X' * X^ ft
«- elk D \ ruiJiJiJiixrirLJiixriJi^^
EB •* multipfier_out 204
Broadcast of data, ualue pointed tp bysel_teg
i b X^°4
i+iR= mult_reg1 100
Data Frometernal inputs are witter to buffer only when sel_jeg = 2'bOO





I :j(o X,Q° X204 X,0D
i p X»* X°
Figure 4.7: Simulation Result for Input Module
The Verilog code for the Input Module is available in APPENDIX G. The testbench
for the simulation in Figure 4.7 is available in APPENDIX H.
4.4.2 Input Module Counter (muxlcnt)
The Input Module Counter provides the control signals to the Input module's
sel_reg[l :0] to select between the multiplier buffers. The counter counts from 0 to 3,
and loops back to 1, whenever there is a reset, the counter goes back to 0.
Figure 4.8: Mealy Machine for Input Module Counter
30
The sel input signal comes from the Control Unit of the neural network processor.
The reset s ignal i s provided by the e xternal r eset. T he V erilog code for the I nput
Module Counter is available in APPENDIX I.
4.4.3 Bias and Weight ROM Module (values)
There are three Bias and Weight ROM modules in the designed processor. Each
module is similar to the other, differing only in the weight and bias values they carry.
Each module is dedicated to one neuron and stores bias and weight values for three
hidden layers and one output layer.
As in the Input Module, weight values are sequentially passed to the Neuron
Module. Each module has an internal counter which tells the module which weight
value is to be passed on. External signals to the modules layer[l :0] input tells which
layer the bias and weights value it can select from. The bias values stored in the
modules are in twos complement format while weight values are stored in sign
magnitude integer. The fraction size for the weight and bias values are different. The
fraction size for weight values is 0.01 whereas the fraction size for bias values is
0.0001.
Name Value S. . 2.0 i . 4.0 . i 6.0 . i • GO i 100 i 120 i 140 i 160




E! •• bias_out 209323?
Bias values remain the same for one hidden layer
* X20G3G41
Si * multiplicand_aut 133D P X284 V13D $
m R= cnt 1 | Counter selects weight value to be transmitted<o » * »
b- sel 0 | | Select pulses increments the modules counterJ | J | | |
^rst o i r
Figure 4.9: Simulation Result for Bias and Weight ROM Module
The Verilog code for the Bias and Weight ROM Module is available in APPENDIX
J. The testbench for the simulation in Figure 4.9 is available in APPENDIX K.
31
4.4.4 Bias and Weight Counter (values_cnt)
The Bias and Weight Counter provides input to the layer[l:0j input of the Bias and
Weight Module. This counter keeps track and ofwhich hidden layer or output layer
the processor is in. Figure 4.10 shows the state machine for the counter.
Figure 4.10: Mealy Machine for Bias and Weight Counter
Thecounteris incremented by the sel signal from the ControlUnit of the neural
network processor. The reset signal is provided by the external reset. The Verilog
code for the Input Module Counter is available in APPENDIX L.
4.4.5 Neuron Module (neuron)
Instead of just using the "*" arithmetic operator to perform multiplication, there is a
need for more control from the process so that control bits can be incorporated into
the multiplication process.
add_out <= multiplier_in* multiplicand_in;
The control bits are important so that feedback such as when the multiplier,
multiplicand and bias values has been recorded or when the multiplication and
summation has completed can be provided back to the control unit. The control unit
would then determine the appropriate action required for the next operation in the
neuron module for the hidden layer iteration.
Multiplication for the neuron module uses the Add Shift Right (ASR) algorithm.
This algorithm is suited for unsigned binary multiplication which is the type of data













Figure 4.11: Flowchart for Unsigned Binary Multiplication
While the multiplication algorithm involves unsigned binary integer, the register
mult_sign stores the sign bit of the multiplicand. The sign bit will be a flag as to
whether the multiplication result need to be complemented before it is summed up
with the value stored in add out.
Name Value S. i . W . i . 4B . i GO . i . 80 i 100 i 120 i 140 i 160 i 180 i 200 i 220
b- star! 0
signifiesstart oFa hiddenlayer, takes invalues lor multiplier, multiplicand aridbias
I J
p-sel 0









[+! *• bias in 0ODQD1 ( ^ODDDDI
E; -» add_out 1FFFFC
Loadingof bias value Bias value - Multiplication result
(XXXXXX XOODDOI X,FFFFe
•* slatus_out 1
Signifies tiiat values have been recorded Arithmetic operation completed
1 1 1
ER=AQ_reg DQD040 *«««xx x x x x x x x x x x xx x xnnnoo
ffiR= M reg 001 IJxxx X002 X001
R1 Mult sign Q 1 1 1
R= addjlag 0 i i i i n
BjR'cnt 4 b XX- X* X3 X* X* Xb ft X* X* ft X' ft ft ft ft ft ft *
Figure 4.12: SimulationResult for Neuron Module
33
Figure 4.12 shows the waveform showing all internal operations of the Neuron
module. There is only one multiplier per neuron, therefore multiplier and
multiplicand values provided by the Input Module and the Bias and Weight ROM
Module respectively are multiplied and summed up sequentially with the bias value
which is directly stored in the add_out register. The loading of values is different for
the starting of a hidden layer and the loading of values there after. Bias values are
only loaded once for every hidden layer into the Neuron Module whereas multiplier
and multiplicand values are loaded at every iteration. To differentiate between
starting of a hidden layer iteration and a normal iteration, two different stimulus
signals are provided, start and sel. The status_out register provides output that
signifies that values has been loaded intothe module and whenarithmetic operations
for a particular iteration have completed.
The Verilog code for the Neuron Module is available in APPENDIX M. The
testbench for the simulation in Figure 4.12 is available in APPENDIX N.
4.4.6 Neuron Output Multiplexer Module (mux2)
The purpose of the Neuron Output Multiplexer Module is to multiplex between the
outputs of the neurons such that only one output is passed through the activation
function. There is only one activation function in the processor and all neuron has to
share its use. This is because the implementation of an 8 bit wide LUT requires a lot
logic gates, thus it is not feasible to have dedicated activation function LUT's for
each neuron.
Name IValue 15. • . 2.0 . i . 4p . . GO. • .8.0 . 1D0 . 120 i 140 . m . M
S^addinl 1000001 Q0GQQ1
88*-add in2 1DDO0D2 ! Pooooa
:+l^add in3 JQ00DG3 i HJOOOm
[tinsel !3 ! KM XP X1
,cik h L.Uiru]^^
"i | Selection Df output occurswhen reset=0
reset !0
Outputs Fromneurons selected one at a time
S1-® add out 1000003 | j(XXXXXX XO0D001 ~X0QD0D2 ~%OOQ003
* status_out jD | | I I | | | |
Figure 4.13: Simulation Result forNeuron Output Multiplexer Module
34
The module does not buffer the output from the Neuron Modules. It simultaneously
reads in the values and passes only one which is chosen by the sel[l:0] input to be
output.
The Verilog code for the Neuron Output Multiplexer Module is available in
APPENDIX O. The testbench for the simulation in Figure 4.12 is available in
APPENDIX P.
4.4.7 Neuron Output Multiplexer Counter (mux2_cnt)
The Neuron Output Multiplexer Counter's design and function is the same as that for
the Bias and Weight Counter. The output of the counter however is now used to
provide selection for the Neuron Output Multiplexer Module's sel[l:0].
The c ounter i s incremented by the sel s ignal from the Control U nit of the neural
network processor. The reset signal is provided by the external reset. The Verilog
code for the Input Module Counter is available in APPENDIX Q.
4.4.8 Number Representation Converter Module (interface)
The purpose of the interface is to convert the twos complement result from the
neuron module back to unsigned binary integer. This is because the twos
complement result for negative results would not be able to be used with the
activation function unless modifications were made to it. The other reason why the
interface is required is because it relieves each neuron modules from having extra
logic to perform the conversion. A centralized interface would reduce the number of
logic used, because the activation function is only accessed one at a time by each
neuron. The output of the interface would be an 8 bit wide word with a separate sign
bit and also a status flag bit.
The Verilog code for the Neuron Output Multiplexer Module is available in
APPENDIX R. The testbench for the simulation in Figure 4.14 is available in
APPENDIX S.
35
Name Value c • • 20 • • • 40 • i GO . i . 90 i 100 i 120 i 140 i 160 i 19D i 200 i 220 i 2f0 ,
o-sel D
input signal triggers the module
n n n
»- elk 1 JijnjiJiJiJiJiJiJi_njiJiJi^^
i+l •- add in 140B00
Positive value Negative value
put values from Neuron Output Multiplexer Module
oooaoa )Jo4oaQ0 X,4Q8QD
•* status_oul D
Notifies when an output is readij to be read
-n n n
•* sum_sign_out 1
Logic 1signifies that the output value has been complemented
I I
i+j * sum out F7
Valuesis complemented firstbefore beingcopied
Bits 3 to 1G are directlycopied for >uevalues
(xxXoo Xoa XF7
Figure 4.14: Simulation Result for Neuron Representation Converter Module
4.4.9 Activation Function LUT Module (activfunction)
The LUT which is to be implemented and declared as a ROM block in the FPGA
device itself contains many redundant entries. Addressing the LUT is an 8 bit input
which has 256 entries. Using mathematical analysis, it is possible to reduce the ROM
usage from 256 x 8 bit word entries to 68 x 8 bit word entries. This is because there
are only 50 unique data that is being accessed in the LUT ranging from decimal
equivalent of 50 to 99.
All input combinations are accounted for with the help of mathematical analysis as
shown below. Several of the combinations can be grouped together for an entry thus


















Figure 4.15: Mathematical Analysis on Sigmoid LUT Values
36
Inputs in each colored band in Figure 4.15 are grouped together to address a similar
LUT equivalent.
The inputs from the Number Representation Converter Module are used to address
the activation function LUT as well as to note the sign of the input argument. As was
mentioned before, the LUT tale can only address values from 50 to 99 corresponding
to the sigmoid range of 0.50 to 0.99. These values are only valid for positive
arguments. If the sign of the argument is negative, some manipulation of the LUT
result has to be performed. The LUT equivalent would be subtracted from 100 to
produce the correct answer.
Name Value |S. i . 2.0 . i . 4.0 . i 6.0 . i • SO i 100 i 120 i 140 i
W\ •" sumrnationjn 1D0 | (100
tvse| 0 I I I I
^clk 0 I jmnrL^^^muuTR^
»- sum_sign_in 0 I
Negative sign bit Positive sign bit
Ei •* functinnjaut 92 |
Manipulated equivalent (100- LUTvalue)
Value from LUT
Q_ X» X* fe
"* status_out 1 CL I I
R= status 1 I | | |
Figure 4.16: Simulation Result for Activation Function LUT Module
The Verilog code for the Activation Function LUT Module is available in
APPENDIX T. The testbench for the simulation in Figure 4.16 is available in
APPENDIX U.
4.4.10 Output Threshold Module (threshold)
The output Threshold Module is used to provide a categorization of the output to
whether it is a logic 1, logic 0 or in the indeterminate state. Output from the
Activation Function ROM Module is continuously processed to provide
classification. For logic 1 the output from the activation function must be in the
range of 0.9 to 1.0. For logic 0, the output must be in the range of 0.0 to 0.1. Any
other values would produce a high impedance output.
37
Name ]Valuis S. i • 2.0 . i . 4.0 . i 60 . • . 80 i 100 i 120 i 1'
El »• functionjn h
i InputsfromActivationFunction
b »» > fe
^clk 1° ^njianTLruuuuiruiJL
»- sel \z
|o i Logic 1 Indeterminate state Logic 0i I
I 1
"** ready in
Figure 4.17: Simulation Result for Output Threshold Module
The Verilog code for the Output Threshold Module is available in APPENDIX V.
The testbench for the simulation in Figure 4.16 is available in APPENDIX W.
4.4.11 Control Unit (Fub2)
The Control Unit for this neural network processor has 10 distinct inputs and 10
distinct outputs. The Control unit is able to guide the rest of the modules to function
as intended as the right time. The sequence of operations is as stated in APPENDIX
F. The control unit was designed directly in the finite state machine editor and
converted to Verilog.
The Control Unit provides control signals to all modules except for the Number
Representation Converter Module and the Activation Function LUT Module. The















±± 11 | UIJIJIJI^^ JLlJlJQil^
! Correct Output
!1 ! b ->—• Outl
Jsr Ready
Processor ready Processor Busy
Figure 4.18: Simulation Result for RFNNA Processor
38
Processor Ready
Figure 4.17 shows the simulation for the whole neural network processor as one
working entity of the modules discussed earlier. The Ready flag, notifies the user
when the processor is available or when processing is complete.
The Verilog code for the Control Unit is available in APPENDIX X. The testbench
for the simulation in Figure 4.18 is available in APPENDIX Y.
4.5 Multiplier Bus
An additional improvement to the original RFNNA architecture would be the
inclusion of the multiplier bus for transmission of inputs for multiplication with
connection weights. The bus works by time multiplexing its resources for the







Figure 4.19: RFNNA Processors Paralleled
This implementation of the multiplier bus would reduce the number of multipliers in
a neuron unit exponentially. Comparing with the previous architecture, the number
of multipliers required for N number neuron architecture would be N2 in total,
whereas the number of multipliers required now is only N. This would save a
significant amount of logic resources.
39
The implementation of the multiplicand bus would also open up the possibility of
similar RFNNA processors to be paralleled as in Figure 4.19, thus increasing the
number of inputs, outputs and neurons. This ability would allow more application to
neural network processed as the number of inputs and outputs are no longer a
limitation
40
CONCLUSION AND FUTURE IMPROVEMENT
5 CONCLUSION AND FUTURE IMPROVEMENT
5.1 Conclusion
The project can be rated as successful with the completion of the RTL designfor the
neural networkprocessor for the XOR problem. The objectives of optimizing the
implementation havebeen performed on two fronts: architecture and organization.
Through research and analysis the RFNNA architecture and the optimized
numbering convention have been specified. The RFNNA architecture is a space
efficient implementation for neural network onto hardware, where all hidden layers
use the same logic resources without sacrificing implementation speed. The twos
complement fixed point fraction on the other hand increases processing speed by
simplifying the computation of binary values and addressing of the sigmoid
activation function's LUT. In the process of analyzing the proposed architecture, an
application was also developed to helpsimulate andjustifyneural networks basedon
the RFNNA architecture. Organization improvements through the implementation of
the multiplicand bus enable the processor to process in parallel. The decision to use a
single activation function ROM module helps reduce implementation size so does
the recursive use of a single multiplier within each neuron.
All of the mentioned findings and implementation have contributed towards
achieving the objectives of the project in which an optimized FPGA hardware




The project provides a strong foundation in which more sophisticated neural network
processors can be built upon. The FeedForward architecture utilising the
BackPropagation learning algorithm caters for a variety of applications in which fast
processing is a must. With this motivation in hand, it is viable for processors to be
designed catering for these needs.
Utilising the RFNNA design and hardware organization design methodologies
mentioned in this report will provide any beginner in neural network hardware
design much useful analysis. However there are still areas which can be improvedon
in terms of design and implementation. The author would suggest that future designs
would have a general purpose neural network processor design in mind.
This general purpose neural network processor ideally can be used for any
applications and can be parallel processed with similar processors so that the number
of inputs,outputs and neurons per layerwill not be a constraint. The processor would
only require the weight and bias values to be reprogrammed. These values can be
stored on external memory so that hardware reprogramming is not required. The
possibility of an ASIC implementation would be more plausible then. The processor
would also have external inputs which controls the number of hidden layers it can
process. For this, the designer must do away with a hardwired implementation of the
control unit.
With a general purpose neural network processor, implementations in ASIC
technology would provide faster processing and at lower prices, opening up the
possibility of neural network processing to a multitude of applications.
42
REFERENCES
[1] Callan R. 1999, Essence ofNeural Networks, Hertfordshire, Prentice Hall
Europe.
[2] Bose N.K., Liang P. 1996, Neural Network Fundamentals with Graphs,
Algorithms and Applications, McGraw-Hill, Singapore.
[3] Cofftnan, K. 1999, Real World FPGA Design with Verilog, Prentice Hall,
New Jersey.
[4] Picton, Phil 1994, Introduction to Neural Networks, MacMillan Press LTD,
London.
[5] Sundararajan N., Saratchandran P. 1998, Parallel Architectures for Artificial
Neural Networks, IEEE Computer Society, Los Alamitos, California.
[6] Vaughn B., Jonathan R., Alexander M. 1999, Architecture and CAD for
Deep-Submicron FPGA's, Kluwer Academic Publishers, Massachusetts.
[7] William Stailings 2003, Computer Organization and Architecture, Prentice
Hall, New Jersey.
[8] J. Zhu, GJ. Milne and B.K. Gunther, Towards An FPGA Based
Reconfigurable Computing Environment for Neural Network
Implementations, in Proceedings of IEE Conference on Artificial Neural
Networks, 1999: p. 661-666
[9] Eldredge, J.G. and B.L Hutchings, Design Methodologies for Partially
Reconfigured Systems, in Proceedings of IEEE Symposium on FPGAs for
Custom Computing Machines, 1994: p. 78-84.
43
[10] Brown, S. and Vranesic, Z. 2003, Fundamentals of Digital Logic with
Verilog Design, McGraw Hill.
[11] Ciletti, M. 2002, AdvancedDigital Design with the VerilogHDL, Prentice
Hall, New Jersey.
[12] Szabo T., Feher B. and Horvath G. 1998, Neural Network Implementation
Using Distributed Arithmetic, in Proceedings of the 2nd International
Conference on Knowledge Based Intelligent Electronic Systems
[13] Haykin S. 1999, Neural Networks - AComprehensive Foundation, Prentice
Hall New Jersey, p 23
[14] Gschwind, M. V. Salapura. and O. Maischberger, Space Efficient Neural
Network Implementation, in Proceedings of the 2nd ACM Workshop on Field
Programmable Gate Arrays, 1994. p. 23-28
44
APPENDIX A













































Similarities and Differences between Neural Net and Von Neumann Computer.
Neural Net Von Neumann Computer
1. Trained (learning by example) by Programmed with instructions (if-then
adjusting the connection strengths, analysis based on logic)
thresholds and structure
2. Memory and processing elements are
collocated
Memory and processing separate
3. Parallel (discrete and continuous), and Sequential or serial, digital,
asynchronous synchronous (with a clock)
4. May be fault tolerant because of
distributed representation and large
scale redundancy
Not fault tolerant
5. Self organization during learning Software dependant
6. Knowledge stored is adaptable; Knowledge stored in an addressed




Operation of the RFNNA Simulator vl.3 Software Application
The RFNNA Simulator works in 2 simulation modes:
a) Direct Simulation
- In this mode, the user would be able to obtain instant results for the XOR problem for any given
input when the simulation button is pressed.
b) Controlled Simulation
- An extra command button "Next Sequence" will appear. In this mode the simulation will pause after
computation is performed on every hidden layer. To proceed to the next hidden layer, press on the
"Next Sequence" command button.
NOTICE:
Pattern files can be manually generated, examples of neural network architecture and corresponding




The wgt files are directly generated from the metaNeural software. However to be able to use it using
this software, the user has to use an appropriate naming convention.
Example:
W2131.wgt
W - All files must start with "W"
2 = Indicates the number of inputs
1 = Indicates the number of hidden layer
3 = Indicates the number of neurons for 1st hidden layer and so on
1 = Indicates the number of outputs
The WGT files will be automatically retrieved from c:\METACTRIA. Pleaseensure that the wgt files




Screenshots of RFNNA Simulator Performing XOR operation for 2-3-2-1
Neural Network Architecture.
NN Architecture Parameters
Number o'lnputM 1 I





SwuJSiai Accural r\ u
SRants '
hidden iayeb»2
• 22645 - 13818 C
k 40X1 - •
* -61331 - I
Re=elNeureiNe>
0.—at® * 24325 hi G91S2
51458 ©--'
•0 x 00877 « G0761


















^0 * 106331 - 901 ©,






43923 fa 0S8J9 ®-~s~-© 1




















































// Title : mux_l.v
// Author : Ivan Teh Fu Sun




// Description : Interfaces btw inputs/LUT and neurons
//
//











always @ (posedge elk)
begin































// stores output from activation function into respective buffers sequentially
else
begin




if (sel_reg == 2'blO)
51
begin
mult reg2 <= function in
end
if i(sel_reg == 2'bll)
begin







Testbench for Input Module















































Input Module Counter (muxl_cnt)
muxl_cnt.v
Ivan Teh Fu Sun
University Teknologi Petronas
//
// Description : provides selection for mux_l module
//
//













2'b00 muxl sel <- 2 bOl
2'b01 muxl sel <- 2 bio
2'bl0 muxl sel < = 2 bll


























Ivan Teh Fu Sun
University Teknologi Petronas
Description : ROM block for bias and multiplicand values
Bias values are initial twos complement
Multiplicand values are initial unsigned binary
integer. MSB of multiplicand is the sign bit




































cnt <= cnt + 1;
end
if (cnt == 2'bOl)
begin
multiplicand_Out <= 11'bOOOlOOOOOlO;










if (cnt == 2'b00)
begin
multiplicand_out <= 11'blOlOOHOOlO;








cnt <= cnt + 1;
end
if (cnt == 2'bOl)
begin
multiplicand_Out <= 11'blOOllllOOOO; //-2.40
cnt <= cnt + 1;
end
if (cnt == 2'blO)
begin








multiplicand__Out <= 11 'bOOlolllllll; //3.83
bias__Out <= 21'b000000000011101110011; //0.19
cnt <= cnt + 1;
end
if (cnt == 2'bOl)
begin
multiplicand_out <= 11 'blOlOllllllO; //-3.82










if (cnt == 2'b00)
begin
multiplicand_OUt <= 11'bOlOOlOOllOl; //5.89
bias_out <= 21'bllllllllll00001101010; //-0.18
cnt <= cnt + 1;
end
if (cnt == 2'bOl)
begin
multiplicand_out <= 11'bOOOOlliOlll; 1/1-19
cnt <= cnt + 1;
end
if (cnt == 2'blO)
begin








Testbench for Bias and Weight ROM Module







wire [10-.0] multiplicand_out ,-









































sel = 1 'bl;
#76000




sel = 1 'b0;
#95000
sel = 1 1bl;
#96000
sel = 1 'b0;
#100000






















Bias and Weight Counter (valuescnt)
values_cnt.v
Ivan Teh Fu Sun
University Teknologi Petronas
//
// Description : provides selection for neuron_values module
//
//
module values_cnt (sel, reset, values_sel);
input sel;
input reset;
output [1: 0] values_sel ,-
reg [1:0] values sel;
always
begin









































Ivan Teh Fu Sun
University Teknologi Petronas
Description : Neuron block containing multiplication and summation blocks
Multiplication is using the AddShiftRight (ASR)
algorithm for unsigned binary.
Value of bias is initial twos complement
Value of multiplier and multiplicand is initial unsigned binary
integer.
module neuron2 (start, sel, elk, multiplier_in, multiplicand_in, bias_in,




input [7:0] multiplier in;
input [10:0] multiplicand in,-
input [20:0] bias in;
output [20:0] add out;
output status out;
reg status out;
reg [20:0] add out;
reg [20:0] AQ reg;




//no sign bit, multiplier value is always positive
//output in twos complement
//Overflow bit included
//l means negative









Mult_sign <= multiplicand_in[10] ;
//XOR the sign bit, 1 means -ve
add_out <= bias_in[20 :0] ,•
status_out <= 1'bl;







AQ_reg[7:0] <= multiplier_in [7:0];
M_reg <= multiplicand_in[9:0];
cnt <= 4'b0;
Mult_sign <= multiplicand_in [10] ,-
status_out <= 1'bl;




















if(add_flag) //When Q = l
begin









AQ_reg[18:8] <= AQ_reg[18:8] + M_reg;
add_flag <= 1'bl;
// status bit notifies that shift needs to take place











else if(cnt == 4'bl000)
begin
if (Mult_sign)
//only multiplicand value can begin negative
begin
//thus changes initial sign depends solely on multiplicand's sign
if (AQ_reg != 0)
begin
AQ_reg <= {3'bill,~AQ_reg[17:0]};
//twos complement inversion, +1 not required
end
end






Testbench for Neuron Module






















multiplier_in = 9'bl00000100; //Q
























Neuron Output Multiplexer Module (mux2)
mux_2.v
Ivan Teh Fu Sun
University Teknologi Petronas
//
// Description : Interface between neuronjnult_sum2 and activation_function's
// interface.
// Selects between the outputs of neurons initial hidden layer to begin
// presented to the neuron_AF_interface
//
//
module mux2 (add__inl, add_in2, add_in3, sel, elk, reset, add_out, status_out);












































Testbench for Neuron Output Multiplexer Module
























































Neuron Output Multiplexer Counter (mux2_cnt)
values_cnt.v
Ivan Teh Fu Sun
University Teknologi Petronas
//
// Description : provides selection for neuron_values module
//
//




reg [1:0] values sel;







2'b00 values sel < = 2 bOl
2 'bOl values sel <- 2 bio
2'blO values sel < = 2 bll
















Ivan Teh Fu Sun
University Teknologi Petronas
//
// Description : Interface between neuron_mult_sum2 and activation_function.
// Converts 23 bit twos complement number into signed integer.
// Selects bits 9 to 16 and sign bit for output
//
//














sum_out <= add_in[15 :8] ;
sum_sign_out <= 0;















Testbench for Number Representation Converter Module












































Author : Ivan Teh Fu Sun
University : University Teknologi Petronas
Description : ROM block for sigmoid activation function
module activ_function (summation_in, sel, elk, sum_sign_in, function_out,
status out);
input [7:0] summation_in;













8 bOOOOOOOx function out
8 bOOOOOOIX function_out




8 bOOOOlOlx function out
8 bOOOOHOO function out
8 bOOOOHOl function__out
8 bOOOOlllx function out
8 bOOOlOOOx function_out
8 bOOOlOOlx function_out








8 bOOlOOOOx function out
8 bOOlOOOlx function_out
8 booiooiox function_out












8 bOOlllllx function out
















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Testbench for Activation Function LUT Module

















































Ivan Teh Fu Sun
University Teknologi Petronas
Description : Provides threshold values for the RFNNA output
module threshold (function in, elk, sel, out, ready)
















8'b0101100i out <= 1'bl;
8'b0101101x out <= l'b
8'bOlOlllxx out <= 1'bl;
8'bOllxxxxx out <= 1'bl;
8'b0000100x out <= l'b
8'bOOOOlOxO out <= 1'bO;
8'bOOOOOxxx out <= 1'bO;









Testbench for Output Threshold Module




































"timescale Ins / lps
module Fub2 (AF_stat, elk, muxl_cnt, muxl_sel__cnt, muxl_sel_inout, mux2_cnt,
mux2_rst, mux2_sel_cnt, nl_stat, neuron_sel, neuron_start, p_ready, p__start, reset,









































// BINARY ENCODED state machine: SregO









































// NextState logic (combinatorial)
//
always ® (p_start or vl_cnt2 or nl_stat or mux2_cnt or vl_cnt or AF_stat or
threshold_ready or muxl_sel__inout or muxl_sel_cnt or vl_sel_cnt or values_rst or





















next muxl sel inout = 1'bO,
next_muxl_sel_cnt






















next muxl sel cnt = 'bO;
next vl sel cnt = 1 bO;
next values sel = 1 bO;
next neuron start = 1'bl;




next neuron start = 1'bO;
next neuron sel = 1 bO;
next values sel = 1 bl;
next muxl sel cnt = 1'bl;



































if (vl_cnt2 == 2'bll)
NextState_SregO <= "S16;










if (mux2_cnt == 2'blO)
NextState_SregO <= "Sll;
else if (vl_cnt == 2'bll)
NextState_SregO <= "S12;
else if (mux2_cnt != 2'blO)





NextState SregO <= "S5;
next_muxl_sel_cnt = l'bl;
if (AF_stat)







































// Current State Logic (sequential)
//








// Registered outputs logic
always @ (posedge elk or posedge reset)
begin : SregO RegOutput
if (reset)
begin
muxl sel inout <= 1'bO;
muxl sel cnt <= 1'bO;
vl sel cnt <= 1'bO
values rst <= l'bl
values sel <= 1'bO
neuron sel =;= 1'bO
neuron start <= 1'bO;
mux2 sel cnt <= 1'bO;
mux2 rst <= l'bl;
end










mux2__sel__cnt <= next_mux2_sel_cnt ;
mux2_rst <= next_mux2__rst ;






Testbench for RFNNA Processor
"timescale lps / lps
module TB RFNNA processor;
reg Inl;
reg In2;
reg reset;
reg Start,-
reg elk;
wire Outl;
wire Ready;
RFNNA_Processor UUT (Inl,In2,Start,elk,reset,Outl,Ready)
initial
begin
end
always
begin
#0
elk = 0;
reset = 1;
#10000
Inl = 1'bO;
In2 = l'bl;
#30000
reset = 0;
#40000
Start = l'bl,-
#2750000
reset = 1;
#2800000
Start = l'bl,-
reset = 0;
if (!Ready)
Start = 1'bO;
#5000
elk = !clk;
end
endmodule
77
