Binarna nevronska mreža na programirljivem vezju FPGA by LANGERHOLC, KLARA
University of Ljubljana
Faculty of Computer and Information science
Klara Langerholc
A binary neural network on an FPGA
BACHELOR’S THESIS
UNIVERSITY STUDY PROGRAMME
UNDERGRADUATE PROGRAMMES
COMPUTER AND INFORMATION SCIENCE
Mentor: izr. prof. dr. Urosˇ Lotricˇ
Co-mentor: doc. dr. Anton Biasizzo
Co-mentor: prof. dr. Dong Seog Han
Ljubljana, 2020

Univerza v Ljubljani
Fakulteta za racˇunalniˇstvo in informatiko
Klara Langerholc
Binarna nevronska mrezˇa na
programirljivem vezju FPGA
DIPLOMSKA NALOGA
UNIVERZITETNI SˇTUDIJSKI PROGRAM
PRVE STOPNJE
RACˇUNALNISˇTVO IN INFORMATIKA
Mentor: izr. prof. dr. Urosˇ Lotricˇ
Somentor: doc. dr. Anton Biasizzo
Somentor: prof. dr. Dong Seog Han
Ljubljana, 2020

Copyright. Rezultati diplomske naloge so intelektualna lastnina avtorja
in maticˇne fakultete Univerze v Ljubljani. Za objavo in koriˇscˇenje rezul-
tatov diplomske naloge je potrebno pisno privoljenje avtorja, fakultete ter
mentorja.
Besedilo je oblikovano z urejevalnikom besedil LATEX.

Kandidat: Klara Langerholc
Naslov: Binarna nevronska mrezˇa na programirljivem vezju FPGA
Vrsta naloge: Diplomska naloga na univerzitetnem programu prve stopnje
Racˇunalniˇstvo in informatika
Mentor: izr. prof. dr. Urosˇ Lotricˇ
Somentor: doc. dr. Anton Biasizzo
F Somentor: prof. dr. Dong Seog Han
Opis:
Globoke nevronske mrezˇe postajajo nepogresˇljive spremljevalke v nasˇem vsak-
danu. Zato je vedno bolj aktualno tudi njihovo izvajanje na namenskih vezjih.
Izbrano konvolucijsko nevronsko mrezˇo z binarnimi operandi opiˇsite v jeziku
Verilog in jo sintetizirajte za izbrano vezje FGPA. Pri nacˇrtovanju vezja cˇim
bolj izkoristite vzporedno naravo nevronske mrezˇe. Vasˇe vezje preizkusite
na realni podatkovni bazi, preverite pravilnost obnasˇanja vezja in izmerite
glavne karakteristike vezja.
Title: A binary neural network on an FPGA
Description:
Deep neural networks are becoming more and more important in our lives.
There is an increasing interest in hardware specialized for the efficient exe-
cution of neural network models. In the thesis, prepare Verilog design of a
selected convolutional neural network with binary operands and synthesize
it to an FGPA circuit. In design, focus on the parallel nature of the neu-
ral network. Test the designed system on a real dataset, check the correct
behavior, and measure the main characteristics.

Contents
Abstract
Povzetek
Razsˇirjeni povzetek
1 Introduction 1
2 Artificial neural networks 3
2.1 Fully-connected neural network . . . . . . . . . . . . . . . . . 3
2.2 Convolutional neural network . . . . . . . . . . . . . . . . . . 5
2.3 Binary neural network . . . . . . . . . . . . . . . . . . . . . . 10
3 Field Programmable Gate Arrays 15
3.1 ZedBoard platform . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Neural networks on FPGAs 19
5 Hardware implementation 23
5.1 Top module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Convolutional layer module . . . . . . . . . . . . . . . . . . . 28
5.3 Max-pooling layer module . . . . . . . . . . . . . . . . . . . . 35
5.4 Fully-connected layer module . . . . . . . . . . . . . . . . . . 36
5.5 Last layer module . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Results 39
6.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Implemented network . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Resource utilization . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 Synthesis results . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Conclusion 49
Bibliography 51
Abstract
Title: A binary neural network on an FPGA
Author: Klara Langerholc
In recent years, the performance of convolutional neural networks has been
increasing rapidly. But higher performance brings higher computational and
memory costs. Research has shown that good accuracy can be achieved even
when operands are constrained to only one or two bits. The purpose of this
work is to implement a binary neural network with operands constrained to
one bit on a field-programmable gate array. The computations in binary neu-
ral networks are mostly binary, while the weights require very little memory,
making them ideal for hardware implementation. The implemented network
was tested on MIO-TCD database, while the implementation was mostly
focused on resource consumption and speed.
Keywords: binary neural network, FPGA, Verilog, ZedBoard.

Povzetek
Naslov: Binarna nevronska mrezˇa na programirljivem vezju FPGA
Avtor: Klara Langerholc
V zadnjih letih se zmogljivosti konvolucijskih nevronskih mrezˇ neprestano
povecˇujejo. Vendar pa se z zmogljivostjo povecˇuje tudi kompleksnost mrezˇ
in s tem poraba racˇunskih virov in virov za shranjevanje utezˇi. Raziskave so
pokazale, da je pri mnogih problemih zadovoljivo delovanje nevronskih mrezˇ
dosezˇeno tudi z le enobitnimi ali dvobitnimi operandi. V tem delu smo bina-
rno konvolucijsko mrezˇo implementirali na reprogramirljivem vezju FPGA.
Racˇunske operacije v binarnih mrezˇah so vecˇinoma binarne, za shranjevanje
utezˇi pa se potrebuje malo prostora, kar jih naredi idealne za implementacijo
v strojni opremi. Implementirano mrezˇo smo testirali na bazi slik MIO-TCD,
pri imlementaciji pa smo bili pozorni predvsem na porabo virov in hitrost
izvedbe.
Kljucˇne besede: binarna nevronska mrezˇa, FPGA, Verilog, ZedBoard.

Razsˇirjeni povzetek
Konvolucijske nevronske mrezˇe v zadnjih letih dosegajo odlicˇne rezultate
pri prepoznavanju slik, govora in analizi podatkov. Mrezˇe so cˇedalje kom-
pleksnejˇse in vsebujejo veliko parametrov, za izracˇun izhoda pa je potrebnih
veliko racˇunskih operacij. V zadnjem cˇasu se je pokazalo, da lahko konvoluci-
jske nevronske mrezˇe natancˇno razvrsˇcˇajo tudi samo z enobitnimi ali dvobit-
nimi operandi. To predstavlja prilozˇnost za razvoj mrezˇ na reprogramirljivih
vezjih FPGA.
Konvolucijske nevronske mrezˇe so sestavljene iz vecˇ razlicˇnih slojev, vhod
v taksˇno mrezˇo pa je sestavljen iz podatkov, ki so razporejeni v obliki pros-
torsko koreliranih struktur, kot so slike s sˇirino, viˇsino in globino. Naj-
pogostejˇsi sloji v konvolucijski nevronski mrezˇi so konvolucijski sloj, zdruzˇevalni
sloj in gosto povezani sloj.
Konvolucijski sloj izvede konvolucijo med prostorsko blizˇnjimi vhodi s
filtri enakih dimenzij. Vsak filter iˇscˇe neko vizualno znacˇilnost vhodov. Filter
drsi po tej matriki in izvaja konvolucijo z obmocˇji vhodov. Konvolucija se
izvede kot skalarni produkt med filtrom in vhodi. Tako za vsako konvolucijo
pridobimo en skalar, ki je vhod v aktivacijsko funkcijo. Zaradi ohranjanja
dimenzije vhoda pri izhodu sloja in zaradi preprecˇevanja izgube podatkov,
vhodni matriki dodamo nicˇle ob njenih robovih.
Zdruzˇevalni sloj se pogosto uporablja med zaporednimi konvolucijskimi
sloji. Njegov namen je zmanjˇsati sˇirino in viˇsino vhodne matrike, hkrati pa
ohraniti njeno globino. Vhodno matriko razdeli na manjˇse dele, nato pa vsak
taksˇen del z operacijo zdruzˇevanja pretvori v le eno samo sˇtevilo. Pogosta
operacija zdruzˇevanja je maksimum.
Gosto povezan sloj povezuje vse vhode z vsemi izhodi. Vsak izhod je
izracˇunan kot vsota zmnozˇka med vhodi in njihovimi ustreznimi utezˇmi.
Binarne nevronske mrezˇe lahko obravnavamo kot skrajno kvantizirano
razlicˇico konvolucijskih nevronskih mrezˇ. V binarni nevronski mrezˇi utezˇi in
aktivacije zavzamejo le dve vrednosti, -1 in 1. Sˇtevila v plavajocˇi vejici lahko
pretvorimo v ti dve vrednosti glede na njihov predznak, torej, cˇe so manjˇsa
ali enaka 0, jih preslikamo v -1, cˇe pa vecˇja od 0, jih preslikamo v sˇtevilo 1.
Velikokrat pa se izkazˇe za bolj prirocˇno, cˇe sˇtevilo -1 predstavimo kot bit 0
in sˇtevilo 1 kot bit 1.
Binarne konvolucijske nevronske mrezˇe so sestavljene iz enakih slojev kot
obicˇajne konvolucijske nevronske mrezˇe. Ker pa so sˇtevila tu predstavljena z
biti 0 in 1, lahko pri operaciji konvolucije vhoda in filtra namesto mnozˇenja
in sesˇtevanja uporabimo bitne operacije. Izkazˇe se, da lahko konvolucijo
opravimo le z operacijami XNOR in sˇtetjem enic v vektorjih. Najprej med
biti regije vhoda in biti filtra opravimo operacije XNOR. Nato v pridobljenem
vektorju presˇtejemo sˇtevilo enic. Pridobljeno sˇtevilo primerjamo z vnaprej
dolocˇenim pragom in tako iz vsake konvolucije pridobimo en bit.
Ker je natancˇnost pri ucˇenju vsake nevronske mrezˇe zelo pomembna, se
tudi binarne nevronske mrezˇe ucˇijo v plavajocˇi vejici.
Reprogramirljiva vezja FPGA (angl. field programmable gate arrays) so
polprevodniˇske naprave, ki jih je mogocˇe elektricˇno programirati za izva-
janje katerega koli digitalnega vezja. Sestavljena so iz nizov programabilnih
logicˇnih celic, programabilnih medsebojnih povezav in nabora programabil-
nih vhodnih in izhodnih celic na robovih naprave. Poleg tega imajo bogat
nabor vgrajenih komponent.
V tej nalogi smo uporabili razvojno plosˇcˇo ZedBoard. ZedBoard ni le
vezje FPGA, ampak je programirljiva naprava SoC (angl. system on chip),
kar pomeni, da zdruzˇuje procesni sistem s programabilno logiko.
Konvolucijske nevronske mrezˇe zahtevajo zelo veliko racˇunsko mocˇ in ve-
like kolicˇine pomnilnika. Taksˇne zahteve predstavljajo racˇunalniˇski izziv za
splosˇne zaporedne procesorje. Cˇe so nevronske mrezˇe implementirane v stro-
jni opremi, se njihovo izvajanje lahko mocˇno pospesˇi, saj se veliko operacij
v nevronskih mrezˇah lahko izvaja vzporedno. Platforme FPGA so lahko
odlicˇna izbira za vgrajene sisteme, ki zahtevajo majhno porabo energije.
Vendar pa imajo svoje omejitve, kot so omejena kolicˇina spomina in logicˇnih
celic, zato je potrebno implementacijo dobro nacˇrtovati. Binarne konvoluci-
jske mrezˇe zahtevajo malo pomnilnika, vecˇinoma izvajajo le preproste binarne
operacije in omogocˇajo vzporedno izvajanje vsake konvolucije. Te lastnosti
jih naredijo primerne za implementacijo na vezjih FPGA.
V tej diplomski nalogi smo na razvojni plosˇcˇi ZedBoard implementirali
binarno nevronsko mrezˇo s sˇtirimi konvolucijskimi sloji, sˇtirimi zdruzˇevalnimi
sloji in dvema gosto povezanima slojema. Mrezˇa je bila najprej naucˇena na
naboru slik MIO-TCD [14], ki ima sˇtiri razlicˇne razrede slik: avtomobili,
pesˇci, kolesarji in ozadje. Nabor slik vsebuje okoli 10.000 primerov za vsak
razred. Ucˇenje mrezˇe je bilo izvedeno v jeziku Python z ogrodjem Theano.
Slikam smo pred ucˇenjem spremenili velikost in jih binarizirali na 32 × 32
bitov. Tako pridobljene binarne utezˇi in izracˇunane pragove smo shranili v
notranji pomnilnik cˇipa.
Implementacija na vezju FPGA je napisana v jeziku Verilog v razvojnem
okolju Vivado. Pripravili smo vecˇ modulov, ki so medsebojno povezani. Vsak
modul predstavlja svoj sloj binarne mrezˇe, zdruzˇuje pa jih vrhnji modul, ki
skrbi za pravilne povezave med njimi.
Pri klasifikaciji slike velikosti 32 × 32, v implementirani sistem vsakih
osem urinih period posredujemo eno vrstico slike, torej 32 bitov. Modul, ki
predstavlja prvi konvolucijski sloj, zbere tri taksˇne vrstice in na njih zacˇne
izvajati operacije v sˇtirih stopnjah. V prvi stopnji se izlocˇijo vsa obmocˇja
velikosti 3 × 3. V drugi stopnji se izvedejo operacije XNOR med obmocˇji
in dvema filtroma. V prvem konvolucijskem sloju imamo 16 filtrov velikosti
3 × 3. V tretji stopnji se presˇtejejo enice v vsakem rezultatu druge stopnje.
Cˇetrta stopnja potem le sˇe primerja rezultate tretje stopnje s pragovi.
Omenjene sˇtiri stopnje se za vsake tri vrstice, ki se ob prihodu v modul
shranijo v registre, izvedejo zaporedno. Operacije znotraj stopenj se nato
izvajajo kar se da vzporedno. Ker smo omejeni s sˇtevilom logicˇnih celic v
izbranem vezju, se konvolucije z vsemi filtri za dani vhod ne morejo izvesti
vzporedno. Zato se naenkrat izvedeta le konvoluciji z dvema filtroma. Ker
imamo v prvem konvolucijskem sloju 16 filtrov, moramo postopek za zbrane
tri vrstice ponoviti osemkrat. Ko so opravljene konvolucije z vsemi filtri,
modul rezultate posreduje v zdruzˇevalni modul.
Modul zdruzˇevalnega sloja je izveden z operacijami ALI, in sicer popol-
noma vzporedno. Modul mora le pocˇakati, da iz prejˇsnjega konvolucijskega
modula dobi dve vrstici, saj zdruzˇuje obmocˇja velikosti 2× 2 v en bit.
Ostali konvolucijski in zdruzˇevalni sloji so implementirani na podoben
nacˇin, le da imajo vhodi vecˇ dimenzij, konvolucije pa opravljamo z razlicˇnimi
sˇtevili filtrov.
Modul gosto povezanega sloja je prav tako sestavljen iz treh stopenj, ki
se za vsak vhod izvedejo druga za drugo. V prvi stopnji se izvedejo XNOR
operacije med vhodi in utezˇmi, v drugi stopnji se presˇtejejo enice v rezul-
tatih prve stopnje, v tretji stopnji pa se v drugi stopnji pridobljeni rezultati
primerjajo s pragovi. Operacije v vsaki stopnji se izvedejo vzporedno, in sicer
z vsemi utezˇmi naenkrat.
Izhod iz pripravljene binarne konvolucijske nevronske mrezˇe je 28-bitni
vektor, ki predstavlja sˇtiri 7-bitna sˇtevila, eno za vsak razred. Sˇtevilo z
najvecˇjo vrednostjo je razred, v katerega je mrezˇa razvrstila vhodno sliko.
Izvedbo smo simulirali z ogrodjem Vivado in rezultate primerjali z izvedbo
v jeziku Python ter tako potrdili, da izvedbi izracˇunata enake rezultate.
Klasifikacijska tocˇnost implementirane binarne mrezˇe je 64,7 %. Mrezˇa, ki
izvaja konvolucije z dvema filtroma naenkrat, zasede prakticˇno vse (98 %)
logicˇne rezine na izbrani platformi. Cˇe naenkrat racˇunamo konvolucijo le z
enim filtrom, pa se poraba virov mocˇno zmanjˇsa (37 %), s tem pa se tudi
izvajalni cˇas povecˇa.
Implementirana binarna mrezˇa potrebuje 589 urinih ciklov da sprejme vse
vhodne vrstice in iz njih izracˇuna pripadajocˇi razred. Izracˇunana maksimalna
urina frekvenca je 93,48 MHz. To dovoljuje klasifikacijo 158.730 slik v eni
sekundi.
Izvedba dovoljuje mozˇnosti za izboljˇsave in nadaljnje raziskovanje. Lahko
bi se pripravile razlicˇne velikosti binarnih mrezˇ na razlicˇno velikih cˇipih, samo
izvedbo pa bi lahko optimizirali z boljˇsim nacˇrtovanjem in izkoriˇscˇanjem
urinih period, v katerih nekateri deli vezja le cˇakajo na naslednje podatke.

Chapter 1
Introduction
Convolutional neural networks (CNNs) have been achieving great results in
many fields like image recognition, speech recognition and data analysis in
recent years, some of them achieving even better than human accuracy in ob-
ject recognition [9]. However, modern CNNs may contain millions of floating-
point parameters and require billions of floating-point operations to recognize
a single image. For instance, VGG-16 from 2014 ImageNet Large Scale Vi-
sual Recognition Competition (ILSVRC) [1] requires 552 MB of parameters
and 30,8 billion floating point operations per image [19]. It has been shown
that convolutional neural networks can classify accurately even with only one
or two-bit weights and activations. This presents an opportunity for devel-
opment of low-precision neural networks on field programmable gate arrays
(FPGAs), since they can only handle implementations with small memory
consumption and cope well with binary operations. Binarized neural net-
works (BNNs), proposed by Courbariaux et al. [7], are especially appropri-
ate for implementation in hardware, since they can be implemented almost
entirely with binary operations.
In this thesis, convolutional neural networks and especially binary neu-
ral networks are presented. FPGAs and their suitability for binary neural
networks inference is described. Then, a binary neural network with four
convolutional layers, four max-pooling layers and two fully-connected lay-
1
2 Klara Langerholc
ers is trained on a CPU. The inference algorithm is written in Verilog and
synthesized on ZedBoard FPGA.
In chapter 2, artificial neural networks are presented and described. We
describe several layers of a convolutional neural network and focus on a highly
quantized version of a convolutional neural network, a binary convolutional
neural network. Chapter 3 presents field programmable gate arrays and
especially the ZedBoard platform. In chapter 4, the meaningfulness of im-
plementing binary neural networks on FPGAs is discussed. Chapter 5 then
describes our implementation of a binary neural network on an FPGA. It con-
tains descriptions of each implemented module in detail. In chapter 6, results
are presented, while in chapter 7, this work is concluded with conclusion.
Chapter 2
Artificial neural networks
An artificial neural network [15] is a computational model which vaguely
models biological neurons. It consists of artificial neurons, which are grouped
in layers.
2.1 Fully-connected neural network
Fully connected neural networks are a type of artificial neural network where
all the nodes or neurons in one layer are connected to the neurons in the next
layer.
As shown in Figure 2.1, a neuron receives an input, computes the sum of
multiplications of its inputs and their corresponding weights and adds a bias.
The resulting value is then processed through activation function which sets
the output of the neuron [16]. There are several activation functions, most
known are sigmoid function, hyperbolic tangent function and rectified linear
unit (ReLU).
A generic fully-connected neural network (Figure 2.2) consists of an input
layer, an output layer and a number of hidden layers. We feed our data
into the input layer. Each neuron in the input layer performs computations
previously described and produces an output, which is an input to the first
hidden layer. The hidden layer performs the same computations and feeds
3
4 Klara Langerholc
Figure 2.1: Artificial neuron.
the outputs to the next hidden layer. This continues until we reach the
output layer. This process is called forward propagation.
When a neural network is being trained, the weights are first initialized
randomly. Forward propagation is performed and the loss is calculated with
a loss function, such as the mean squared error. It indicates the error of the
calculated output score when compared to the ground truth. Weights and
biases need to be changed so that the loss is minimized. The gradient of
the loss with respect to weights and biases gives the rate at which weights
and biases should be changed. The process of computing the gradient of the
loss with respect to weights and biases in the entire network is called back-
propagation [24]. The gradient is computed repeatedly and parameters are
updated accordingly. In backpropagation, the gradient of every weight and
bias is calculated using the gradient chain rule going from the output layer
to the input layer. When building a neural network, it is important to deter-
mine the right hyperparameters, such as the rate of change and complexity
of the model. One of the common hyperparameters used in deep neural net-
works is the learning rate. The learning rate determines the step size for the
parameter update along the direction of the gradient. The learning rate has
to be chosen carefully, if it is too small, the convergence of the network will
Bachelor’s thesis 5
Figure 2.2: Fully-connected artificial neural network with an input layer,
hidden layers and an output layer.
be slow and if it is too large, it may give us higher loss because of the coarse
step size.
2.2 Convolutional neural network
The convolutional neural network [12] works on the same principle as the
fully-connected neural network. The main difference is that a convolutional
neural network takes advantage of the fact that the input consists of data
which can be arranged in the form of spatially correlated structures such as
images with width, height and depth. This makes a convolutional network
more complex than a regular neural network because every layer becomes
a high dimension array instead of one column of neurons. A convolutional
neural network can be seen in Figure 2.3. It consists of different types of
layers, which will be explained in the following sections.
6 Klara Langerholc
Figure 2.3: A convolutional neural network.
2.2.1 Convolutional layer
The convolutional layer is the core building block of the convolutional net-
work. Neurons in this layer are not connected to every neuron in the previous
layer. Instead, every neuron is connected to the local region in the previous
layer. There are two important concepts, feature maps and kernel filters.
Feature maps are the output data from the previous layer. When they enter
a new layer, all the feature maps together are called an input activation of
that layer. Kernel filters are trainable 2D arrays that are very small along
the width and height, but extend along the full depth of the input activation,
as seen in Figure 2.4. The number of kernel filters corresponds to the depth
of the output activation. Each kernel filter looks for some different visual
feature. The number of kernel filters determines the number of features that
a convolutional layer is extracting.
In Figure 2.4, a 6 × 6 zero-padded input activation with depth 3 is con-
volved with 3 × 3 filters of depth 3. First channel of the input (there are
three channels, because depth = 3) is convolved with the first slice of the
filter, the second one with the second one and third one with the third one.
Bachelor’s thesis 7
This produces three 4 × 4 maps (stride = 1). These maps are then added
together, which results in one 4× 4 feature map. The output is therefore of
size 4× 4 of depth 2, since there are two filters.
Figure 2.4: Multidimensional convolution.
During the forward propagation, every kernel filter slides across the entire
input activation by a predetermined number called stride. Convolution be-
tween kernel filter elements (also referred to as weights) and the correspond-
ing local regions in the input activation is performed and a two-dimensional
feature map, which is the convolutional output of the kernel filter at every
spatial position of the input activation, is produced (Figure 2.5). Convolution
can be represented as a matrix multiplication operation, which is essentially
computing the dot product of each row of matrix A with each column of
matrix B. Computing the dot product translates to multiply-accumulate
(MAC) operations. When a filter slides across the image, it gets activated,
when it is convolved with the certain type of image features such as line,
curve, edge, corner or a certain combination of colours. Every kernel filter
tries to identify different features from given input and stores the result in
one 2-D activation map. N filters generate N 2-D feature maps. These N
8 Klara Langerholc
feature maps are joined along the depth to make one 3-D output activation
map, which is then used as input for the next layer.
Figure 2.5: Convolution between an area of input and a kernel filter. The
same filter will convolve with each possible area in the input and a scalar num-
ber will be produced for each convolution. The output is a two-dimensional
map of all the produced scalars.
A non-linear activation function is applied after almost every convolu-
tional and fully connected layer in most of the neural networks. As men-
tioned before, there are many different activation functions. Rectified linear
unit function has been proven to give better results in the neural network
than other non-linear functions, because it requires less computation time
and can also give performance improvement [8]. The function returns 0 if it
receives any negative input, but for any positive value it returns that value
back.
Before the application of a non-linear function, batch-normalization is
often used. It normalizes each input channel across a mini-batch. Instead of
using the entire dataset to normalize activations, we use mini-batches as each
mini-batch produces estimates of the mean and variance of each activation.
The size of a mini-batch is a parameter, which needs to be determined before
training. Batch normalization is used to speed up training of convolutional
Bachelor’s thesis 9
neural networks and reduce the sensitivity to network initialization. First,
the activations of each input channel are normalized by subtracting the mini-
batch mean (µ) and dividing by the mini-batch standard deviation (σ). Then,
the inputs are shifted by a learnable offset β and scaled by a learnable scale
factor γ.
In order to keep information during the convolutions with filters, we use
zero padding on the input activations as seen in Figure 2.4. Without using
zero padding, the edges of the inputs would not be convolved with filters as
many times as the data in the center of the matrices, therefore we would lose
data around the edge. When using stride equal to 1, zero padding insures,
that the output width and height are equal to the input width and height.
2.2.2 Pooling layer
Pooling layer is often used between successive convolutional layers. Its pur-
pose is to reduce the spatial dimension of the input activation while keeping
the depth size. There are different kinds of pooling layers, the most used are
the average pooling layer and max-pooling layer.
A window of predetermined constant size (most common is 2× 2) and a
stride (most common is 2) is applied to each 2-D feature map in the input ac-
tivation independently, and a maximum operation (in case of a max-pooling
layer) in that window is carried out. Reduction in spatial dimensions reduces
the number of parameters required in the next convolutional layers, which
reduces memory requirement and the computation cost. Additionally, pool-
ing layer also helps in controlling the overfitting by providing distortion and
scale invariance.
2.2.3 Fully connected layer
A fully connected layer in a convolutional neural network is the same as a
hidden layer in a regular fully-connected neural network. Each neuron has
connections to all the neurons in the previous layer. Fully connected layer
10 Klara Langerholc
Figure 2.6: Max-pooling layer.
does not take into account spatial properties of images, so the inputs coming
from a convolutional or pooling layer are first flattened into a one-dimensional
array. Because of that, there can not be any convolutional layers after the
fully connected layer. Generally, a stack of fully connected layers follows
convolutional and pooling layers, ending with a final layer, which determines
the outputs just like in a fully-connected neural network.
2.3 Binary neural network
Binary or binarized neural networks [7] can be considered as extreme quan-
tized version of convolutional neural networks. In a binary neural network,
weights and activations take only two values, -1 and 1. To transform floating
point weights and activations into these two binary values, several different
binarization methods were proposed. One of them is deterministic binariza-
tion method, which is a simple sign function. The binary weight can be
expressed as
wb = sign(w) =
−1 if w ≤ 0,1 otherwise ,
where w is the real-valued, full-precision weight.
Bachelor’s thesis 11
Another way to binarize the floating-point weights is the stochastic bina-
rization, which is more precise than deterministic binarization, but requires
the generation of random bits while quantizing, which requires more compu-
tation and memory space. Therefore the deterministic binarization is more
commonly used, especially in the hardware implementations. Additionally,
results are comparable in both cases.
Representing weights and inputs with -1 and 1 is convenient when we have
only partially binarized neural networks, for example, if only the weights are
binarized, while the inputs are not. But in fully-binarized neural networks
described here, it is more convenient for the weights and inputs to be repre-
sented with 0 and 1. Therefore we map -1 to 0 and 1 to 1. This approach
requires only one bit to save each weight/activation input and also enables
easy binary computations.
Binary convolutional neural networks are essentially a simplified version
of convolutional neural networks and therefore consist of the same layers.
They come, however, with a few computational optimizations, which come
from the fact, that the weights and inputs are binarized.
During forward propagation, the weights and inputs are binarized at each
layer. Because all the input activations and weights are binarized, multiply-
accumulate operations in the convolutional and in the fully-connected layers
can be performed by XNOR and population count (popcount) operations,
as shown in Figure 2.7. Popcount operation simply counts the number of
set bits (number of ones) in a vector. This reduces the computational and
memory requirements.
In the convolutional layer, we first perform XNOR operations between the
filter bits and the activation area bits. Then, the number of set bits is counted
in the resulting bits. This repeats for all the possible areas (with the same
size as filter) of the input activation to form the final matrix. Afterwards,
the values in the resulting matrix are compared to a certain threshold and
are binarized (Figure 2.7).
The usual activation function for a binary neural network is the sign
12 Klara Langerholc
Figure 2.7: Convolution with XNOR and popcount operations.
function. By counting weighted input bits (bitcount) and comparing that
number to a certain trainable threshold, we can efficiently implement sign
activation [17]. This makes the computation much simpler.
Pooling layers are inserted at regular intervals in the binary neural net-
works to reduce the spatial dimension of activations and work in the same
way as in non-binarized convolutional neural networks.
Once the forward propagation is completed and loss is computed, gradient
with respect to every parameter is calculated by moving backwards in the
binary neural network. In backpropagation, precision is very important if we
want a well-trained network. Therefore real-valued gradients are calculated
Bachelor’s thesis 13
and stored. Also, during the backward pass, real-valued weights are used.
As already explained, in the process of backpropagation, the network
parameters are updated using the derivatives of the loss. Normally, the acti-
vation functions are easily differentiable. The problem in binary neural net-
works is that the derivative of the sign function is almost zero everywhere,
resulting in an incompatibility with the backpropagation training process.
While propagating backwards, gradients are calculated using the chain rule.
When the gradients of the loss with respect to weights are computed, sign
function makes gradients zero. We use a straight-through estimator [7] to
cope with this problem. Straight-through estimator applies a threshold func-
tion to calculate the gradient. It performs backpropagation through sign
function by treating the derivative of the sign function as an identity func-
tion. The estimation of the gradient of an activation sum z can be expressed
as
gz =
gy if |z| ≤ 1,0 otherwise ,
where gy is the gradient of the loss function at the output of the activation
function. Variable z is the activation sum, which is an input to the activation
function y = sign(z).
From gradient gz gradients of weights with respect to the loss function
are easily obtained.
14 Klara Langerholc
Chapter 3
Field Programmable Gate
Arrays
Field programmable gate arrays (FPGAs) are semiconductor devices, which
can be electrically programmed to implement any digital circuit [6]. An
FPGA consists of arrays of programmable logic cells, called configurable logic
blocks (CLBs), programmable interconnections and a set of programmable
input and output cells on the edges of the device. In addition, they have a rich
set of embedded components such as digital signal processing (DSP) blocks,
used to perform arithmetic-intensive operations, block RAMs (BRAMs),
look-up tables (LUTs), flip-flops (FFs), clock management unit, high speed
I/O links, and others.
Their reconfigurability distinguishes them from Application Specific Inte-
grated Circuits (ASICs), which are custom built for a specific design. Similar
to ASICs, FPGA designs are designed using hardware description languages
(HDLs) like Verilog and VHDL. Once the design is described in HDL, it is
compiled, synthesized and implemented to create a configuration file, also
known as bitstream file. This bitstream file contains information about how
different components of FPGAs should be wired. Once this bitstream file is
downloaded to FPGA, it is configured to run the design until the FPGA is
powered off. There are two main FPGAs companies, Xilinx and Intel FPGA
15
16 Klara Langerholc
Figure 3.1: Basic structure of an FPGA [2].
(Altera). These two companies dominate around 90% of the FPGA mar-
ket [23]. Xilinx, Intel and other companies have their own tools to perform
the entire FPGA design process, such as Vivado and ISE. With high-level
synthesis, you can also program FPGA directly by using a high-level language
such as C++, C, SystemC, OpenCL or Java. The beauty of reprogrammable
circuits is that they combine the speed of hardware with the flexibility of
software, essentially the best characteristics of hardware and software.
From the very beginning, when FPGAs were introduced, they have be-
come a popular choice for ASICs engineers to test various aspects of design.
The main reason behind their success is their reconfigurability. If the design
turns out to be faulty, then it can be corrected just by changing the HDL code
and downloading a new bitstream onto FPGAs. This reconfigurability makes
them slower in clock speed than ASICs because ASIC designs are optimized
for a specific design. Power consumption in FPGA designs is also higher
Bachelor’s thesis 17
as in custom ASICs. In addition, not every design can be implemented on
every FPGA, since FPGAs have limited resources. Overall, FPGAs are pre-
ferred for lower speed designs and for lower quantity production than ASIC
designs [5]. In general, the amount of exploitable parallelism is the key factor
in determining the suitability of an application for FPGAs. FPGAs can only
outperform modern processors by exploiting huge amounts of parallelism.
FPGAs have the advantage of maximizing performance per watt of power
consumption, reducing costs for large scale operations. This makes them an
excellent choice for accelerators for battery operated devices and in cloud
services on large servers. FPGAs have recently been widely used for deep
learning acceleration, because of their flexibility and the ability to support
highly-parallel architectures, resulting in high execution speeds. Software-
level programming models such as the open computing language (OpenCL)
standard in FPGA tools made them more attractive for usage in deep learn-
ing.
3.1 ZedBoard platform
The ZedBoard [4] is an evaluation and development board based on the Xil-
inx Zynq-7000 All Programmable System-on-a-chip (SoC). Therefore, the
ZedBoard is not only an FPGA, but it is a programmable SoC device, which
means that it combines the processing system (PS) with the programmable
logic (PL). In particular, the ZedBoard combines a dual Corex-A9 Process-
ing System (PS) with 85,000 Series-7 Programmable Logic (PL) cells. The
board contains interfaces and supporting functions to enable a wide range of
applications.
The processing system mainly includes on-chip memory, external memory
interfaces, 8-channel DMA controller and a variety of I/O peripheral inter-
faces. Programming logic mainly consists of configurable logic blocks (CLBs),
Block RAMs (BRAMs), Digital Signal Processing blocks (DSP blocks), pro-
grammable I/O blocks, PCIe blocks, high-speed transceivers and analog to
18 Klara Langerholc
digital converters (ADCs). Design development is easiest with Xilinx Vi-
vado [3] environment for rapid development of hardware and software de-
signs.
Figure 3.2: ZedBoard [4].
Chapter 4
Neural networks on FPGAs
Convolutional neural networks require very high computational power and
large amounts of memory [22]. Such requirements represent a computational
challenge for general purpose processors, since they execute most operations
required by CNN sequentially, which leads to low efficiency. Parallel pro-
cessing, on the other hand, can provide tremendous speedups. When imple-
mented in hardware, neural networks can take full advantage of their inher-
ent parallelism and run orders of magnitude faster than software simulations.
Therefore, heterogeneous computing platforms are widely used to accelerate
CNN tasks, such as GPU, FPGA, and ASIC.
In practice, CNNs are trained off-line using backpropagation. Then, the
off-line trained CNNs are used to perform recognition tasks using the feed-
forward process. Therefore, the speed of feed-forward process is what mat-
ters.
GPUs are the most widely used hardware accelerators for improving
both training and classification processes in CNNs [21]. This is due to
their high memory bandwidth and throughput as they are highly efficient
in floating-point matrix-based operations. However, GPU accelerators con-
sume a large amount of power. Such power consumption can be tolerable for
high-performance servers, but for embedded systems such as mobile devices
and robots, which are mostly powered by batteries, low power consumption
19
20 Klara Langerholc
becomes essential. Recently, FPGA has become an option for hardware ac-
celeration because of its low energy consumption and high computational
power, giving it a higher power efficiency [18]. FPGA and ASIC hardware
accelerators have relatively limited memory, I/O bandwidths, and computing
resources compared to GPU-based accelerators. However, they can achieve
at least moderate performance with lower power consumption.
Without careful planning of deep learning networks and maximizing re-
source sharing, the desired implementation may not fit on FPGAs due to
limited logic resources. The design, implemented on an FPGA, needs to be
limited in the amount of memory it utilizes during execution. It would also
need to have a suitable architecture in terms of depth, layers and amount
of weights needed, since all of these determine the amount of memory and
resources the approach needs.
Convolutional neural networks have a very useful property, that is, the
same filter weights are used for multiple convolutions. This makes them more
appealing for hardware implementations. In addition, due to the structure of
artificial neural networks, computations for each node in a layer are generally
independent of all other nodes. In terms of convolutional layers, this means
that we can perform multiple convolutions in parallel. Artificial neural net-
works are inherently parallel, which makes them a great choice for hardware
accelerators.
The weights used in an artificial neural network require a considerable
amount of storage, which is not always available on-chip on an FPGA. Using
floating-point precision increases the memory usage on the FPGA and there-
fore it is a good idea to use fixed-point precision when implementing on an
FPGA. Using a highly quantized neural network, such as binary neural net-
work, drastically reduces memory requirements. Not only do binary neural
networks achieve smaller memory consumption, they also reduce the com-
plexity of calculations, since their dominant computations are binary logic
operations. This reduces the usage of computational resources. Even the
activation function, which would normally use a lot of resources for imple-
Bachelor’s thesis 21
mentation, can be implemented only as a threshold comparison in a binary
neural network [17]. Binary neural networks are evidently very well-suited for
implementation on hardware, even though they come with a small decrease
of accuracy.
22 Klara Langerholc
Chapter 5
Hardware implementation
The binary convolutional neural network is written in Verilog and designed in
a fully modular way. There are five main modules, described in the following
subsections. In Figures 5.1, 5.2 and 5.3, the connections between the modules
can be seen.
Figure 5.1: Neural network implementation (part 1).
23
24 Klara Langerholc
Figure 5.2: Neural network implementation (part 2).
Figure 5.3: Neural network implementation (part 3).
The input N ×N matrix, which is to be classified is sent to the network
row by row (one row has N bits). The first convolutional layer buffers three
rows, performs calculations on them and sends them to the next max-pooling
layer. The max-pooling layer buffers two rows, performs computations and
sends them to the next convolutional layer. This continues until we reach the
last layer. The output of the last layer are bits, representing values for each
class of used dataset. The class with the highest value is the classification
result. This whole process is pipelined. When max-pooling layer module is
performing computations on the buffered rows, the previous convolutional
layer module is already performing computations on the next rows.
All the computations in the max-pooling layer modules are performed in
parallel. When the max-pooling layer buffers the inputs, it calculates all the
Bachelor’s thesis 25
outputs at the same time and needs only one clock cycle for that.
The computations in the fully-connected layer are staged in three stages.
All the computations in the first stage are executed in parallel and take one
clock cycle. Then, the second stage computations are executed all at once
on the data coming from the first stage. In this clock cycle, new data is
already getting calculated in the first stage. Therefore, it takes three clock
cycles for the buffered data to go through all stages and for the output to be
produced, but the next output is available already in the next clock cycle.
With three-stage pipeline design of the fully connected layer the latency is
three clocks while achieving maximal throughput.
The computations in the convolutional layer modules are performed sim-
ilarly as in the fully-connected layer module, except that each buffered input
goes through all stages (there are four stages) M/2 times (M is the number
of filters in this layer), because we are only performing convolutions with two
filters at the same time. This means that when we have a buffered input, it
will take four clock cycles to convolve the input with two filters. The stages
are pipelined, so the next result of convolution with the next two filters will
be available next clock cycle. This means that it will take M/2 + 3 clock
cycles to convolve the buffered input with all the filters. But if the next
buffered input is ready three clock cycles before the previous one finishes
computing, it will only take M/2 additional clock cycles to convolve with all
filters. In this case, the new input goes to stage one, when the old one is still
in stage two, convolving with the last two filters. Every time a convolution
with two filters is made, the result is written in the correct bit positions in
the output register, forming a valid bit sequence when convolutions with all
filters are performed. Before that, the output of the convolutional layer mod-
ule is useless. A part of the convolutional layer module schematic is shown
in Figures 5.4 and 5.5. Dataflow between all the modules for each stage is
visible. All the other elements, such as multiplexers and registers, are part
of the logic, which controls the dataflow between modules. Figure 5.6 shows
a part of the schematic of the module, which performs computations in the
26 Klara Langerholc
third stage of the convolutional layer module. It can be seen that all the
XNOR operations in one stage are executed in parallel.
Figure 5.4: Schematic of a convolutional layer module (part 1).
Figure 5.5: Schematic of a convolutional layer module (part 2).
Bachelor’s thesis 27
Figure 5.6: Schematic of a module, which performs computations in the
second stage of the convolutional layer module. It performs parallel XNOR
operations between two filters and parts of the input activation.
5.1 Top module
The basic function of the top module is to ensure the correct dataflow between
other main modules. The input to the top module are N activation bits,
which represent one row of an N × N input matrix. At every rising clock,
top module writes the received row of N activation bits to the register, which
is read by the first convolutional layer module. The output of this layer is
then written to a register, which is read by the first max-pooling layer. The
top module ensures this dataflow. It also reads the required weights and
thresholds from memory and sends them to the correct modules. It ensures
the modules are connected to each other correctly and that they receive
28 Klara Langerholc
the correct data on time. When the final output comes from the last layer
module, the top module makes sure it is written to the output register.
In short, the top module receives the input data, transfers it to other
modules and outputs the final result. Essentially, all other modules are a
part of the top module. Figures 5.1, 5.2 and 5.3 together form the top
module.
5.2 Convolutional layer module
The convolutional layer module receives D × N activation bits, where N is
the input activation width and D is the depth of the input activation (in the
first convolutional layer, D is 1).
K D × N rows are first buffered, where K is the kernel filter size. In
the beginning, the first row in the buffer consists of all zeroes, to enable zero
padding. When the first row arrives into the module, one zero bit as added
to the beginning of the row and one to the end. This row is then added to
the buffer. Every time a new row comes in, we add zeroes to it and put it
in the buffer, until there are K rows (including the first row made of zeroes)
in it. Then we start computing. When the next row comes in, we put it in
the buffer, while discarding the first row in the buffer, so we always have K
rows in it. We count the number of rows that come into the buffer and when
the N -th row comes in, we put a row of all zeroes after it. This is how 2× 2
zero padding with stride 2 of the input image is performed.
Convolutional operations are performed on the buffered rows in four
stages. All the computations inside the stages are performed in parallel,
but convolutions are done with only two filters at the same time, therefore
all the stages need to execute M/2 times for each K×D×N buffered input,
where M is the number of filters in this layer. This means that the bits in
the input buffer change only after every M/2 clocks. Each clock, one result
of convolution with two filters is obtained (N × 2 bits) and only after M/2
clocks the final output of the convolutional layer for a certain buffered input
Bachelor’s thesis 29
is obtained (N ×M bits).
For instance, consider a 32 × 32 figure, which is to be classified, is sent
into the top module row by row. One row is sent every 8 clock periods to
the first convolutional layer. The first convolutional layer first buffers 3 rows
and begins performing operations. The calculations take 8 clock periods,
because there are 16 filters and 2 filters are convolved at the same time, plus
3 clock periods for propagating the rows through the four stages, all together
taking 11 clock periods. After 8 clock periods, the next row is received and
added to the buffer in a first-in-first-out manner. Then the calculations are
performed on the new data in the buffer. When new data goes into the first
stage, the previous data is still in the second stage and it will take 3 more
clock periods to come out of the fourth stage. This means that a valid output
of the convolutional layer is obtained every 8 clock periods, 3 periods after
a new input comes in. The next layer (max-pooling) needs to take this into
account.
Each stage is implemented in its own module. At a rising clock the
convolutional layer module saves the buffered bits to a register, which is read
by the first stage module.
The first stage receives K×D× (N +2) buffered bits and extracts all the
possible spatially correlated K ×K blocks of the padded input activations.
There are N ×D such blocks. The K ×K ×N ×D resulting bits are saved
in a register, read by the second stage module.
In the second stage module, XNOR operations between the N K×K×D
read blocks and 2 K × K × D filters are performed, resulting in K × K ×
D ×N × 2 bits.
Figure 5.8 shows the Verilog code for this module. At a rising clock, if
the module is enabled (signal start comes from the first stage module, when
it produces a new output), XNOR operations between blocks of the buffered
inputs and two filters are performed. The correlated blocks come from the
previous first stage module. There are two for loops. The inner loop goes
through all the blocks in the input and the outer one goes through two filters.
30 Klara Langerholc
An input to this module are also threshold bits. Register f stores the index
of the threshold, which is currently being convolved. It is incremented every
time the input blocks are convolved with a filter. When it reaches the total
number of filters, it is reset to 0. At this point, the next buffered input data
from the convolutional layer module are already in the first stage module and
will come to this module next clock cycle to be convolved with the first two
filters with indexes 0 and 1.
Figure 5.7: Convolutional layer implementation in the hardware design.
Bachelor’s thesis 31
Figure 5.8: A module, performing XNOR operations with two filters in Ver-
ilog.
5.2.1 Population count
After performing XNOR operations, the third stage is to count the number
of ones in each D ×K ×K vector, resulting in log2(D ×K ×K) bits.
The counting of ones in a binary vector is also called population count
(popcount). It is easy to implement by simply adding all the bits together.
The 0 bits will not change the final sum, so it will only result in the number
32 Klara Langerholc
of ones. This, however requires a lot of additions, for D × K × K bits we
would need D ×K ×K–1 adders.
If we want to reduce the amount of adders, it is a good idea to design
popcount in another way. In this implementation, popcount is implemented
with LUTs, using the compressor tree method [11, 13].
Using a case statement in Verilog, every combination of 9 input bits is
mapped to a predetermined number of ones in this combination. For 9 bits,
there are 512 combinations and they have been manually written in the code
(Figure 5.9).
Figure 5.9: A part of popcount implementation for a 9-bit vector. We want
to count the number of set bits in a 9-bit vector in. Variable o is the number
of set bits for a particular in.
When computing a number of ones in a 9-bit array, we therefore only
Bachelor’s thesis 33
need one lookup operation to lookup a 4-bit value (if all the 9 bits are ones,
the value is 1001, therefore a 4-bit value).
When computing a number of ones in an array that has more than 9 bits,
we split the bits into P parts of 9 bits. For each such part, we lookup the
number of ones, so we end up with P values. If the number of bits in an array
is not a multiple of 9, we simply add zeroes to it, until it is (Figure 5.10).
This will not change the final result. We could add these P values and get
the correct result. But when dealing with an array with a lot of bits, this
would still result in a lot of adders. One proposed solution for that is the
following:
1. We take the first bit of each of the P 4-bit values, group them in 9-bit
sequences and lookup the number of ones in the sequences. We shift
the looked-up values by 3 to the left, essentially multiplying them by
8.
2. We take the second bit of each of the P 4-bit values, perform lookups
and shift the values by 2 to the left, multiplying them by 4.
3. We perform the same with the third bits and shift to the left by 1,
multiplying them by 2.
4. We perform the same with the fourth bits, but do not shift.
5. We add the looked-up and shifted values and get the final result.
This way the implementation uses about a third less LUTs than the usage
of adder trees would [13]. An example in Verilog is shown in Figure 5.10.
34 Klara Langerholc
Figure 5.10: Popcount implementation for a vector with more than 9 bits.
In this example, the vector has 144 bits.
5.2.2 Threshold comparisons
After performing population count, we are left with 2×N × log2(D ×K ×
K) bits. The last stage in a convolutional layer module is to compare the
log2(D ×K ×K)-bit values to log2(D ×K ×K)-bit thresholds, resulting in
a (2×N)-bit output.
In hardware inference, we can simply perform a comparison between each
threshold and each popcount result to obtain a binarized result (Figure 5.11).
Bachelor’s thesis 35
Figure 5.11: Threshold comparisons in Verilog
5.3 Max-pooling layer module
Max-pooling layer buffers two N ×M outputs of a convolutional layer mod-
ule, which come every M/2 clock periods. The buffering this time is not
performed as first-in-first-out. One pair of rows is saved in a register, and
then the next pair, while the first pair is completely overwritten. This means
36 Klara Langerholc
that operations can not be performed for one clock period, while the module
is waiting for the second row. All the operations that are performed when
the second row arrives, are performed in parallel and therefore only take one
clock period.
The max-pooling layer reduces the buffered 2×N×M rows into (N/2)×M
bits.
M×N bits are divided in M×(N/2) blocks of 2×2 bits. Since everything
is binarized, the maximum in each block will be 1, if there is at least one 1 in
the block. This is easily implemented by using OR operations (Figure 5.12).
Because max-pooling layer takes two rows and transforms them into one,
the output of the max-pooling layer is changed for every second input, there-
fore if the input comes every M/2 clock periods, the output will be produced
every M clock periods.
Figure 5.12: Max-pooling in Verilog
5.4 Fully-connected layer module
The fully-connected layer module receives M×(N/2) activation bits. Because
in the fully-connected layer all the inputs are required for the calculation of
Bachelor’s thesis 37
the outputs, it needs to buffer all the valid outputs of the previous max-
pooling layer (which are produced every M clock periods) for the current
input matrix and forms a M × (N/2) × (N/2) array. This takes N/2 clock
periods.
Fully-connected module performs XNOR operations between the M ×
(N/2)×(N/2) input bits and M×(N/2)×(N/2) weights bits F times, where
F is the number of output neurons. This results in M × (N/2)× (N/2)× F
bits.
We count the ones in each of the F produced arrays as previously de-
scribed. This results in log2(M × (N/2) × (N/2)) × F bits. We compare
these F values to F thresholds and obtain F output bits.
The fully connected layer buffers N/2 inputs the same way as max-pooling
layer module does. This means that it also produces a valid output after
(N/2)×M clock periods, plus 3 clock periods for each convolution stage.
Figure 5.13 shows a part of the Verilog code for a fully-connected layer
module. As we can see, the three stages are not implemented in separate
modules, because all the convolutions with all the weights are performed at
the same time.
5.5 Last layer module
The last layer module is implemented in the same way as the fully-connected
module, except that we do not compare the results of population count to
thresholds. Since the previous fully-connected layer module produced only
one result, it does not need to buffer the inputs in the beginning.
It transforms F ×O bits into O × log2 F bits, where O is the number of
our final output classes.
38 Klara Langerholc
Figure 5.13: A part of the code in a fully-connected layer module.
Chapter 6
Results
A binary convolutional neural network, inspired by VGG-net [20] is imple-
mented on ZedBoard [4]. The network has first been trained, which produced
weights and thresholds. These were then saved on-chip to perform inference
on hardware. Simulations of the implemented network have been run and
the results have been compared to a software implementation to check the
network’s validity. Then, the network was synthesized on ZedBoard and the
final performance results have been obtained.
6.1 Training
A subset of MIOvision Traffic Camera Dataset (MIO-TCD) [14] is used for
training and testing. It has a training set of approximately 10,000 samples
for each of four classes: cars, pedestrians, cyclists and background. The
images from the dataset are resized and binarized to fit 32 × 32 bits. No
other preprocessing techniques were used. Figure 6.1 shows an image from
the MIO-TCD dataset. It belongs to class background.
The training algorithm is a modified version of the algorithm used in [7].
It is written in Python with Theano framework and executed on CPU. The
main modification in the algorithm is that the first layer uses completely
binarized inputs and weights, whereas the algorithm in [7] uses full-precision
39
40 Klara Langerholc
inputs for the first layer.
The loss is minimized with optimization algorithm ADAM [10] and an
exponentially decaying learning rate is used. The training also includes batch
normalization.
Trained batch normalization parameters µ, β, σ and γ (described in sec-
tion 2.2.1) are extracted for each output dimension M in convolutional layers
and each output neuron F in the fully-connected layer. Thresholds are then
computed as
thresh = µ− b− β ∗ σ/γ for each M or F .
After the training, all the required binarized weights and integer thresh-
olds are written into Verilog code to be saved on-chip.
The classification accuracy against the test set (containing approximately
1500 images per class) of the implemented network is 64.7%. This could be
improved with better training, but it was not the focus of this thesis. The
hardware implementation is tested against the software implementation and
produces the same results.
6.2 Classification
The binary neural network, implemented in Verilog, has been tested against
a software implementation to ensure correct results.
The testing image in Figure 6.1 is first resized and binarized (Figure 6.2).
When we put the image bits row by row into our inference algorithm in
Verilog and run a behavioral simulation, the resulting waveform of is as
shown in Figure 6.3. As we can see, the final result of the network is 54710ab
in hexadecimal representation. This translates to 0101010 0011100 0100001
0101011 in binary representation, since there are four classes and each class
is represented with seven bits. In decimal representation, this translates to
numbers 42, 28, 33, 43, representing classes cars, pedestrians, cyclists and
background, in this order.
Bachelor’s thesis 41
We can conclude that the classified result is class background, with con-
fidence of 43, followed very closely by class cars, with confidence 42. The
highest score that can be achieved is 1111111 in binary or 127 in decimal
representation.
Figure 6.1: An image from the MIO-TCD dataset.
Figure 6.2: A resized and binarized image from the MIO-TCD dataset.
42 Klara Langerholc
Figure 6.3: Waveform of the test-bench.
6.3 Implemented network
The implemented binary network has four convolutional layers and two fully-
connected layers. The max-pooling layer follows every convolutional layer
(Figure 6.4). The first convolutional layer convolves with 16 filters, the second
with 32, third with 48 and fourth with 64. All filters have width and depth
equal to 3. The sizes and timing of input and output activations for each
layer can be seen in Table 6.2 and the number of clock periods required per
each layer can be seen in Table 6.3.
For example, in the first convolutional layer, when an input is buffered
and ready, computations are performed in stages as shown in Table 6.1. It
takes 11 clock periods to calculate the first output. As seen in Table 6.2,
a new buffered input comes in every 8 clock periods and goes to stage 1.
At this time, the previous input is still in stage 2, getting convolved with
the last two filters. The next output will be ready 8 clock periods after the
first one (at clock cycle 19 in the table, while the first one was ready at
clock cycle 11). Therefore, we can conclude that the very first valid output
of the convolutional layer 1 will be available after 11 clock cycles and every
next output will be available after 8 additional clock cycles, which is seen in
Table 6.2.
In some convolutional layers, however, the module performs all the con-
volutions with all filters, before a new input comes to stage 1 and therefore
needs to wait for the next input without doing anything. For example, in
convolutional layer 3, it takes 27 clock periods to produce the result of one
buffered input, but a new input is received only every 32 clock periods, which
Bachelor’s thesis 43
Filter number
Clock period 1, 2 3, 4 5, 6 7, 8 9, 10 11, 12 13, 14 15, 16
1 st1
2 st2 st1
3 st3 st2 st1
4 st4 st3 st2 st1
5 st4 st3 st2 st1
6 st4 st3 st2 st1
7 st4 st3 st2 st1
8 st4 st3 st2 st1
9 st1 st4 st3 st2
10 st2 st1 st4 st3
11 st3 st2 st1 st4
12 st4 st3 st2 st1
13 st4 st3 st2 st1
14 st4 st3 st2 st1
15 st4 st3 st2 st1
16 st4 st3 st2 st1
17 st1 st4 st3 st2
18 st2 st1 st4 st3
19 st3 st2 st1 st4
Table 6.1: Staged computations in convolutional layer module 1. We denote
stage X as stX.
44 Klara Langerholc
Figure 6.4: The architecture of the implemented binary neural network.
Network layer Input activation Output activation
Convolutional 1 1× 32 = 32 every 8 cc 32× 16 = 512 every 8 cc
Pooling 1 32× 16 = 512 every 8 cc 16× 16 = 256 every 16 cc
Convolutional 2 16× 16 = 256 every 16 cc 32× 16 = 512 every 16 cc
Pooling 2 32× 16 = 512 every 16 cc 32× 8 = 256 every 32 cc
Convolutional 3 32× 8 = 256 every 32 cc 48× 8 = 384 every 32 cc
Pooling 3 48× 8 = 384 every 32 cc 48× 4 = 192 every 64 cc
Convolutional 4 48× 4 = 192 every 64 cc 64× 4 = 256 every 64 cc
Pooling 4 64× 4 = 256 every 64 cc 64× 2 = 128 every 128 cc
Fully-connected 64× 2 = 128 every 128 cc 64 every 256 cc
Last 64 every 256 cc 4× 7 = 28 every 256 cc
Table 6.2: Input and output activation sizes in bits and their delays in clock
cycles (cc) for each layer in the network.
Bachelor’s thesis 45
Network layer Number of clock cycles required
Convolutional 1 16/2 + 3 = 11
Pooling 1 1 + 1 = 2
Convolutional 2 32/2 + 3 = 19
Pooling 2 1 + 1 = 2
Convolutional 3 48/2 + 3 = 27
Pooling 3 1 + 1 = 2
Convolutional 4 64/2 + 3 = 35
Pooling 4 1 + 1 = 2
Fully-connected 1 + 3 = 4
Last 2
Table 6.3: Number of clock cycles required for each layer in the network.
means that the module is not doing anything for 5 clock periods. The output
value changes after every 32 clock periods (5 for the new input to come, plus
27 for the new value to be computed).
When taking into account also all the clock periods when the modules
are buffering the inputs, the calculation of the class of a 32× 32 input image
takes 589 clock cycles.
6.4 Resource utilization
Table 6.4 shows a number of parameters required by this network archi-
tecture. All together, there are 64,768 parameters required by the neural
network. A binary neural network would therefore require 7.9 KB to store
its parameters.
Table 6.5 shows the number of parallel XNOR and popcount operations
required for each implemented network layer module. All these operations
execute M/2 times in each convolutional layer module sequentially for each
buffered input.
46 Klara Langerholc
Network layer Kernel filters size Threshold size sum
Convolutional 1 16 x 1 x 3 x 3 = 144 16 x 4 = 64 208
Pooling 1 0 0 0
Convolutional 2 32 x 16 x 3 x 3 = 4,608 32 x 8 = 256 4,864
Pooling 2 0 0 0
Convolutional 3 48 x 32 x 3 x 3 = 13,824 48 x 9 = 432 14,256
Pooling 3 0 0 0
Convolutional 4 64 x 48 x 3 x 3 = 27,648 64 x 9 = 576 28,224
Pooling 4 0 0 0
Fully-connected 64 x 2 x 2 x 64 = 16,384 64 x 9 = 576 16,960
Last 64 x 4 = 256 0 256
Table 6.4: Kernel filters and thresholds sizes in bits for each layer in the
network.
Network layer XNOR operations popcount operations
Convolutional 1 2 x 32 x 1 x 3 x 3 = 576 2 x 32 = 64, 9-bit
Pooling 1 / /
Convolutional 2 2 x 16 x 16 x 3 x 3 = 4,608 2 x 16 = 32, 144-bit
Pooling 2 / /
Convolutional 3 2 x 8 x 32 x 3 x 3 = 4,608 2 x 8 = 16, 288-bit
Pooling 3 / /
Convolutional 4 2 x 4 x 48 x 3 x 3 = 3,456 2 x 4 = 8, 432-bit
Pooling 4 / /
Fully-connected 64 x 2 x 2 x 64 = 16,384 64, 256-bit
Last 64 x 4 = 256 4, 256-bit
Table 6.5: Number of XNOR and popcount operations required for each
layer.
Bachelor’s thesis 47
6.5 Synthesis results
The performance of the proposed system architecture is evaluated on Zed-
board platform. Xilinx Vivado 2018.3 environment is used to perform syn-
thesis and implementation of the design. Mainly registers and LUTs are used
in this design.
Slice logic utilization of the synthesized design is shown in Table 6.6. As
we can see, the selected FPGA barely has enough resources for the implemen-
tation of this design. If we change the number of filters to convolve in parallel
in convolutional layers from two to one, the slice LUTs utilization reduces
from 98% to 37%. But this also means that each convolutional layer module
would require two times more clock periods to perform its calculations.
After the implementation, the resource utilization slightly reduces, but
only by a few percent (Table 6.7).
Site type Used Available Utilized [%]
Slice LUTs 52,505 53,200 98.69
Slice registers 8,021 106,400 7.54
F7 Muxes 7,646 26,600 28.74
F8 Muxes 1,901 13,300 14.29
Table 6.6: Resource utilization of the synthesized design.
Site type Used Available Utilized [%]
Slice LUTs 51,028 53,200 95.92
Slice registers 8,021 106,400 7.54
F7 Muxes 7,646 26,600 28.74
F8 Muxes 1,901 13,300 14.29
Table 6.7: Resource utilization of the implemented design.
The maximum calculated clock frequency of the implemented design is
93.48 MHz. It has a total latency of 589 clock cycles or 6.3 µs. Maximum
48 Klara Langerholc
throughput this design can reach is 158,730 images per second with 32× 32
resolution.
Chapter 7
Conclusion
In this thesis, a binary neural network has been implemented on an FPGA.
We presented binary neural networks and their different types of layers. We
have shown the basic structure of FPGAs and discussed, why binary neural
networks are suitable for implementation on FPGAs.
A binary neural network with four convolutional layers, four max-pooling
layers and two fully-connected layers has been trained in Python. For the
training, MIO-TCD dataset [14] with about 10,000 samples for each of four
classes was used. The trained binary weights and thresholds have been saved
on-chip on ZedBoard FPGA. Then, an inference algorithm has been written
in Verilog and synthesized in Vivado. The simulation results were tested
against a software implementation to ensure that the hardware implementa-
tion works correctly.
Since the implemented network is quite big, it was a challenge to im-
plement the entire network on the selected chip with limited resources. For
that, we used pipelining at multiple stages, while we tried to keep as much
parallelism as possible for the network to still fit in the FPGA.
The maximum calculated clock frequency of the implemented design is
93.48 MHz. It has a total latency of 589 clock cycles or 6,3 µs. Maximum
throughput this design can reach is 158,730 images per second with 32× 32
resolution.
49
50 Klara Langerholc
Much could be improved in this implementation. Firstly, the accuracy
could be improved by advanced training and preprocessing techniques and by
changing the first convolutional layer. In this implementation, the first layer
receives binary inputs of depth 1, but if we changed that into for example 8-
bit activations of depth 3 (representing green, blue and red), while binarizing
all the inputs at every following layer, accuracy could be much improved. But
this would, of course, not be a completely binary neural network anymore.
If we had an FPGA with more resources on it, the binary neural network
implementation could be designed in a more parallel way. But even this
implementation could be optimized by exploiting the time when layers wait
for new data.
Bibliography
[1] ILSVRC. Available: http://www.image-net.org/challenges/
LSVRC/. [Accessed: 11.9. 2020].
[2] Structure-of-field-programmable-gate-array-fpga. Available:
https://iq.opengenus.org/structure-of-field-programmable-
gate-array-fpga/. [Accessed: 24. 8. 2020].
[3] Vivado. Available: https://www.xilinx.com/products/design-
tools/vivado.html. [Accessed: 24. 8. 2020].
[4] Zedboard. Available: zedboard.org. [Accessed: 24. 8. 2020].
[5] Differences-between-fpga-and-asics. Available: https://numato.com/
blog/differences-between-fpga-and-asics/, 2018. [Accessed: 24.
8. 2020].
[6] How-does-an-fpga-work. Available: https://learn.sparkfun.com/
tutorials/how-does-an-fpga-work/all, 2018. [Accessed: 24. 8.
2020].
[7] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. Binarized neural networks: Training deep neural net-
works with weights and activations constrained to +1 or -1. arXiv
preprint arXiv:1602.02830, 2016. URL https://arxiv.org/abs/1602.
02830.
51
52 Klara Langerholc
[8] George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. Improving
deep neural networks for LVCSR using rectified linear units and dropout.
In ICASSP, pages 8609–8613. IEEE, 2013. URL http://dblp.uni-
trier.de/db/conf/icassp/icassp2013.html#DahlSH13.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. pages 770–778, 06 2016. doi:
10.1109/CVPR.2016.90.
[10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. CoRR, abs/1412.6980, 2014. URL http://dblp.uni-
trier.de/db/journals/corr/corr1412.html#KingmaB14.
[11] M. Kumm and P. Zipf. Pipelined compressor tree optimization using
integer linear programming. In 2014 24th International Conference on
Field Programmable Logic and Applications (FPL), pages 1–8, 2014.
[12] Yann Lecun, Le´on Bottou, Yoshua Bengio, and Patrick Haffner.
Gradient-based learning applied to document recognition. In Proceed-
ings of the IEEE, pages 2278–2324, 1998.
[13] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei.
FP-BNN: Binarized neural network on FPGA. Neurocomputing, 275, 10
2017. doi: 10.1016/j.neucom.2017.09.046.
[14] Zhiming Luo, Frederic B.-Charron, Carl Lemaire, Janusz Konrad,
Shaozi Li, Akshaya Mishra, Andrew Achkar, Justin Eichel, and Pierre-
Marc Jodoin. MIO-TCD: A new benchmark dataset for vehicle classi-
fication and localization. IEEE Transactions on Image Processing, PP:
1–1, 06 2018. doi: 10.1109/TIP.2018.2848705.
[15] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas
immanent in nervous activity. The bulletin of mathematical biophysics,
5(4):115–133, 1943.
Bachelor’s thesis 53
[16] Tadej Murovicˇ and Andrej Trost. Massively parallel combinational
binary neural networks for edge processing. Elektrotehniˇski Vest-
nik/Electrotechnical Review, 86:47–53, 01 2019.
[17] M. Rusci, L. Cavigelli, and L. Benini. Design automation for binarized
neural networks: A quantum leap opportunity? In 2018 IEEE Interna-
tional Symposium on Circuits and Systems (ISCAS), pages 1–5, 2018.
[18] A. Shawahna, S. M. Sait, and A. El-Maleh. FPGA-based accelerators of
deep learning networks for learning and classification: A review. IEEE
Access, 7:7823–7859, 2019.
[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. arXiv 1409.1556, 09 2014.
[20] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. In Yoshua Bengio and Yann
LeCun, editors, 3rd International Conference on Learning Representa-
tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference
Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556.
[21] V. Sze, Y. Chen, T. Yang, and J. S. Emer. Efficient processing of deep
neural networks: A tutorial and survey. Proceedings of the IEEE, 105
(12):2295–2329, December 2017. ISSN 1558-2256. doi: 10.1109/JPROC.
2017.2761740. URL http://arxiv.org/abs/1703.09039.
[22] Vivienne Sze, Yu-Hsin Chen, Joel Einer, Amr Suleiman, and Zhengdong
Zhang. Hardware for machine learning: Challenges and opportunities.
pages 1–8, 04 2017. doi: 10.1109/CICC.2017.7993626.
[23] Electronic Theses and 2004-2019 Dissertations. Exploring FPGA im-
plementation for binarized neural network inference. IEEE, 2018. URL
https://stars.library.ucf.edu/etd/6205.
[24] Paul J Werbos. Backpropagation through time: what it does and how
to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
