An FPGA Accelerated Method for Training Feed-forward Neural Networks
  Using Alternating Direction Method of Multipliers and LSMR by Foumani, Seyedeh Niusha Alavi et al.
Imperial College London
Department of Computing
An FPGA Accelerated Method for Training
Feed-forward Neural Networks Using Alternating
Direction Method of Multipliers and LSMR
Author:
Seyedeh Niusha Alavi
Foumani
Supervisor:
Professor Wayne Luk
Submitted in partial fulfilment of the requirements for the MSc degree in Advanced
Computing of Imperial College London
September 2020
ar
X
iv
:2
00
9.
02
78
4v
1 
 [c
s.L
G]
  6
 Se
p 2
02
0
Abstract
In this project, we have successfully designed, implemented, deployed and tested a
novel FPGA accelerated algorithm for neural network training. The algorithm it-
self was developed in an independent study option. This training method is based
on Alternating Direction Method of Multipliers algorithm, which has strong parallel
characteristics and avoids procedures such as matrix inversion that are problematic
in hardware designs by employing LSMR. As an intermediate stage, we fully imple-
mented the ADMM-LSMR method in C language for feed-forward neural networks
with a flexible number of layers and hidden size. We demonstrated that the method
can operate with fixed-point arithmetic without compromising the accuracy. Next,
we devised an FPGA accelerated version of the algorithm using Intel FPGA SDK for
OpenCL and performed extensive optimisation stages followed by successful deploy-
ment of the program on an Intel Arria 10 GX FPGA. The FPGA accelerated program
showed up to 6 times speed up comparing to equivalent CPU implementation while
achieving promising accuracy.
Keywords: Feed-forward Neural Network, Alternating Direction Method of Multi-
pliers (ADMM), LSMR, FPGA, OpenCL, Fixed-point
ii
Acknowledgements
I would like to thank the following people:
‚ My supervisor, Prof Wayne Luk for all his valuable help and advice and for
giving me the opportunity to get involved in an exciting area of research.
‚ Dr Ce Guo for all of his guidance and constructive feedback. He introduced the
fascinating world of hardware for neural networks to me and patiently taught
me a lot throughout the course of this project.
‚ My family and my fiance´ for their unconditional love and support.
iii

Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Feed-forward Neural Networks . . . . . . . . . . . . . . . . . . . 4
2.1.2 Gradient-Based Methods . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Alternating Direction Method of Multipliers . . . . . . . . . . . . . . . 7
2.3 LSMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 ADMM for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Hardware for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Field-programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . 13
2.7 FPGA for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7.1 Precision Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.2 Sparsification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.3 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Fixed-point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
v
CONTENTS Table of Contents
2.8.1 Rounding Methods . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Development Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9.2 Intel® FPGA SDK for OpenCL . . . . . . . . . . . . . . . . . . 25
2.9.3 Intel AOC Compiler Report . . . . . . . . . . . . . . . . . . . . 27
2.9.4 Intel® FPGA Devcloud . . . . . . . . . . . . . . . . . . . . . . . 29
3 Software Design 30
3.1 C Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Code Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.3 Bottleneck Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Implementation of Fixed-point Arithmetic . . . . . . . . . . . . . . . . 35
3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Fixed-point Arithmetic Details . . . . . . . . . . . . . . . . . . 36
3.3 Using Fixed-point LSMR in ADMM . . . . . . . . . . . . . . . . . . . . 43
4 Hardware Design 44
4.1 Hardware-accelerated ADMM-LSMR . . . . . . . . . . . . . . . . . . . 44
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 OpenCL for FPGA Implementation . . . . . . . . . . . . . . . . 45
4.2 Optimisations of FPGA Implementation . . . . . . . . . . . . . . . . . 51
5 Experimental Results 59
5.1 Comparing Floating-point vs Fixed-point . . . . . . . . . . . . . . . . . 60
5.1.1 IRIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.2 HIGGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi
Table of Contents CONTENTS
5.2 Comparing CPU Implementation and FPGA Implementation: Accuracy 63
5.2.1 IRIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 HIGGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Comparing CPU Implementation and FPGA Implementation: Time . . 64
5.3.1 HIGGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Run-time Relation to Network Complexity . . . . . . . . . . . . . . . . 66
5.4.1 HIGGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Conclusion and Future Work 67
6.1 Technical Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Results and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
vii

Chapter 1
Introduction
In this project, we have designed and implemented a hardware-accelerated neural
network training algorithm. This project is, in fact, a continuation of an independent
study option, in which a hardware-friendly approach for training neural networks using
ADMM and LSMR was introduced [1]. In this work, we present an implementation
of the ADMM-LSMR algorithm, which is accelerated with FPGA using OpenCL. To
the best of our knowledge, this is the first hardware-accelerated implementation of
an ADMM-based training method that uses LSMR to avoid matrix inversion. This
implementation takes advantage of parallel characteristics of ADMM and LSMR and
uses fixed-point arithmetic to suit hardware design restrictions.
1.1 Motivation
Machine learning and in particular neural networks have shown promising performance
in many domains both in academia and industry. The models are getting more and
more complex, and the available amount of data is rapidly growing. As a result, more
sophisticated challenges are emerging in this field, and many techniques have been em-
ployed to make training algorithms more efficient and keep up with the data volume
and complexity. One way to address some of these challenges is to fortify the training
platforms by employing hardware acceleration or even designing custom hardware.
Several approaches can be taken in order to use hardware acceleration in neural
network training algorithms. Gradient-based methods, which are a commonly used
category of algorithms for training neural networks, have many characteristics that
complicate the use of hardware acceleration. In [1], the ADMM-LSMR method is
described as an alternative to gradient-based methods alongside a high-level Python
implementation as proof of concept. This method is, in fact, orthogonal to common
1
1.2. OBJECTIVES Chapter 1. Introduction
approaches as it is focused on the algorithm itself to be suitable for hardware design
rather than implementing a hardware-accelerated variant of a conventional training
algorithm. However, a method being theoretically suitable for hardware design and a
Python implementation is far away from a practical and deployable hardware design.
1.2 Objectives
In this project, we improved the proposed method in [1] by applying some common
techniques and made an even more hardware friendly variant of the algorithm. Then
we took multiple steps to optimise the hardware design and finally, a fully operational
deployment on an FPGA card using OpenCL was achieved.
The key components of this project and the achievements can be enumerated as follow:
‚ Full low-level C implementation of the ADMM-LSMR method.
‚ Implementation of fixed-point arithmetic with four different rounding methods.
‚ Implementation of 16-bit and 32-bit fixed-point variants of LSMR method to be
more hardware efficient.
‚ Primary design and implementation of both CPU (host) side and device (FPGA)
side of an OpenCL program and applying detailed amendments to emulate the
program successfully.
‚ Deployment of the design on Intel Arria® 10 GX FPGA on Intel DevCloud
stack which required another set of design amendments to achieve a successful
deployment.
‚ Applying multiple stages of optimisation to increase speed and maximise FPGA
board utilisation.
‚ Conducting experiments to assess the accuracy and efficiency of the final hardware-
accelerated program.
1.3 Outline
In the background chapter of this report, first, we briefly described neural networks,
ADMM optimisation method and LSMR algorithm. Next, we explored common ap-
proaches in hardware acceleration of neural networks followed by a description of
2
Chapter 1. Introduction 1.3. OUTLINE
FPGAs and their structures. Then, usage of FPGAs in neural network training ac-
celeration is explored. Finally, an overview of the employed technologies, including
OpenCL and Intel FPGA development stack is provided.
In chapter 3, software and algorithmic aspects of the implementation are described.
First, the ADMM-LSMR algorithm [1] is explained, and it is demonstrated how and
why the method is a perfect candidate for hardware acceleration. Later, details of the
low-level implementation of the method, fixed-point arithmetic and their combination
are described.
In chapter 4, the route taken to convert the C implementation to an OpenCL acceler-
ated program and finally a fully working FPGA deployment is explained. A separate
section is dedicated to applied optimisation techniques and their outcome.
Chapter 5 contains the results of experiments conducted to assess the accuracy and
time efficiency of the implemented method.
Finally, the report is closed on chapter 6, providing the conclusion and potential areas
to be improved.
3
Chapter 2
Background
2.1 Artificial Neural Networks
The main goal of the neural networks is to find a function f that best approximates
some target function f˚. This goal is achieved through a process called training, where
the network learns a set of parameters from the input data. Using the learned param-
eters to predict the output of a new data is called inference.
Two key components of a neural network are neurons and layers. Layers are a collec-
tion of neurons, and the network is composed by connecting these layers. Different
approaches for connecting layers together leads to different types of neural networks
like feed-forward neural networks, convolutional neural networks and recurrent neural
networks. Each of these networks is suitable for a specific set of applications [2].
min
W
`pfpx0,W q, yq (2.1)
The optimisation problem of a neural network can be written as equation 2.1, where `
is a loss function, and W is the learnable parameter. As we mentioned, the goal is to
learn W such that we can minimise the difference between the output of f given the
input x0 and the actual output y.
2.1.1 Feed-forward Neural Networks
In a feed-forward neural network, all of the neurons of a layer are connected to all of
the neurons of the next layer as it is shown in figure 2.1. This type of neural network
is called feed-forward as the information always flows forward in them. It has been
proposed that a feed-forward neural network with one hidden layer can approximate
4
Chapter 2. Background 2.1. ARTIFICIAL NEURAL NETWORKS
Figure 2.1: A simple feed-forward neural network
any continuous function, but the required hidden size may be significantly large and
that would make the process of learning impossible [2]. Feed-forward neural networks
are suitable for unstructured data for example when the data is not time-dependent
or sequential.
Following statements hold for a three-layer feed-forward neural network:
Input data x0 P IRD˚N
W1 P IRHS˚D
z1 “ W1x0, z1 P IRHS˚N
Input of hidden layer x1 “ h1pz1q P IRHS˚N
W2 P IRHS˚HS
z2 “ W2x1, z2 P IRHS˚N
Input of last layer x2 “ h2pz2q P IRHS˚N
W3 P IROS˚HS
Output z3 “ W3x2, z3 P IROS˚N
Where we have N training samples, D features, HS number of neurons in the hidden
layers and OS is the dimensions of output. hl is the activation function of layer l.
This notation is used in this report.
5
2.1. ARTIFICIAL NEURAL NETWORKS Chapter 2. Background
2.1.2 Gradient-Based Methods
Gradient-based methods [3] are the most common optimisers which are used alongside
back-propagation [4] to solve the optimisation problem of neural networks. These
iterative algorithms use the first-order derivative of the objective function to move
towards the optimal solution.
Many variants of gradient-based methods have been proposed. Such as SGD [5] [6],
AdaDelta [7], AdaGrad [8], Nadam and Adam [9] which is the most popular and most
commonly used. In general, it is observed that these methods suffer from several
limitations, such as:
Vanishing and Exploding Gradient
This problem which occurs as a result of repeated matrix multiplication, is known as
one of the fundamental limitations of the gradient-based methods. Multiplying small
values of gradient several times results in a very small value that can slow down or even
stop the training process. This defect is called vanishing gradient. On the other hand,
exploding gradient happens as a consequence of multiplying big values of gradient
multiple times. This can make the learning process extremely unstable. In Recurrent
Neural Networks, this becomes even more crucial [10]. Proposed methods to reduce the
effect of this problem include changing the architecture to Long Short Term Memory
networks [11] and using Rectified Linear Units [12] for activation function which helps
with vanishing gradient and also clipping gradients to mitigate exploding gradients.
Sequential Dependency
Gradient-based methods are sequential by nature. In these methods the gradient
computation of a batch can only start after the computation of the previous batch has
been completed and weight updates take place when gradients of a batch are available.
This characteristic makes the gradient-based methods not a suitable candidate for
parallel implementation. This is even a more critical issue in FPGA implementation of
neural networks when the parallelisation involves hardware pipelines and an algorithm
with fewer dependencies and the ability to be pipelined is desired.
Converging to Local Minima or Saddle Points
Generally, the optimisation problem of neural networks is non-convex. Therefore,
converging to local minima or saddle points is another concern. Also, it has been
discussed that in higher dimensions, saddle points can cause more crucial issues [13].
6
Chapter 2. Background 2.2. ALTERNATING DIRECTION METHOD OF MULTIPLIERS
Sensitivity to Ill-conditioning
Another common issue in training neural networks using gradient-based methods is
ill-conditioning. When the Hessian of the objective function is ill-conditioned, it can
drastically affect the convergence rate of gradient-based methods and slow down the
training process [14].
2.2 Alternating Direction Method of Multipliers
Recently, Alternating Direction Method of Multipliers (ADMM) [15] has been used as
an optimisation method in a wide variety of applications [16] including machine learn-
ing and neural networks [17]. This powerful iterative optimisation method breaks
down the objective function into smaller pieces that can be solved easier [18].
This method can be applied in parallel, and this characteristic makes it a suitable
replacement for gradient-based methods in the optimisation problem of neural net-
works. The inherent parallelism of ADMM also makes it a good option for hardware
implementation. However, as we discussed in [1] this method includes matrix inver-
sion, which is not a hardware friendly operation. ADMM is a combination of dual
decomposition and method of multipliers. We have explored the mathematical details
of these methods in [1].
2.3 LSMR
LSMR [19] is an iterative method that solves linear systems and least-square problems
like 2.2 where A P IRm˚n, b P IRm and x P IRn. This method is based on the Golub-
Kahan bidiagonalization process [20], which is shown in pseudo-code 1. Other iterative
least-square solvers also exist which are based on Golub-Kahan bidiagonalization pro-
cess [21] [22]. The difference between these methods is usually their early stopping
criterion. These algorithms are sequential in principle, but it is worth mentioning
that they can be parallelized for solving problems where X and B are matrices as
each column can be processed separately. This characteristic is desirable for hardware
implementation since it allows pipelining and parallelisation.
Ax “ b (2.2)
min
x
||Ax´ b||2
7
2.4. ADMM FOR NEURAL NETWORKS Chapter 2. Background
Algorithm 1: Golub-Kahan Bidiagonalization Process [22]
Input: A P IRm˚n, b P IRm
1 β1 Ð ||b||2
2 u1 Ð b{β1
3 α1 Ð ||ATu1||2
4 v1 Ð ATu1{α1
5 for k “ 1, 2, ... do
6 βk`1 Ð ||Avk ´ αkuk||2
7 uk`1 Ð pAvk ´ αkukq{βk`1
8 αk`1 Ð ||ATuk`1 ´ βk`1vk||2
9 vk`1 Ð pATuk`1 ´ βk`1vkq{αk`1
10 end
2.4 ADMM for Neural Networks
The implemented ADMM-LSMR method in [1] and this work, is based one an ADMM-
based training method proposed in ”Training Neural Networks Without Gradients:
A Scalable ADMM Approach” [23]. Here, their work is briefly discussed using the
notation provided in section 2.1.1.
To utilise ADMM for solving the optimisation problem of neural networks, the key
idea proposed in [23] is to use a variable called pre-activation zl for each layer l. This
will enable us to decouple the weights from the activation function and rewrite the
optimisation problem of an L layer neural network to the following:
min
Wl,xl,zl
`pzL, yq (2.3)
subject to zl “ Wlxl´1, for l “ 1, 2, ...L
xl “ hlpzlq, for l “ 1, 2, ...L´ 1
The augmented Lagrangian of 2.3 can be written as:
`pzL, yq ` βL||zL ´WLxL´1||22 (2.4)
`
L´1ÿ
l“1
rγl||xl ´ hlpzlq||22 ` βl||zl ´Wlxl´1||22s
`
L´1ÿ
l“1
λTl pzl ´Wlxl´1q ` δTl pxl ´ hlpzlqq
`λTLpzL ´WLxL´1q ` δTLpxL ´ hLpzLqq
8
Chapter 2. Background 2.4. ADMM FOR NEURAL NETWORKS
Where λl and δl are vectors of Lagrangian multipliers and γl and βl are penalty param-
eters. The proposed method in [23] uses just one Lagrangian multiplier vector since
they observed that by applying the classic ADMM where each constant has its own
Lagrangian vector, the method would be unstable. This results in 2.5 where the only
Lagrangian multiplier vector is λ
`pzL, yq ` βL||zL ´WLxL´1||22 (2.5)
`
L´1ÿ
l“1
rγl||xl ´ hlpzlq||22 ` βl||zl ´Wlxl´1||22s
`λT pzL ´WLxL´1q
Pseudo-code of their proposed method is provided in algorithm 2. In this method,
variables are updated one at a time while the others are fixed. Minimisation steps of
this algorithm are explained in the following sections.
Algorithm 2: ADMM for Neural Networks [23]
1 while not converged do
2 for l “ 1, 2, ...L´ 1 do
3 Wl Ð zlx:l´1
4 xl Ð pγlI ` βl`1W Tl`1Wl`1q´1pγlhlpzlq ` βl`1W Tl`1zl`1q
5 zl Ð argminz γl||xl ´ hlpzlq||22 ` βl||zl ´Wlxl´1||22
6 end
7 WL Ð zLx:L´1
8 zL Ð argminz `pzL, yq ` βL||zL ´WLxL´1||22 ` λT pzL ´WLxL´1q
9 λÐ λ` βLpzL ´WLxL´1q
10 end
Weight Update
Solution of minimising 2.5 with respect to Wl can be written as 2.6 where x:l´1 is the
pseudo-inverse of the matrix xl´1.
Wl Ð zlx:l´1 (2.6)
9
2.4. ADMM FOR NEURAL NETWORKS Chapter 2. Background
Activation Update
xl is updated using the equation 2.7 in each in each step. Details of how this equation
is derived of are discussed in [1].
xl Ð pγl ` βl`1W Tl`1Wl`1q´1pγlhlpzlq ` βl`1W Tl`1zl`1q (2.7)
Output Update
The new value of zL is calculated using the optimisation problem 2.8. This optimi-
sation problem is non-convex and non-quadratic because of the activation function h.
However, it can be solved easily in closed form when h is piece-wise linear since the
activation function works element-wise on its inputs.
argmin
z
γl||xl ´ hlpzlq||22 ` βl||zl ´Wlxl´1||22 (2.8)
Lagrangian Multiplier Update
The Lagrangian multiplier is updated using the following equation:
λÐ λ` βLpzL ´WLxL´1q (2.9)
10
Chapter 2. Background 2.5. HARDWARE FOR NEURAL NETWORKS
2.5 Hardware for Neural Networks
As the amount of available data and also the complexity of neural networks are increas-
ing, computation and storage cost of these models are growing rapidly. In some cases,
these requirements have made the use of large neural networks impossible, especially
in applications where low power consumption or small latency is critical. As a result,
choosing and designing efficient computing platforms for neural network applications
is becoming more critical than ever [24] [25].
Training or inference of neural networks on general-purpose CPUs with von Neumann
architecture is inefficient since a significant amount of MAC operations are involved
in these processes. CPUs neither have high performance in this area nor low power
consumption and are not suitable for either cloud or mobile applications of neural
networks. Also, breakdown of Dennard scaling, failure to increase the clock frequency,
and the low rate of data transfer between CPU and memory, known as von Neumann
bottleneck, have made the use of custom hardware architectures for neural networks
more interesting.
GPUs have a higher arithmetic density compared to CPUs. As a result, nowadays,
neural networks are usually trained on GPUs which have very high power consump-
tion. The need for low-power and efficient platform for training neural networks has led
to a significant increase in research in using custom hardware architecture for neural
networks in the last decades [26] [27] [28].
2.5.1 Motivation
One of the main reasons for interest in the use of custom hardware is exploiting the in-
herent parallelism of neural networks. Also, as we mentioned, von Neumann bottleneck
and breakdown of Dennard scaling result in limitations for CPUs and highlight the
need for more efficient hardware platforms. Computations of neural network models
usually take place in the cloud, but, there are some applications where local embed-
ded processing is preferred because of privacy. In these cases, small footprint and low
power consumption become more important. These two factors are also critical in
wearable or implantable medical devices. Other important motivations are increasing
the speed of computation and decreasing the latency, which is always desirable and
also is critical in applications like autonomous vehicles and robotics. It has been ob-
served that with custom chips, we can achieve much faster neural networks compared
to von Neumann architectures [25] [29].
11
2.5. HARDWARE FOR NEURAL NETWORKS Chapter 2. Background
Considering the above, the main motivations behind the growing interest in hardware
implementations of neural networks can be summarised as:
‚ Parallelism
‚ CPU limitations
‚ Low power consumption
‚ Small footprint
‚ High Speed
2.5.2 Approaches
Key factors that should be considered when implementing hardware-based neural net-
works are the following [25]:
‚ Accuracy: A common measure that demonstrates how performant a neural net-
work is. Accuracy of neural networks should not be compromised for the hard-
ware implementation.
‚ Power consumption: The energy consumed by the platform. Data movements
usually cost more energy than computation.
‚ Throughput/latency: How much data can be processed at a time and how fast
can the network respond to queries. Latency is more important in inference.
‚ Cost: Is determined by many factors. If a small number of chips are needed,
ASIC design costs much more than FPGA. Also, the complexity of the circuit
and the amount of memory required on the chip have a direct impact on the
cost.
Various techniques have been developed to maintain a trade-off between these factors
while making neural networks more hardware-compatible. These techniques usually
aim to reduce data movement, computation and required storage on the chip, while
maintaining the accuracy.
12
Chapter 2. Background 2.6. FIELD-PROGRAMMABLE GATE ARRAYS
Hardware implementation of neural networks can be split into three major categories
[29] [30]:
‚ Analog: ASIC and FPAA (Field Programmable Analog Arrays).
‚ Digital: ASIC and FPGA.
‚ Mixed Analog/Digital systems.
Analog ASIC designs are fast and dense with low power consumption, but, they are
expensive and lack flexibility. In general analog designs are noisier than digital designs,
and they may suffer from problems such as not being precise and robust along with
data storage problems. Another problem with FPAAs is that currently there are very
few FPAA manufacturers and their on-chip resources, which are critical in neural
network implementations, are much less than FPGAs. Digital ASIC designs provide
more accuracy and robustness compared to analog ASIC, but again they are very
expensive and not flexible with a very time-consuming and challenging development
process. There is also ongoing research to implement mixed analog/digital circuits for
networks. These systems have the overhead of ADC and DAC conversion [25] [30].
FPGAs usually are less performant comparing to ASIC designs in term of area, power
and speed. On the other hand, they have a faster design process and are less expensive.
These reconfigurable platforms also benefit from increased processing density (greater
performance per unit of silicon area) compared to general-purpose processors, and they
can have better cost:performance ratio compared to both ASIC and general purpose-
processors. In addition, FPGAs have the advantage of being reconfigurable, which
means that they are flexible and can be programmed to be used on different neural
networks on-demand [31] [32] [33].
2.6 Field-programmable Gate Arrays
A field-programmable gate array or FPGA is a semiconductor integrated circuit (IC)
which is made of small computation units, usually called logic blocks, connecting
together with programmable interconnections. FPGAs can execute an infinite variety
of functions as they can be configured over and over again. Configuring FPGAs is
actually programming their logic blocks in a way that their output(s) becomes a specific
function of their inputs. Because of this capability, we are able to build custom data-
paths and program the dataflow directly into the hardware.
A simple view of an FPGA architecture is shown in figure 2.2. Interconnections have
13
2.6. FIELD-PROGRAMMABLE GATE ARRAYS Chapter 2. Background
Figure 2.2: FPGA Architecture [34]
the task of connecting the logic blocks and making the flow of signals inside the chip
possible. A logic block usually consists of lookup tables (LUT), flip flops (FF) and
multiplexers. The primary structure of FPGA is a two-dimensional array of logic
blocks, interconnections and I/O blocks. These days FPGAs usually include on-die
processors, RAM blocks, digital signal processors (DSP) and embedded multipliers.
The main advantages of FPGAs can be enumerated as following:
‚ They are programmable, and their functionality can be changed by downloading
a new configuration file into the device. FPGAs can be considered as platforms
that can implement a custom instruction set for a target application. This is
while multiple instructions must be combined to perform the same operations in
CPUs, DSPs or GPUs.
‚ One of the main advantages of FPGAs is their support for pipeline implemen-
tations. In FPGAs, the parallelism is not necessarily achieved by replication
14
Chapter 2. Background 2.6. FIELD-PROGRAMMABLE GATE ARRAYS
Figure 2.3: FPGA design flow
of compute units. The pipelining approach allows parallelism while maximising
hardware utilisation [34].
‚ Besides being re-programmable, FPGA design process is considerably faster and
easier than ASIC design. The development cost is also significantly lower com-
pared to ASIC.
These features have made FPGAs very popular over the past decades and they are
employed in a wide range of applications like speech recognition, image processing,
video compression, ASIC prototyping and medical applications.
The FPGA design flow is shown in figure 2.3. Each step is explored in the following
paragraphs.
Design entry is performed by using schematic or a hardware description language
(HDL). By using schematic, the designer has to design the low-level hardware. As a
result, this technique can be used only in small projects while HDLs such as Verilog
and VHDL can be used for more complex systems and make the design process faster.
Recently, this step can also be done using higher-level programming languages like C
and let the C-to-FPGA compilers translate the C code into HDL. Such translation is
performed when developing FPGA accelerated programmes using OpenCL. By using
higher-level languages, the designer has less control over the FPGA resources and may
not be able to utilise all of the available hardware resources compared to HDL designs,
15
2.7. FPGA FOR NEURAL NETWORKS Chapter 2. Background
but, the design process will be less time-consuming.
In the synthesis step, the design is translated into a circuit using a netlist. A netlist
lists the required logic elements and interconnections. First, a syntax check is applied
and then an optimisation process is performed in order to eliminate redundant logic
and reduce the size of the design. Next, the details of the design are planned, and the
timing properties are estimated.
The layout of the design is determined in the implementation step. In this stage, the
design is mapped into logic blocks of the FPGA, and then the IO blocks and logic
blocks are connected.
In the last step, the mapped and routed design is loaded into the FPGA using a
generated bitstream file.
In order to test the design, at the end of each step, a simulation can be performed.
Behavioural simulation is performed before synthesis to check the functionality of the
design. Functional simulation or post-synthesis simulation is a netlist level simulation
which ignores the timing. In timing simulation, wiring and delays are also taken into
account. This simulation usually is more time consuming but is the most accurate one
[35] [36] [37].
2.7 FPGA for Neural Networks
FPGAs have a parallel architecture which makes them suitable for massive convolution,
MAC, and other essential matrix operations in training neural networks [38]. They
also benefit from flexibility in design and short development process like software,
while having the performance closer to ASIC designs [33].
An FPGA-based neural network system consists of two parts: CPU part (host) and
FPGA part (device). These two parts are usually connected with PCIe connections.
FPGAs generally have on-chip SRAM (Static Random Access Memory) which are
usually not enough for storing neural network parameters, and we have to use off-
chip memory. The performance of the system is usually bounded by bandwidth and
power consumption of this external memory. An abstract structure of a typical FPGA
implementation of a neural network is illustrated in figure 2.4. The role of the host is
to monitor the FPGA and issue commands to it. The FPGA usually has a controller
which is responsible for communicating with the host and also issuing signals for other
modules in FPGA. This controller can be a finite state machine or a decoder [24].
16
Chapter 2. Background 2.7. FPGA FOR NEURAL NETWORKS
Figure 2.4: A typical FPGA-based Neural Network [24]
In general, two main factors limit the performance of an FPGA-based neural network:
‚ On-chip resources.
‚ Off-chip memory bandwidth.
The main proposed ideas in order to make neural-networks more suitable for FPGA-
implementation fall into three categories [24] [25]:
‚ Reduce precision.
‚ Sparsity
‚ Compression
In the following sections, first precision reduction techniques are explained in details
as this approach is employed in our implementation. Next, an overview of the other
two approaches is given. There are also other techniques that do not fall into these
categories. The common challenge in all the existing approaches is to make an optimal
trade-off between accuracy and hardware speed and energy. For example, by reducing
the size of each computation unit, we would be able to place multiple replicas of the
units on the FPGA and increase the throughput. One possible way to reduce the
compute unite size could be using fixed-point arithmetic which results in sacrificing
the precision. It is also worth mentioning that some implementation details like data
access pattern can affect the efficiency of hardware utilisation.
17
2.7. FPGA FOR NEURAL NETWORKS Chapter 2. Background
2.7.1 Precision Reduction
In order to meet the computation requirements of training neural networks, domain-
specific accelerators which have densely packed floating-point arithmetic units are
being utilised. One of the factors that limit the speed of training neural networks is
the arithmetic density of hardware platforms. As a result, there is a fair amount of
ongoing research to replace the floating-point arithmetic with fixed-point or even using
less number of bits which increase the performance density [39].
Techniques of narrowing arithmetic are widely used in neural network customised hard-
ware and are not specific to FPGA implementations. For example, NVIDIA TESLA
V100 GPU [40] takes advantage of 16bit-32bit mixed-precision arithmetic and google
TPU v2 and v3 [41] use Bfloat 16 which is a 16-bit floating-point representation that
has been tailored for training neural networks and has a better performance compared
to IEEE half-precision representation in neural network applications.
One of the main issues of implementing neural networks on FPGA is selecting the
best numerical precision. Single and double-precision floating-point representations
decrease the quantisation error because of their high precision at the cost of a signif-
icant amount of FPGA resources. For example, in FP32 (Single precision floating-
point) 24 bits are dedicated for mantissa, which results in a very high precision that
is not needed for our purpose. In [39] they trained ResNet 20 [42] on CIFAR-10
using floating-point representations, and they altered mantissa and exponent width
to observe the validation error. Their observations can be summarised as following:
1.Convergence without precision loss, using 8-bit mantissa. 2.Convergence with a
small precision loss, using 4-bit mantissa. 3.Divergence using 2-bit mantissa. They
also mentioned that the exponent width could not be narrowed as it has a significant
impact on the representable range. FP16 (half-precision floating-point) is denser than
FP32 but still needs more hardware than fixed-point. In this representation, 11 bits
are assigned to mantissa, and the remaining 5 are for the exponent. FP16 suffers from
the issue of narrow representable range. On the other hand, fixed-point numbers which
have less precision and narrow range, increase the quantisation error while requiring
less amount of FPGA resources [39].
In [43], it has been discussed that training with fixed-point or half-precision floating-
point has mixed results because of the limited representable range. This fact makes
a vital trade-off between hardware resources and precision. The precision affects the
neural network accuracy and also the speed of its convergence. However, higher preci-
sion is associated with more hardware requirement. The challenge is to find an optimal
18
Chapter 2. Background 2.7. FPGA FOR NEURAL NETWORKS
point and a balance between the required precision and hardware resources.
Since there is a limited amount of resources available on FPGAs, and in order to make
efficient use of them, we aim to find the minimum viable precision and minimum viable
range. This is equivalent to finding the maximum amount of quantisation error that
can be tolerated without affecting the accuracy drastically. By using fewer bits for
neural network computations, we can also reduce the bandwidth requirement, which
is one of the main issues of the FPGA-based implementation of neural networks.
To summarise, using fewer bits and simpler representation have the following advan-
tages which make the FPGA implementation of neural networks more feasible:
‚ Less memory requirement
‚ Less computation cost
‚ Less hardware requirement
‚ Less bandwidth requirement
Considering these advantages, using standard floating-point numbers is not the best
choice and usually, more area-efficient numeric representations like 16 or 32 bit fixed-
point are used. In order to use low-precision computation, we have to quantise the
weights and activations of the neural network [44]. One of the simplest techniques is
to use the nearest fixed-point number representation of each parameter. This method
suffers from overflow and underflow because the range of floating-point is highly dy-
namic and easily exceeds the representable range with fixed-point. It has been found
that the range of parameters of a neural network (weights and activation) is limited
in a single layer, but, this range differs when comparing different layers [45]. It is
also possible to use more bits for the first and last layer and utilise ternary or binary
representations for hidden layers [46].
Low-precision computation is widely used in the inference part of neural networks to
make the run-time faster, and they can usually reach 32-bit floating-point accuracy
[47] [48] [49] [50] [51] [52] [53] [54]. Even binary and ternary representations have been
used for inference [55] [56] [57].
On the contrary, using low-precision computation in training neural networks usually
has an evident negative effect on accuracy. This is mainly due to the nature of back-
propagation and gradient-based methods [58] [59] [60] [61] [62] [63] [64] [65] [66]. For
example, in stochastic gradient descent, which is a common optimiser used in neural
19
2.7. FPGA FOR NEURAL NETWORKS Chapter 2. Background
networks, many small noisy steps take place for each parameter update. It is obvious
that to keep track of these small steps, and for SGD to work at all, high precision is
required [67] [55]. One of the widely used methods to tackle this problem is to use
high precision for gradient accumulation and then use lower-precision for other parts
of the learning [62] [64] [67] [68]. Gradient accumulators are frequently updated during
training, and the fact that storing them in low-precision adversely affects the accuracy
is not desirable [69]. Training neural networks with end-to-end low precision has been
done in [70], [71] and [72]. They have used different methods to restrict the range of
activations and selecting quantisation points. It is also worth mentioning that in [71],
parameters of first and last layers of networks are not quantised.
In general, we can conclude that while many benefits are observed in researches on
low-precision training, reduction in the accuracy is also reported[73].
2.7.2 Sparsification
It is possible to reduce the number of MAC operations by removing some weights
of the network. There are different approaches to achieve this, including removing
the weights with small absolute value [50] [74], or values with minimal impact on the
output. Another approach is to set the value of some weights to zero in order to
remove them. This techniques are beneficial for both computation costs and required
storage [24] [25].
2.7.3 Compression
As we mentioned before, data movement and storage are two critical issues in imple-
menting FPGA-based neural networks. Using compression techniques helps with both
of these issues. In order to compress parameters of a neural network, both lossless
and lossy compression can be utilised. In some cases, codes are assigned to values
and a translation table should be used [50] [75] [76] [24] [25]. Using low-precision for
inference of neural networks can also be considered as a type of compression. In [77],
they proposed a greedy algorithm to encode the parameters of networks considering
the platform’s memory and the accuracy required for the given task. Also, in [76],
they used a hash function to compress and reduce the size of a trained model.
20
Chapter 2. Background 2.8. FIXED-POINT ARITHMETIC
2.8 Fixed-point Arithmetic
As previously stated, fixed-point data types are widely used in FPGA implementation
of neural networks. In general, fixed-point numbers can be utilised whenever perfor-
mance is more critical than precision.
Two parameters are associated with a fixed-point data type definition:
‚ Bit width of representation. We refer to this as word length (WL).
‚ Number of fractional bits which determines the position of the binary point. We
refer to this as fraction length (FL).
We will use the notation fixedxWL,FLy in this report. We can also calculate the inte-
ger length (IL), the representable range (RR) and the smallest representable positive
number () as the following:
IL “ WL´ FL. (2.10)
RR “ r´2IL´1, 2IL´1 ´ 2´FLs (2.11)
 “ 2´FL (2.12)
The  is a crucial parameter in fixed-point data type and is used frequently in fixed-
point arithmetic. It is important to highlight the fact that in computer a bit pat-
tern can represent different values in different number systems. When working with
fixed-point numbers we should consider both the representation and the value of
a number. In a given fixedxWL,FLy we have FL number of fraction bits in rep-
resentation and to interpret the value of a given bit pattern we have to do the
following:
1. Calculate the value of the bit pattern in two’s complement format.
2. Divide the calculated value by 2FL (or multiply with ).
Consider the following examples with fixedx16, 10y data type:
 representation: 0000000000000001 (2.13)
 value : 2´10 (2.14)
representation: 0101110010001001 (2.15)
value : 23689 ˚ 2´10 “ 23.1337890625
21
2.8. FIXED-POINT ARITHMETIC Chapter 2. Background
representation: 1001000110100010 (2.16)
value : ´ 28254 ˚ 2´10 “ ´27.591796875
It is also worth mentioning that the two’s complement representation can be considered
a special case of fixed-point representation with FL “ 0.
As we mentioned earlier, although using fixed-point data types can save a considerable
amount of hardware resources and computation, the precision and representable range
of a floating-point data type with equivalent word length is significantly higher.
Data Type Representable Range
Smallest Positive
Representable Value
32bit Floating-point
IEEE 754 Single Precision
r´3.4 ˚ 1038,`3.4 ˚ 1038s 1.18 ˚ 10´38
fixedx32, 18y r´213, 213 ´ 2´18s 2´18
Table 2.1: Comparing representable range and smallest positive representable number
in floating-point and fixed-point data types
2.8.1 Rounding Methods
As the precision of floating-point data types is less than fixed-point numbers, a round-
ing procedure is required when converting from floating-point to fixed-point. Different
rounding methods can be used for this conversion. The implemented methods in this
project are the following:
‚ Downward rounding: Rounds to the nearest representable number which is
smaller than the input.
‚ Upward rounding: Rounds to the nearest representable number which is bigger
than the input.
‚ Nearest rounding: Rounds to the nearest representable number to the input.
‚ Stochastic rounding: Rounds to the nearest representable number greater or less
than the input based on a probability calculated proportionally to the distance.
Given a number x and a fixed-point representation fixedxWL,FLy we define txu as
the largest integer multiple of  “ 2´FL less than or equal to x. The above methods
22
Chapter 2. Background 2.9. DEVELOPMENT STACK
can be mathematically defined as:
downwardRoundpxq “ txu (2.17)
upwardRoundpxq “ txu`  (2.18)
nearestRoundpxq “
$&%txu if
x´txu

ă 0.5
txu`  otherwise
(2.19)
stochasticRoundpxq “
$&%txu with probability 1´
x´txu

txu`  with probability x´txu

(2.20)
Fixed-point quantisation with stochastic rounding has shown promising results in
training neural networks with gradient-based methods. However, it has the overhead
of pseudo-random number generator compared to nearest rounding [44] [78].
2.9 Development Stack
2.9.1 OpenCL
OpenCL (Open Computing Language) is a free standard for parallel-programming
across heterogeneous processing platforms including CPUs, GPUs, DSPs, FPGAs and
other processors or hardware accelerators. This framework is used to write portable
yet efficient programmes.
The OpenCL specification [79] uses four models to describe OpenCL concepts:
‚ Platform Model
‚ Memory Model
‚ Execution Model
‚ Programming Model
23
2.9. DEVELOPMENT STACK Chapter 2. Background
Figure 2.5: OpenCL platform model
Platform Model
The platform model contains a host connected to one or more devices, as shown in
figure 2.5. Each OpenCL device consist of one or more compute units. To perform
computations on the compute units of a device, relevant commands should be submit-
ted from the host.
Execution Model
OpenCL programmes execute in two parts:
1. Kernels that execute on devices (CPUs, GPUs, FPGAs and DSPs)
2. Host program that executes on the host (Usually a general-purpose CPU)
Unit of concurrent execution in OpenCL standard is work-item. Each work-item exe-
cutes the kernel body. A single iteration of a loop is usually mapped to a work-item.
Work-items are divided into work-groups. The host program manages the execution
of kernels by defining and controlling a context. The context includes the following:
‚ Devices: A list of available devices.
‚ Kernels: Functions that run on devices.
‚ Program Objects: Kernels executable
‚ Memory Objects: Memory objects that are visible to host and devices
The host creates one or more command-queues for each device and enqueues different
commands to them to manage the execution of kernels. Different queues run inde-
pendently and concurrently. Commands can be kernel execution commands, memory
24
Chapter 2. Background 2.9. DEVELOPMENT STACK
commands or synchronisation commands. A command queue can accept all the com-
mand types and schedules them. Commands in a command queue can execute in-order
or out-of-order relative to each other. Whenever a kernel execution or a memory com-
mand is submitted to a queue, an event is created. These events can be used by the
host program and other commands. Using these events, execution of different com-
mands and their dependencies on each other can be orchestrated and synchronisation
points of host program and kernels are managed.
Memory Model
Four different memory regions are defined in OpenCL:
1. Global Memory: All work-items in all work-groups have read/write access to
this memory. Host can access global memory.
2. Local Memory: This memory is local to a work-group and all work-items within
a work-group have read/write access to this memory.
3. Constant Memory: A subset of global memory that does not change during the
execution of a kernel.
4. Private Memory: This memory is private to a work-item and is not visible to
other work-items.
Programming Model
OpenCL supports data-parallel and task-parallel programming models. By using data-
parallelism, we are able to apply a sequence of instruction on different elements of
memory. In task parallelism, we execute a kernel using one work-item. We can then
enqueue multiple tasks to achieve parallelism.
2.9.2 Intel® FPGA SDK for OpenCL
Intel FPGA SDK for OpenCL [80] provides a compiler and a set of tools for building
and running OpenCL programmes on Intel FPGA products. Two main components
of applications implemented using Intel FPGA SDK for OpenCL are:
‚ Bitstream for programming FPGA
‚ Host program for managing the application flow and FPGA
25
2.9. DEVELOPMENT STACK Chapter 2. Background
Figure 2.6: Diagram of the Intel FPGA SDK for OpenCL programming model [80]
This software development kit includes two compilers. A C++ compiler and an AOC
compile. AOC is a specialised offline compiler that compiles C code written for the
FPGA (OpenCL kernel) to generate emulator executable or a hardware programming
image. The regular C++ compiler generates the executable that runs on the host.
As it is shown in figure 2.6, first, the offline compiler compiles OpenCL kernels to an
FPGA image file with .aocx extension. This image file is then is used by the host to
program the FPGA. The C++ compiler in the host side compiles the host program
and links it to the run time libraries of Intel FPGA SDK for OpenCL. The host
application, which has the task of programming and executing the hardware image
onto the FPGA, is then run by the host.
As we explained, in order to program an Intel FPGA the following components should
work jointly:
‚ The host compiler
‚ The host application
‚ The offline compiler
26
Chapter 2. Background 2.9. DEVELOPMENT STACK
‚ The OpenCL kernel(s)
‚ The custom platform
When a kernel is compiled, a custom dataflow circuit is generated, and acompute unit
is made of different pre-optimised components including load/store units, arithmetic
units and flow control units. These components are connected together depending
on the dataflow that is implied by the kernel(s). One of the main advantages of
Intel FPGA SDK for OpenCL is that it enables the use of FPGAs without requiring
programs written in their specific programming languages such as VHDL or Verilog.
Another important feature of Intel FPGA SDK for OpenCL is the detailed report
generation in html format which contains plenty of information about the resource
and area usage alongside performance bottlenecks. We will discuss the information
provided in this report more in section 2.9.3.
2.9.3 Intel AOC Compiler Report
Intel recommends using single-work item kernels which are called task instead of
NDRange kernels. When a kernel is written as a task, the Intel FPGA SDK FOR
OpenCL compiler is able to heavily apply pipeline parallelism to iterations of the
loops to achieve high-throughput.
The kernel’s report.html file gives us kernel analytical data including memory and area
usage and also kernel pipeline information. This report includes a summary section
alongside three categories of information that can be enumerated as:
1. Summary
2. Throughput Analysis
3. Area Analysis
4. System Viewers
We discuss each these sections in more details in the following.
Summary Report
Provides an overview of the design, including compile information such as target FPGA
family, device and board and AOC version, basic information about kernels including
the number of their compute units and their resource usage and achieved fmax of the
design.
27
2.9. DEVELOPMENT STACK Chapter 2. Background
Throughput Report
The information in this section includes the fmax, bottleneck summary, loop analysis
and latency estimation which aids the developer with optimising the kernel. This
section is divided into two parts:
‚ Loop Analysis: Provides useful information for all of the loops, including
whether the loops are pipelined or not, and their II (initiation interval). These
information helps the developer to maximise the throughput of the designed ker-
nel.
Pipelined loops make efficient use of hardware by keeping more resources occu-
pied and making the process of several data chunks concurrently possible.
II or initiation interval is an important parameter in loop pipelining. When a
loop is pipelined, the next iteration will begin before the previous one is finished.
II determines the number of clock cycles that are required for launching a new
iteration. This is actually the number of clock cycles that is needed to resolve
dependencies between iterations of the loop. It is obvious that smaller values of
II are desirable. In the best case, II value is equal to one, which means that one
loop iteration is launched every clock cycle.
‚ Fmax Report: The scheduled fmax of all the blocks is provided in this section.
The maximum frequency at which the output of registers is updated is called
the fmax. The duration of the clock cycle is limited by the physical propaga-
tion delay of signals between two successive registers which is a function of the
complexity of logic of the path. The path with the highest delay limits the speed
of the entire circuit and is called the critical path. The fmax is calculated as
the inverse of the critical path delay. A high value of fmax means higher per-
formance when there is no other bottleneck and hence is desirable. The AOC
compiler tries to optimise the design to achieve the highest possible fmax. When
the value of desired fmax and II are not specified in the design, the compiler uses
a heuristic to achieve the best fmax/II trade-off.
Area Report
This report provides details about the resource utilisation of the kernels, which helps
with optimising the kernel to be more area efficient. The resource usage information
is available in three levels of hierarchy:
28
Chapter 2. Background 2.9. DEVELOPMENT STACK
‚ System area: Resources that are utilised by all kernels, including global inter-
connects and board interface.
‚ Kernel area: Resources that are utilised by each of the kernels of the design,
including kernel dispatch logic.
‚ Block area: Resources that are utilised by each of the blocks inside the kernels.
Each Block is usually a branch-free section of the code like a loop body.
System Viewers
This section presents a graphical representation of the generated hardware. This
section has three parts:
‚ Graph viewer: Provides graphical report that includes information about sizes
of loads and stores, latency and stalls.
‚ Kernel memory viewer: Demonstrates the AOC compiler interpretation of the
data movements of the kernels.
‚ Schedule viewer: Illustrates the scheduled cycle and latency of a group of in-
structions in the design.
2.9.4 Intel® FPGA Devcloud
Intel FPGA Devcloud [81] is an Intel hosted cloud service which provides Intel XEON
processors and FPGA acceleration cards for the developers to devise and test their
designs. This cloud infrastructure allows users to experiment with their designs’ func-
tionality on high-end FPGA accelerator cards. The access to this service will be
granted upon request. You will benefit from remote access to Intel servers which are
equipped with:
‚ Latest Intel FPGA programmable acceleration cards like Intel Stratix 10 and
Intel Arria 10 devices
‚ Intel Core processors 6th to 8th generation
‚ Intel optimised frameworks and libraries
‚ Software tools needed for FPGA design, development and workload testing, in-
cluding Intel FPGA SDK for OpenCL.
29
Chapter 3
Software Design
As we mentioned in the previous sections, there are some limitations in gradient-based
methods that encouraged us to look for a better solution to solve the optimisation
problem of neural networks. The proposed method in [1] which is based on [23],
applies ADMM [18] for training feed-forward neural networks in order to make this
process more feasible for hardware implementation. This method implements a simple
version of LSMR [19] as an iterative least-squares method to avoid performing matrix
inversion. As it has been discussed in [1], the key characteristics of this method are
the following:
‚ Since the method does not use the gradient-based optimisers, their sequential de-
pendency is avoided, and the method is parallel by nature. To be more specific,
line 2 and 3 in algorithm 3 (weight update and activation update proce-
dures) can run in parallel since they don’t have any dependencies.
‚ The proposed method had the potential of being combined with fixed-point arith-
metic, and since back-propagation is not used in this method, it was expected
that the accuracy would not be severely affected.
‚ By using an iterative least-squares method, there is no need to perform matrix
inversion, which is the only obstacle for hardware-implementation of ADMM-
based training method. In this implementation, LSMR is used to avoid matrix
inversion and pseudo-inversion.
‚ LSMR is perfectly suitable for pipeline parallelism as it is an iterative method.
This can be observed in algorithm 4. It also allows independent computation of
each column of the result.
The pseudo-code of ADMM-LSMR method for training feed-forward neural networks
and the implemented LSMR can be seen in the algorithms 3 and 4 respectively.
30
Chapter 3. Software Design
Algorithm 3: ADMM-LSMR for Neural Networks [1]
1 while not converged do
2 for l “ 1, 2, ...L´ 1 do
3 Wl Ð weight updatepzl, xl´1q
4 xl Ð activation updatepWl`1, zl`1, zl, β, γq
5 zl Ð argminz γl||xl ´ hlpzlq||22 ` βl||zl ´Wlxl´1||22
6 end
7 WL Ð weight updatepzL, xL´1q
8 zL Ð argminz `pzL, yq ` βL||zL ´WLxL´1||22 ` λT pzL ´WLxL´1q
9 λÐ λ` βLpzL ´WLxL´1q
10 end
11 Function weight update
Input: zl P IRm˚n, xl´1 P IRp˚n
Output: Wl P IRm˚p
12 for i “ 1, 2, ...,m do
13 W Tl r:, is Ð LSMRpxTl´1, zTl r:, isq
14 end
15 end
16 Function activation update
Input: Wl`1 P IRm˚n, zl`1 P IRm˚p, zl P IRp˚n, β, γ
Output: xl P IRn˚p
17 part1Ð γlI ` βl`1W Tl`1Wl`1
18 part2Ð γlhlpzlq ` βl`1W Tl`1zl`1
19 for i “ 1, 2, ...,m do
20 xlr:, is Ð LSMRppart1, part2r:, isq
21 end
22 end
31
Chapter 3. Software Design
Algorithm 4: LSMR
1 Function LSMR
Input: A P IRm˚n, b P IRm
Output: x P IRn
2 β1 Ð ||b||2, u1 Ð b{β1
3 α1 Ð ||ATu1||2, v1 Ð ATu1{α1
4 ζ1 Ð α1 ˚ β1, α1 Ð α1
5 ρ1 Ð ρ1 Ð c1 Ð 1, s1 Ð ζ1 Ð 0
6 h1 Ð v1
7 xÐ h1 Ð 0
8 for k “ 1, 2, ...,minpm,nq do
9 βk`1 Ð ||Avk ´ αkuk||2
10 uk`1 Ð pAvk ´ αkukq{βk`1
11 αk`1 Ð ||ATuk`1 ´ βk`1vk||2
12 vk`1 Ð pATuk`1 ´ βk`1vkq{αk`1
13 ck`1, sk`1, rk`1 Ð sympαk, βk`1q
14 αk`1 Ð ck`1 ˚ αk`1
15 ck`1, sk`1, ρk`1 Ð sympck ˚ ρk, sk`1 ˚ αk`1q
16 ζk`1 Ð ck`1 ˚ ζk, ζk`1 Ð ´sk`1 ˚ ζk
17 hk`1 Ð ´psk ˚ ρk`1 ˚ ρk`1q{pρk ˚ ρkqhk ` hk
18 xk`1 Ð pζk`1{pρk`1 ˚ ρk`1qhk`1 ` xk
19 hk`1 Ð ´ppsk`1 ˚ αk`1q{ρk`1qhk ` v
20 end
21 end
22 Function sym
Input: a, b
Output: c, s, r
23 if abs(b) ą abs(a) then
24 τ Ð a{b
25 sÐ signpbq{sqrtp1` τ 2q
26 cÐ s ˚ τ, r Ð b{s
27 end
28 else
29 τ Ð b{a
30 cÐ signpaq{sqrtp1` τ 2q
31 sÐ c ˚ τ, r Ð a{c
32 end
33 end
32
Chapter 3. Software Design 3.1. C IMPLEMENTATION
3.1 C Implementation
As the first step towards a hardware implementation, a low-level C implementation was
required. Each of the sub-procedures of the method including LSMR, weight update,
activation update ,output update and lagrangian update) have been imple-
mented and tested individually and a likewise comparison with Python modules im-
plemented in [1] was performed for a sanity check.
3.1.1 Motivation
A Python version of the proposed method had been implemented in [1]. This im-
plementation was heavily using the NumPy library for matrix operations, and the
underlying details were out of control of the developer. First of all, C implementation
is required for an OpenCL accelerated program as the host program can only be writ-
ten in C or C++. Secondly, the parts which were aimed to be accelerated also should
have been implemented in C because a high-level Python implementation can not be
used as a reference to measure the speed up gain of the accelerated version. On the
other hand, the device kernel programming language is a derivation of C language and
a C implementation which is easier to be tested and modified, can be converted to an
OpenCL kernel with minimal effort.
3.1.2 Code Structure
This program takes the number of hidden layers in the network and the number of
neurons in each layer as input.
In order to implement algorithms 3 and 4 in C, we implemented a data structure for
storing matrices and also functions to perform primary matrix operations. In this
implementation, we used double data type to store the data of matrices. We defined
the matrix data type as a struct:
typede f s t r u c t
{
i n t rows ;
i n t c o l s ;
double ∗ data ;
} matrix ;
In this struct, we store the number of rows and columns of the matrix and a pointer
to where elements of the matrix are stored row-wise in the memory. Numerous ma-
trix operations were required to be implemented for this matrix datatype. There was
33
3.1. C IMPLEMENTATION Chapter 3. Software Design
no particular challenge associated with this implementation apart from dealing with
low-level concepts of C language and avoiding memory leaks.
3.1.3 Bottleneck Analysis
Using the C implementation, we measured the execution time of each procedure in a
single iteration of training of a 4 layer neural network with the hidden size of 28 on a
subset of HIGGS data set [82] and a 3 layer neural network with the hidden size of 8 on
IRIS data set [83]. These measurements were performed for 3000 iterations. The per-
centage of execution time associated with each of the procedures is reported in tables
3.1 and 3.2 and figure 3.1. As it is evident from the results, the most time-consuming
sub-procedures are activation update and weight update. Considering algorithm
3, we concluded that these procedures are considerably more time-consuming because
of several LSMR calls. Therefore, we chose the LSMR function as the primary target
for hardware acceleration.
activation update weight update output update lagrangian update
54.35 % 39.46 % 6.12 % 0.05 %
Table 3.1: Percentage of the execution time of different procedures in one iteration on
HIGGS
activation update weight update output update lagrangian update
68.06 % 29.98 % 4.69% 0.21%
Table 3.2: Percentage of execution time of different procedures in one iteration on IRIS
34
Chapter 3. Software Design 3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC
40%
54%
6%
Procedures Time Percentage on HIGGS dataset
weight update
activation update
output update
27%
68%
5%
Procedures Time Percentage on IRIS dataset
Figure 3.1: Piechart of execution time of different procedures in one iteration
3.2 Implementation of Fixed-point Arithmetic
Fixed-point arithmetic is widely used in FPGA implementation of neural networks,
for both inference and training and is considerably faster and more efficient compared
to floating-point. In the following sections, we discuss the motivations, challenges and
the code structure of our implementation.
3.2.1 Motivation
Since fix-point arithmetic is composed of simpler data types and operations compared
to floating-point, it is widely used for general speed up optimisations and especially
for hardware designs because it requires less silicon area. Nevertheless, depending on
the algorithm, the disadvantage of narrower range and lower precision may adversely
affect precision efficiency.
In stochastic gradient descent which is a primary version of gradient-based methods,
problem space of parameters is explored using small and noisy steps. Such explo-
ration demands relatively high precision during the updates in SGD algorithm. By
observing the implemented low-precision algorithms [84] and also considering the the-
oretical upper bound on the performance of low-precision SGD [85] we can conclude
that precision-accuracy trade-off has limited the performance of current training algo-
rithms.
As the ADMM-LSMR does not involve gradient calculation and by being parallel
avoids heavily sequential processes which cause the required precision being accumu-
lated, we conclude that our method would be much less vulnerable to fixed-point errors
35
3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC Chapter 3. Software Design
comparing to SGD.
Also it has been proposed that noise in neural networks is a form of regularisation and
can help the model to generalise better. This concept has been assessed in previous
works like Dropout [86] [87], DropConnect [88] and Binary Connect [67].
Considering the above and the fact that fixed-point arithmetic requires less hardware
resources and is faster, we deduced that employing this technique will probably re-
sult in an overall improvement in hardware-accelerated ADMM-LSMR algorithm. We
implemented both 16bit and 32bit fixed-point with four different rounding methods.
3.2.2 Challenges
The challenges in the process of developing the fixed-point arithmetic and embedding
it in matrix operations can be summarised as:
‚ Numerous edge cases to be considered
‚ Keeping track of fraction bits in chain operations
‚ Boundary checks
‚ Temporary containers overflow check and precision
‚ Challenging bit-wise operations and generalisation to support flexible change of
fractional bits
‚ Difficult testing procedure
3.2.3 Fixed-point Arithmetic Details
First, we defined our fixed-point data type and associated arithmetic functions. Also,
we implemented a set of conversion functions for each rounding method. Next, we im-
plemented a data structure for storing matrices with our fixed-point data type. Finally,
we implemented the fixed-point version of all the relevant matrix functions. In the fol-
lowing sections, details of selected functions are explained. We will only demonstrate
the detail for stochastic rounding method as it was the most complicated version. We
also only discuss the implementation of 32bit fixed-point. 16bit implementation is
similar to the 32bit version with minor differences.
36
Chapter 3. Software Design 3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC
Fixed-point Data Type
Arithmetic on fixed-point numbers is almost identical to integers with some adjust-
ment. We used types int32 t and int16 t to define 32bit and 16bit fixed-point data
types.
#d e f i n e WL 32 // Word Length
#d e f i n e Fixed i n t 3 2 t
#d e f i n e FL 18 // Fract ion Length
#d e f i n e IL WL ´ FL // I n t e g e r Length
#d e f i n e Eps i lon pow(2 , ´ FL)
#d e f i n e Ubound value ( f l o a t ) (pow(2 , IL ´ 1) ´ pow(2 , ´FL) )
#d e f i n e Lbound value ( f l o a t ) (´ (1 << ( IL ´ 1) ) )
#d e f i n e Ubound ( Fixed ) ( (1LL << (WL ´ 1) ) ´ 1)
#d e f i n e Lbound ( Fixed ) (´ (1 << (WL ´ 1) ) )
#d e f i n e ONE F ( Fixed ) (1 << FL)
#d e f i n e MINUS ONE F ( ( Fixed ) ( (1 << FL) ´ 1) ) ˆ 0xFFFFFFFF
We also defined important constants like WL , FL ,IL. These constants define the
fixed-point data type fixedxWL,FLy. Another important constant is . As we men-
tioned in section 2.8,  is the smallest positive number that can be represented given
a fixed-point data type.
As previously stated in section 2.8, we should consider both the representation and
the value when working with fixed-point numbers. Ubound and Lbound define the
representation of boundaries in our data type. Since our representation is the same
as two’s complement, Ubound is represented by setting all bits except the leftmost bit
to 1, and the Lbound is represented by setting the left most bit to 1 and the others to
zero. Ubound value and Lbound value store the value of boundaries in float. Repre-
sentation of 1 and ´1 are also stored in ONE F and MINUS ONE F.
Considering fixedx32, 18y, which is defined above, these constant can be written as
below:
Ubound “ 01111111111111111111111111111111 (3.1)
Lbound “ 10000000000000000000000000000000 (3.2)
ONE F “ 00000000000001000000000000000000 (3.3)
MINUS ONE F “ 11111111111111000000000000000000 (3.4)
37
3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC Chapter 3. Software Design
To convert a given float number to fixed-point, first, a boundary check is required. This
boundary check takes place in convert function and either saturation to boundaries is
applied, or a specified rounding function is called.
convert(x) “
$’’’&’’’%
Ubound if x ě Ubound value
Lbound if x ď Lbound value
round f(x) otherwise
(3.5)
Two essential functions were implemented to cast 64bit fixed-point with FL bits of
the fraction and 2 ˚ FL bits of fraction to our defined data type. These two functions
are frequently used in the fixed-point arithmetic. The pseudo-code of these functions
can be seen in algorithm 5.
The function cast f64 simple preforms an overflow check, and then the input is either
saturated to boundaries or its leftmost 32bits are discarded.
The other function, cast f64, is more complicated. As it can be seen in algorithm 5,
first a boundary check is performed. To perform this check, we first cast Lbound and
Ubound to int64 t, so they become 64 bits and then in order to align them with the
input we shift them to left FL times. After the boundary check, if the input is in
range, we change it to fit in 32 bit. Since the input number has 2 ˚FL bits of fraction,
the rightmost FL bits can not be presented in our target data type. We mask these
bits using SHIFT and AND operations and store them in diff. We can write:
x´ txu “ diff ˚ 2 (3.6)
Where txu is defined as the largest integer multiple of  less than or equal to x. We
use diff ˚  “ x´txu

to compute the rounded version of input using 2.20 formula.
Three functions were also implemented to perform primary operations on our defined
fixed-point data type. The pseudo-code of these functions are provided in algorithm
6.
For adding two 32 bit fixed-point numbers we simply cast each of them to 64bits then
add them to avoid overflow. In the end, castf simple function is called to fit the result
in 32 bits.
38
Chapter 3. Software Design 3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC
Algorithm 5: Casting 64bit fixed-point to 32bit fixed-point
1 Function cast f64 simple
Input: int64 t x
Output: Fixed out
2 if x ď pint64 tq Lbound then
3 out Ð Lbound
4 end
5 if x ě pint64 tq Ubound then
6 out Ð Ubound
7 end
8 else
9 out Ð (Fixed) x
10 end
11 end
12 Function cast f64
Input: int64 t x
Output: Fixed out
13 if x ď ppint64 tq Lboundq ! FL then
14 out Ð Lbound
15 end
16 if x ě ppint64 tq Uboundq ! FL then
17 out Ð Ubound
18 end
19 else
20 diff Ð px&pint64 tqpp1 ! pFLqq ´ 1qq
21 prob Ð 1´ diff ˚ 
22 if random ď prob then
23 out Ð pFixedqpx " FLq
24 end
25 else
26 out Ð ppFixedqpx " FLq ` 1q
27 end
28 end
29 end
39
3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC Chapter 3. Software Design
Algorithm 6: Primary operations on fixed-point data type
1 Function add f
Input: Fixed a, Fixed b
Output: Fixed out
2 int64 t temp Ð pint64 tqa` pint64 tqb
3 cast f64 simple(temp, out)
4 end
5 Function multiply f
Input: Fixed a, Fixedb
Output: Fixed out
6 int64 t temp Ð pint64 tqa ˚ pint64 tqb
7 cast f64(temp, out)
8 end
9 Function divide f
Input: Fixed a, Fixed b
Output: Fixed out
10 int64 t temp Ð pppint64 tqaq ! FLq{ppint64 tqbq
11 cast f64(temp, out)
12 end
By multiplying two fixed-point numbers, a and b, with FL bits of fraction, we have:
representation: a ˚ b “ temp (3.7)
value : a ˚  ˚ b ˚  “ temp ˚ 2
So the result has 2˚FL bits of fractions and we call cast f function to perform fraction
adjustment.
In fixed-point by fixed-point division, assuming both numbers having FL bits of
fraction, the dividend should be shifted to left FL times. Consider the division
a{b “ result. The value of the result should be:
result value: a
b
(3.8)
Since the value of a representation in our data type is calculated as representation ˚
, the representation of the result in our data type should be:
result representation: a
b ˚  “
a ˚ ´1
b
(3.9)
To achieve this we divide a (dividend) by  using a SHIFT operation.
40
Chapter 3. Software Design 3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC
Fixed-point Matrix
In order to use our implemented fixed-point data type in matrix operations, we defined
a new struct:
typede f s t r u c t
{
i n t rows ;
i n t c o l s ;
Fixed ∗ data ;
} fmatr ix ;
This struct is the same as the other matrix struct that we defined in section 3.1, except
that the type of stored data is Fixed. All the associated matrix operations were also
implemented for fmatrix. Most of these functions are quite identical to their equiva-
lent version with double data type matrices, with primary operations (add, multiply,
divide) being replaced with fixed-point variants.
On the other hand, some functions including matMul, dot and norm required more
modification. The pseudo-code of these functions are provided in algorithms 7, 8 and
9 respectively.
Algorithm 7: Matrix multiplication of fmatrix
1 Function matMul f
Input: fmatrix mat1 P IRm˚n , fmatrix mat2 P IRn˚p
Output: fmatrix prod
2 for col “ 1, 2, ..., p do
3 for row “ 1, 2, ...,m do
4 pint64 tq sum Ð 0
5 for k “ 1, 2, ..., n do
6 pint64 tq temp Ð pint64 tqmat1rrowsrks ˚ pint64 tqmat2rksrcols
7 sum Ð temp` sum
8 overflow check and saturation of sum
9 end
10 Fixed result
11 cast f64(sum , result)
12 prod[row][col] Ð result
13 end
14 end
15 end
41
3.2. IMPLEMENTATION OF FIXED-POINT ARITHMETIC Chapter 3. Software Design
In all of these functions MAC (Multiply and Accumulate) operation is performed.
To apply this critical operation in fixed-point with minimal error, the following ap-
proached was used:
‚ We store the result of each multiplication which is a fixed-point number with
2 ˚FL bits of fraction and 2 ˚ IL bits of integer in an int64 t. The conversion of
these results to our Fixed data type is avoided and is delayed to the next step,
but, after each addition, overflow check is applied.
‚ We convert the sum of all the results to Fixed data type using cast f function
since it is a 64bit fixed-point with 2 ˚ FL bits of fraction.
Algorithm 8: Dot product of fmatrix
1 Function dot f
Input: fmatrix v1 P IR1˚m , fmatrix v2 P IRm˚1
Output: Fixed prod
2 pint64 tq sum Ð 0
3 for k “ 1, 2, ...,m do
4 pint64 tq temp Ð pint64 tqv1r0srks ˚ pint64 tqv2rksr0s
5 sum Ð temp` sum
6 overflow check and saturation of sum
7 cast f64(sum , prod)
8 end
9 end
Algorithm 9: L2-norm of fmatrix
1 Function norm f
Input: fmatrix v P IRm˚1
Output: Fixed n
2 pint64 tq sum Ð 0
3 for row “ 1, 2, ...,m do
4 pint64 tq temp Ð pint64 tqvrrowsr0s ˚ pint64 tqvrrowsr0s
5 sum Ð temp` sum
6 overflow check and saturation of sum
7 end
8 n Ð integer sqrt of sum
9 end
A function to perform integer square root was needed in norm function. We used the
proposed algorithm in [89] with a little modification for this purpose.
42
Chapter 3. Software Design 3.3. USING FIXED-POINT LSMR IN ADMM
3.3 Using Fixed-point LSMR in ADMM
As we discussed in section 3.1.3, LSMR function in weight update and activa-
tion update was the most time consuming part of the C implementation. In order to
speed up the training process, we aimed to make the LSMR function faster by running
it on hardware and taking advantage of both task and pipeline parallelism.
As previously stated, fixed-point arithmetic is considerably more efficient comparing
to floating-point. Therefore, in order to make the hardware implementation more fea-
sible, fixed-point arithmetic was used in the LSMR module. The fixed-point version
of LSMR works with fixed-point matrices and uses their relevant functions to perform
matrix operations which were explained in section 3.2.
Also, some experiments were performed to check if the ADMM-LSMR algorithm works
with low precision using different rounding methods. Results of these experiments can
be found in section 5.1.
In summary, we observed that:
‚ We were able to achieve near float accuracy using 32bit fixed-point implementa-
tion of LSMR.
‚ The proposed ADMM-LSMR method failed to converge using 16bit fixed-point
implementation of LSMR.
43
Chapter 4
Hardware Design
4.1 Hardware-accelerated ADMM-LSMR
To achieve our final goal, which was hardware implementation of the proposed method,
we used Intel FPGA SDK for OpenCL. A Programmable Acceleration Card with Intel
Arria® 10 GX FPGA was used for our design.
In this implementation we were able to run weight update and activation update
procedures in parallel and speed up or training process. To the best of our knowledge,
this is the first hardware implementation of ADMM, which also uses LSMR for training
neural networks.
4.1.1 Motivation
The proposed ADMM-LSMR method in [1] is a hardware-friendly approach for train-
ing neural networks. From the early stages of this work, our goal was to use hardware
acceleration to take advantage of the inherent parallelism of this method. After im-
plementing the fixed-point version of the method and observing its promising results,
our next step was to develop an OpenCL program to perform FPGA emulation and
finally run on our target FPGA card.
In the following sections, the latest version of the implementation, which is a product
of extensive optimisation is explained. The details of these optimisation stages are
discussed in section 4.2.
44
Chapter 4. Hardware Design 4.1. HARDWARE-ACCELERATED ADMM-LSMR
4.1.2 Challenges
‚ Extreme system requirements of the development tools. Development had to be
performed fully remote on department workstations or Intel Devcloud
‚ Unstable emulator
‚ Lack of informative error/crash reports
‚ Long hardware compilation time.
‚ Getting access to Intel FPGA boards and technical difficulties in working with
Intel servers
‚ Minor but undocumented differences between the emulator and physical FPGA
4.1.3 OpenCL for FPGA Implementation
As we mentioned in section 2.9.1, an OpenCL accelerated program has two parts:
1. A C++ program to run on the host. This program is compiled using g++.
2. An OpenCL program including kernels to run on the device which is complied
using Intel AOC compiler.
In the following sections, an overview of the program flow and more details of the host
and device sections are explained.
Figure 4.1: Gantt chart of execution time of one iteration of ADMM-LSMR on a
hidden layer with hidden size of 28 on HIGGS data set
45
4.1. HARDWARE-ACCELERATED ADMM-LSMR Chapter 4. Hardware Design
Figure 4.2: Activity diagram of one iteration of ADMM-LSMR on a single layer
Program Flow
A high-level overview of the program can be described as:
1. Initialising OpenCL run-time and resources
2. Preparing the inputs and setup network architecture
3. Performing ADMM-LSMR until converged as:
(a) Apply the following for each layer:
i. Performing weight update and activation update with LSMR call
commands being sent to the FPGA.
ii. Wait for stage i results and perform output update if not in last layer
(b) Apply last output update for last layer and lagrangian update
4. Save the results and clean up the resources
46
Chapter 4. Hardware Design 4.1. HARDWARE-ACCELERATED ADMM-LSMR
The hardware-accelerated part takes place on stage 3.a. An activity diagram of this
process is shown in figure 4.2. Also, a Gantt chart illustrating this stage on a single
hidden layer of size 28 on HIGGS data set is provided in figure 4.1.
Device Program
As it is shown in pseudo-code 3, in both activation update and weight update
procedures, the LSMR function is called inside a for loop. This is because pf the fact
that original LSMR method solves least-square problems like 2.2 and its output has
the following form:
x “ A´1b
A P IRm˚n, b P IRm˚1, x P IRn˚1 (4.1)
Therefore, in order to solve a problem like 4.1.3, the LSMR function should be called
p times. These p different LSMR calls are independent in relation to each other which
makes them an ideal candidate to be implemented in hardware and be parallelised.
X “ A´1B
A P IRm˚n, B P IRm˚p, X P IRn˚p (4.2)
Our implemented LSMR kernel takes 3 matrices as input to solve a problem of a form
4.1.3. Intel AOC compiler applies pipeline parallelism when it translates the kernel to
bitstream for FPGA.
k e r n e l void lsmr ( const i n t m, const i n t n , const i n t p , g l o b a l const
Fixed ∗ r e s t r i c t base input , g l o b a l Fixed ∗ r e s t r i c t base At ,
g l o b a l Fixed ∗ r e s t r i c t base output , l o c a l Fixed ∗ r e s t r i c t
base u , l o c a l Fixed ∗ r e s t r i c t base v , l o c a l Fixed ∗ r e s t r i c t
base h , l o c a l Fixed ∗ r e s t r i c t base hbar , const i n t o f f s e t )
As dynamic allocation is not allowed inside the kernel, all of the allocations take place
in the host and OpenCL memory objects are passed to the kernel. This memory objects
can be seen as a pointer inside the kernel. These memory objects which appear in the
header of the kernel are:
‚ global const Fixed * restrict base input: Pointer to start of a global memory
that stores inputs A and B. We refer to this as input buffer in the host side.
‚ global Fixed * restrict base At: Pointer to starts of a global memory for storing
47
4.1. HARDWARE-ACCELERATED ADMM-LSMR Chapter 4. Hardware Design
AT which is a matrix that is needed in internal computing of the kernel. We
refer to this as internal buffer in the host side.
‚ global Fixed * restrict base output: Pointer to starts of a global memory for
storing the output X. We refer to this as output buffer in the host side.
‚ local Fixed * restrict base u , local Fixed * restrict base v, local Fixed * re-
strict base h , local Fixed * restrict base hbar : Pointers to start of local memories
for storing internal matrices needed for kernel computations.
In addition to these memory objects, the kernel takes four integers. m, n and p deter-
mine dimensions of the input and output matrices.
In order to utilise most of the available FPGA resources, after the extensive optimi-
sations, we were able to fit four compute units of the LSMR kernel in the target
FPGA. In every iteration of training, two of these compute units are used for activa-
tion update LSMR, and the other two perform weight update LSMR. As we mentioned
earlier, computing the columns of the output of LSMR are independent in relation to
each other. Therefore in order to split the workload between the compute units, we
made each compute unit responsible for computing half of the columns of the output.
We used an offset (a const int) to tell the compute units which part of the input they
should use and in which part of the output they should write.
Another important matter to mention is that this implementation uses round to near-
est method for fixed-point numbers. All of the inputs are converted to fixed-point in
the host using the specified rounding method in that program. But, the used rounding
method inside the kernel is nearest rounding.
Since the inputs of LSMR were memory objects, we had to define another version of
fmatrix in the kernel program. All of the matrix operation functions had to be modi-
fied consequently. Most of these functions are not used explicitly in the final version
since we had to make the matrix operations inline to optimise the kernel performance
and resource utilisation. The final implementation is quite complex, tangled and hard
to read as a result of hardware optimisations.
Host Program
The host program has three main tasks:
1. Configure OpenCL runtime and initialising OpenCL objects.
2. Perform the skeleton of the algorithm.
48
Chapter 4. Hardware Design 4.1. HARDWARE-ACCELERATED ADMM-LSMR
3. Orchestrate the acceleration and device commands.
Configuration and Initialisation
We defined a manager for OpenCL to perform all of the initial configurations of the
OpenCL and also to keep track of important objects including platform, device, con-
text, program and command queues. As we mentioned earlier, we have four compute
units that we want to work in parallel. Therefore it was required to define four different
command queues.
typede f s t r u c t
{
c l p l a t f o r m i d plat form ;
scoped array<c l d e v i c e i d> dev i c e ;
c l c o n t e x t context ;
cl command queue queue0 ;
cl command queue queue1 ;
cl command queue queue2 ;
cl command queue queue3 ;
c l program program ;
} opencl manager ;
We also defined a function to perform initialisation, init opencl, and another one to
release allocated objects, cleanup opencl. These two functions are invoked at the start
and end of the host program respectively.
In the init opencl function we perform the following:
1. Get the OpenCL platform.
2. Query the available OpenCL devices and pick the first one.
3. Create the context.
4. Create the program and build it.
5. Create the command queues.
Each LSMR kernel invocation requires a set of parameters and OpenCL objects. There-
fore the following struct was defined. This struct includes dimensions of the matrices,
all the input, output and other buffers that the kernel requires, and event objects used
to define dependencies of commands and synchronisation with the host.
49
4.1. HARDWARE-ACCELERATED ADMM-LSMR Chapter 4. Hardware Design
typede f s t r u c t
{
i n t m;
i n t n ;
i n t p ;
cl mem input bu f ;
cl mem i n t e r n a l b u f 1 ;
cl mem i n t e r n a l b u f 2 ;
cl mem output buf ;
c l k e r n e l ke rne l 1 ;
c l k e r n e l ke rne l 2 ;
c l e v e n t k e r n e l e v e n t [ 2 ] ;
c l e v e n t f i n i s h e v e n t [ 2 ] ;
} lsmr module ;
In the host program we defined two dynamic arrays of lsmr module. weight update lsmr
and activation update lsmr. We constructed one lsmr module for each call of weight update
and activation update in one training iteration. As the dimensions of inputs and out-
put of the function in different iterations are the same, these modules are reused in
iterations of the training. Therefore weight update lsmr and activation update lsmr
arrays are consist of layers and layers -1 number of lsmr modules respectively.
We also defined a function in order to initialise lsmr modules with the appropriate pa-
rameters. The important parameters are m, n and p which determine the dimensions
of the inputs and output of the LSMR.
This function performs the following:
1. Create input buffer, output buffer and internal buffer with the appropriate size.
2. Create the kernel and set its arguments.
Device Invocation
The host program is responsible for managing the device kernel calls. This process
includes setting kernel arguments, uploading the inputs, invoking the kernel and down-
loading the results. It also manages dependencies and synchronisation of upload, in-
vocation, download and in general host-device synchronisation.
In our implementation, a function called run lsmr opencl is responsible to perform
OpenCL commands given an initialised lsmr module object. This function gets called
50
Chapter 4. Hardware Design 4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION
from the weight update and activation update functions and performs the follow-
ing actions:
1. clEnqueueWriteBuffer to upload input matrix A.
2. clEnqueueWriteBuffer to upload input matrix B, we upload this matrix column
wise.
3. clEnqueueWriteBuffer to fill the output buffer with zero.
4. clEnqueueTask to invoke the kernel for computing first half of the result.
5. clEnqueueTask to invoke the kernel for computing second half of the result.
6. clEnqueueReadBuffer to download the output X from the output buffer.
It is worth mentioning that for each clEnqueueTask, we had to use a different queue
so they would be able to run in parallel. Also, OpenCL events are used to ensure the
kernels start after the upload is finished and likewise, the download is started when
the kernels are completed.
In the main training loop, after calling the weight update and activation update
a function called post process is invoked where we wait for the results of the two
enqueued kernels and perform the further required operations.
As previously stated, we use four LSMR compute units in parallel and each of them is
pipelined internally. As a result, the two most consuming parts of the training process
execute in parallel and pipelined fashion which leads to a noticeable speed up.
4.2 Optimisations of FPGA Implementation
In this section, we describe the applied optimisation steps and their results on timing
and resource usage. These optimisations were critical to maximise the performance
and utilisation of the available resources. Our goal was to achieve the maximum
frequency of our target device (240 MHz), and II equal to one for most parts of the
design.
In each step, we have provided two tables for showing some of the non-optimised
blocks of code and critical issues based on II and fmax values. A separate table is also
provided to show the resource usage of each design.
51
4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION Chapter 4. Hardware Design
Version 1
Our first OpenCL implementation of the LSMR kernel was similar to its C imple-
mentation with minor modification. This version took as input a matrix A P IRm˚n
and a vector b P IRm˚1 and produced an output of the form x P IRn˚1. Some critical
deficiencies of this design is shown in tables 4.2 and 4.1. One of the main issues of this
implementation was the memory dependency between the load and store operations
which caused a high value of II in different sections of the code.
Location in source code II Details
Computing norm „172 Data dependency.
Load from global memory.
Summation of matrices „258 Memory dependency.
Load and then store to global memory.
Transposing matrix „257 Memory dependency.
Load and then store to global memory.
Table 4.1: Non-optimised blocks of design based on II. Version 1.
Location in source code Scheduled fmax Details
Computing Integer SQRT 98.3 Loop feedback
Performing Matrix Multiplication 175.0 Loop feedback
Computing norm 135.0 Loop feedback
Table 4.2: Non-optimised blocks of design based on fmax. Version 1.
ALUTs FFs RAMs MLABs DSPs
Board Interface 8% 8% 6% 0% 0%
Kernel System 28% 18% 36% 3% 33%
Total 36% 26% 42% 3% 33%
Table 4.3: Estimated resource of system. Version 1.
52
Chapter 4. Hardware Design 4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION
Version 2
In this version, we changed the LSMR kernel to work on inputs of the form A P IRm˚n
and vector B P IRm˚p and produce an output of the form B P IRn˚p. We also used
local memory for two of internal vectors, which were accessed more frequently than
others, to reduce the size of internal buffer and to reduce the number of load and store
from the global memory. This modification was done to address one of the issues of
the former version. As a general rule, if possible, it is better to copy parts of memory
that are accessed more than once into local memory.
Location in source code II Details
Computing norm of a vector
in local memory
„41 Data dependency.
Load from local memory.
Computing norm of a vector
in global memory
„156 Data dependency.
Load from global memory.
Summation of two vectors
in global memory
„214 Memory dependency.
Load and then store to global memory.
Table 4.4: Non-optimised blocks of design based on II. Version 2.
ALUTs FFs RAMs MLABs DSPs
Board Interface 8% 8% 6% 0% 0%
Kernel System 32% 20% 68% 5% 33%
Total 40% 28% 74% 5% 33%
Table 4.5: Estimated resource usage of system. Version 2.
The scheduled fmax value did not change significantly in this step.
By comparing tables 4.3 and 4.5 we can observe that the only major difference in
resource usage between these two versions is the RAM usage, even though the version
2 design can perform the same operation as the version 1 but on multiple columns of
input using a for loop. This comparison demonstrates that the AOC compiler does
not replicate the hardware for each column, and tries to apply pipeline
parallelism.
The high amount of RAM usage in version 2 is resolved in the next versions.
53
4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION Chapter 4. Hardware Design
Version 3
In previous versions, in most of the nested loops, the outer loop was not pipelined.
This was because the compiler was not able to recognise that number of iterations of
the inner loop are the same for different iterations of the outer loop. In this version,
we guided the compiler to pipeline these outer loops by using constant type copies of
the variables for specifying the number of loop iterations.
Also in previous versions, there was a compile warning about not using ”restrict”
keyword for pointers to memories in kernel signature. This keyword was used in this
version which helps the compiler with some cache optimisations and also restricts the
effect of pointer aliasing.
Another technique used in this step was coalescing nested loops manually wherever
possible. In coalescing a nested loop, we transform it into a single loop without
changing its functionality. This technique reduces the loop overhead and also latency,
and as a result, reduces the kernel resource usage.
Fusing adjacent loops is also applied manually in this step. This technique also reduces
the loop overhead and therefore the area usage. The main effect of this technique is
running the adjacent loops concurrently as they are considered a single loop and this
increases the performance. In this version, we managed to use one nested loop instead
of three, for calculating 4.3 as well as calculating 4.4 which correspond to lines 10
and 12 of algorithm 4. This was achieved by fusing the following loops: matrix-vector
multiplication nested loop, vector-scalar multiplication loop and subtracting vectors
loop for both calculations. This also helped with precision and reduced the amount of
internal buffer needed by eliminating some internal vectors.
uk`1 Ð pAvk ´ αkukq (4.3)
vk`1 Ð pATuk`1 ´ βk`1vkq (4.4)
Again the scheduled fmax did not change significantly in this step.
The II value changed a little for computing norm of a vector in global memory and
summation of vectors.
There was a considerable change in the amount of resource usage specially RAM usage.
This reduction in resource usage is because of eliminating the overhead of some loops.
54
Chapter 4. Hardware Design 4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION
Location in source code II Details
Computing norm of a vector
in local memory
„41 Data dependency.
Load from local memory.
Computing norm of a vector
in global memory
„150 Data dependency.
Load from global memory.
Summation of two vectors
in global memory
„196 Memory dependency.
Load and then store to global memory.
Table 4.6: Non-optimised blocks of design based on II. Version 3.
ALUTs FFs RAMs MLABs DSPs
Board Interface 8% 8% 6% 0% 0%
Kernel System 27% 17% 50% 5% 28%
Total 35 % 25% 56 % 5% 28%
Table 4.7: Estimated resource usage of system. Version 3.
Version 4
In this step, more loop fusing was performed. We fused the loop for computing the
norm of vector to its adjacent loop, which itself was the result of a loop fusion in the
previous step. By this modification, we were able to perform 4.5 in one nested loop
and 4.7 in another nested loop. 4.5 and 4.7 correspond to lines 10 and 12 of algorithm
4.
uk`1 Ð pAvk ´ αkukq (4.5)
βk`1 Ð ||Avk ´ αkuk||2 (4.6)
vk`1 Ð pATuk`1 ´ βk`1vkq (4.7)
αk`1 Ð ||ATuk`1 ´ βk`1vk||2 (4.8)
55
4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION Chapter 4. Hardware Design
Also we fused the loops for computing 4.9, 4.10 and 4.11 (corresponding to lines 17,
18 and 19 of the algorithm 4) and we used just one loop to perform all of them.
hk`1 Ð ´psk ˚ ρk`1 ˚ ρk`1q{pρk ˚ ρkqhk ` hk (4.9)
xk`1 Ð pζk`1{pρk`1 ˚ ρk`1qhk`1 ` xk (4.10)
hk`1 Ð ´ppsk`1 ˚ αk`1q{ρk`1qhk ` v (4.11)
These loop fusions again helped with reducing resource usage. The only factor that
changed considerably in this step of optimisation was resource usage:
ALUTs FFs RAMs MLABs DSPs
Board Interface 8% 8% 6% 0% 0%
Kernel System 25% 15% 37% 4% 28%
Total 33% 22% 44% 4% 28%
Table 4.8: Estimated resource usage of system. Version 4.
Version 5
In this version, we used one of the OpenCL built-in integer functions, add sat. Using
this function, eliminated a great number of conditional statements inside the main
loop. which were related to overflow check logic. This change, led to a significant
reduction in resource usage.
We also reduced the value of II significantly and as it can be seen in table 4.9, the only
block of code with value of II bigger than 1 was the square root block. The value of II
for integer square root was 4 in all the other versions but not mentioned in the tables
as there were more critical blocks in previous versions. Again, the scheduled fmax did
not change significantly.
Location in source code II Details
Integer SQRT 4 Data dependency.
Table 4.9: Non-optimised blocks of design based on II. Version 5.
56
Chapter 4. Hardware Design 4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION
ALUTs FFs RAMs MLABs DSPs
Board Interface 8% 8% 6% 0% 0%
Kernel System 11% 7% 20% 3% 18%
Total 19% 15% 26% 3% 18%
Table 4.10: Estimated resource of system. Version 5.
Version 6
To solve to problem of II value of integer square root block, we used the OpenCL
built-in function for computing the float square root. In order to use this function,
we had to convert the input from fixed-point to float and convert the output back
to fixed-point. By this change, we were able to reduce the II value to 1 as the Intel
implementation of the built-in square root function is highly optimised. However, we
sacrificed a small amount of RAM usage (Less than one percent) that is negligible.
This also solved the problem of low fmax in integer sqrt and we were able to increase
the fmax value from 98 (as it is shown in table 4.2) to 240 for this block
In this version all of the blocks of code shown the II value of (1 , 1˜ , ě 1) based on the
report. We also used prefetch load in this version for reading from the global memory.
ALUTs FFs RAMs MLABs DSPs
Board Interface 8% 8% 6% 0% 0%
Kernel System 11% 7% 20% 3% 18%
Total 19% 15% 26% 3% 18%
Table 4.11: Estimated resource of system. Version 6.
Version 7
In this version we replaced the add sat with add and we observed that this alteration
did not lead to any loss in accuracy. Instead, we were able to solve the problem of low
fmax values. In this version, all of the code blocks were showing the scheduled fmax
equal to 240 MHz, which is the maximum achievable frequency on our target board.
The area usage did not change significantly.
In this version, optimisation was completed. In the next versions, we focused on
57
4.2. OPTIMISATIONS OF FPGA IMPLEMENTATION Chapter 4. Hardware Design
achieving maximum hardware utilisation.
Version 8
At this stage, we used two LSMR compute units to run the LSMR in the activa-
tion update and weight update in parallel. It can be observed in table 4.12 that
there were still unused hardware resources.
ALUTs FFs RAMs MLABs DSPs
Board Interface 8% 8% 6% 0% 0%
Kernel System 23% 14% 40% 6% 37%
Total 31% 22% 46% 6% 37%
Table 4.12: Estimated resource of system. Version 8.
Version 9
To take advantage of the remaining available hardware resources, we increased the
number of compute units of LSMR kernel to 4 on our FPGA. Consequently, the host
code was modified to split the workload of LSMR in each of the activation update
and weight update between two compute units. Also, the LSMR kernel was modified
to only work on a segment of the problem to fit into this purpose.
This modification led to about 2X speed up compared to the previous version as we
doubled the task parallelism.
It is also worth mentioning that the board interface resource usage, which was the
same in all of the steps, slightly increased in this step.
ALUTs FFs RAMs MLABs DSPs
Board Interface 16% 10% 15% 0% 0%
Kernel System 46% 27% 81% 11% 75%
Total 62% 37% 96% 11% 75%
Table 4.13: Estimated resource of system. Version 9.
58
Chapter 5
Experimental Results
In this project, two data sets were used for experiments. IRIS [83] which is a small data
set and a subset of HIGGS which is a bigger and more complex data set. The achieved
test accuracy of the ADMM-LSMR method was assessed against two state-of-the-art
gradient-based methods in [1]. It was observed that the ADMM-LSMR algorithm is
able to achieve better accuracy compared to SGD and Adam on HIGGS and IRIS data
sets. In this work, first, we compared the achieved test accuracy of the implemented
method when using fixed-point arithmetic in LSMR against the original floating-point
implementation. Second, we compared the execution time of each training iteration
of C implementation and hardware-accelerated FPGA implementation. Finally, we
studied the impact of increasing hidden size on the execution time of each iteration in
both CPU and FPGA accelerated implementations.
The key observation can be summarised as the following:
‚ We were able to achieve near floating point accuracy, with less than one percent
penalty using fixed point LSMR with nearest rounding method.
‚ The nearest rounding method was the best choice when using fixed-point version
of LSMR in ADMM. This was an important observation as it is reported that
using the stochastic rounding in gradient-based methods has the best accuracy
compared to other rounding methods [44].
‚ We were able to achieve up to 6X speed up when using the hardware-accelerated
FPGA implementation compared to the C implementation.
‚ The speed up gain of using hardware-accelerated implementation grows by in-
creasing the hidden size of the network.
59
5.1. COMPARING FLOATING-POINT VS FIXED-POINT Chapter 5. Experimental Results
The experiments were conducted on two different machines. For CPU runs, one of
the custom computing lab workstations (cccad5) was used and for FPGA accelerated
runs Intel Devcloud nodes were employed. The specification of these platforms were
the following:
cccad5 workstation
CPU model name: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
CPU cache size: 25344 KB
CPU cores: 18
Memory: 768GB
Intel Devcloud node
CPU model name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
CPU cache size: 28160 KB
CPU cores: 20
Memory: 18 GB
FPGA Board : Intel® Programmable Acceleration Card with Intel Arria® 10
GX FPGA
5.1 Comparing Floating-point vs Fixed-point
In this section, we compared the accuracy of training algorithm using fixed-point and
floating-point arithmetic. The results were derived from 500 runs of each training
algorithm.
5.1.1 IRIS
3 layer network with hidden size equal to 8
In this section, a 3 layer network with hidden size equal to 8 is used to train on IRIS
data set using the ADMM-LSMR algorithm. We trained this model using fixed-point
LSMR with four different rounding methods and floating-point LSMR.
We observed that the average of achieved accuracy using nearest rounding is better
than the other rounding methods. Also, the achieved accuracy of this method has less
variance compared to other rounding methods.
It is worth mentioning that as opposed to the conventional training methods such as
60
Chapter 5. Experimental Results 5.1. COMPARING FLOATING-POINT VS FIXED-POINT
Mean STDV
Floating-point 83.7 % 1.6%
Fixed-point with nearest rounding 82.8 % 2.0%
Fixed-point with stochastic rounding 82.3 % 3.1%
Fixed-point with upward rounding 81.1% 5.3%
Fixed-point with downward rounding 80.8% 5.2%
Table 5.1: Comparing accuracy of using floating-point vs fixed-point with different
rounding methods on IRIS
SGD which perform better using the stochastic rounding method with fixed-point, we
did not observe significant difference by using this rounding method.
We also observed less than 1% penalty in accuracy when using fixed-point with nearest
rounding method compared to floating-point.
5.1.2 HIGGS
In this section, we trained three neural networks with hidden sizes of 8, 14 and 28
on a subset of HIGGS data set using ADMM-LSMR algorithm. We trained these
models using fixed-point LSMR with four different rounding methods and also using
floating-point LSMR.
3 layer network with hidden size equal to 8
Mean STDV
Floating-point 63.6 % 0.1%
Fixed-point with nearest rounding 62.8 % 0.2%
Fixed-point with stochastic rounding 62.6 % 0.2%
Fixed-point with upward rounding 62.6% 0.1%
Fixed-point with downward rounding 62.6% 0.2%
Table 5.2: Comparing accuracy of using floating-point vs fixed-point with different
rounding methods on HIGGS
61
5.1. COMPARING FLOATING-POINT VS FIXED-POINT Chapter 5. Experimental Results
3 layer network with hidden size equal to 14
Mean STDV
Floating-point 63.6 % 0.1%
Fixed-point with nearest rounding 62.7 % 0.2 %
Fixed-point with stochastic rounding 62.7 % 0.1%
Fixed-point with upward rounding 62.6% 0.1%
Fixed-point with downward rounding 62.7% 0.2%
Table 5.3: Comparing accuracy of using floating-point vs fixed-point with different
rounding methods on HIGGS
3 layer network with hidden size equal to 28
Mean STDV
Floating-point 61.3 % 0.1%
Fixed-point with nearest rounding 62.8 % 0.2 %
Fixed-point with stochastic rounding 62.8 % 0.1%
Fixed-point with upward rounding 62.7% 0.2%
Fixed-point with downward rounding 62.7% 0.2%
Table 5.4: Comparing accuracy of using floating-point vs fixed-point with different
rounding methods on HIGGS
We observed that the average of achieved accuracy using nearest rounding is better
than the other rounding methods.
It is worth mentioning that as opposed to the conventional training methods such as
SGD which perform better using the stochastic rounding method with fixed-point, we
did not observe significant difference by using this rounding method.
We also observed less than 1% penalty in accuracy when using fixed-point with nearest
rounding method compared to floating-point.
Additionally, it is evident that by increasing the hidden size of the network, the
floating-point is subject to minor loss in accuracy while such behaviour is not ob-
served in fixed-point LSMR implementation. This can be due to the fact that using
62
Chapter 5. Experimental Results 5.2. COMPARING CPU IMPLEMENTATION AND FPGA IMPLEMENTATION: ACCURACY
fixed-point injects noise to the neural network, which delays the overfitting and helps
the network to generalise better.
5.2 Comparing CPU Implementation and FPGA
Implementation: Accuracy
In this section, we compared the accuracy of FPGA implementation with CPU imple-
mentation. The results are from running each training algorithm 500 times.
As expected, the FPGA implementation was able to achieve the same accuracy as the
C implementation. This set of experiments were done to assess the correctness of the
FPGA implementation.
5.2.1 IRIS
3 layer network with hidden size equal to 8
Mean STDV
CPU implementation 82.8 % 2.0%
FPGA implementation 82.7 % 2.4%
Table 5.5: Comparing accuracy of C implementation vs FPGA implementation
5.2.2 HIGGS
3 layer network with hidden size equal to 8
Mean STDV
CPU implementation 62.8 % 0.2%
FPGA implementation 62.6 % 0.2%
Table 5.6: Comparing accuracy of C implementation vs FPGA implementation
63
5.3. COMPARING CPU IMPLEMENTATION AND FPGA IMPLEMENTATION: TIME Chapter 5. Experimental Results
3 layer network with hidden size equal to 14
Mean STDV
CPU implementation 62.7 % 0.2%
FPGA implementation 62.7 % 0.1%
Table 5.7: Comparing accuracy of C implementation vs FPGA implementation
3 layer network with hidden size equal to 28
Mean STDV
CPU implementation 62.8 % 0.2%
FPGA implementation 62.8 % 0.2%
Table 5.8: Comparing accuracy of C implementation vs FPGA implementation
5.3 Comparing CPU Implementation and FPGA
Implementation: Time
In this section, we compare the execution time of each loop iteration of implemented
ADMM-LSMR algorithm in CPU and FPGA. The execution time of 2500 iterations
of each algorithm has been measured to produce these results. A subset of HIGGS
data set was used for these experiments.
As it is evident, we were able to achieve up to 6 times speed up depending on the
architecture of the network.
64
Chapter 5. Experimental Results 5.3. COMPARING CPU IMPLEMENTATION AND FPGA IMPLEMENTATION: TIME
5.3.1 HIGGS
3 layer network with hidden size equal to 8
Mean STDV Speed up
CPU Implementation 589.8 ms 13.3 ms 4.1
FPGA Implementation 143.7 ms 0.4 ms
Table 5.9: Comparing execution time of C implementation vs FPGA implementation
3 layer network with hidden size equal to 14
Mean STDV Speed up
CPU Implementation 1391.4 ms 24.0 ms 5
FPGA Implementation 277.7 ms 0.9 ms
Table 5.10: Comparing execution time of C implementation vs FPGA implementation
3 layer network with hidden size equal to 28
Mean STDV Speed up
CPU Implementation 5523.7 ms 93.1 ms 5.9
FPGA Implementation 931.1 ms 2.3 ms
Table 5.11: Comparing execution time of C implementation vs FPGA implementation
4 layer network with hidden size equal to 28
Mean STDV Speed up
CPU Implementation 8437.2 ms 156.2 ms 6
FPGA Implementation 1387.2 ms 3.5ms 22
Table 5.12: Comparing execution time of C implementation vs FPGA implementation
65
5.4. RUN-TIME RELATION TO NETWORK COMPLEXITY Chapter 5. Experimental Results
5.4 Run-time Relation to Network Complexity
5.4.1 HIGGS
In this section, we investigated the relation of the execution time of each training
iteration with hidden size in a 3 layer neural network on a subset of HIGGS.
0
1000
2000
3000
4000
5000
6000
0 10 20 30 40 50 60
Ex
ec
ut
io
n 
Tim
e 
(m
s)
Hidden Size
FPGA CPU
Figure 5.1: Correlation of execution time to hidden of network
It is observed that while the FPGA implementation constantly performs faster, the
run time of both implementations grows in a non-linear fashion when the hidden size
is increased. Also, it is evident that the CPU implementation is more sensitive to
hidden size changes and the gap between the execution time of implementations also
grows with hidden size.
66
Chapter 6
Conclusion and Future Work
The ADMM-LSMR method was introduced for the first time in the [1] alongside a
Python implementation. It is also known that floating-point operations will cause
performance issues on hardware designs. Hence we had to replace most of the arith-
metic with custom implemented fixed-point arithmetic. Therefore, the feasibility of
a hardware design that maintains accuracy while being deploy-able on a reasonable
FPGA board was not clear at the beginning. Not only we were able to achieve such
design, but also several stages of optimisation were applied to improve the initial al-
gorithm and maximise the parallelism and hardware utilisation and achieve noticeable
speed up comparing to equivalent CPU implementation.
Based on the experimental results, the FPGA accelerated program was able to achieve
the same accuracy as the original implementation with less than 1% loss using the
fixed-point LSMR with nearest rounding. Additionally, the implementation has
shown up to 6 times speed up depending on the network size, and architecture and
it is evident that the acceleration is more effective on larger networks regarding the
hidden size.
6.1 Technical Achievements
The achievements of this project can be summarised as the following:
‚ C implementation: ADMM-LSMR training method fully implemented in C for
the first time. This was a necessity for OpenCL implementation and enabled us
to perform a bottleneck analysis and identify LSMR as the target for acceleration.
‚ Fixed-point implementation: As a known technique for hardware implemen-
tations, fixed-point arithmetic with various rounding methods were implemented.
67
6.2. RESULTS AND OBSERVATIONS Chapter 6. Conclusion and Future Work
Also, 16-bit and 32-bit were used with flexibility of setting the precision. These
variants were employed in the LSMR module.
‚ OpenCL implementation: Conversion of the C implementation to an OpenCL
accelerated program composed of CPU(host) program and device(FPGA) ker-
nels. This was achieved by several structural changes both to adopt OpenCL
models and Intel OpenCL SDK for FPGA guidelines and led to successful train-
ing using the emulated FPGA.
‚ FPGA deployment: By getting access to Intel DevCloud environment, we
were able to deploy and test the program on an actual FPGA board. This step
demanded more design changes and primary optimisations as the emulation is
not exactly equivalent to real hardware and hardware capacity has been added
to the constrains.
‚ Optimisations: There have been multiple stages of optimisation applied to
the primary design. We were able to both speed up the design and reduce the
utilised hardware resource in each iteration and finally fit multiple duplicates
of the design on the target board to maximise utilisation and consequently the
speed up.
6.2 Results and Observations
The key observations of this work can be summarised as following:
‚ Accuracy of ADMM-LSMR method was assessed in [1], and it was observed that
this method is able to achieve higher accuracy compared to SGD and Adam,
which are two commonly used gradient-based optimisers. In this work, the accu-
racy of the implementation has been constantly assessed during the development
both for checking the implementation correctness and more importantly, verify-
ing the feasibility of applied techniques like variants of the fixed-point arithmetic.
We were able to maintain the accuracy of the original ADMM-LSMR method
with less than 1% penalty while using nearest rounding method on HIGGS and
IRIS data sets.
‚ After reaching an acceptable design regarding the hardware utilisation and per-
formance reports, the performance of the program was measured in several ways.
In general, we were able to demonstrate 6 times speed up comparing to CPU
68
Chapter 6. Conclusion and Future Work 6.3. FUTURE WORK
implementation. We also assessed the impact of the architecture on acceleration
by increasing the hidden size and observed more effectiveness on larger networks.
‚ We observed that the nearest rounding method is more effective in the ADMM-
LSMR method. This observation was unexpected as it is reported that stochastic
rounding is more efficient when fixed-point arithmetic is used with gradient-
based methods. The nearest rounding is simpler than stochastic rounding as it
does not have the overhead of pseudo-random number generator and requires
less resources. Considering this fact and our observation, the nearest rounding
method was used in the final implementation.
6.3 Future Work
Some areas can be improved and many ideas can be employed to extend this work,
such as:
‚ Design and utilisation of a 16-bit iterative least-square solver.
‚ Using other iterative least-square solvers like LSLQ
‚ Perform more hardware optimisations and potentially increase the speed up.
‚ Using more than one devices.
‚ Employ full or partial HDL implementation to maximise hardware utilisation
and efficiency.
‚ Assess the method on other architectures of neural networks.
69
Bibliography
[1] S. N. A. Foumani, “An Analysis of Alternating Direction Method of Multipliers
for Feed-forward Neural Networks.” Independent Study Option, 2020. pages 1, 2,
3, 7, 8, 10, 30, 31, 33, 44, 59, 67, 68
[2] I. J. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA,
USA: MIT Press, 2016. http://www.deeplearningbook.org. pages 4, 5
[3] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv
preprint arXiv:1609.04747, 2016. pages 6
[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by
back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986. pages 6
[5] L. Bottou, “Stochastic gradient learning in neural networks,” Proceedings of
Neuro-Nımes, vol. 91, no. 8, p. 12, 1991. pages 6
[6] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks: Tricks of the
trade, pp. 421–436, Springer, 2012. pages 6
[7] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint
arXiv:1212.5701, 2012. pages 6
[8] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online
learning and stochastic optimization.,” Journal of machine learning research,
vol. 12, no. 7, 2011. pages 6
[9] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014. pages 6
[10] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with
gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2,
pp. 157–166, 1994. pages 6
70
BIBLIOGRAPHY BIBLIOGRAPHY
[11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computa-
tion, vol. 9, no. 8, pp. 1735–1780, 1997. pages 6
[12] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann
machines,” in ICML, 2010. pages 6
[13] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Ben-
gio, “Identifying and attacking the saddle point problem in high-dimensional
non-convex optimization,” in Advances in neural information processing systems,
pp. 2933–2941, 2014. pages 6
[14] E. K. Chong, “Chong and Zak, SH: An Introduction to Optimization,” 1996.
pages 7
[15] D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear vari-
ational problems via finite element approximation,” Computers & mathematics
with applications, vol. 2, no. 1, pp. 17–40, 1976. pages 7
[16] F. Lin, M. Fardad, and M. R. Jovanovic´, “Design of optimal sparse feedback
gains via the alternating direction method of multipliers,” IEEE Transactions on
Automatic Control, vol. 58, no. 9, pp. 2426–2431, 2013. pages 7
[17] F. Kiaee, C. Gagne´, and M. Abbasi, “Alternating direction method of multipliers
for sparse convolutional neural networks,” arXiv preprint arXiv:1611.01590, 2016.
pages 7
[18] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimiza-
tion and statistical learning via the alternating direction method of multipliers,”
Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
pages 7, 30
[19] D. C.-L. Fong and M. Saunders, “LSMR: An iterative algorithm for sparse
least-squares problems,” SIAM Journal on Scientific Computing, vol. 33, no. 5,
pp. 2950–2971, 2011. pages 7, 30
[20] G. Golub and W. Kahan, “Calculating the singular values and pseudo-inverse of
a matrix,” Journal of the Society for Industrial and Applied Mathematics, Series
B: Numerical Analysis, vol. 2, no. 2, pp. 205–224, 1965. pages 7
[21] C. C. Paige and M. A. Saunders, “LSQR: An algorithm for sparse linear equations
and sparse least squares,” ACM Transactions on Mathematical Software (TOMS),
vol. 8, no. 1, pp. 43–71, 1982. pages 7
71
BIBLIOGRAPHY BIBLIOGRAPHY
[22] R. Estrin, D. Orban, and M. A. Saunders, “LSLQ: An iterative method for linear
least-squares with an error minimization property,” SIAM Journal on Matrix
Analysis and Applications, vol. 40, no. 1, pp. 254–275, 2019. pages 7, 8
[23] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein, “Training
neural networks without gradients: A scalable ADMM approach,” in International
conference on machine learning, pp. 2722–2731, 2016. pages 8, 9, 30
[24] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A survey of FPGA-based neural
network accelerator,” arXiv preprint arXiv:1712.08934, 2017. pages 11, 16, 17,
20
[25] V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, and Z. Zhang, “Hardware for ma-
chine learning: Challenges and opportunities,” in 2017 IEEE Custom Integrated
Circuits Conference (CICC), pp. 1–8, IEEE, 2017. pages 11, 12, 13, 17, 20
[26] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini, P. Akselrod,
and S. Talay, “Large-scale FPGA-based convolutional networks,” Scaling up Ma-
chine Learning: Parallel and Distributed Approaches, pp. 399–419, 2011. pages
11
[27] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
N. Sun, et al., “Dadiannao: A machine-learning supercomputer,” in 2014 47th
Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622,
IEEE, 2014. pages 11
[28] S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha, “Back-
propagation for energy-efficient neuromorphic computing,” in Advances in neural
information processing systems, pp. 1117–1125, 2015. pages 11
[29] C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S.
Rose, and J. S. Plank, “A survey of neuromorphic computing and neural networks
in hardware,” arXiv preprint arXiv:1705.06963, 2017. pages 11, 13
[30] B. Girau, “FPNA: concepts and properties,” in FPGA Implementations of Neural
Networks, pp. 63–101, Springer, 2006. pages 13
[31] A. R. Omondi, J. C. Rajapakse, and M. Bajger, “FPGA neurocomputers,” in
FPGA Implementations of Neural Networks, pp. 1–36, Springer, 2006. pages 13
72
BIBLIOGRAPHY BIBLIOGRAPHY
[32] J. Liu and C. Wang, “A Survey of Neuromorphic Engineering–Biological Nervous
Systems Realized on Silicon,” in 2009 IEEE Circuits and Systems International
Conference on Testing and Diagnosis, pp. 1–4, IEEE, 2009. pages 13
[33] M. Moussa, S. Areibi, and K. Nichols, “On the arithmetic precision for imple-
menting back-propagation networks on FPGA: a case study,” in FPGA Imple-
mentations of Neural Networks, pp. 37–61, Springer, 2006. pages 13, 16
[34] Intel, Intel® FPGA SDK for OpenCL™ Pro Edition Best Practices Guide . Intel,
2020. pages 14, 15
[35] Xilinx, What is an FPGA? Xilinx, 2020. Accessed June 2, 2020. pages 16
[36] Intel, What is an FPGA? Intel, 2020. Accessed June 2, 2020. pages 16
[37] HardwareBee, The Ultimate Guide to FPGA Design Flow, 2019. Accessed June
2, 2020. pages 16
[38] Y. Hao, “A General Neural Network Hardware Architecture on FPGA,” arXiv
preprint arXiv:1711.05860, 2017. pages 16
[39] M. Drumond, L. Tao, M. Jaggi, and B. Falsafi, “Training DNNs with hybrid block
floating point,” in Advances in Neural Information Processing Systems, pp. 453–
463, 2018. pages 18
[40] T. NVIDIA, “V100 GPU architecture. the world’s most advanced data center
GPU. Version WP-08608-001 v1. 1,” NVIDIA. Aug, p. 108, 2017. pages 18
[41] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of
a tensor processing unit,” in Proceedings of the 44th Annual International Sym-
posium on Computer Architecture, pp. 1–12, 2017. pages 18
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-
nition,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 770–778, 2016. pages 18
[43] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg,
M. Houston, O. Kuchaiev, G. Venkatesh, et al., “Mixed precision training,” arXiv
preprint arXiv:1710.03740, 2017. pages 18
73
BIBLIOGRAPHY BIBLIOGRAPHY
[44] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with
limited numerical precision,” in International Conference on Machine Learning,
pp. 1737–1746, 2015. pages 19, 23, 59
[45] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song,
et al., “Going deeper with embedded FPGA platform for convolutional neural
network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pp. 26–35, 2016. pages 19
[46] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design flow of ac-
celerating hybrid extremely low bit-width neural network in embedded FPGA,”
in 2018 28th International Conference on Field Programmable Logic and Appli-
cations (FPL), pp. 163–1636, IEEE, 2018. pages 19
[47] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and
K. Vissers, “Finn: A framework for fast, scalable binarized neural network in-
ference,” in Proceedings of the 2017 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pp. 65–74, 2017. pages 19
[48] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-
efficient dataflow for convolutional neural networks,” ACM SIGARCH Computer
Architecture News, vol. 44, no. 3, pp. 367–379, 2016. pages 19
[49] M. Ghasemzadeh, M. Samragh, and F. Koushanfar, “ReBNet: Residual binarized
neural network,” in 2018 IEEE 26th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), pp. 57–64, IEEE, 2018.
pages 19
[50] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural
networks with pruning, trained quantization and Huffman coding,” arXiv preprint
arXiv:1510.00149, 2015. pages 19, 20
[51] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient infer-
ence: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018. pages 19
[52] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and
K. Gopalakrishnan, “Pact: Parameterized clipping activation for quantized neural
networks,” arXiv preprint arXiv:1805.06085, 2018. pages 19
[53] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quan-
tization: Towards lossless CNNs with low-precision weights,” arXiv preprint
arXiv:1702.03044, 2017. pages 19
74
BIBLIOGRAPHY BIBLIOGRAPHY
[54] S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep convolu-
tional neural networks for object recognition,” in 2015 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1131–1135,
IEEE, 2015. pages 19
[55] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized
neural networks: Training neural networks with low precision weights and activa-
tions,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898,
2017. pages 19, 20
[56] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint
arXiv:1605.04711, 2016. pages 19
[57] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv
preprint arXiv:1612.01064, 2016. pages 19
[58] S. Fox, J. Faraone, D. Boland, K. Vissers, and P. H. Leong, “Training deep neural
networks in low-precision with high accuracy using FPGAs,” in 2019 International
Conference on Field-Programmable Technology (ICFPT), pp. 1–9, IEEE, 2019.
pages 19
[59] S. Siddhartha, S. Wilton, D. Boland, B. Flower, P. Blackmore, and P. Leong,
“Simultaneous inference and training using on-FPGA weight perturbation tech-
niques,” in 2018 International Conference on Field-Programmable Technology
(FPT), pp. 306–309, IEEE, 2018. pages 19
[60] Z. Liu, Y. Dou, J. Jiang, Q. Wang, and P. Chow, “An FPGA-based processor
for training convolutional neural networks,” in 2017 International Conference on
Field Programmable Technology (ICFPT), pp. 207–210, IEEE, 2017. pages 19
[61] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized
neural networks: Training deep neural networks with weights and activations
constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016. pages 19
[62] N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, “Training
deep neural networks with 8-bit floating point numbers,” in Advances in neural
information processing systems, pp. 7675–7684, 2018. pages 19, 20
[63] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang,
et al., “Ese: Efficient speech recognition engine with sparse LSTM on FPGA,”
75
BIBLIOGRAPHY BIBLIOGRAPHY
in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 75–84, 2017. pages 19
[64] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers in deep
neural networks,” arXiv preprint arXiv:1802.04680, 2018. pages 19, 20
[65] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neural networks with
low precision multiplications,” arXiv preprint arXiv:1412.7024, 2014. pages 19
[66] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for 8-bit train-
ing of neural networks,” in Advances in neural information processing systems,
pp. 5145–5153, 2018. pages 19
[67] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep
neural networks with binary weights during propagations,” in Advances in neural
information processing systems, pp. 3123–3131, 2015. pages 20, 36
[68] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet
classification using binary convolutional neural networks,” in European conference
on computer vision, pp. 525–542, Springer, 2016. pages 20
[69] G. Yang, T. Zhang, P. Kirichenko, J. Bai, A. G. Wilson, and C. De Sa,
“Swalp: Stochastic weight averaging in low-precision training,” arXiv preprint
arXiv:1904.11943, 2019. pages 20
[70] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, “ZipML: Training
linear models with end-to-end low precision, and a little bit of deep learning,” in
International Conference on Machine Learning, pp. 4035–4043, 2017. pages 20
[71] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training
low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv
preprint arXiv:1606.06160, 2016. pages 20
[72] U. Ko¨ster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable, O. Elibol,
S. Gray, S. Hall, L. Hornof, et al., “Flexpoint: An adaptive numerical format for
efficient training of deep neural networks,” in Advances in neural information
processing systems, pp. 1742–1752, 2017. pages 20
[73] C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and
C. Re´, “High-accuracy low-precision training,” arXiv preprint arXiv:1803.03383,
2018. pages 20
76
BIBLIOGRAPHY BIBLIOGRAPHY
[74] H. Nakahara, Y. Sada, M. Shimoda, K. Sayama, A. Jinguji, and S. Sato, “FPGA-
Based Training Accelerator Utilizing Sparseness of Convolutional Neural Net-
work,” in 2019 29th International Conference on Field Programmable Logic and
Applications (FPL), pp. 180–186, IEEE, 2019. pages 20
[75] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE:
efficient inference engine on compressed deep neural network,” ACM SIGARCH
Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016. pages 20
[76] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing neural
networks with the hashing trick,” in International conference on machine learning,
pp. 2285–2294, 2015. pages 20
[77] M. Samragh, M. Ghasemzadeh, and F. Koushanfar, “Customizing neural net-
works for efficient FPGA implementation,” in 2017 IEEE 25th Annual Interna-
tional Symposium on Field-Programmable Custom Computing Machines (FCCM),
pp. 85–92, IEEE, 2017. pages 20
[78] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein, “Training quantized
nets: A deeper understanding,” in Advances in Neural Information Processing
Systems, pp. 5811–5821, 2017. pages 23
[79] K. O. W. Group et al., “The OpenCL Specification, Version 1.2. Document re-
vision 19,” URL http://www. khronos. org/registry/cl/specs/opencl-1.0, vol. 29,
2012. pages 23
[80] Intel, Intel® FPGA SDK for OpenCL™ Pro Edition Programming Guide . Intel,
2020. pages 25, 26
[81] Intel, “FPGA Acceleration with Intel® DevCloud,” 2020. pages 29
[82] P. S. Baldi, P. and D. Whiteson, “Searching for Exotic Particles in High-energy
Physics with Deep Learning,” 2014. pages 34
[83] R. Fisher, “The use of multiple measurements in taxonomic problems, Annual Eu-
genics, 7, Part II, 179-188 (1936); also in Contributions to Mathematical Statis-
tics,” 1950. pages 34, 59
[84] C. De Sa, M. Feldman, C. Re´, and K. Olukotun, “Understanding and optimizing
asynchronous low-precision stochastic gradient descent,” in Proceedings of the
44th Annual International Symposium on Computer Architecture, pp. 561–574,
2017. pages 35
77
BIBLIOGRAPHY BIBLIOGRAPHY
[85] C. M. De Sa, C. Zhang, K. Olukotun, and C. Re´, “Taming the wild: A unified
analysis of hogwild-style algorithms,” in Advances in neural information process-
ing systems, pp. 2674–2682, 2015. pages 35
[86] N. Srivastava, “Improving neural networks with dropout,” University of Toronto,
vol. 182, no. 566, p. 7, 2013. pages 36
[87] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: a simple way to prevent neural networks from overfitting,” The journal
of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014. pages 36
[88] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural
networks using dropconnect,” in International conference on machine learning,
pp. 1058–1066, 2013. pages 36
[89] T. Muntsinger, “integer square root,” 2012. pages 42
78
