University of Texas at El Paso

ScholarWorks@UTEP
Open Access Theses & Dissertations
2021-08-01

Hardware for Quantized Mixed-Precision Deep Neural Networks
Andres Rios
University of Texas at El Paso

Follow this and additional works at: https://scholarworks.utep.edu/open_etd
Part of the Computer Engineering Commons, Computer Sciences Commons, and the Electrical and
Electronics Commons

Recommended Citation
Rios, Andres, "Hardware for Quantized Mixed-Precision Deep Neural Networks" (2021). Open Access
Theses & Dissertations. 3333.
https://scholarworks.utep.edu/open_etd/3333

This is brought to you for free and open access by ScholarWorks@UTEP. It has been accepted for inclusion in Open
Access Theses & Dissertations by an authorized administrator of ScholarWorks@UTEP. For more information,
please contact lweber@utep.edu.

HARDWARE FOR QUANTIZED MIXED-PRECISION DEEP NEURAL NETWORKS

ANDRES RIOS
MASTER'S PROGRAM IN COMPUTER ENGINEERING

APPROVED:

Patricia Nava, Ph.D., Chair

Michael McGarry, Ph.D.

Sai Mounika Errapotu, Ph.D.

Julio Urenda, Ph.D.

Stephen L. Crites, Jr., Ph.D.
Dean of the Graduate School

Copyright ©

by
Andres Rios
2021

Dedication
To Mom, Dad, Zeke, Marcos, and Ayleen.

HARDWARE FOR QUANTIZED MIXED-PRECISION DEEP NEURAL NETWORKS
by
ANDRES RIOS, B.S.E.E

THESIS

Presented to the Faculty of the Graduate School of
The University of Texas at El Paso
in Partial Fulfillment
of the Requirements
for the Degree of

MASTER OF SCIENCE

Department of Electrical and Computer Engineering
THE UNIVERSITY OF TEXAS AT EL PASO
August 2021

Acknowledgments
I want to thank Dr. Patricia Nava for her continued support and mentorship throughout the past
two years; I would not be where I am if not for her. Dr. Nava inspired and granted me the freedom
to pursue topics that interest me. Taking me in as an undergraduate, she has poured endless
amounts of time into me through our weekly meetings, manuscript revisions, and letters of
recommendation. Dr. Nava is genuinely invested in her student's success and gives them a platform
to reach their highest potential. For that, I am incredibly grateful to her, and I am proud to call her
my mentor.
I would also like to thank the Pathways to Success in Graduate Engineering (PASSE) program for
supporting me financially from the end of my undergraduate degree to the completion of my
master's degree. This program allowed me to fully invest myself in my classes and research without
being burdened by monetary struggles.
Furthermore, I thank UTEP and the College of Engineering for allowing me to pursue a graduate
degree concurrently with my undergraduate studies. Through the Fast Track program, I was able
to earn a Master of Science in Computer Engineering in under a year.
Finally, I would like to thank my family and friends for their love and support throughout this
process. They encouraged me to pursue a graduate degree and to continue following my passions.
I deeply cherish their unwavering support.

v

Abstract
Recently, there has been a push to perform deep learning (DL) computations on the edge rather
than the cloud due to latency, network connectivity, energy consumption, and privacy issues.
However, state-of-the-art deep neural networks (DNNs) require vast amounts of computational
power, data, and energy—resources that are limited on edge devices. This limitation has brought
the need to design domain-specific architectures (DSAs) that implement DL-specific hardware
optimizations. Traditionally DNNs have run on 32-bit floating-point numbers; however, a body of
research has shown that DNNs are surprisingly robust and do not require all 32 bits. Instead, using
quantization, networks can run on extremely low-bit widths (1-8 bits) with fair accuracy.
Suggesting that edge devices can handle low-bit width DNNs at the cost of accuracy, saving
computations and energy. In addition to DNNs being run on low-bit widths, it has also been shown
that not all layers within a network require the same precision. Therefore, a further optimization
suggests using per-layer mixed-precision quantization rather than uniform quantization. This
thesis conducts a comparative study on the effects of mixed-precision quantization using
"simulated quantization" in software. Furthermore, a mixed-precision multiplier—able to be
configured at run time—is designed to support mixed-precision quantized DNNs in hardware, and
a comparative study is performed between a full-precision implementation.

vi

Table of Contents
Acknowledgments........................................................................................................................... v
Abstract .......................................................................................................................................... vi
Table of Contents .......................................................................................................................... vii
List of Tables .................................................................................................................................. x
List of Figures ................................................................................................................................ xi
1

Introduction ............................................................................................................................. 1
1.1

2

Organization ..................................................................................................................... 1

Background.............................................................................................................................. 3
2.1

History of Neural Networks ............................................................................................. 3

2.1.1

The Biological Neuron.............................................................................................. 3

2.1.2

The Artificial Neuron................................................................................................ 4

2.1.3

The XOR Problem .................................................................................................... 5

2.1.4

Multilayer Perceptron ............................................................................................... 7

2.1.5

State of AI ................................................................................................................. 8

2.2

Artificial Intelligence ....................................................................................................... 8

2.2.1

Deep Learning........................................................................................................... 9

2.2.2

Learning Strategies ................................................................................................. 10

2.3

Fundamentals of Deep Learning .................................................................................... 11

vii

2.3.1

Feedforward ............................................................................................................ 11

2.3.2

Backpropagation ..................................................................................................... 14

2.4

2.4.1

Convolutional Neural Networks (CNNs)................................................................ 15

2.4.2

Recurrent Neural Networks (RNNs) ....................................................................... 18

2.4.3

Spike Neural Networks (SNNs) .............................................................................. 20

2.5

Hardware for Deep Learning.......................................................................................... 21

2.5.1

Computer Metrics ................................................................................................... 22

2.5.2

Processors ............................................................................................................... 24

2.6

3

Advanced Neural Networks ........................................................................................... 15

Deep Learning on the Edge ............................................................................................ 29

2.6.1

Challenges of edge-to-cloud models....................................................................... 29

2.6.2

The Case for Edge Intelligence ............................................................................... 30

2.6.3

Applications of Edge Intelligence........................................................................... 31

Discussion of the Problem ..................................................................................................... 34
3.1

Network Sparsity and Pruning ....................................................................................... 34

3.2

Precision Reduction........................................................................................................ 38

3.2.1

Number Representations ......................................................................................... 38

3.2.2

Quantization ............................................................................................................ 41

3.2.3

Mixed-precision Quantization ................................................................................ 48

3.3

Variable Bit Width Multipliers ...................................................................................... 49
viii

3.4
4

Discussion of the Work & Results ........................................................................................ 57
4.1

Effects of Extreme Mixed-precision Quantization on MNIST and CIFAR-10 Datasets57

4.1.1

Mixed-precision Quantization Methodology for MNIST ....................................... 57

4.1.2

MNIST Dataset Results and Comparative Study.................................................... 62

4.1.3

Mixed-precision Quantization Methodology for CIFAR-10 .................................. 64

4.1.4

CIFAR-10 Dataset Results and Comparative Study ............................................... 66

4.2

5

Proposed Work ............................................................................................................... 54

Mixed-precision Architecture Proof of Concept on XOR Problem ............................... 68

4.2.1

Single-precision Model in PyTorch ........................................................................ 68

4.2.2

Single-precision Architecture Design ..................................................................... 69

4.2.3

Mixed-precision Model in Pytorch ......................................................................... 79

4.2.4

Mixed-precision Architecture Design ..................................................................... 80

4.2.5

Single-precision and Mixed-precision XOR Problem Comparative Study Results 87

Conclusions & Future Work .................................................................................................. 89
5.1

Conclusion...................................................................................................................... 89

5.2

Future Work ................................................................................................................... 90

References ..................................................................................................................................... 92
Appendix A Bit Brick Verilog Code ............................................................................................ 97
Appendix B Fusion Unit Mulitplier Verilog Code ....................................................................... 98
Curriculum Vita .......................................................................................................................... 101
ix

List of Tables
Table 2.1 AlexNet layers. ............................................................................................................. 18
Table 2.2 Total MAC operations of popular models during inference. ........................................ 25
Table 4.1 Comparison between single-precision and mixed-precision models on MNIST. ........ 64
Table 4.2 Number of weights and activations per network. ......................................................... 65
Table 4.3 Comparison between single-precision and mixed-precision Models on CIFAR-10. ... 67
Table 4.4 Comparison of single-precision and mixed-precision models on the XOR Problem. .. 88

x

List of Figures
Figure 2.1 Diagram of a biological neuron [2]. .............................................................................. 4
Figure 2.2 Diagram of the perceptron. ............................................................................................ 5
Figure 2.3 (a) Linearly separable problem (b) XOR Problem. ....................................................... 6
Figure 2.4 Multilayer Perceptron. ................................................................................................... 7
Figure 2.5 Artificial Intelligence hierarchy. ................................................................................. 10
Figure 2.6 Deep Neural Network (DNN). .................................................................................... 12
Figure 2.7 Activation functions (a) ReLU (b) Sigmoid. ............................................................... 13
Figure 2.8 Gradient of a single variable function. ........................................................................ 15
Figure 2.9 Convolution method. ................................................................................................... 16
Figure 2.10 Max pool.................................................................................................................... 17
Figure 2.11 RNN neuron............................................................................................................... 19
Figure 2.12 Clock rate and Power for Intel x86 microprocessors over eight generations and 25
years [13]. ..................................................................................................................................... 24
Figure 2.13 Multiply and Accumulate (MAC) operation block diagram. .................................... 25
Figure 2.14 Temporal Architecture............................................................................................... 27
Figure 2.15 Spatial Architecture. .................................................................................................. 29
Figure 2.16 Tesla's Full Self-Driving (FSD) computer with dual FSD chips [21]. ...................... 31
Figure 3.1 Distribution of weights in AlexNet [26]. ..................................................................... 35
Figure 3.2 DNN before and after pruning. Transparent connections represent below threshold
weights. ......................................................................................................................................... 36
Figure 3.3 Pruning process. .......................................................................................................... 37
Figure 3.4 Distribution of weights in AlexNet after pruning [26]. ............................................... 37

xi

Figure 3.5 IEEE Sigle-precision floating-point format................................................................. 39
Figure 3.6 8-bit signed fixed-point format. ................................................................................... 40
Figure 3.7 4-bit integer quantization visualized on a timeline...................................................... 44
Figure 3.8 Quantization error/rescaled quantized values for 4-bit integer quantization............... 45
Figure 3.9 Relative accuracy of quantization on three different-sized networks [31]. ................. 46
Figure 3.10 Energy consumptions for operations in 45nm technology [28]. ............................... 47
Figure 3.11 Relative energy consumption for quantized networks [31]. ...................................... 48
Figure 3.12 Optimal per-layer bit widths using mixed-precision quantization on three differentsized networks at 100% and 99% accuracy [32]. ......................................................................... 49
Figure 3.13 Data-gated conventional MAC in 8x8-bit (left), 4x4-bit (middle), and 2x8-bit (right)
configurations [33] ........................................................................................................................ 50
Figure 3.14 Bit Brick (BB) diagram. ............................................................................................ 51
Figure 3.15 Fusion Unit (FU) multiplier....................................................................................... 52
Figure 3.16 Fusion Unit in four configurations containing different sized Partial Fusion Units
(PFU)............................................................................................................................................. 53
Figure 4.1 LeNet-5 architecture. ................................................................................................... 57
Figure 4.2 LeNet-5 Convolution 2 weight histogram. .................................................................. 58
Figure 4.3 LeNet-5 8-bit quantized Convolution 2 weight histogram. ......................................... 59
Figure 4.4 LeNet-5 Convolution 2 activation histogram. ............................................................. 60
Figure 4.5 Lenet-5 modified quantization architecture................................................................. 61
Figure 4.6 LeNet-5 8-bit quantized Convolution 2 activation histogram. .................................... 62
Figure 4.7 Optimal bit width, per layer, found on LeNet-5 while maintaining 95% of the baseline
accuracy. ....................................................................................................................................... 63

xii

Figure 4.8 CIFAR-10 architectures............................................................................................... 65
Figure 4.9 Optimal bit width found on CIFAR-10 while maintaining 95% of the baseline
accuracy. ....................................................................................................................................... 66
Figure 4.10 Single-precision XOR Problem model. ..................................................................... 68
Figure 4.11 Single-precision systolic array. Circles indicate inputs. Image adapted from [38]. .. 70
Figure 4.12 Single-precision Processing Engine (PE) schematic. ................................................ 72
Figure 4.13 Single-precision Processing Engine (PE) testbench simulation ................................ 74
Figure 4.14 Single-precision XOR Problem schematic. ............................................................... 76
Figure 4.15 Single-precision XOR Problem testbench simulation ............................................... 78
Figure 4.16 Mixed-precision XOR Problem model. ..................................................................... 79
Figure 4.17 Optimal bit width found on XOR Model while maintaining 100% of the baseline
accuracy. ....................................................................................................................................... 80
Figure 4.18 Bit Brick (BB) schematic. ......................................................................................... 81
Figure 4.19 Fusion Unit (FU) multiplier block diagram............................................................... 82
Figure 4.20 Fusion Unit Processing Engine (FUPE) schematic. .................................................. 82
Figure 4.21 Fusion Unit Processing Engine (FUPE) testbench simulation. ................................. 84
Figure 4.22 Quantization unit schematic. ..................................................................................... 85
Figure 4.23 Mixed-precision XOR Problem schematic. ............................................................... 86
Figure 4.24 Mixed-precision XOR Problem testbench simulation. .............................................. 87

xiii

1 Introduction
The explosion of algorithmic advancements in Deep Learning (DL) in the past decade allows
Artificial Intelligence (AI) to become possible in many areas. However, as the sophistication of
DL models grows, so does the computational complexity required to process them. Computer
architects look to keep up with this demand by designing Domain Specific Architectures (DSA),
capable of exploiting optimizations unique to DL applications.
Specifically, one of the domains where DSAs would prove beneficial is edge devices. Due to the
limitations of the edge-to-cloud model seen widely today, there has been a push to process AI
computations on the edge rather than in the cloud, also known as Edge Intelligence (EI). However,
in contrast to the virtually limitless resources seen in the cloud, edge devices are highly constrained
in energy, performance, and size. Therefore, techniques that reduce model size are sought to
overcome the constraints associated with edge devices.
As it has been shown that Deep Neural Networks (DNNs) are robust to minor changes, precision
reduction is one technique used to reduce network size. This work focuses on a precision reduction
method called quantization and proposes hardware explicitly designed to support it.

1.1 Organization
The remainder of this document is organized as follows:
Chapter 2 starts by discussing the history of AI through the various waves of rises and falls
experienced by the field. A DNN is defined, and the main three learning strategies are discussed.
Specifics are given on the Feedforward and Backpropagation algorithms and various advanced NN
architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs), and Spike Neural Networks (SNNs) are examined. Processor metrics including
1

performance, power, and area are explored, and the temporal and spatial architectures are
compared. Challenges posed by the edge-to-cloud model for DL on the edge are given, and a case
is made for moving computations to the edge. The chapter is concluded with examples of
applications where EI would provide significant improvement or make AI possible.
Chapter 3 provides explanations for two optimizations that aim to reduce model size: Prunning
and Precision Reduction. Precision Reduction is discussed in-depth, describing various number
representation formats and the methods used to reduce precision—quantization and mixedprecision quantization. Two variable bit-width multipliers capable of handling mixed-precision
operands are compared. Finally, the proposed work is presented, which involves studying the
effect of mixed-precision quantization and designing an architecture with the ability to handle
mixed-precision computations.
Chapter 4 provides the methodology used to carry out the proposed work. Mixed-precision
quantization is performed on models classifying the MNIST and CIFAR-10 datasets, and a
comparative study is conducted between the single-precision counterparts. Additionally, an
architecture capable of executing mixed-precision NNs is designed, and a comparative study is
performed between a single-precision architecture using the XOR Problem.
Chapter 5 summarizes the findings from Chapter 4 and expounds on their significance for EI.
Finally, four future work recommendations are given, providing a starting point for possible further
improvements.

2

2 Background
2.1 History of Neural Networks
2.1.1

The Biological Neuron

Neurons are the basic cells within our brains and are responsible for powering our nervous system.
Most neurons consist of three main components, as shown in Figure 2.1: dendrites, an axon, and a
cell body. Dendrites act as inputs and allow signals from other neurons to reach the cell. Similarly,
an axon is the cell's output; it branches and ends at the axon terminals. The point where the axon
of one cell meets with the dendrite of another is called the synapse. The cell body or soma contains
the nucleus and is electrically excitable. When a voltage gradient across a nucleus membrane
changes quickly and by a significant amount, the cell produces an electrical pulse called an action
potential. Finally, this action potential is sent out through the neuron's axon. Our brains are
networks containing 86 billion of these neurons [1], and the result is our ability to perceive and
think cognitively.

3

Figure 2.1 Diagram of a biological neuron [2].

2.1.2

The Artificial Neuron

As part of the "first wave" of neural networks (NNs), in 1958, Frank Rosenblatt from the Cornell
Aeronautical Laboratory invented the perceptron. Modeled after the biological neuron, the
perceptron performs as a linear predictor function. It multiplies its inputs, represented by the vector
𝐱𝐧 , with predetermined weights, 𝐰𝐧 . It then performs a summation of all weighted inputs. The
output of the perceptron, 𝒚𝒊 , is the output of the activation function 𝛟( ) applied to the summation
as shown in Equation [2.1].
i

yi

= ϕ (∑

wi × x i )

n=0

where:
yi
wi
xi
ϕ()

is the output;
is the weight;
is the input; and
is the activation
function

4

[2.1]

From the perceptron shown in Figure 2.2, similarities to the biological neuron can be observed.
The inputs are the axons from other neurons, the dendrites contain the product of inputs and
weights, the function is applied in the nucleus, and the final value is sent out through the axon.

Figure 2.2 Diagram of the perceptron.

2.1.3

The XOR Problem

Initially, the Navy implemented the perceptron to perform image processing and described it as
"the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write,
reproduce itself and be conscious of its existence" [3]. Although the Navy had high hopes for the
perceptron, a single perceptron could not recognize certain patterns because of its linear nature.
An example of this is the XOR classification problem. Figure 2.3 (a) shows an example of a
linearly separable problem. It is known to be linearly separable because a single line can separate
the two classifications. However, Figure 2.3 (b) shows the XOR Problem, which requires two lines
to separate the categories despite having only four valid data points. A single layer of Perceptrons

5

is unable to solve a nonlinearly separable problem such as an XOR classification. The limitation
of Perceptrons caused a decline in the field, and it was not until multiple layers of Perceptrons
were used that this type of problem could be solved successfully.

Figure 2.3 (a) Linearly separable problem (b) XOR Problem.

6

2.1.4

Multilayer Perceptron

Figure 2.4 shows a diagram of a multilayer perceptron, and although they were presented as a
powerful upgrade that can solve the nonlinearly separable problem, the issue was how to train and
adjust the weights of these networks. The "second wave" in Artificial Intelligence (AI) came with
the reinvention and popularization of the backpropagation algorithm in 1986 [4]. This algorithm
utilized gradients for training multilayer networks. Although the algorithms were available to solve
layered networks, the computing power available at the time was not. Simple "toy" problems could
be solved; however, when a network was scaled up, the number of computations increased
dramatically, also known as the combinatorial explosion problem [5]. The implementation of a
neural network was not practical and led to yet another decline in the field.

Figure 2.4 Multilayer Perceptron.

7

2.1.5

State of AI

From the years 1993 – 2006, not much attention was given to AI. Nonetheless, algorithmic
development for NNs made progress. One of the most notable of these developments was the
invention of Convolutional Neural Networks (CNNs). Simultaneously, semiconductor technology
had continued its advancement, making massive improvements to computational power limits
since the 1980s. Gordon Moore's prediction in 1975 (known as Moore's Law) that computers'
capability would double every two years had held true. The wide adoption of the internet also
allowed for massive amounts of data to be generated and collected. The synchronicity of
algorithms' sophistication, advances in computing power, and big data acquisition gave birth to
AI's "third wave"n the late 2000s [6].
Most recently, with the decline of Moore's Law, processor performance is now forecasted to
double every 20 years. Researchers are now looking at hardware accelerators to meet the
increasing demand for the computing of NNs. GPU's [7] and Domain-Specific Architectures
(DSAs) [8] are at the forefront of this effort for their ability to exploit parallelism found naturally
within NNs.

2.2 Artificial Intelligence
AI is a broad topic, encompassing many models, techniques, and ideologies, but it can be defined
as anything perceived to have human or animal-like intelligence. One of the first examples of AI
was in 1997 when the Deep Blue chess-playing computer from IBM won in a six-game match
against Garry Kasparov, a world champion in chess. The system did not "learn" to play chess by
example, but instead, its human-like intelligence was hardcoded. Although a challenging game,
chess contains stringent rules which players must adhere to, allowing programmers to set
8

predetermined rules in which the computer functioned explicitly. This type of hard coding is
known as a knowledge base approach to AI. However, this approach has limited capabilities when
it comes to performing tasks like humans and is considered by many to not be true intelligence
because of the inability of a machine to learn on its own using this method. Machines can
accomplish tasks that people find difficult, such as playing chess or performing complicated math.
Ironically, they struggle to perform tasks that we find effortless, such as driving a car or speech
recognition. The difficulty with the knowledge-based approach is that programmers typically fail
to describe our world's complexity fully.
A more effective strategy is to let computers derive their descriptions from data, also known as
Machine Learning (ML). In an ML algorithm, a computer is given a task, data, and performance
measures to see how well it completes the task. The data can be thought of as experience, and as
it acquires more experience with the task, it can iteratively adjust itself to achieve better
performance. One of the ways machines can learn from data is through biologically inspired
models like NNs.
2.2.1

Deep Learning

Multilayer Perceptrons and neural networks are synonymous and describe a layered network of
Perceptrons. Every Neural Network has an input layer, at least one hidden layer, and an output
layer, as shown in Figure 2.4. Another class of networks called a Deep Neural Network (DNN)
describes a Neural Network with more than one hidden layer, hence the word 'Deep,' and is the
focus of the field of Deep Learning (DL). The need for multiple hidden layers is because many
features need to be extracted from data. For example, consider a neural network for classifying
handwritten digits. In that case, one layer could pick up on vertical edges, and the next could look

9

for horizontal edges, another for curved edges, and so on. The network could combine all these
features in the final layer and indicate the corresponding number for any handwritten digit. Of
course, this might not be at all how a network extracts features, the exact assignment or purpose
of each neuron and layer is delegated to the DNN and its learning algorithm. Hidden layers are
hidden because they have no obligation to make sense to humans; therefore, we only need to focus
on the visible layers (input and output). Figure 2.5 shows the relationship between AI, ML, and
DL. The process of creating and using a DNN can be broken up into training and inferring.
Training is when the network is expected to learn from the data presented to it. Inferring is when
the network has already learned from the training data and uses its knowledge to produce an output.

Figure 2.5 Artificial Intelligence hierarchy.

2.2.2

Learning Strategies

There are three main types of learning: supervised learning, unsupervised learning, and semisupervised learning. In a supervised learning data set, all samples are associated with a label or

10

target. These labels act as a reference during the training and guide the network towards its
intended outcome. After training, the labels are used to evaluate a network's performance, typically
by reserving a previously "unseen" section of the data set for testing only. In contrast, an
unsupervised learning data set contains no labeled samples. Instead, networks create their own
categories by finding patterns and relationships in the training data. In-between supervised and
unsupervised learning, we find semi-supervised learning, which contains labeled and unlabeled
samples. Typically, the network is shown a small number of labeled samples and is tasked with
correctly labeling the unlabeled data. The most common of the techniques outlined is supervised
learning, usually used for classification and regression problems, while unsupervised learning is
used for clustering data and detecting outliers.

2.3 Fundamentals of Deep Learning
2.3.1

Feedforward

The model presented for the single perceptron in Section 2.1.1 is the same one used in a DNN.
Please note that the terms perceptron and artificial neuron are used synonymously. Even though
the equation for a neuron is simple, the culmination of hundreds, thousands, or even millions of
them creates the complex system that is a DNN. To put it simply, the output of a perceptron is a
function of its weights and inputs. In a DNN, the neurons are grouped into layers, as shown in
Figure 2.6. The first layer is the input layer and contains the raw data feed into the network. The
information is then "fed forward" through the hidden layers until the massaged information finally
reaches the output layer. The hidden layers are "hidden" because a programmer cannot explicitly
tell the network what features each layer might choose to extract. The input and output layers form
the visible layers and are the interfaces to the system. The arrows going from each neuron in one

11

layer to each neuron in the next represent the weights. Each input or activation is multiplied by its
respective weight and fed into the next layer.

Figure 2.6 Deep Neural Network (DNN).

Since many multiplications happen between layers, it is beneficial to look at it as a matrix-vector
dot product. The input x or activation a are nx1 vectors, and weights w are mxn matrices. Equation
[2.2] shows the feedforward equation for layer l in a DNN, resulting in an mx1 vector.

12

al = ϕ(w l ∙ al−1 + bl )

=ϕ

l
w0,0

l
w0,1

⋯

l
w1,0
⋮
l
w
([ M,0

l
w1,1
⋮
l
wM,1

⋯
⋱
⋯

[2.2]
l
w0,N

al−1
bl0
0
l
l−1
l
w1,N
∙ a1 + b1
⋮
⋮
⋮
l−1
l
wM,N ] [aN ] [blN ]

)
l
l−1
l
l
l−1
l
w0,0
∗ al−1
+
w
∗
a
+
⋯
+
w
∗
a
1
0
0,1
0,N
N + b0

=ϕ

=

0
l−1
l
l
l−1
l
w1,0
∗ al−1
0 + w1,1 ∗ a1 + ⋯ + w1,N ∗ a N + b1
⋮
l
l
l
l−1
l
l−1
l−1
w
∗
a
+
w
∗
a
1 + ⋯ + wM,N ∗ a N + bN ])
M,1
([ M,0 0
l
l−1
l
l
l−1
l
ϕ(w0,0
∗ al−1
0 + w0,1 ∗ a1 + ⋯ + w0,N ∗ a N + b0 )

0
l−1
l
l
l−1
l
ϕ(w1,0
∗ al−1
0 + w1,1 ∗ a1 + ⋯ + w1,N ∗ a N + b1 )
⋮
l
l
l
l
l−1
l−1
∗ al−1
[ϕ(wM,0 ∗ a0 + wM,1 ∗ a1 + ⋯ + wM,N
N + bN )]

where:
a1
w1
b1
ϕ( )

are the activations of layer 1;
are the weights in layer 1;
are the bias' of layer 1; and
is the activation function.

The activation function 𝛟( ) can be chosen depending on the need of the system. The two most
common functions: the Rectified Linear Unit (ReLU) and Sigmoid functions are shown in Figure
2.7. Not delving into specifics, the choice of an activation function can dramatically impact a
network's performance; therefore, much care should be taken when selecting one.

Figure 2.7 Activation functions (a) ReLU (b) Sigmoid.
13

2.3.2

Backpropagation

As mentioned previously, the act of creating and using a DNN is broken into two parts: inferring
and learning, with the feedforward operation satisfying the former. It is necessary to provide a way
for the network to correct and learn from its presented data. In a supervised learning model, the
"supervisor" can inform the network if it is incorrect and, if so, by how much. The error or loss
can be calculated in many ways, but most methodologies involve a straightforward equation.
̂ is the value our network
Equation [2.3] shows the Mean Squared Error loss function where 𝒚
predicted, y is what the prediction should have been, and N is the number of training samples used.
The loss function quantifies how much the predictions should change, with a loss of zero being
the best-case scenario.
∑𝑁
̂𝑖 )2
𝑖=1(𝑦𝑖 − 𝑦
𝑀𝑆𝐸(𝑦, 𝑦̂) =
𝑁
where:
ŷ is the predicted output;
Y is the expected output; and
N is the number of training samples.

[2.3]

In a simple network, weights could be manually tuned, one by one, to achieve the desired outcome.
However, this can be a daunting and inefficient task in a more complicated network with one or
more hidden layers. The most common solution to this problem is to use the backpropagation
technique to update a network's weights. A DNN is essentially a sizeable multi-variable function.
Recall from calculus that a function's gradient ∇ contains vectors that point to its local maxima. If
the gradient is negated, it will have vectors that point to the local minima. Figure 2.8 shows an
example of the loss with respect to a certain weight. The black arrows represent our negative
gradient, and they can be thought of as pointing in the direction that a ball would roll if placed at

14

a particular point on the graph. Utilizing the gradient, the backpropagation technique "nudges"
each weight towards its optimal value.

Figure 2.8 Gradient of a single variable function.

Equation [2.4] shows the gradient descent algorithm for updating the weights where w are the
weights,  is the learning rate, and 𝛁𝑳(𝒘) is the gradient of the loss function.
w = w − ∇L(w)

[2.4]

where:
w are the weights of the network;
 is the learning rate:
L(w) is the gradient of the loss function
with respect to the weights.

2.4 Advanced Neural Networks
2.4.1

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are used for problems involving data represented in
arrays. For example, discrete-time signals and images can be expressed in 1D and 2D arrays,
15

respectively. Both examples share a common property in that data points located near each other
are related. This property can be exploited by convolving an input with a window of weights. This
window of weights is called a kernel and allows for weights to be shared among the inputs. The
input is convolved by sliding the kernel over the input and computing an element-wise matrix
multiplication and summation. Convolving an input and kernel produces what is called a feature
map. The amount that the kernel slides over in each step of a convolution is called its stride and
can change the feature map's size. Figure 2.9 shows the first and last steps of a convolution with a
4x4 input, a 2x2 kernel, and a stride of 1, with the output being a 3x3 feature map.

Figure 2.9 Convolution method.

One of the benefits derived from convolutional networks is the savings of weight parameters.
Weights in each kernel are shared throughout the entire input. For example, given the 4x4 input,
if it is feed to a traditional layer, also called a fully connected (FC) layer, the network would need
16

4x4xN weights, where N is the number of neurons in the next layer. In comparison, convolution
with a 2x2 kernel only needs four weights (assuming the network in question only has one kernel).
Typically, multiple kernels are used for each channel to produce numerous feature maps, with an
image usually composed of 3 channels, red, green, and blue.
It is often necessary to downsample the feature maps to reduce the number of activations fed into
subsequent layers or better generalize an image using a lower granularity. A pooling layer
downsizes outputs by splitting the image into equally sized windows and taking either the
minimum, maximum, or average of those windows. A max pool is the most common method and
takes the maximum of a given window. An example of a max pool operation is shown in Figure
2.10.

Figure 2.10 Max pool.

A CNN is a network that has a convolutional layer anywhere in the network. A standard
implementation of a CNN contains multiple convolutional layers towards the beginning of the

17

network, where a pooling layer follows some convolutional layers. Then in the final layers, the
network flattens the output arrays to one dimension and feeds them into fully connected layers
resembling our traditional DNN. Table 2.1 shows the layers of AlexNet, a popular image
recognition network, which follows the general format described by having five convolutional
layers at the beginning (some followed by a max pool) and three fully connected layers at the end.

Table 2.1 AlexNet layers.

2.4.2

Layer Type
C1
Convolution
S2
Max pool

Input Shape Output Shape Activation
244x244x3
55x55x96
ReLU
55x55x96
27x27x96

C3
S4

Convolution
Max pool

27x27x96
27x27x256

27x27x256
13x13x256

ReLU

C5
C6
C7
S8

Convolution
Convolution
Convolution
Max pool

13x13x256
13x13x384
13x13x384
13x13x256

13x13x384
13x13x384
13x13x256
6x6x256

ReLU
ReLU
ReLU

F9
F10
F11

Fully Connected 9216
Fully Connected 4096
Fully Connected 4096

4096
4096
1000

ReLU
ReLU
ReLU

Recurrent Neural Networks (RNNs)

Another unique class of NNs is Recurrent Neural Networks (RNNs). RNNs help solve sequential
data problems, such as speech, text, and time-series signals, which would not work well with a
traditional DNN. Feeding sequential data into a DNN is not the best approach because a DNN fails

18

to observe the time relationship between data. This approach, when applied to text or speech, is
called the bag-of-words method. To illustrate the problem, consider the sentence "The dog is in
the house." and "The house is in the dog.". In the bag-of-words approach, the sentence is broken
up into words, and the number of times each word occurs in the sentence is fed into a DNN.
However, even though the DNN might recognize that the subject has something to do with a "dog"
or "house," it has no way of distinguishing the sentences' differences. It fails to see that the ordering
of words can change the meaning. RNNs solve this by feeding a neuron's output back into itself,
thereby keeping a memory of past inputs and exploiting the input data's sequential properties [9].
Figure 2.11 shows a single recurrent neuron with sequential inputs X0-X3. The unfolded version
depicts the neuron through each step in time, producing an output dependent on all past inputs.
Notice that the inputs contribute less to the final output as time passes.

Figure 2.11 RNN neuron.

19

One of the problems that occur with RNNs is the vanishing gradient problem. The problem is the
cause of a gradient becoming smaller and smaller as it is backpropagated through many layers.
The gradients become so small that the earlier layers are only changed by a minuscule amount
during a weight update, making it difficult for the network to converge. RNNs suffer from this due
to the high number of successive multiplications of inputs and weights. The vanishing gradient
can be best illustrated by looking again at Figure 2.11. It can be shown that the final output h3
contains only a tiny sliver of the first input x0, and it will seem to "disappear" as we keep adding
sequential inputs. Although RNNs are suitable for short amounts of data, they struggle with
problems requiring long-term memory. Long Short-Term Memory (LSTM) is an enhancement on
RNNs as it exhibits a long-term memory and allows for fine-grained control of what it
"remembers" and "forgets" [10].
2.4.3

Spike Neural Networks (SNNs)

Spiking Neural Networks (SNNs) are said to be the third generation of NNs and are extremely
energy efficient and closely modeled after biological models, even more so than DNNs. SNNs,
instead of using digital signals (ones and zeros), use spike communication much like the human
brain through neuromorphic computing [11]. SNNs add another dimension that has been missing
from other biologically inspired models—time. Although it can be argued that RNNs provide time
variance, the underlying mechanisms in which they operate is still in digital logic. Spikes in an
SNN are coded to represent a particular value. Different techniques can be used to encode
information, such as the spike rate in a specific time interval, the time between multiple
consecutive spikes, or the time between two successive spikes.

20

The most popular method used to train an unsupervised SNN is Spike-Time-Dependent Plasticity
(STDP), which relies on spikes occurring at the input and output of a neuron. The synaptic
strengths are adjusted based on the relative timing of spikes occurring at the neurons' input and
output. For supervised learning, backpropagation must be approximated due to the nondifferentiable nature of SNNs.
An observation in the field shows the trade-off between creating a more biologically correct model
and creating one that is more computationally feasible. Models that closely resemble the biological
model tend to be more computationally intensive, while more efficient models tend to be less
biologically accurate. The most popular models find a middle ground between these two criteria
[12].

2.5 Hardware for Deep Learning
So far, this manuscript has focused mainly on AI and DL at the algorithmic level. However, it is
vital to convey the relationship between algorithms and hardware at the physical level. The
algorithmic side of DL can also be referred to as the software stack and encompasses the highlevel methods and procedures used for the training and inferring of these networks. The hardware
side can be thought of as the actual underlying mechanisms that serve as a platform for these
networks. Both areas go hand in hand. As researchers continue to make progress on network
architecture and hardware designers struggle to keep up with the demand for even more
computational complexities required by these algorithms, their relationship must be known.

21

2.5.1

Computer Metrics

Comparing different processors can be challenging; engineers must carefully weigh factors such
as cost, performance, size, application, and flexibility. In the AI research field, the most soughtafter metrics considered when evaluating hardware are performance, power, and area.
2.5.1.1

Performance

Execution time is a valuable metric of computational performance and is the primary metric used
when classifying hardware for AI. Execution time is the time it takes for a computer to complete
a task from start to finish and is dependent on three factors: instruction count (IC), cycles per
instruction (CPI), and the clock rate. IC is the number of instructions needed to execute a particular
task. CPI is dependent on the computer architecture and describes the number of clock cycles
needed to complete one instruction of the task. The clock rate is also dependent on the hardware
and is defined as how many cycles the clock oscillator completes per second. Equation [2.5] gives
the basic formula for calculating the execution time for a given task.
Execution Time =

IC × CPI

[2.5]

Clock Rate

Another measure for performance is the throughput of a system. Throughput is the number of tasks
completed per second and is beneficial in a server environment where the system must service
many tasks. In ML, this metric is often used to show the number of multiplications and additions
that can occur per second, as they are the primary operations in a DNN. This value is usually
expressed as operations per second (OPS), floating-point operations per second (FLOPS), or
billions of multiply and accumulate per second (GMACS).

22

2.5.1.2

Power

Power is also an important metric to consider when rating computer systems. In mobile devices, it
is an especially critical metric to consider due to limited battery life. Similarly, it is essential in
server farms where a massive amount of energy is spent on computing and cooling.
Complementary Metal Oxide Semiconductor (CMOS) is the most used technology in integrated
circuits (ICs). CMOS's primary power consumer is dynamic energy, which is consumed by
transistors switching on and off. The power required for a system is based on three variables:
capacitive load, voltage, and frequency. The capacitive load of a transistor is mainly determined
by the number of transistors connected to its output (also called fanout). Other factors like wire
placement, size, or transistor specifications can affect the capacitive load as well. The voltage and
frequency are determined by the given design of the CPU. Equation [2.6] shows the Equation to
calculate the dynamic power consumed by a CPU [13].

Dynamic power =

1
× Capacitive Load × Voltage2 × 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
2

[2.6]

Figure 2.12 shows the relationship between the clock rate and power in Intel microprocessors from
1982 to 2012. From 1982 to 2004, clock rates increased exponentially; however, it can be observed
that the power did not increase by the same rate during this time, bringing up an important property
of transistors. As transistors become smaller, they require less voltage and become faster, enabling
them to operate at higher frequencies. Along with increasing the clock speed exponentially,
engineers were able to continually scale down the voltage, offsetting the power increase
significantly. However, there is a limit to this technique. It was discovered that the voltage could
not be decreased further than 1V; otherwise, the transistors would not endure the effects of voltage

23

leakage. Frequency and voltage could no longer be scaled as before, leaving clock rates to settle
at around 3GHz; this became known as the power wall [14].

Figure 2.12 Clock rate and Power for Intel x86 microprocessors over eight generations and 25
years [13].

2.5.1.3

Area

In mobile and embedded devices area is also a metric of concern. Area refers to the size of the
silicon space needed for the chip and is expressed in squared millimeters (mm2) or squared
micrometers (µm2). The area of a processor depends on its complexity and transistor technology.
2.5.2

Processors

Although state-of-the-art NNs have proven to be very powerful (in some cases even beating human
intelligence [15]), they come at a cost. ML demands large amounts of computational power. The
multiply and accumulate (MAC) operation is the most used and fundamental operation in NNs for
inference and training [16]. A MAC can be broken into two parts: multiplication and accumulation.

24

A block diagram for the MAC is shown in Figure 2.13, where weights and activations are
sequentially multiplied and then added to the previous multiplication stored in the accumulator.

Figure 2.13 Multiply and Accumulate (MAC) operation block diagram.

The number of MACs required for a model is an excellent metric to look at when judging the
computational complexity. Table 2.2 shows the total number of MACs needed during inferencing
for popular models, ranging from 341 thousand to 15.5 billion. These numbers are massive and
are a testament to the vast amounts of computational power ML requires.
Table 2.2 Total MAC operations of popular models during inference.
Model
Total MACs

LeNet 5
(1998)
341 K

AlexNet
(2012)
723 M

VGG16
(2014)
15.5 G

GoogLeNet
(2014)
1.43 G

ResNet50
(2015)
3.9 G

Although networks require large amounts of MAC operations, NNs inherently exhibit topological
and operational parallelism. Topological parallelism is seen in fully connected and convolutional
layers since the MAC operations are data-independent and can be computed concurrently.
Furthermore, since data sets contain multiple samples, it is possible to run these samples

25

simultaneously, also known as operational parallelism. Two types of architectures are identified
to take advantage of the natural parallelism seen in NNs, temporal and spatial.
2.5.2.1

Temporal Architectures

Vector CPUs, GPUs, and other general-purpose processors adopt the temporal architecture (shown
in Figure 2.14). An array of Arithmetic Logic Units (ALUs) are arranged to execute data in parallel
in this architecture. The ALUs operate in lockstep, with each receiving the same instruction but
operating on different data. Additionally, memory and control are centralized, and there is no inter
ALU communication, as ALUs communicate directly with the memory and control modules. In
Flynn's taxonomy, temporal architectures fall under the Single Instruction Multiple Data (SIMD)
category. GPUs use this architecture on a large scale by having many processing cores and are
classified as Single Instruction Multiple Thread (SIMT) processors. Out of the two types of
processors, CPUs are the least ideal for large networks, as the number of cores and ALUs is small,
limiting the FLOPS possible. Currently, GPUs dominate the DNN space due to their high number
of cores, ranging from the hundreds to the thousands [17].

26

Figure 2.14 Temporal Architecture.

2.5.2.2

Spatial Architectures

One of the drawbacks that CPUs and GPUs have in DL is that they are considered general-purpose
processors and are designed to fit a wide variety of application needs. Therefore, in general, DLspecific hardware optimizations are not supported by CPUs and GPUs. To circumvent this
drawback, DSAs are designed to fit a specific purpose at the cost of flexibility. DSAs are typically
implemented on Field Programmable Gate Arrays (FPGAs) or Application-Specific Integrated
Circuits (ASICs). FPGAs contain arrays of gates that can be programmed, allowing for rapid
development and prototyping of hardware designs. However, FPGAs are generally less energyefficient than their ASIC counterparts because they lack support for application-specific
optimizations. On the other hand, ASICs cannot be changed on the fly and require that the design
go through all stages of semiconductor fabrication. The result is higher time to market and
production costs; however, they can be highly energy-efficient.

27

The spatial architecture is an example of a DSA. As mentioned previously, MACs consume the
majority of operations in a DL application. However, the problem comes not with the actual
multiplication or accumulation itself but with the memory access it requires. MACs require four
different types of memory accesses: weight reads, activation reads, partial sum reads, and partial
sum writes. Assuming no onboard buffers, each access must read from or write to off-chip
Dynamic Random Access Memory (DRAM). Access to DRAM is expensive in terms of energy
and consumes around two orders of magnitude more energy than a single MAC operation (multiply
and accumulate only, no memory accesses) [18]. Additionally, accessing DRAM introduces large
amounts of latency which can bottleneck performance. Therefore, it is beneficial to keep data as
close as possible to where it is being processed.
Spatial architectures improve upon temporal architectures in two different ways: 1) decreasing the
amount of off-chip accesses, 2) Limiting data movement. Similar to how temporal architectures
arrange ALUs into an array, spatial architectures arrange Processing Engines (PEs) in an array. A
PE (shown in Figure 2.15) is composed of three components: a register (also called a Register File
(RF)), an ALU, and a controller. Additionally, PEs are given the ability to pass partial sums to
their neighbors, allowing MACs to be computed spatially rather than temporally. In contrast with
the centralized memory and control unit found in the temporal architectures, PEs in spatial
architectures contain their own memory and control logic; This limits both on-chip and off-chip
data movement prevalent in the temporal architecture.

28

Figure 2.15 Spatial Architecture.

2.6 Deep Learning on the Edge
2.6.1

Challenges of edge-to-cloud models

Although powerful, state-of-the-art NNs require massive amounts of energy and data. For this
reason, NNs are typically computed in server farms where resources are, in some senses, limitless.
Edge devices that utilize ML, such as Internet-of-Things (IoT) and mobile devices, put the cloud
into use by transmitting data over the network to servers, which carry out the computations and
send the result back over the network to the edge device. However, cloud computing for edge
devices produces significant latency/connectivity, network bandwidth, and privacy issues.
2.6.1.1

Latency & Connectivity

Applications where time is critical and results are expected in real-time pose a challenge for edge
devices using ML. For example, a self-driving vehicle would not be able to utilize the edge-tocloud paradigm as transmitting data over a cellular connection would not meet the strict timing
requirements critical to the vehicle's operation. A single video frame alone would incur a latency
29

of more than 200 ms to be transmitted and computed end-to-end [19], far too long for a timecritical application. Furthermore, with limited cellular coverage across the globe, mobile devices
cannot guarantee present or stable internet access at all times, making them unreliable for critical
or high-risk applications.
2.6.1.2

Network Bandwidth

Not only does the edge-to-cloud model present network issues for the single user, but it also puts
a strain on an entire network for all its users. Video data is particularly of concern. A city with
12,000 people transmitting 1080p video would require uplinks of 100 Gbs per second, and
proportionally a city with one million people would require 8.5 Tb per second [20]. Thus, network
access to a cloud presents a bottleneck as more devices are added to a network. Additionally, it is
energy inefficient to move data from one end of a network to the other, which can cause batteries
in mobile devices to drain more quickly due to energy consumption from communication.
2.6.1.3

Privacy

Privacy concerns also pose a threat. Deep learning applications in health care are especially at risk
as the edge-to-cloud model requires transmitting sensitive electronic health data over the internet.
This leaves users and their information vulnerable to hackers. Also, there are concerns about
companies who offer cloud services for the possibility of user data being mishandled or misused.
For these reasons, cloud computing for edge devices is not ideal for privacy.
2.6.2

The Case for Edge Intelligence

Because of the challenges outlined above, there has been a push towards decentralizing ML
computations and performing them on edge devices, called edge intelligence (EI). Currently of
interest is how to conduct training in the cloud and inference on the edge. However, moving
computations to the edge means losing virtually limitless resources utilized on server farms in the
30

cloud. Edge devices are constrained by energy consumption, compute, area, and memory cost
limitations. Therefore dedicated hardware must be custom designed to leverage algorithmic and
hardware optimizations in DNNs.
2.6.3

Applications of Edge Intelligence

Applications, where EI is either necessary or highly beneficial, will be outlined in this section. It
should be observed that all the applications described would pose problems if implemented in an
edge-to-server model; therefore, many of them implement an edge-only model or a hybrid of the
two.
2.6.3.1

Computer Vision

Image and object detection are components in DL useful for applications such as autonomous
driving and video surveillance. As highlighted in 2.6.1.1, autonomous driving requires low latency
and reliable DL computations. Tesla's full self-driving (FSD) System on a Chip (SoC) is a custom
neural network accelerator capable of processing up to 2300 frames per second, as shown in Figure
2.16. The FSD powers most of Tesla's Autopilot algorithm, enabling vehicles to drive semiautonomously [21].

Figure 2.16 Tesla's Full Self-Driving (FSD) computer with dual FSD chips [21].
31

Similarly, edge devices for video surveillance and threat detection benefit from EI, as network
bandwidth becomes strained when many remote cameras stream video to a cloud. Also,
information captured by surveillance may be private, including faces, documents, and locations,
leaving video streams susceptible to intrusion. Amazon's DeepLens mitigates this by including
hardware to perform DL locally and only uploads the video of interest to the cloud, saving
bandwidth and limiting the amount of sensitive information transferred over the internet [22].
2.6.3.2

Natural Language Processing

DL in Natural Language Processing (NLP) has become popular for language understanding,
translation, and synthesis. Voice assistants like Amazon Alexa, Apple's Siri, Google Assistant, and
Microsoft's Cortana put the power of NLP into IoT and the hands of users. These voice assistants
work by using a wake-word, in Google's case "Hey Google," and immediately start sending voice
data to servers across the internet and query for a response. These wake-words are first processed
by DNNs running on onboard hardware. In Microsoft's case, researchers were able to recognize
"Hey Cortana" with just a 1 KB model using RNNs [23]. Despite impressive performance on wakewords, edge devices still need improvement on NLP due to latency from the cloud. For example,
in experiments, it was shown that a professional translator could translate 5x faster than Google
Translate could with the Pixel Buds headphones [19]. Therefore, there is a need to push the entire
model to the edge to lower the embedded devices' latency.
2.6.3.3

Health Care

EI devices enable healthcare providers to provide care to patients even outside clinical settings.
By 2022 the internet of medical things (IoMT) market is expected to be valued at $158 billion
globally [24]. Physicians or users can use edge devices in health care to track physiological health

32

data, enabling solutions such as chronic illness monitoring, epidemiological surveillance, elderly
and pediatric care, and health and nutrition management [25]. EI is critical to health care
applications as it provides real-time solutions with minimal latency and low energy consumption.

33

3 Discussion of the Problem
As discussed in the previous chapter, Edge Intelligence (EI)—bringing computations to the edge
rather than the cloud—allows for the existence or improvement of many applications in DL.
Therefore, it is necessary to build custom architectures that implement DL-specific optimizations.
This section starts with the details of two optimizations that can be used to realize DL on the edge:
network pruning and precision reduction. Both optimizations have a common goal—to reduce the
overall model size, whether on the parameter level (pruning) or the bit level (precision reduction).
Then variable bit-width multipliers are presented to support reduced precision operations in
hardware. Finally, this section is concluded with an explanation of the proposed work.

3.1 Network Sparsity and Pruning
A variety of applications show that many of the trained weights within a DNN converge to zero or
close to zero. For example, the weights of AlexNet are shown in Figure 3.1. It can be seen that a
large portion of weights fall near or at zero. Interestingly, a similar trend can be seen throughout
many trained networks. This curious feature is known as sparsity and can be seen in input data,
weights, and intermediate activations.

34

Figure 3.1 Distribution of weights in AlexNet [26].

An intuitive explanation for why weights exhibit sparsity is that when a network is being trained,
the essential connections are given more importance, or weight. In contrast, the less essential
connections are given less weight until after multiple training iterations, reach zero or near zero.
In addition, many DNNs are overparameterized and contain a large number of redundant
connections [27], whose associated weights are eventually reduced in magnitude by
backpropagation. Activations undergo a similar fate in networks where the ReLU activation
function is used. Recall that the ReLU function returns zeros for all negative inputs; therefore,
assuming that half of all pre-activation values are negative, half of all activations become zero
after passing through the ReLU. Therefore, DNNs that use the ReLU exhibit large amounts of
sparsity in their internal activations.
In the models presented in this paper thus far, the architecture of a DNN cannot be changed during
training; only the weights within the model are updated. Consequently, if a model contains
unneeded connections, the architecture cannot dynamically "cut" those connections. The outcome
35

is that insignificant operations are performed, and on edge devices where storage, energy, and
computational power are at a premium, this may become prohibitive.
A technique known as pruning aims to eliminate unneeded or insignificant connections by
removing all weights below a specified threshold value. Additionally, it is possible to remove
entire neurons if the neuron has no input or output connections. Figure 3.2 shows a network before
and after pruning. After pruning, a network contains significantly fewer parameters, and as a result,
considerably fewer operations are performed.

Figure 3.2 DNN before and after pruning. Transparent connections represent below threshold
weights.

Pruning is implemented by a three-step process (shown in Figure 3.3), including initial training,
pruning, and retraining. A network's sparsity is first learned from the initial training step, where
zero and near-zero weight values are discovered. Then weights that are below a particular threshold
value are "pruned." As a result of the pruning operation, the accuracy of a network will most likely
have suffered severe degradation; therefore, a final retraining step is performed to tune the
remaining weights. The pruning and retraining steps are then reiterated as needed.

36

Figure 3.3 Pruning process.

The distribution of weights in AlexNet after pruning can be shown in Figure 3.4, where a large
portion of the weights on and around zero have been reduced.

Figure 3.4 Distribution of weights in AlexNet after pruning [26].

37

Pruning has proven to be an effective optimization technique, reducing parameters by a factor of
9, from 61 million to 6.7 million weights in AlexNet, and by a factor of 13, from 138 million to
10.3 million, on VGG-16 [26].

3.2 Precision Reduction
Another optimization strategy involves reducing the number of bits and thus reducing the precision
with which numbers are represented. This optimization strategy has a beneficial impact on the
speed, energy consumption, and storage size of NNs. However, the consequence of reduced
precision is a loss in network accuracy; therefore, a trade-off must be made between precision
reduction and accuracy.
3.2.1

Number Representations

An essential prerequisite for understanding how precision-reduction optimizations can be used for
DNNs is how numbers are represented, including the floating-point format, the fixed-point format,
and the integer format. This section gives detailed explanations of all three.
3.2.1.1

Floating Point

Typically, DNNs perform calculations in a 32-bit floating-point format. Specifically, the Institute
of Electrical and Electronics Engineers (IEEE) 754 Single-Precision binary floating-point format
is used. Similar to how decimal numbers can be represented in scientific notation, binary numbers
can also be represented in this fashion. Observing Equation [3.1] shows how a large number is
represented in a more compact form.
1004200000010 = 1.004210 × 1010

38

[3.1]

Analogously, we can observe a binary example shown in Equation [3.2]. Since the representation
is in base two, the decimal point is called the binary point. The number to the right of the
multiplication symbol gives the scaling factor, and the number to the left provides the precision.
The scaling factor gives the effect of moving the binary point to the left or right (in this case, left
because of the negative exponent), hence usage the term "floating-point."
10.1012 = 1.01012 × 2−1

[3.2]

The IEEE Single-precision format, shown in Figure 3.5, is a 32-bit floating-point format consisting
of a sign bit, an 8-bit exponent, and a 23-bit mantissa. The decimal value of a single-precision
floating-point number is calculated using Equation [3.3].
value = (−1)s × (1 + m) × 2e − 127

[3.3]

where:
s is the sign;
m is the mantissa; and
e is the exponent.

Figure 3.5 IEEE Sigle-precision floating-point format

The format can represent numbers almost as large as 2.0 × 1038 or nearly as small as 2.0 × 10−38 .
Moreover, it can represent a total of 4,278,190,081 unique real values. However, as it will be
shown, this amount of precision is likely unneeded.
39

3.2.1.2

Fixed-Point

An alternative number format is the fixed-point representation (an 8-bit signed version is shown
in Figure 3.6). The floating-point and fixed-point formats are similar in a sense; the fixed-point
format can almost be simplified to a floating-point with a fixed exponent value. An N-bit signed
fixed-point number contains a signed bit and an (N-1)-bit mantissa. The decimal value is calculated
using Equation [3.4].
value = (−1)s × m × 2−f

[3.4]

where:
s is the sign;
m is the mantissa; and
f is the number of fractional bits.

Figure 3.6 8-bit signed fixed-point format.

The mantissa is split into two parts: the integer and fractional portions. The integer portion to the
left of the binary point determines the range which can be expressed. Meanwhile, the fractional
portion to the right of the binary point determines the precision. In Figure 3.6, 3-bits are assigned,
40

so the binary point is located 3-bits from the right. The trade-off between range and precision is
carefully considered, and the fractional and integer bit widths are chosen depending on the
application. It should also be noted that all fixed-point numbers in a system are represented with
the same integer and fractional bit widths.
An advantage of the fixed-point over the floating-point representation is the amount of power
consumed when performing operations. A 32-bit fixed-point addition consumes nine times less
energy than a 32-bit floating-point addition, and a 32-bit fixed-point multiplication consumes 1.2
times less energy than a 32-bit floating-point multiplication [28]. In conjunction with energy
consumption, fixed-point adders and multipliers need significantly less area than their floatingpoint counterparts [16]. The efficiency of fixed-point numbers is seen further when the total bitwidth is reduced to 16 and 8-bits.
3.2.1.3

Integer

The integer representation is similar to fixed-point representation with the absence of the fractional
bits. A signed N-bit integer represents integer values from −(2N−1 ) to (2𝑁−1 ) − 1 while an
unsigned implementation represents values from 0 to 2𝑁 . The underlying hardware for integer and
fix-point operations is the same; only the interpretation of bit patterns differs. Therefore, integers
see the same energy and area efficiency as the fixed-point representation.
3.2.2

Quantization

Although floating-point numbers offer a significant amount of precision, it has been shown [29]
that DNNs are surprisingly robust to small perturbations in weight and activation values; This
means that DNNs do not require the full precision offered by 32-bit floating-point numbers. In
fact, it is generally held that most networks experience negligible accuracy loss when numbers are
reduced to 8-bits.
41

Reducing bit widths in DNNs results in energy, memory, and computation savings. Energy is saved
from reduced bit-width operations—such as memory accesses, multiplications, and additions—
which generally consume less energy. Additionally, because of smaller bit widths, less memory is
needed to store these numbers in hardware. Finally, reduced bit widths can make DNNs
significantly less computationally complex; while the number of MACs stay the same, it is possible
they take fewer cycles to complete and thus, reduce latency. The technique used to reduce the bit
widths of numbers—and thus reduce the precision—is called quantization.
Quantization is the process of taking numbers from a large set of possible values and mapping
them to a smaller set of values. It can also be thought of as a type of lossy compression algorithm.
Equation [3.5] shows a quantization function 𝑞( ) that accepts a given input and outputs the
quantized value.
xq = q(x)

[3.5]

where:
x is the original value; and
xq is the quantized value.

Of course, applying quantization will most likely result in a difference between the original and
quantized values. This difference is called the quantization error and is expressed by Equation
[3.6].
eq = x q − x

[3.6]

Floating-point numbers can be reduced and represented in fixed-point or integer representation
using quantization. This paper looks specifically at integer-only quantization [30] due to its

42

simplistic and linear implementation. There exist non-linear quantization schemes; however, its
performance is burdened by associated overhead and is outside the scope of this work.
The integer quantization scheme can be reduced to a simple scale, round, and clip operation shown
in Equation [3.7].
[3.7]
xq = q(x)
= clip(round(s × x), min, max), ∀𝑥X
where:
x is the original value;
s is the scale;
min is the minimum value of the output range;
max is the maximum value of the output range; and
X is the domain of original values.

First, the original value x is multiplied by a scale s. The purpose of the scale is to transform the
range of X from the range desired to be maintained to the range that can be represented. The scale
can be chosen to fit every value of X, i.e., min/max range, or only the most critical, e.g., 3-sigma
range.
Second, the scaled value 𝒔 × 𝒙 is rounded to the nearest integer since only integers can be
represented by this method. Finally, a clip function is applied to ensure that values outside the
representable range are saturated to the minimum or maximum values.
An example of integer quantization is shown in Figure 3.7, where values are quantized to 4-bits
using the min/max range (-1 to 1). A 4-bit signed integer can represent integer values from -8 to
7; therefore, a scale factor of 7 is chosen to fit the range best. As shown, the reduction of precision
from quantization introduces perturbations in some values. Although scaling has no impact on the
precision of the original number—assuming that the scaling factor is known—error is introduced
when rounding and clipping are applied. In the case of the 4-bit quantization example, quantization
43

will at most introduce an error of

0.5
7

, or 0.0714. Figure 3.8 shows the rescaled quantized output xq

(output is rescaled by dividing by the original scaling factor after quantization) and quantization
error eq for a given input x. Interestingly, some values may end up quantized to the same value.
For instance, in the 4-bit example, both −0.78 and −0.65 are quantized to the value of -5.

Figure 3.7 4-bit integer quantization visualized on a timeline.

44

Figure 3.8 Quantization error/rescaled quantized values for 4-bit integer quantization.

When applied to DNNs, quantization can be implemented in two different areas: the weights and
activations. Since weight values are predetermined before inference, weight quantization is applied
offline, i.e., after training and before inference. In contrast, activation values are unknown before
inference; therefore, activation quantization must be performed in real-time. Although activation
values are unknown before inference time, the scaling factors for activation quantization are
determined based on a small subset of the training data, which should provide a reasonable
estimation of the values expected to be seen during inference. At this point, one may expect the
outcome of the network to be affected by the scaling introduced by quantization. However, as long
as each activation in the layer is scaled by the same factor, the activations stay relative. Therefore

45

in categorical classification networks, the outcome is proportional. The same can be said when
scaling weights.
Quantized DNNs have shown promising results. In [31], quantization is applied to three differentsized image recognition CNNs: a small CNN—Lenet-5, a medium-sized CNN—CifarQuick, and
a large CNN—AlexNet. The results, shown in Figure 3.9, show the accuracy relative to the
baseline of the quantized network given the number of bits to which the weights and activations
are quantized. In all three networks, accuracy stays relatively close to the baseline until around 17
bits, after which accuracy degrades substantially. The findings are impressive considering the
networks achieve close to 100% accuracy with almost half the amount of bits or less as the baseline
32-bit network.

Figure 3.9 Relative accuracy of quantization on three different-sized networks [31].

Quantization is particularly useful in reducing a network's storage and energy footprint. When
weight quantization is applied, it reduces the network's overall storage requirement since weight

46

bit-widths are smaller. Furthermore, weight and activation quantization can reduce energy
consumption, as smaller multipliers and adders are required, and data movement is reduced. Figure
3.10 shows the energy consumptions for add, multiply, and read operations at certain bit-widths in
45nm technology.

Figure 3.10 Energy consumptions for operations in 45nm technology [28].

As a general rule, lower bit-width operations yield lower energy costs. Consequently, the MultiplyAccumulate, or MAC, operation consumes significantly less energy in a quantized network,
whether from the multiplication, accumulation, or memory accesses. Figure 3.11 shows the
relative energy consumption of the networks in [31] depending on the quantization level. It can be
shown that as the precision of the networks decline, so does the energy consumption. Notably, in
the case of extreme quantization (less than eight bits), energy consumption is reduced by a
significant amount.

47

Figure 3.11 Relative energy consumption for quantized networks [31].

3.2.3

Mixed-precision Quantization

So far, the described method of quantization quantizes all weights and activations to the same
number of bits; this is known as uniform quantization. However, there is significant variability
between the precision required across layers and the precision needed between weights and
activations. Therefore, a more sophisticated optimization scheme proposes mixed-precision
quantization. In mixed-precision quantization, weights and activations are quantized individually
and on a per-layer basis. This optimization allows for a finer granularity of control on the
precision/accuracy trade-off and enables the bits to be reduced further on layers that may not
require as much precision as others.
[32] applies mixed-precision quantization to the networks from [31] as a further optimization.
Figure 3.12 shows the optimal quantization bits found per layer when using mixed-precision
quantization while still maintaining 99% accuracy compared to its uniformly quantized counterpart
at 100%. Data points at integer values show quantization of the weights, and other data points
48

show activation quantization. With only a 1% drop in accuracy, mixed-precision quantization
significantly reduced the number of quantization bits needed. In the case of LeNet-5, input data
was even able to be quantized to a single bit.

Figure 3.12 Optimal per-layer bit widths using mixed-precision quantization on three differentsized networks at 100% and 99% accuracy [32].

3.3 Variable Bit Width Multipliers
Although mixed-precision quantization has been shown to reduce precision further than uniform
quantization, hardware must be specially designed to exploit this optimization. To take full
advantage of mixed-precision quantization, multipliers must meet the following criteria:
1) The multiplier must have the ability to handle mixed-precision multiplication, e.g., 16-bit
activations times 8-bit weights.
2) The multiplier must execute only on the least number of bits needed to compute the result;
otherwise, energy is consumed processing unneeded bits.
3) Since there are a fixed set of multiplication units in an architecture, multipliers must be
flexible enough to handle all mixed-precision cases within a network in real-time.

49

Currently, the only processor on the market able to handle low-bit width quantization is Google's
Tensor Processing Unit (TPU) [8]. However, this architecture fails to support mixed-precision
processing and can only handle as little as eight bits of precision. Therefore, a custom multiplier
able to meet all three criteria is desired to be implemented.
A data-gated conventional MAC is proposed in [32] to support mixed-precision quantization.
Unneeded sections of the unit are turned off depending on the number of activation and weight
bits, as shown in Figure 3.13. Turning off the unnecessary sections of the multiplier eliminates
switching activity in those sections and thus reduces energy consumption. Despite the energy
savings, the proposed MAC fails to efficiently utilize the area given, as sections of the MAC may
be idle. In edge devices, this poses a problem, as area is another limiting factor constraining their
design.

Figure 3.13 Data-gated conventional MAC in 8x8-bit (left), 4x4-bit (middle), and 2x8-bit (right)
configurations [33]

50

The Fusion Unit (FU) multiplier, proposed in [34], circumvents the area utilization problem seen
in the data-gated conventional MAC. Small 2-bit multipliers, called Bit Bricks (BBs), are
dynamically fused in different configurations depending on the precision of weights and
activations. Instead of letting space go unused, the FU multiplier allows multiple MAC operations
to occur within a single unit.
A single BB shown in Figure 3.14 takes a 2-bit multiplicand and multiplier along with two sign
bits indicating whether the operators are in signed or unsigned representation. The BB then
produces a signed 6-bit result. The underlying implementation is relatively small, consisting only
of five AND gates, four NAND gates, three half-adders, and three full-adders.

Figure 3.14 Bit Brick (BB) diagram.

The FU multiplier shown in Figure 3.15 contains 16 spatially arranged BBs, which can be fused
in real-time into nine different configurations. The notation for the configuration is given by 𝐣 × 𝐤
where 𝒋 represents the activation bits and 𝒌 the weight bits, each of which can take on values of

51

two, four, or eight. The multiplier only supports up to eight bits because it has been shown that
most networks can be quantized to eight bits with a negligible drop in accuracy, as discussed in
Section 3.2.2. Therefore in most cases, eight bits is sufficient for quantization purposes.

Figure 3.15 Fusion Unit (FU) multiplier.

Depending on the unit's configuration, the FU multiplier is divided into sections called Partial
Fusion Units (PFU). Each PFU individually computes a single multiplication; afterward, each
PFU's product is accumulated through shift and add logic, and a 32-bit result is produced. Figure
3.16 shows four cases, in which the multiplier is placed in the 𝟖 × 𝟖, 𝟐 × 𝟐, 𝟖 × 𝟒, and 𝟐 × 𝟖
configurations. Note that in the 𝟖 × 𝟖 configuration, a single MAC can be computed inside the FU
multiplier. However, as the bit widths are decreased to two bits in the 𝟐 × 𝟐 case, 16 MACs can
be performed inside the FU multiplier at a time; this is significant, as an activation requiring 16

52

MACs can be computed in only a single cycle as opposed to 16 cycles. Speedup is seen not only
in the 𝟐 × 𝟐 case but in every case where there is more than one PFU in a single FU multiplier.

Figure 3.16 Fusion Unit in four configurations containing different sized Partial Fusion Units
(PFU).

The FU multiplier meets all the criteria required to take full advantage of mixed-precision
quantization. Firstly, the multiplier can handle mixed-precision multiplication in any combination
of two, four, and eight bits for activations and weights. Secondly, the multiplier computes only the
bits needed to produce a result, as the array of BBs is spatially divided to fit specific quantization
cases. If a single PFU does not consume the entire array of BBs, multiple products are produced

53

and accumulated. Lastly, the FU multiplier is flexible enough to handle all nine quantization cases
in real-time, as bit widths can vary throughout a network.

3.4 Proposed Work
As discussed in Section 3.2.3, [32] further reduces the precision of activations and weights within
models using mixed-precision quantization while still maintaining above 99% of the baseline
accuracy. However, it would be beneficial to further investigate the precision/accuracy trade-off
for highly constrained edge devices that require extreme quantization (less than eight bits).
Therefore a study is proposed to find the optimal bit widths when 95% of the baseline accuracy is
needed. As part of the study, a comparison will be made between the full-precision models and
mixed-precision quantized models, where statistics will be collected on weight and activation
sizes.
The methodology used will follow the integer-only mixed-precision quantization method outlined
in Section 3.2.2 and Section 3.2.3. The study will be conducted in software only. Although, in
reality, the underlying computations will not be handled in mixed-precision representation, its
effects can be simulated in software. Weights and activations will be quantized to either two-, four, or eight-bits to follow the configurations possible with the FU multiplier.
Python is chosen as the language to implement this study as it supports a plethora of ML libraries
such as Keras, TensorFlow, Theano, and PyTorch. Specifically, this study will use PyTorch due
to its simplistic and flexible nature, making the implementation of mixed-precision quantization
possible.
For a general trend to be seen, the study will be conducted on the Modified National Institute of
Standards and Technology (MNIST) dataset [35] and the Canadian Institute for Advanced

54

Research 10 class (CIFAR-10) dataset [36]. The MNIST dataset contains 70,000 28 × 28
greyscale images of handwritten digits, 60,000 of which are dedicated to training, while 10,000
are for testing. Network models which can reach 99% accuracy on the test set are relatively small
and straightforward to implement. The CIFAR-10 dataset contains 60,000 32x32 colored images
of 10 classes, including airplane, automobile, bird, cat, etc. 50,000 of those images are dedicated
to training, while 10,000 are for testing. This dataset is far more complex than MNIST and thus
requires a more extensive network. A reasonable-sized model can achieve an accuracy of around
75% on the test set. Although both datasets only need relatively small models to achieve high
accuracy, they are essential to the ML community as they are easy to dissect, and findings can be
extrapolated to larger datasets.
In addition to studying the effects of extreme mixed-precision quantization, it is important to
provide a proof of concept example in hardware. Therefore, a mixed-precision architecture is
proposed. Following the design of the spatial architecture first presented in Section 2.5.2.2, the FU
multiplier will replace the full-precision multiplier. This adaptation will enable the spatial
architecture to support mixed-precision quantization.
A comparative study will be conducted between the mixed-precision architecture and its fullprecision counterpart. The study will focus on important metrics for constrained edge devices,
including weight and activation sizes, memory accesses, and the number of cycles required to
produce a result.
The data set chosen for the study will be from the XOR Problem introduced in Section 2.1.3.
Although creating a model to solve the XOR Problem is seen as a "toy" example, it provides
valuable insight; due to the nonlinearity of the problem, it can be deduced that a concept provable
on the XOR Problem will scale well to larger datasets.
55

Both mixed-precision and full-precision architectures will be implemented using SystemVerilog,
and the results will be collected from simulations in Xilinx's Vivado Design Suite.

56

4 Discussion of the Work & Results
4.1 Effects of Extreme Mixed-precision Quantization on MNIST and CIFAR-10
Datasets
4.1.1

Mixed-precision Quantization Methodology for MNIST

LeNet-5 [37]—a simple CNN containing three convolutional layers and two fully connected layers
as shown in Figure 4.1—is chosen to classify the MNIST dataset. Using the Mini-Batch Gradient
Descent (GD) optimizer, with a batch size of 4, a learning rate of 0.001, and a momentum of 0.9,
the model achieved a 99.26% accuracy on the test set when trained for five epochs. The trained,
single-precision model will serve as the baseline case for comparison with the quantized models.

Figure 4.1 LeNet-5 architecture.

57

Weight quantization is performed first. The weights' distribution is analyzed to find the appropriate
scales by which the weights are scaled in each layer. Figure 4.2 shows the weight distribution of
LeNet-5's second convolution layer, where values range from -0.479 to 0.509.

Figure 4.2 LeNet-5 Convolution 2 weight histogram.

Then, 8-bit weight quantization is performed on each layer individually. Since eight bits allow for
integers from -128 to 127 to be represented, 249.1 is the optimal value by which the second
convolutional layer's weights are scaled. Figure 4.3 shows the new distribution of the weights in
the convolutional layer after 8-bit quantization is applied. It should be noted that although on a
different range, the original weights and the quantized weights follow a similar distribution; the
same can be said for every layer within the quantized network. With 8-bit weight quantization
applied, the network achieves an accuracy of 99.28%, a negligible difference from the singleprecision accuracy, and in this case—an improvement.

58

Figure 4.3 LeNet-5 8-bit quantized Convolution 2 weight histogram.

Similar to how the distribution of weights is analyzed before weight quantization is performed,
activation quantization requires activations to be profiled. As described in Section 3.2.2,
activations are profiled using a small subset of the training data, which should provide a reasonable
expectation of the activation values which can be expected. Activations are profiled using a subset
of 500 training samples. Figure 4.4 shows the distribution of activations from the second
convolution layer with values ranging from 0 to 16.64. It should be noted that because of the ReLU
activation function, there are no negative activation values. Therefore, a sign bit is not required,
and with 8-bits, integers from 0 to 255 can be represented. Activation scales are found for each
layer using the profiling information.

59

Figure 4.4 LeNet-5 Convolution 2 activation histogram.

Recall from section 3.3.2 that, unlike weights, activations must be quantized in real-time.
Therefore, the current architecture of LeNet-5 must be modified to allow for activation
quantization to be performed. Quantization layers—performing scaling, rounding, and clipping—
are added between layers, as shown in Figure 4.5.

60

Figure 4.5 Lenet-5 modified quantization architecture.

Activations are profiled again using the modified quantization architecture. Figure 4.6 shows the
activations from the second convolution layer after quantization is performed across the network.
As expected, activations now lie in the 0 to 255 range for this layer.

61

Figure 4.6 LeNet-5 8-bit quantized Convolution 2 activation histogram.

When quantizing both weights and activations to eight bits, the model achieves an accuracy of
99.23%, a negligible drop in accuracy compared to the single-precision model.
Since the network has shown almost no accuracy loss when quantized to eight bits, a search is
performed to find the optimal bit widths for each layer while maintaining 95% of the baseline
accuracy. Starting in the first layer, the search is conducted by dividing the activation bits by two
until accuracy falls below the 95% threshold. Afterward, the same is done with the weight bits in
the layer. The process is then repeated sequentially for every layer in the network.
4.1.2

MNIST Dataset Results and Comparative Study

Figure 4.7 shows the optimal bit widths found for weights and activations using the described
method, where integer ticks on the x-axis represent weights, and decimal ticks represent
activations.

62

Figure 4.7 Optimal bit width, per layer, found on LeNet-5 while maintaining 95% of the baseline
accuracy.

A comparison is made between the single-precision and mixed-precision models using the optimal
bit widths found. Table 4.1 shows the results of the comparative study. Most notably, in the
quantized model, weight size is reduced by almost 90%, while internal activations are reduced by
92%.

63

Table 4.1 Comparison between single-precision and mixed-precision models on MNIST.
Parameter

Single-precision

Accuracy

99.25%

Mixed-precision
Quantized
95.05%

Number of Weights

61,706

61,706

Number of Activations
sample)
Weight Bit-width

(single 2,574
[32, 32, 32, 32, 32]

2,574
[2, 4, 4, 4, 4]

Activation Bit-width

[32, 32, 32, 32, 32, 32] [2, 2, 4, 4, 4, 8]

Weight Size

246.834 KB

Activation Size

10.3 KB

4.1.3

25.25 KB
(89.77% reduction)
0.8 KB
(92.21% reduction)

Mixed-precision Quantization Methodology for CIFAR-10

The same methodology for mixed-precision quantization used on the MNIST dataset is applied to
the CIFAR-10 dataset. However, this time, it is beneficial to analyze multiple network
architectures and observe quantization effects. Figure 4.8 shows the three architectures chosen. All
three networks cover a range of structures that will give valuable insight into the effects of
quantization on different models. Specifically, the number and ratio of convolutional layers to fully
connected layers are observed. Network 1 is very deep, containing four convolutional layers and
three fully connected layers. Network 2 contains a high number of convolutional layers and a low
number of fully connected layers, with four and one layers, respectively. Conversely, Network 3
contains a low number of convolutional layers and a high number of fully connected layers, with
one and four layers, respectively.

64

Figure 4.8 CIFAR-10 architectures.

Table 4.2 shows the number of weights and activations found in each of the three networks.
Because Network 2 is comprised mainly of convolutional layers, weights are "shared" and
therefore require fewer weights. In contrast, Network 1 and Network 3 contain many fully
connected layers, requiring more weights.
Table 4.2 Number of weights and activations per network.
Parameter

Network 1 Network 2 Network 3

Number of Weights

6,538,080

67,936

4,474,672

Number of Activations 58,986

56,170

8,522

Total

124,106

4,483,194

6,597,066

65

It is expected that more extensive networks (e.g., Network 1) contain a substantial amount of
redundant connections as compared to smaller networks. Therefore, it is hypothesized that the
larger the network is, the larger the percentage that weights and activations can be reduced by
when quantization is performed.
Each network is trained for five epochs, using the Adam optimizer and a batch size of 4. Networks
1, 2, and 3 achieve a single-precision accuracy of 80.21%, 79.3%, and 84.05, respectively.
4.1.4

CIFAR-10 Dataset Results and Comparative Study

Figure 4.9 shows the optimal bit widths for weights and activations when quantizing all three
networks while maintaining 95% of the baseline accuracy.

Figure 4.9 Optimal bit width found on CIFAR-10 while maintaining 95% of the baseline
accuracy.

A comparison is made between the single-precision and mixed-precision models using the optimal
bit widths found. Table 4.3 shows the results of the comparative study for all three networks. Most
notably, all weight and activation sizes are reduced by at least 75%. However, contrary to the
hypothesis proposed, the largest network (Network 1) shows the least percentage reduction in
weight size. A possible explanation for the contradiction has to do with the way the optimal bit
66

widths are found. The first layers—which contain the least amount of weights—are heavily
quantized and drop accuracy substantially. Therefore, when the optimization algorithm reaches the
fully connected layers containing the most weights, the accuracy has reached the limit and can not
be quantized further.
Table 4.3 Comparison between single-precision and mixed-precision Models on CIFAR-10.
Parameter

Network 1

Network 2

Network 3

Single-precision Accuracy

80.21%

79.3%

84.05%

Number of Weights

6,538,080

67,936

4,474,672

Number of Activations
(single sample)

58,986

56,170

8,522

Single-precision Weight Size

26.15 MB

271.74 KB

17.9 MB

Single-precision Activation Size

235.94 KB

224.68 KB

34.088 KB

Optimal Weight Bit Length Found [4, 4, 8, 4, 8, 8, [4, 8, 8, 8, 4]
(>95% of baseline)
8]

[8, 4, 8, 8, 8]

Optimal Activation Bit Length Found [4, 8, 8, 8, 8, 8, [8, 8, 8, 8, 8, 8] [4, 4, 8, 8, 8, 8]
(>95% of baseline)
8, 8]
Accuracy Using Optimal Bit Lengths

76.64%

75.66%

80.89%

Quantized Weight Size

6.52 MB
(75.05%
reduction)
57.45 KB
(75.65%
reduction)

55.98 KB
(79.4%
reduction)
56.17 KB
(75%
reduction)

2.38 MB
(86.72%
reduction)
4.94 KB
(85.51%
reduction)

Quantized Activation Size

Impressively, Network 3 achieved the most significant reduction in weight size: 86.72% for
weights and 85.51% for activations. Most likely, this is due to the network being comprised mainly
of fully connected layers, thus offering high redundancy throughout the model.

67

4.2 Mixed-precision Architecture Proof of Concept on XOR Problem
4.2.1

Single-precision Model in PyTorch

A single-precision XOR Problem model is created following the architecture shown in Figure 4.10,
containing an input layer, hidden layer, and output layer. The input layer contains two neurons and
a bias neuron. The inputs are fed into the hidden layer, also containing two neurons and a bias
neuron. The output is designed as a classification layer, with O0 corresponding to the zero class
and O1 corresponding to the one class. Both hidden and output layers use the ReLU activation
function. The model is trained for 5000 epochs using the Adam optimizer and achieves an accuracy
of 100%.

Figure 4.10 Single-precision XOR Problem model.

68

4.2.2

Single-precision Architecture Design

The single-precision XOR Problem will be solved using a systolic array of Processing Elements
(PEs), as shown in Figure 4.11. The inputs are fed sequentially into the system, and each PE
computes a single MAC operation. The Input/Input Re-use arrow signifies an input to the
network or a PE forwarding its input to a corresponding PE in another neuron. When a PE
completes a MAC, it passes the partial result to another PE as indicated by the Partial Sum arrow.
The Final Neuron Output represents the final result from a particular neuron. The color of each
input and PE corresponds to the cycle in which it operates; In total, this model requires seven
cycles to produce a complete result.

69

Figure 4.11 Single-precision systolic array. Circles indicate inputs. Image adapted from [38].

As discussed in section 2.5.2.2, unlike the temporal architecture that contains only an array of
ALU's, the spatial architecture contains PEs comprised of registers, an ALU, and control logic.
Xilinx's Vivado Design Suite and SystemVerilog Hardware Description Language (HDL) code
are used to create the design. The schematic of the single-precision PE is shown in Figure 4.12.
Inputs are shown on the left side of the schematic. They include activationsInput—incoming

70

activations to the PE, partialSumInput—partial sum from a previous PE, weightsInput—
weights to be set in the PE, setWeights—enable pin used to set weights, rst—global reset pin,
clk—global clock, and validInput—flag set when valid input is present. Outputs are shown on
the

right

side

and

include

acitvationsFoward—"recycled"

activations

from

input,

partialSumOutput—partial sum result, and validOutput—flag set to indicate valid output.

71

Figure 4.12 Single-precision Processing Engine (PE) schematic.

72

The internals of the PE consist of four registers, a floating-point multiplication unit, and a floatingpoint add unit. The weights_reg register contains the weights used in the PE; weightInput is
sampled when the setWeights pin is enabled. The activations_reg register samples
activationsInput, and "holds" the value on the activationsForward output till the next rising
edge. The floating-point multiplier, floating_point_multi, is from Xilinx's IP catalog and
produces a product from values in activations_reg and weight_reg. The multiplier also sets a
valid flag when the product is ready to be read. The product is fed into the floating-point adder,
floating_point_add (another module from Xilinx's IP catalog), and accumulated with the value
from partialSumInput, and a valid flag is set. partialSumOutput_reg and validOutput_reg
"hold" the result of the adder and the valid flag in their respective outputs for a single clock cycle.
Figure 4.13 shows the testbench simulation used to verify the single-precision PE. The unit is first
reset, and 51.99 is fed to the weightsInput (yellow signal). Once the weight value is set,
partialSumInput (pink signal) and activationsInput (red signal) are set to 19.08 and -86.28
respectively. The result expected, calculated in Equation [4.1], is exactly the result (purple signal)
seen from the single-precision PE.
(51.99 × −86.28) + 19.08 = −4466.6172

73

[4.1]

Figure 4.13 Single-precision Processing Engine (PE) testbench simulation

74

Figure 4.14 shows a schematic of the single-precision XOR Problem systolic array following
design from Figure 4.11. The system's input includes a clock pin, rest pin, set weight enable pin,
valid input pin, inputs, and weights. The first row of PEs represents the neuron h0, while the second
row represents h1. input[0] is fed to PE_h_0_0 in the first cycle, and input[1] is fed to PE_h_0_1
in the second cycle. Each PE forwards its input to the next row of PEs once the MAC operation is
completed. The biases are added to the partial sum in the last column of PEs. After the
accumulation, the activation function is applied in the ReLU units and then fed into the third row
of PE's representing neuron O0. PE's in the third row forward their input activations to the fourth
row (representing O1) once their operation is completed. The MAC operations in the third and
fourth rows are completed, and the ReLU function is applied to the sums. Half of the final result
is produced at the result[0] output in one cycle, and in the next cycle, the other half of the result
is produced at result[1].

75

Figure 4.14 Single-precision XOR Problem schematic.

76

Figure 4.15 shows a simulation of the XOR Problem on the single-precision architecture. After an
initial reset, weights are set with values found from the trained model in PyTorch. Then the
validInput pin is set, and inputs are fed in a staggered fashion. In six cycles, a partial result is
produced indicated by the resultValid[0] output. After seven cycles, the first complete result is
available, indicated by the resultValid[1] output. At this point, the pipeline is fully loaded, and a
complete result is produced every clock cycle. The design requires ten cycles to produce results
for the complete XOR data set. Ultimately, the architecture achieves identical results to the model
in the software environment and achieves an accuracy of 100%.

77

Figure 4.15 Single-precision XOR Problem testbench simulation

78

4.2.3

Mixed-precision Model in Pytorch

The single-precision XOR Problem model is modified to support mixed-precision quantization.
The mixed-precision network is shown in Figure 4.16 and contains Quantize units after each layer,
which perform scaling, rounding, and clipping operations.

Figure 4.16 Mixed-precision XOR Problem model.

The mixed-precision model is optimized using the same methodology as section 4.1 to the
appropriate bits per layer while maintaining 100% accuracy. Figure 4.17 shows optimal bit widths
found per layer for both activations and weights. Each weight and activation is representable in
just two bits, except for the weights in the first layer, which require four bits.

79

Figure 4.17 Optimal bit width found on XOR Model while maintaining 100% of the baseline
accuracy.

4.2.4

Mixed-precision Architecture Design

Multiple submodules must be designed to implement a mixed-precision architecture for the XOR
model, including the variable bit-width multiplier, the mixed-precision PE, and the quantization
unit. The design of each submodule and the mixed-precision architecture is described.
Based on the FU multiplier architecture described in section 3.3, a BB is designed. Figure 4.18
shows the schematic of the BB implemented with two AND gates and a three-bit multiplier.

80

Figure 4.18 Bit Brick (BB) schematic.

Using an array of 16 BBs, the FU multiplier from Figure 3.15 is designed. Figure 4.19 shows the
block level diagram of the FU multiplier. The composition input to the system contains an encoded
value that specifies the configuration of the unit. In the boundary composition of 2x2, the unit can
process 15 2-bit inputs (activations) and weights; therefore, inputs and weights are 32 bits wide.
sInputs and sWeights specify the sign bit for up to 15 inputs and weights; consequently, both are
15 bits wide. 32-bits is chosen as the size of outputs to ensure an overflow will not occur.

81

Figure 4.19 Fusion Unit (FU) multiplier block diagram.

The Fusion Unit Processing Engine (FUPE) shown in Figure 4.20 is designed much like the singleprecision PE. However, additional inputs are added, including the composition, activation sign,
and weight sign.

Figure 4.20 Fusion Unit Processing Engine (FUPE) schematic.

A FUPE is reset, and a binary bit pattern of 10 is loaded into activations and weightsInput to test
the submodule. After the weights are set, the testbench cycles through each possible value of

82

composition. Figure 4.21 shows the simulation; partialSumOutput is verified, and the unit
produces correct results for each configuration.

83

Figure 4.21 Fusion Unit Processing Engine (FUPE) testbench simulation.

84

As shown in Figure 4.22, the quantization unit is designed to implement scale, rounding, and
clipping operations. The input from activation is converted from fixed-point representation with
a fixed_to_float unit and multiplied with scale in the floating_point_multi multiplier. Afterward,
the product is converted back into a fixed-point representation which also acts as around operation.
Lastly, a clip module clips the result based on the composition and sign inputs.

Figure 4.22 Quantization unit schematic.

Finally, the mixed-precision XOR Problem architecture is implemented using the submodules
described above, as shown in Figure 4.23. The first row of modules represents the hidden layer,
and the second row represents the output layer. First, quantization units, quantize_0_0 and
quantize _0_1, quantize the inputs of the network. A mux—PE_mux_0—is added to "pack" the
quantized inputs into the appropriate format for the PEs. Since the activations and weights in the
first layer are 2-bit and 4-bit, each PE can simultaneously perform eight MAC operations.
Therefore, only one PE is needed per neuron. Additionally, since all MAC operations, in this case,
can be done in a single cycle, it eliminates the need for PE's in the same layer to operate in a
staggered fashion. As a result, PE_fusion_unit_0_0 and PE_fusion_unit_0_1 produce outputs at
the same time. ReLU units relu_0_0 and relu_0_1 are added to apply the activation function and
produce the activations for the hidden layer.

85

Figure 4.23 Mixed-precision XOR Problem schematic.

86

The second row of modules implements the output layer in much the same way as the hidden layer.
However, this time weights and activations are represented in two bits, and quantization units,
quantize_2_0 and quantize_2_1, are added to produce the final quantized outputs—results[0]
and results[1]. The valid signals from PE_fusion_unit_1_0 and PE_fusion_unit_1_1 are
"ANDed" to produce the resultValid signal, indicating the completion of a result.
A testbench is created to verify that the model performs as expected; the simulation is shown in
Figure 4.24. The module is reset, and weights from the trained mixed-precision model are set in
the registers within the PEs. validInput is set, and the XOR dataset is fed sequentially to the
module. The first result is produced within two clock cycles indicated by the resultValid output,
and subsequent results are available every cycle afterward when pipelined. The results are checked
against the mixed-precision model in the software environment, and it is verified that the module
produces the same results.

Figure 4.24 Mixed-precision XOR Problem testbench simulation.

4.2.5

Single-precision and Mixed-precision XOR Problem Comparative Study Results

Table 4.4 shows the results of the comparative study based on the design and simulations of each
network in hardware. Weight sizes include weights and biases, and in the mixed-precision network

87

also accounts for the single-precision scaling factors. Activation sizes include inputs, inter-layer
activations, and output activations. Memory access size (read) is calculated from weights, scales,
and inputs needing to be read from memory. Meanwhile, memory access size (write) includes the
size of the final results, which need to be written back to memory. Cycles are calculated based on
the number of clock cycles required to produce results for the entire dataset when pipelined.
Table 4.4 Comparison of single-precision and mixed-precision models on the XOR Problem.
Parameter

Single-precision XOR Mixed-precision XOR

Activation Bits

[32, 32, 32]

[2, 2, 2]

Weight Bits

[32, 32]

[4, 2]

Weight Size

48 Bytes

Activation Size

24 Bytes

Memory Access Size (Read)

112 Bytes

Memory Access Size (Write) 32 Bytes
Total Cycles

10

Accuracy

100%

16.5 Bytes
(66% reduction)
1.5 Bytes
(94% reduction)
24.5 Bytes
(56% reduction)
2 Bytes
(94% reduction)
5
(speedup of 2)
100%

Impressively, the mixed-precision network achieves more than a 50% reduction in all categories
and speedup of 2 while still maintaining 100% accuracy.

88

5 Conclusions & Future Work
5.1 Conclusion
In order to implement Deep Learning (DL) in energy and performance-limited edge devices, this
study aims to minimize network model size. Specifically, the effects of mixed-precision
quantization using extremely small bit widths (2, 4, and 8) and the precision/classificationaccuracy trade-off are observed on two data sets.
Optimal bit widths are found per layer on LeNet-5 while maintaining 95% of the baseline accuracy
on the MNIST dataset. The mixed-precision version of LeNet-5 achieves a 90% reduction in
weight sizes and a 92% reduction in activations sizes.
For the CIFAR-10 dataset, three networks are studied, each containing a different number of layers
and a different ratio of convolutional to fully connected layers. While maintaining 95% of the
baseline accuracy, each network is quantized using the optimal bit widths, and all networks achieve
more than 75% reduction of weights and activation. It was hypothesized that the more extensive
networks would significantly reduce model size when quantized due to their large number of
redundant connections. However, due to limitations of the algorithm for finding optimal bit widths,
the largest model (Network 1) exhibits the least percentage of reduction.
Single-precision and mixed-precision architectures are designed to classify the XOR Problem.
Both implementations follow the design of a temporal architecture by utilizing Processing Engines
(PEs) containing independent memories and control units. However, the mixed-precision version
contains quantization units, and its PEs make use of variable bit-width multipliers. The mixedprecision architecture shows improvements in many areas while maintaining 100% accuracy:

89

weight size is reduced by 66%, activation size is reduced by 94%, memory read size is reduced by
56%, memory write size is reduced by 94%, and a speedup of 2 is found when processing the
entire dataset. The results are even more impressive when the size of the network is considered;
the XOR model is minimal, and it would be expected that not much redundancy exists in the
connections. Nevertheless, the model is still able to withstand extreme quantization while
maintaining comparable accuracy.
The quantization of Deep Neural Networks (DNNs), specifically mixed-precision quantization,
allows the size of models to be significantly reduced at the cost of classification accuracy. This
optimization would fit well in edge devices with limited resources, and small accuracy losses can
be traded for faster and more energy-efficient processors.

5.2 Future Work
Although this study produced outstanding results, it only scratched the surface of what is possible
with quantization. A total of four future work recommendations are given, with three of them being
on the algorithmic development side and one being on the hardware development side:
1. Observe effects of Batch Normalization in quantized networks.
2. Observe effects of Quantization-aware training.
3. Improve optimal bit width search algorithm.
4. Implement physical implementation of the proposed architecture.
Batch normalization is a technique that normalizes the activations of each layer before being fed
to the following layers. Although this technique is used to speed up the training of DNNs, it has
been shown to have fortuitous benefits in quantized networks [39]; This is because activations and
weights in a normalized distribution lend themselves to quantization more easily, which allows

90

precision to be reduced further. A study can observe the effects of extremely quantized neural
networks when batch normalization layers are added.
Compared to post-training quantization performed in this study, quantization-aware training aims
to train networks from scratch with quantization in mind [30]. The technique can improve model
accuracy, as the backpropagation algorithm is modified to reduce loss in the quantized network
explicitly. A study could compare extremely quantized networks using post-training quantization
and quantization-aware training.
As mentioned in Section 4.1.4, the algorithm used to find the optimal bit width in this study is
limited as it fails to consider the factors that play a role when specific layers are quantized.
Therefore a more sophisticated algorithm capable of reducing precision further is desired.
In this work, only behavioral simulations of the architectures are studied. However, to measure the
actual effects of quantization on energy and area, a physical implementation or a simulation of a
physical implementation is needed. Therefore it is recommended that one be designed to gain
further insight.

91

References

[1] S. Herculano-Houzel, "The human brain in numbers: a linearly scaled-up primate brain,"
Frontiers in Human Neuroscience, vol. 3, 2009.
[2] J. L. Anastasopoulos, D. Badani, C. Lee, S. Ginosar and J. R. Williams, Artists, Political
image analysis with deep neural. [Art]. Harvard, 2017.
[3] M. Olazaran, "A Sociological Study of the Official History of the Perceptrons Controversy,"
Social Studies of Science, vol. 26, no. 3, pp. 611-659, 1996.
[4] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning representations by backpropagating errors," Nature, vol. 323, 1986.
[5] J. McCarthy, "Artificial intelligence: A general Survey," Science Research Council, vol.
Artificial Intelligence: a paper symposium, 1973.
[6] B. Reagen, R. Adolf, P. Whatmough, G.-Y. Wei and D. Brooks, Deep Learning for Computer
Architects, Morgan & Claypool, 2017.
[7] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep
Convolutional Neural Networks," in NIPS Proceedings of the 25th International Conference
on Neural Information Processing Systems, 2012.
[8] Google, Inc., "In_Datacenter Performance Analysis of a Tensor Processing Unit,"
International Symposium on Computer Architecture (ISCA), vol. 44, 2017.

92

[9] C. C. Aggarwal, Neural Networks and Deep Learning, Yorktown Heights: Springer, 2018.
[10] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, 2016.
[11] W. Maass, "Networks of Spiking Neurons: The Third Generation of Neural Network
Models," Neural Networks, vol. 10, pp. 1659-1671, 1997.
[12] M. Capra, B. Bussolino, A. Marchisio, G. Masera and M. Shafique, "Hardware and Software
Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends,
Challenges, and the Road Ahead," IEEE Access, 2020.
[13] D. A. Patterson and J. L. Hennessy, Computer Organization and Design The
Hardware/Software Interface, Waltham: Elsevier, 2014.
[14] N. H. E. Weste and D. M. Harris, CMOS VLSI Design A Circuits and Systems Perspective,
Boston: Addison-Wesley, 2011.
[15] D. Silver, A. Huang, C. J. Maddison, A. Guez, Schrittwieser, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I.
Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel and D. Hassabis, "Mastering
the game of Go with deep neural networks and tree search," Nature, vol. 529, pp. 484-489,
2016.
[16] V. Sze, Y.-H. Chen, T.-J. Yang and J. Emer, "Efficient Processing of Deep Neural Networks:
A Tutorial and Survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295 -2329, 2017.

93

[17] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera and M. Martina, "An
Updated Survey of Efficient Hardware," Future Internet, vol. 12, no. 113, 2020.
[18] B. Moons, K. Goetschalckx, N. Van Beckelaer and M. Verhelst, "Minimum Energy
Quantized Neural Networks," in 51st Asilomar Conference on Signals, Systems, and
Computers, 2017.
[19] J. Chen and X. Ran, "Deep Learning With Edge Computing: A Review," Proceedings of the
IEEE, vol. 107, no. 8, pp. 1655 - 1674, 2019.
[20] M. Satyanarayanan, "The Emergence of Edge Computing," Computer, vol. 50, no. 1, pp. 30
- 39, 2017.
[21] E. Talpes, D. Das Sarma, G. Venkataramanan, P. Bannon, B. McGee, B. Floering, A. Jalote,
C. Hsiong, S. Arora, A. Gorti and G. S. Sachdev, "Compute Solution for Tesla's Full SelfDriving Computer," IEEE Micro, vol. 40, no. 2, pp. 25-25, 2020.
[22] A. Golding, K. Li and R. Neuhausler, "Machine Learning in Distributed Edge Networks
Optimizing for Security and Latency in Surveillance Systems," 2018.
[23] A. Kusupati, M. Singh, K. Bhatia, A. Kumar, P. Jain and M. Varma, "FastGRNN: A Fast,
Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network," arXiv preprint,
2019.
[24] A. Meola, "IoT Healthcare in 2021: Companies, medical devices, and use cases," 2 February
2021. [Online]. Available: https://www.businessinsider.com/iot-healthcare. [Accessed 19
April 2021].

94

[25] S. Umar Amin and A. M. Shamim Hossain, "Edge Intelligence and Internet of Things in
Healthcare: A Survey," IEEE Access, vol. 9, pp. 45-59, 15 December 2020.
[26] S. Han, J. Pool, J. Tran and W. J. Dally, "Learning both Weights and Connections for
Efficient Neural Networks," arXiv, 8 Jun 2015.
[27] M. Denil, B. Shakibi, L. Dinh, M. Ranzato and N. De Freitas, "Predicting parameters in deep
learning," arXiv preprint arXiv:1306.0543, 2013.
[28] M. Horowitz, "1.1 Computing's energy problem (and what we can do about it)," in IEEE
International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San
Francisco, 2014.
[29] P. Gysel, J. Pimentel, M. Motamedi and S. Ghiasi, "Ristretto: A Framework for Empirical
Study of Resource-Efficient Inference in Convolutional Neural Networks," IEEE
Transactions on Neural Networks and Learning Systems, vol. 29, no. 11, pp. 5784-5789,
2018.
[30] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam and D. Kalenichenko,
"Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only
Inference," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt
Lake City, 2018.
[31] B. Moons, B. De Brabandere, L. Van Gool and M. Verhelst, "Energy-efficient ConvNets
through approximate computing," in IEEE Winter Conference on Applications of Computer
Vision, Lake Placid, 2016.

95

[32] B. Moons, D. Bankman and M. Verhelst, Embedded Deep Learning: Algorithms,
Architectures and Circuits for Always-on Neural Network Processing, Cham: Springer,
2018.
[33] V. Camus, C. Enz and M. Verhelst, "Survey of Precision-Scalable Multiply-Accumulate
Units for Neural-Network Processing," in IEEE International Conference on Artificial
Intelligence Circuits and Systems (AICAS), Hsinchu, 2019.
[34] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra and H. Esmaeilzadeh, "Bit Fusion:
Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network,"
in ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los
Angeles, 2018.
[35] Y. LeCun, C. Cortes and C. J. Burges, "The MNIST Data Base of handwritten digits," 1998.
[36] A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," 2009.
[37] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-Based Learning Applied to
Document Recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 - 2324, 1998.
[38] A. I. Solis, "Dedicated Hardware for Machine/Deep Learning: Domain Specific
Architectures," Open Access Theses & Dissertations, 2019.
[39] E. Sari and V. P. Nia, "Batch Normalization in Quantized Networks," arXiv, 2020.

96

Appendix A Bit Brick Verilog Code
`timescale 1ns / 1ps
//////////////////////////////////////////////////////////////
// Company: University of Texas at El Paso
// Engineer: Andres Rios
//
// Create Date: 10/04/2020 07:59:18 PM
// Design Name:
// Module Name: bit_brick
// Project Name:
// Target Devices:
// Tool Versions:
// Description:
//
// Dependencies:
//
// Revision:
// Revision 0.01 - File Created
// Additional Comments:
//
//////////////////////////////////////////////////////////////

module bit_brick(
x,
y,
sx,
sy,
p
);
input [1:0] x;
input [1:0] y;
input
sx;
input
sy;
output [5:0] p;
wire signed [2:0] x_int;
wire signed [2:0] y_int;
assign x_int = {(sx & x[1]), x[1:0]};
assign y_int = {(sy & y[1]), y[1:0]};
assign p = x_int * y_int;
endmodule

97

Appendix B Fusion Unit Mulitplier Verilog Code
`timescale 1ns / 1ps
////////////////////////////////////////////////////////////
// Company: University of Texas at El Paso
// Engineer: Andres Rios
//
// Create Date: 10/14/2020 01:07:35 AM
// Design Name:
// Module Name: fusion_unit multiplier
// Project Name:
// Target Devices:
// Tool Versions:
// Description:
//
// Dependencies:
//
// Revision:
// Revision 0.01 - File Created
// Additional Comments:
//
////////////////////////////////////////////////////////////
`include "common_defines.sv"
module fusion_unit(
inputs,
weights,
sInputs,
sWeights,
composition,
outputs
);
input
input
input
input
input
output

[1

:0] inputs
sInputs
[1
:0] weights
sWeights
[`comp_bits-1:0] composition;
[31
:0] outputs;

wire signed [5

:0] products

[3:0][3:0];
[3:0][3:0];
[3:0][3:0];
[3:0][3:0];

[3:0][3:0];

/* create array of bit bricks */
generate
genvar x,y;
for (x = 0; x < 4; x = x + 1) begin
: bit_brick_x_gen
for (y = 0; y < 4; y = y +1) begin
: bit_brick_y_gen
bit_brick bit_bricks (
.x (inputs [x][y]),
.y (weights [x][y]),
.sx(sInputs [x][y]),

98

.sy(sWeights[x][y]),
.p (products[x][y])
);
end
end
endgenerate
integer
integer
integer
integer
integer
reg signed [31:0]

a;
b;
brickWidth;
brickHight;
shift;
sum;

always_comb begin
sum = 0;
case(composition)
`_2x2: begin
brickHight = 1;
brickWidth = 1;
end
`_2x4: begin
brickHight = 1;
brickWidth = 2;
end
`_2x8: begin
brickHight = 1;
brickWidth = 4;
end
`_4x2: begin
brickHight = 2;
brickWidth = 1;
end
`_4x4: begin
brickHight = 2;
brickWidth = 2;
end
`_4x8: begin
brickHight = 2;
brickWidth = 4;
end
`_8x2: begin
brickHight = 4;
brickWidth = 1;
end
`_8x4: begin
brickHight = 4;
brickWidth = 2;
end
`_8x8: begin
brickHight = 4;
brickWidth = 4;
end
endcase
for(a = 0; a < 4; a = a + 1) begin

99

for(b = 0; b < 4; b = b + 1) begin
shift = ((b % brickWidth) * 2)
+ ((a % brickHight) * 2);
sum
= sum
+ (products[a][b] << shift);
end
end
end
assign outputs

= sum;

endmodule

100

Curriculum Vita
Andres Rios was born on June 4, 1998, in El Paso, Texas. Graduating in June 2016 from Franklin
High School, he enrolled in The El Paso Community College (EPCC) that fall. Mr. Rios later
attended The University of Texas at El Paso (UTEP) in spring 2017 to pursue a bachelor's in
Electrical Engineering. As an undergraduate, Mr. Rios participated in the Fast Track program at
UTEP, allowing him to take graduate courses as part of his bachelor's and master's degrees. During
this time, he became interested in machine learning while pursuing research under the mentorship
of Dr. Patricia Nava in fall 2019. Mr. Rios graduated with his bachelor's degree in December 2020
and was awarded the Floyd Decker Top Senior Award by the Department of Electrical and
Computer engineering. He officially started pursuing a master's in Computer Engineering in
January 2021.
During his four and a half years at UTEP, Mr. Rios was heavily involved in the Institute of
Electrical and Electronics Engineers (IEEE) student chapter, holding positions as Engineering
Student Leadership Council (ESLC) Representative, President, and Auxillary officer. In addition,
He was awarded a scholarship from the Pathways to Success in Graduate Engineering (PASSE)
program, which aims to support Fast Track students financially, academically, and professionally.
He also participated in two industry internships as a Digital Design Engineer at Texas Instruments
(TI) and was the TI Student Ambassador to UTEP in his final year. After graduation, Mr. Rios
plans to join full-time as a Digital Design Engineer at TI in Dallas, Texas.

101

