Performance analysis of machine learning applications on rapid: a highly parallel computer architecture by Modi, Aakash Ketan
c© 2017 Aakash Ketan Modi
PERFORMANCE ANALYSIS OF MACHINE LEARNING
APPLICATIONS ON RAPID: A HIGHLY PARALLEL COMPUTER
ARCHITECTURE
BY
AAKASH KETAN MODI
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Adviser:
Associate Professor Rakesh Kumar
ABSTRACT
Over the past few years, the interest and application of machine learning al-
gorithms has risen exponentially. Machine learning has found extensive use in
diverse fields like self-driving cars, speech recognition, image processing, com-
puter vision, molecular biology, security etc. A lot of recent research involves
evaluation of machine learning applications on different architectures. In this
thesis, we evaluate the performance of six common machine learning algo-
rithms: K-Means, K-Nearest Neighbors, Linear Regression, Latent Dirichlet
Allocation, Deep Neural Network, and Radix Sort on RAPID. RAPID is a
highly parallel computer architecture developed at Oracle Labs for acceler-
ating and improving the performance of database analytic workloads. We
find that the RAPID platform performs well on the performance-per-watt
metric i.e. it is a power-efficient architecture. Moreover, the machine learn-
ing applications can be easily scaled to hundreds of nodes of the RAPID
architecture, thereby making it suitable for distributed machine learning ap-
plications. However, we find certain bottlenecks in the micro-architecture,
memory system and network of the RAPID architecture and propose opti-
mizations to make it a more performance efficient architecture for machine
learning applications.
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
Writing this thesis would not have happened without an internship at Oracle
Labs, Belmont. I want to thank first Professor Rakesh Kumar for helping
me to get this opportunity, and for agreeing to supervise my work during six
months. He not only made securing the internship possible, but also before
that gave me the opportunity to work as a Master Student at the PASSAT
laboratory, UIUC. I thank him for his renewed kindness to me across those
two years.
Thanks to Professor Rakesh Kumar and the members of the PASSAT
laboratory, I have discovered and been amazed by the field of Computer
Architecture. I thank all the members of the lab for talking, chit-chatting
and giving me all the knowledge I have about that field.
I also thank warmly my advisers at Oracle Labs, Sungpack Hong and
Hassan Chafi for welcoming me in their group. My advisers gave me the
opportunity to attack challenging problems. I have really appreciated being
given freedom when it was enjoyable, and some support when it was becoming
challenging. I have enjoyed collaborating and sharing moments with the other
members of the OLRSDK team: thank you Jeongseob Ahn for letting me
ask you all the questions on my mind; thank you Ettore Trainiti for always
bringing the Italian smile and humor at the office (and sharing it also outside
of the office hours too); thank you Damein Hilloulin for giving me interesting
questions and perspectives.
My parents have been a big remote support during that internship. Thank
you very much for putting faith in me and pushing me forward not just for
those last six months but my for entire life.
Finally, I also really want to thank my friends - Agrim, Azin, Konik, Astha,
Kalika, Namita, Sowmya, Ishan, Yasir, Nishant and Shivani for supporting
and encouraging me throughout the course of the last 2 years.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 SYSTEM ARCHITECTURE . . . . . . . . . . . . . . . 3
2.1 Intel Haswell Platform . . . . . . . . . . . . . . . . . . . . . . 3
2.2 RAPID v1.0 Platform . . . . . . . . . . . . . . . . . . . . . . 5
CHAPTER 3 MACHINE LEARNING ALGORITHMS . . . . . . . . 10
3.1 K-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . 13
3.4 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Radix Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
CHAPTER 4 EVALUATION AND PERFORMANCE ANALYSIS . . 18
4.1 Single Node Performance Evaluation . . . . . . . . . . . . . . 18
4.2 Multi-Node RAPID Performance Evaluation . . . . . . . . . . 25
CHAPTER 5 POTENTIAL RAPID ARCHITECTURAL IMPROVE-
MENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 32
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
LIST OF TABLES
2.1 Oracle x5-2 System Configuration . . . . . . . . . . . . . . . . 4
2.2 Oracle RAPID DPU Description . . . . . . . . . . . . . . . . . 9
4.1 Machine Learning Applications and Input Parameters . . . . . 19
4.2 Micro-Architectural Statistics-RAPID . . . . . . . . . . . . . . 25
vi
LIST OF FIGURES
2.1 Backend of the Intel Haswell Processor Family and All Its
Executions Units . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Overview of RAPID Physical System . . . . . . . . . . . . . . 6
2.3 DPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 dbCore Architecture . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 LDA Algorithm Overview . . . . . . . . . . . . . . . . . . . . 13
3.2 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Linear Regression Example . . . . . . . . . . . . . . . . . . . . 16
3.4 Radix Sort Example . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Speedup on Intel Haswell Platform . . . . . . . . . . . . . . . 20
4.2 Scalability on Intel Haswell Platform . . . . . . . . . . . . . . 21
4.3 Speedup on RAPID Platform . . . . . . . . . . . . . . . . . . 21
4.4 Scalability on RAPID Platform . . . . . . . . . . . . . . . . . 22
4.5 Normalized Runtime of Machine Learning Applications . . . . 23
4.6 Normalized Performance/Watt of Machine Learning Ap-
plications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7 Speedup on Multi-DPU . . . . . . . . . . . . . . . . . . . . . . 26
4.8 Scalability on Multi-DPU . . . . . . . . . . . . . . . . . . . . 26
vii
CHAPTER 1
INTRODUCTION
Machine learning [1], [2] is a branch of artificial intelligence [3], [4], [5], [6],
which involves the design and construction of computer applications or sys-
tems that are able to learn based on their data inputs and/or outputs. Ba-
sically, a machine learning system learns by experience; that is, based on
specific training, the system will be able to make generalizations based on
its exposition to a number of cases and then be able to perform actions after
new or unforeseen events.
The discipline of machine learning also incorporates other data analysis
disciplines, ranging from predictive analytics and data mining to pattern
recognition. Furthermore, a variety of specific algorithms are used for this
purpose that are frequently organized in taxonomies. These algorithms can
be used depending on the type of input required.
As a discipline, machine learning is not new. Initial documents and refer-
ences can be traced back to the early 1950s with the work of Alan Turing [7],
Arthur Samuel [8], and Tom M. Mitchell [9]. The field has undergone exten-
sive development since that time.
Resurging interest in machine learning is due to the same factors that have
made data mining and Bayesian analysis more popular than ever. These
include developments like growing volumes and varieties of available data,
computational processing that is cheaper and more powerful, and affordable
data storage. These advances mean it is possible to quickly and automati-
cally produce models that can analyze bigger, more complex data and deliver
faster, more accurate results even on a very large scale. By building precise
models, an organization has a better chance of identifying profitable oppor-
tunities or avoiding unknown risks. Some recent areas where it has found
extensive application is Google’s self-driving cars, online recommendations
such as from Amazon and Netflix, fraud detection, computer vision, and
molecular biology.
1
For machine learning applications to perform optimally, specialized hard-
ware which is customized for these applications is necessary. Many big com-
panies have invested time and money in developing such custom architectures
like Google’s Tensor Processing Unit (TPU) [10], Nvidia’s DGX-1 [11], Intel’s
Movidius [12]. However, there is no public documentation of the underly-
ing architectures of these platforms. Therefore, we decided to undertake a
project to investigate computer architecture characteristics’ suitable for ma-
chine learning applications.
For this investigation, we decided to evaluate and compare the perfor-
mances of two platforms - Intel Haswell [13] and Oracle’s RAPID for a suite
of machine learning applications. These platforms lie at two end-points of the
design spectrum, that is Intel Haswell is a high-performance power-hungry ar-
chitecture and RAPID is a low-performance power-efficient architecture. By
evaluating and comparing the performance of machine learning algorithms
on both the platforms, we are in a position to propose optimizations to the
RAPID architecture and hence, make machine learning applications more
performance efficient.
The machine learning applications we investigate are very simple and from
a broad spectrum yet extensively used in the field. K-Means [14] and K-
Nearest Neighbors [15] find application in data mining and pattern recogni-
tion. Linear Regression [16] is used in Gradient Descent Algorithm which is
the backbone of the Deep Neural Networks (DNN). Latent Dirichlet Alloca-
tion [17] has use in text classification. Finally, we also implement a simple
DNN architecture - MNIST [18] to understand behavior.
The thesis is organized in the following way. In Chapter 2, we describe
the architectures of Intel Haswell and RAPID platforms. In Chapter 3, we
explain the different machine learning applications we have used in the study.
In Chapter 4, we analyze and evaluate the performance of these applications
on the platforms. In Chapter 5, we propose changes that need to be made
to the RAPID platform to make it better suited for machine learning appli-
cations. Finally, in Chapter 6, we provide a conclusion of this investigative
project.
2
CHAPTER 2
SYSTEM ARCHITECTURE
In this study, we have used two parallel architectures, Intel Haswell and
RAPID. The Intel Haswell system was used as a baseline architecture to
evaluate the performance of RAPID architecture. This performance com-
parison and evaluation assisted in proposing architectural changes to RAPID
architecture that are aimed at improving the performance of machine learn-
ing applications on it. In this chapter, we explain architectures of both the
systems.
2.1 Intel Haswell Platform
The reference system used during this work was a bi-processor Intel Haswell
system [13]. Widely used in general-purpose servers and desktop computers,
it features general-purpose processors and an abundant main memory. We
will first go through a high-level description of the system and then a more
detailed micro-architectural description of the processors powering it.
2.1.1 System Description
During this work, the reference platform was the Intel Haswell system. The
configured system includes two Intel Xeon Haswell E2697-v3 processors each
one including 18 physical cores. The 36 physical cores operate at a base clock
of 2.3 GHz (but can reach up to 3.6 GHz if the TurboBoost technology is
enabled) and are backed by 164 GB of DDR4 main memory.
These Out-of-Order processors have support for the HyperThreading tech-
nology which presents to the operating system twice as many logical cores,
giving an apparent set of 72 logical cores. Similar to other processors in
the Haswell family, these processors also include the AVX2 instruction set
3
extensions, that are SIMD operations capable of operating on independent
streams of 8 integers of 32 bits, 16 integers of 16 bits, or 32 integers of 8 bits.
The Haswell processors are conventional general purpose processors. They
therefore include support for virtual memory, have caches and support the
protected execution of user-level programs.
Table 2.1 includes a summary of the characteristics of this system.
Table 2.1: Oracle x5-2 System Configuration
Processors Intel Xeon E5-2699v3 (2 processors)
Core Count 18 physical per processor 36 logical
LLC Cache 45 MB Shared
L2 Cache 256 kB Private to each core
L1 Data Cache 32 kB private to each core
L1 Instruction Cache 32 kB private to each core
Frequency 2.3 GHz (base)
RAM 164 GB of DDR4 2133
TDP 145 W per processor
Figure 2.1: Backend of the Intel Haswell Processor Family and All Its
Executions Units
At the heart of the Oracle x5-2 platform are two Intel Xeon E5-2699v3
4
processors based on the Haswell core family. Those processors are aggressive
Out-Of-Order processors designed for a wide range of applications, and aim-
ing at really good single-thread performance. Figure 2.1 from [19] presents
all the different execution units available in a core of an Intel Haswell-based
processor.
The Haswell cores include numerous load/store, arithmetic and floating-
point units. To feed them, up to four x86 instructions are decoded each cycle
into micro-ops and put into the instruction decode queue until they can be
brought into the reorder buffer. The reorder buffer tracks the status of the
instructions that are being executed or have been speculatively executed until
they have been committed. The instructions that are ready for execution are
dispatched as soon as an appropriate execution port is available. In total
we can see that the Haswell cores can dispatch at best eight instructions
per cycle through its ports. For our work, it is important to notice that it is
possible to execute up to four scalar integer operations and two AVX2 integer
operations per cycle.
2.2 RAPID v1.0 Platform
2.2.1 System Description
The Oracle RAPID processor had originally been designed to accelerate and
improve the performance efficiency on database analytics workloads. Target-
ing big data warehouses, which are nowadays limited by the power they can
draw, RAPID tries to diminish the power envelope to a minimum while im-
proving performance. To reach these goals RAPID includes various hardware
blocks that make it possible to implement the SQL operators efficiently. At
the same time, RAPID abandons many of the features used in conventional
desktop or server processors.
Figure 2.2 describes the complete RAPID physical system. It shows the
RAPID hardware stack is divided into different components like RDBMS, in-
terconnects (Infiniband and PCI-E), and RAPID processors called Data Pro-
cessing Units (DPUs). The DPUs are designed to improve the performance
of database analytics workloads while minimizing the power consumption of
the system. The DPUs are physically arranged into domains and each do-
5
main is connected to other domains via the PCI-E and Infiniband channels.
There are 12 domains in the system and each domain has 14 DPUs, which
makes the RAPID platform a highly parallel architecture.
Figure 2.2: Overview of RAPID Physical System
DPU Architecture
Figure 2.3 describes the DPU architecture. It consists of the following:
• 32 cores (called dbCores): main processing unit of RAPID
• ARM processors: mostly used for networking
• Atomic Transaction Engine (ATE): low-latency interconnect between
dbCores
• Data Movement System (DMS): smart DMA engine
• MailBox Controller (MBC): allows message passing between ARM pro-
cessors and dbCores
• Low-Level Interfaces: serial buses used for system management
6
Figure 2.3: DPU Architecture
dbCore Architecture
Figure 2.4 shows the dbCore architecture. Similar to other specialized
hardware, RAPID trades big power-hungry caches for smaller caches with low
associativity and does not offer cache coherency and includes scratchpads.
In RAPID, the scratchpads are private to a core and data is moved from
RAM into these scratchpads by programming custom DMAs. These customs
DMAs propose the traditional sequential read and write mode from or to
DRAM and specific modes to accelerate data analytics workloads. We will
not detail what those modes are as our implemented algorithms only required
sequential reads and writes.
Seeing that these hardware blocks are enough to accelerate the SQL oper-
ators, the actual general processing cores of RAPID are much simpler than
the Intel Haswell counterparts. The RAPID cores are in-order cores with a
shallow pipeline of six stages. Most of the instructions execute in one cycle.
The RAPID pipeline is dual issue and can dispatch a load or store on the
load store unit in parallel of an instruction executing on the ALU.
Thanks to this lean design, a DPU includes 32 cores (called dbCores),
7
split across four Macros of eight dbCores. Inside a macro all the eight cores
share a second-level cache, but without any enforced cache coherency. The
cache coherency must be enforced in software by executing cache maintenance
instructions as flush, invalidate to force the write-back of modified lines and
force a read from the next memory hierarchy level on the next read/write.
Figure 2.4: dbCore Architecture
The RAPID cores do not include any prefetcher or predictor to improve
the usage of the caches or avoid the penalties of branches. Instead, it is
recommended to make the most of the scratchpad and the DMS to avoid
cache misses or branching. Thanks to the DMS, which is a special DMA
with a special mode of operations, it is conceptually possible to perform the
necessary data fetches and writebacks. The DMS system covers the essen-
tial DMA memory movements operations: sequential accesses from DRAM
into the scratchpad and the other way around, but also strided accesses, or
gather/scatter operations. Also, the RAPID DPUs do not support virtual
memory and have a limited memory protection mechanism.
Table 2.2 sums up the configuration of one RAPID DPU. The RAPID
DPUs are then put as part of a RAPID domain board; each domain board
8
includes 14 DPUs connected through PCIe interconnect. DPUs from differ-
ent domain boards can communicate with each other through an Infiniband
interconnect.
Table 2.2: Oracle RAPID DPU Description
Processors Oracle RAPID DPU (2 processors)
Core Count 32 dbCore
L2 Cache 128 kB private to each macro
L1 Data Cache 32 kB private to each dbCore
L1 Instruction Cache 32 kB private to each core
Scratchpad 32 kB per dbCore private to each core
Frequency 800 MHz
RAM 8GB of DDR3 6 GB accessible to user
TDP 6 W
As we can see, the RAPID hardware is a very different design point com-
pared to the Intel processors. It is designed as a low-power processor for
energy efficiency, that is provided with communication buses to be able to
scale-out.
Hence, these platforms lie at two end-points of the design spectrum, one
being a high-performance power-hungry architecture and the other with a
low-performance power-efficient architecture. By evaluating and comparing
the performance of machine learning algorithms on both the platforms, we are
able to propose ideal design points for these applications from the spectrum.
9
CHAPTER 3
MACHINE LEARNING ALGORITHMS
3.1 K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a non-parametric lazy machine learning al-
gorithm that classifies data based on its neighbors [15].
Non-parametric technique means that it does not make any assumptions
on the underlying data distribution. This is very useful in the real world
because most of the practical data does not obey the typical theoretical
assumptions made (e.g. Gaussian mixtures, linearly separable etc.). It is
also a lazy algorithm, i.e. it is does not use the training data points to do
any generalization. In other words, there is no explicit training phase or it
is very minimal. This means the training phase is comparatively fast. Lack
of generalization means that KNN keeps all the training data. More exactly,
all the training data is needed during the testing phase.
Next, we explain how KNN classification works. In this case, we are given
some data points for training and a new unlabeled data for testing. Our
aim is to find the class label for the new point. The algorithm has different
behavior based on k.
• Case 1: k = 1 or Nearest Neighbor Rule
This is the simplest scenario. Let x be the point to be labeled. Find
the point closest to x. Let it be y. Now the nearest-neighbor rule asks
to assign the label of y to x. This seems too simplistic and sometimes
even counter intuitive. If you feel that this procedure will result a huge
error, you are right but there is a catch. This reasoning holds only
when the number of data points is not very large.
If the number of data points is very large, then there is a very high
chance that the labels for x and y are the same. An example might
10
help. Let us say you have a (potentially) biased coin. You toss it for 1
million times and you get heads 900,000 times. Then most likely your
next call will be a head. We can use a similar argument here.
Let me try an informal argument here. Assume all points are in a
D-dimensional plane. The number of points is reasonably large. This
means that the density of the plane at any point is fairly high. In other
words, within any subspace there is an adequate number of points.
Consider a point x in the subspace which also has many neighbors.
Now let y be the nearest neighbor. If x and y are sufficiently close,
then we can assume the probability that x and y belong to the same
class is high. Then by decision theory, x and y have the same class.
• Case 2: k = K or k-Nearest Neighbor Rule
This is a straightforward extension of 1NN. Basically, what we do is
try to find the k nearest neighbor and do a majority voting. Typically,
k is odd when the number of classes is two. Let us say k = 5 and there
are three instances of C1 and two instances of C2. In this case, KNN
says that a new point must be labeled as C1 as it forms the majority.
We follow a similar argument when there are multiple classes.
One of the straightforward extension is not to give one vote to all
the neighbors. It is very common to use weighted KNN where each
point has a weight which is typically calculated using its distance. For
example, under inverse distance weighting, each point has a weight
equal to the inverse of its distance to the point to be classified. This
means that neighboring points have a higher vote than the farther
points.
It is obvious that the accuracy might increase when you increase k but
the computation cost also increases.
KNN is surprisingly versatile and its applications range from computer
vision to proteins to computational geometry to graphs and so on.
11
3.2 K-Means
K-means is one of the oldest and most commonly used clustering algorithms.
It is a prototype-based clustering technique defining the prototype in terms
of a centroid which is considered to be the mean of a group of points and
is applicable to objects in a continuous n-dimensional space [14]. Clustering
is an unsupervised learning technique. The aim here is to group the data
points into clusters such that similar items are lumped together in the same
cluster. Here, nobody trains the algorithm and it is expected to do a good
job. Clustering is one of the important tools in exploratory analysis.
The algorithm accepts two inputs. The data itself, and k, the number of
clusters. The output is k clusters with input data partitioned among them.
The aim of K-means (or clustering) is this: We want to group the items
into k clusters such that all items in the same cluster are as similar to each
other as possible. Further, the items not in the same cluster are as different
as possible. We use the distance measures to calculate similarity and dissim-
ilarity. One of the important concepts in K-means is that of centroid. Each
cluster has a centroid. Consider it as the point that is most representative of
the cluster. Equivalently, centroid is the point that is the center of a cluster.
Following is the algorithm for K-means:
• Randomly choose k items and make them as initial centroids.
• For each point, find the nearest centroid and assign the point to the
cluster associated with the nearest centroid.
• Update the centroid of each cluster based on the items in that cluster.
Typically, the new centroid will be the average of all points in the
cluster.
• Repeats steps 2 and 3, until no point switches clusters.
K-means finds application in various topics including market segmentation,
computer vision, astronomy and agriculture.
12
3.3 Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for
collections of discrete data such as text corpora [17]. LDA is a three-level
hierarchical Bayesian model, in which each item of a collection is modeled
as a finite mixture over an underlying set of topics. Each topic is, in turn,
modeled as an infinite mixture over an underlying set of topic probabilities.
The problem of modeling text corpora and other collections of discrete data
is addressed by the algorithm. The goal is to find short descriptions of the
members of a collection that enable efficient processing of large collections
while preserving the essential statistical relationships that are useful for basic
tasks such as classification, novelty detection, summarization, and similarity
and relevance judgments. However, the LDA model is not necessarily tied to
text, and it has applications to other problems involving collections of data,
including data from domains such as collaborative filtering, content-based
image retrieval and bioinformatics.
Figure 3.1: LDA Algorithm Overview
13
The algorithm takes two inputs, the collection of documents (text cor-
pora) and number of topics. Each document consists of a bag-of-words. As
shown in figure 3.1, the model gives three outputs: how much of each topic a
document contains, which topic a word in a document belongs to, and what
words get what probabilities in each topic. The following steps are performed
a large number of times to get steady-state distributions:
• Assign randomly each word in every document to one of the topics.
• For each word w in every document d and for each topic t compute:
– p(topic t | document d)
– p(word w | topic t)
• Reassign w a new topic, where we choose topic t with probability
p(topic t | document d) * p(word w | topic t).
3.4 Deep Neural Network
Use of neural networks is a computational approach used in computer science
and other research disciplines, which is based on a large collection to neural
units (artificial neurons), loosely mimicking the way a biological brain solves
problems with large clusters of biological neurons connected by axons. Each
neural unit is connected with many others, and links can be enforcing or
inhibitory in their effect on the activation state of connected neural units.
Each individual neural unit may have a summation function which combines
the values of all its inputs together. There may be a threshold function or
limiting function on each connection and on the unit itself, such that the
signal must surpass the limit before propagating to other neurons. These
systems are self-learning and trained, rather than explicitly programmed,
and they excel in areas where the solution or feature detection is difficult to
express in a traditional computer program.
Neural networks typically consist of multiple layers or a cube design, and
the signal path traverses from front to back. Back-propagation is the use
of forward stimulation to reset weights on the front neural units and this
is sometimes done in combination with training where the correct result is
14
known. More modern networks are a bit more free-flowing in terms of stim-
ulation and inhibition with connections interacting in a much more chaotic
and complex fashion. Dynamic neural networks are the most advanced in
that, based on rules, they dynamically can form new connections and even
new neural units while disabling others.
The goal of the neural network is to solve problems in the same way that
the human brain would, although several neural networks are more abstract.
Modern neural network projects typically work with a few thousand to a few
million neural units and millions of connections, which is still several orders
of magnitude less complex than the human brain and closer to the computing
power of a worm.
Here, we implement a simple deep neural network to identify handwritten
digits. This algorithm works on the MNIST dataset [18]. Figure 3.2 depicts
the goal of the DNN.
Figure 3.2: MNIST
3.5 Linear Regression
Linear regression is an approach for modeling the relationship between a
scalar dependent variable y and one or more explanatory variables (or inde-
pendent variables) denoted X [16]. The case of one explanatory variable is
called simple linear regression. For more than one explanatory variable, the
process is called multiple linear regression.
In other words, linear regression is a way to approximate the trends of
given data with linear expression, e.g., y = c0 + c1x1 + c2x2 + ... + cnxn.
15
As shown in figure 3.3, the algorithm tries to find c0,c1,...,cn for given
x1,x2,...,xn. There can be many ways to find out the values of c0 to cn,
but we focus on gradient descent. Gradient descent is an iterative algorithm
that calculates the gradient of errors to minimize the error between the model
and the actual data. It starts from an initial guess (randomly generated) and
gradually optimizes it by calculating the gradient for a specific set of data
points until the model converges (or after the fixed number of iterations).
The following summarizes the resulting LMS (least mean square) algorithm.
Figure 3.3: Linear Regression Example
• For each data point d with the associated value y, do the following:
– Compute the dot product between c and d, i.e., h = c * d
– Vector add (y - h) * d to the new solution c’
3.6 Radix Sort
Radix sort is one of the fastest sorting algorithms. It is fast especially for a
large problem size and hence is useful in many machine learning algorithms.
Radix sort is not a comparison sort but a counting sort.
The algorithm sorts k n-bits integer keys using buckets. Figure 3.4 illus-
trates the algorithm. Each key is first figuratively dropped into one level of
16
Figure 3.4: Radix Sort Example
buckets corresponding to the value of the rightmost bit. Each bucket pre-
serves the original order of the keys as the keys are dropped into the buckets.
There is a one-to-one correspondence between the buckets and the values
that can be represented by the rightmost digit. Then, the process repeats
with the next neighboring more significant digit until there are no more digits
to process. In other words:
• Take the least significant bit (or group of bits, both being examples of
radices) of each key.
• Group the keys based on that bit, but otherwise keep the original order
of keys.
• Repeat the grouping process with each more significant bit.
17
CHAPTER 4
EVALUATION AND PERFORMANCE
ANALYSIS
In this chapter, we will evaluate the performance of the different implemen-
tations of our algorithm and try to analyze their behavior. Our evaluation is
divided into a group of two different tests: in the first group of tests we will
look at the performance characteristics of one DPU versus the Intel Haswell
platform (scalability inside a DPU, performance of the dbCores); while in the
second group of experiments we will explore the scalability of our distributed
RAPID implementation.
The goal of the first group of experiments was to provide enough insight
to be able to indicate to the hardware designers of RAPID how to possibly
improve the performance of the dbCores for machine learning applications.
The second group of experiments was dedicated to providing insight on
possible software and hardware modifications on the RAPID implementation
to improve the scalability of our workload.
4.1 Single Node Performance Evaluation
In this section, we will explore the performance of our different single node
implementations. In order to have good performance of machine learning
applications, it is imperative for them to be scalable to multiple cores. Hence,
we begin our evaluation by analyzing the extent to which the applications
are parallelizable on both the platforms. After precise evaluation, we give
the performance comparison between our RAPID implementation and the
ones for the Haswell platform at maximal performance. We will also look at
the micro-architectural behavior on RAPID (instruction stream composition,
efficiency of the dual-issuing, impact of branch mispredictions). If possible,
we will compare that to the metrics obtained on the Intel platform. This
group of experiments had the goal to provide more insight to the designers of
18
the RAPID hardware about the impact of their choices on the performance
of the algorithms compared to the Intel Haswell platform.
4.1.1 Evaluation Protocol
In this first set of experiments, we focus on comparing the RAPID platform
and Intel Haswell platform by analyzing the performance of six machine
learning algorithms discussed in Chapter 3. Table 4.1 lists these algorithms
and their input parameters. We first implement a sequential version of all the
algorithms on both the platforms and then try to scale them to 32 threads
using OpenMP [20]. We then analyze the performance and scalability of
these applications individually on both the platforms and then compare them
based on performance and power. Finally, we use hardware performance
counters on the RAPID platform to thoroughly investigate the bottlenecks
in the architecture and, hence, propose changes to it in order to improve its
performance.
We run our experiments on the following machine configurations:
• Intel Haswell: Use up to 32 Haswell cores in an X5-2 machine (arranged
as 36 physical cores in 2 sockets) with 6 GB DRAM
• RAPID: Use up to 32 dbCores in a single DPU with 6 GB DRAM
Table 4.1: Machine Learning Applications and Input Parameters
Application Input Parameters
K-Means 1M data-points, 784 dimensions, 10 clusters
KNN 60K data-points, 784 dimensions, 20 nearest neighbors
Linear Regression 1M data-points, 784 dimensions
LDA 200K Documents, 2.2M Vocabulary, 100 topics
DNN 1M images
Radix Sort 500M integer keys
19
4.1.2 Single Node Performance on Intel Haswell Platform
Figure 4.1: Speedup on Intel Haswell Platform
Figure 4.1 shows the speedups of the six applications when we scale them
from two threads to 32 threads. For all the applications, we see that they
do not scale linearly with the number of cores. For instance, ideal speedup
at 32 threads should be 32x but for K-Means the speedup is about 23x. Out
of the six applications, K-Means, KNN and DNN scale better than Linear
Regression, Radix Sort and LDA. The loss of scalability is due to many
factors, Amdahl’s law [21], load imbalance [22] and cache/memory effects.
The loss of scalability indicates that the Intel Haswell platform may not be
the best fit for machine learning algorithms when there is a need to have
hundreds of threads.
Figure 4.2 quantifies the scalability in terms of Parallel Efficiency. The
higher the parallel efficiency the better is the scalability.
Parallel Efficiency =
Observed Speedup
Ideal Speedup
× 100 (4.1)
20
Figure 4.2: Scalability on Intel Haswell Platform
4.1.3 Single Node Performance on RAPID Platform
Figure 4.3: Speedup on RAPID Platform
Figure 4.3 shows the speedups of the six applications when we scale them
from two dbCores to 32 dbCores of a single DPU. The applications scale
21
almost linearly with the number of threads suggesting that the RAPID ar-
chitectural could be suitable for machine learning applications that need to
be scaled to hundreds of threads.
The primary reason the RAPID platform scales better than the Intel
Haswell platform is because of the presence of a custom DMA engine and a
scratchpad memory that efficiently (described in Chapter 5) manages cache
and memory accesses for each of the dbCores in the system.
Figure 4.4 shows that the parallel efficiency of applications on the RAPID
platform is close to 100%; reiterating that the applications are scalable on
the platform and potentially to hundreds of threads.
Figure 4.4: Scalability on RAPID Platform
22
4.1.4 Single Node Performance Comparison
In this section, we will compare both the platforms based on performance
and performance-per-watt.
Figure 4.5: Normalized Runtime of Machine Learning Applications
Figure 4.5 compares the performance of all the six applications in terms of
absolute performance. We run 32 threads of each application for this study.
The execution time of applications on Intel Haswell has been normalized to
1 and the execution time on RAPID platform has been scaled accordingly.
It can be clearly seen that the Intel Haswell platform outperforms the
RAPID platform in five out of the six applications. This result was expected
because the Haswell cores were designed to be high-performance cores.
The study becomes interesting when we compare the applications in terms
of performance-per-watt. Figure 4.6 shows this comparison. Here again, the
numbers on Intel Haswell have been normalized to 1 and correspondingly
scaled for RAPID platform.
The RAPID platform outperforms Intel Haswell platform for all the appli-
cations due to its architecture specifically designed to be power efficient. The
32 cores on the Haswell platform consumes about 260 W of power compared
to only 6 W of power consumed by 1 DPU.
These results show that it is possible to build a customized architecture for
machine learning applications by making certain micro-architectural changes
to the RAPID architecture and hence, improving its performance compared
23
Figure 4.6: Normalized Performance/Watt of Machine Learning
Applications
to Intel Haswell but at the same time remaining power-efficient. In other
words, we could improve the RAPID architecture by trading power for some
performance improvements specifically targeted for machine learning appli-
cations. In order to do this, we need to analyze its performance at the
micro-architectural level. In the next section, we do precisely that.
4.1.5 Micro-Architectural Results
In this section, we enumerate detailed performance statistics gathered on
RAPID and Intel Haswell Platform. For collecting statistics on RAPID, we
use the internal performance counters whereas on Intel Haswell, we use the
VTune Amplifier [23] performance measurement tool.
Table 4.2 lists the results of the six applications on the RAPID platform.
We will evaluate these statistics in Chapter 5 when we describe the architec-
tural changes that could be made to RAPID for making it better suitable for
machine learning application.
24
Table 4.2: Micro-Architectural Statistics-RAPID
K-Means KNN LR LDA RS DNN
IPC 0.70 0.71 0.63 1.17 0.64 0.70
Cache
L1 I$ Miss-rate (%) 0.00 0.03 0.06 0.67 0.00 0.10
L1 D$ Miss-rate (%) 2.21 0.37 0.02 4.43 5.24 10.24
L2 Miss-rate (%) 14.62 38.43 49.73 28.04 22.97 49.99
Instruction Mix
Loads (%) 19.97 16.66 19.98 9.25 7.14 0.97
Stores (%) 0.91 0.05 5.00 3.72 4.76 0.00
ALU (%) 68.22 66.70 65.02 75.87 83.33 98.54
Dual-Issued (%) 10.90 16.59 10.00 11.16 4.76 0.49
Cycles Stalled
L1 D$ Stalls (%) 3.22 0.59 3.68 19.11 25.35 3.22
L1 I$ Stalls (%) 0.00 0.01 0.01 3.75 0.01 0.00
Branch Stalls (%) 7.06 6.16 6.35 20.37 6.10 2.25
Mult-Div Stalls (%) 27.87 40.48 39.80 25.42 31.90 0.00
4.2 Multi-Node RAPID Performance Evaluation
In the previous sections, we found that the RAPID architecture is scalable
(parallel efficiency close to 100%) on a single DPU. In this section, we im-
plement the applications on multiple DPUs in order to verify if the RAPID
architecture is indeed scalable to more cores. We also investigate any soft-
ware or hardware modifications that need to be made to RAPID in order to
make it scale better.
Figure 4.7 shows the speedup of four applications as we scale from 1 DPU
to 4 DPUs. We see that K-Means, KNN and Linear Regression scales almost
linearly with DPUs. However, LDA loses scalability as we scale it to more
DPUs. This is because of a low latency network present between DPUs
which is required to communicate data between them. Also, the software
framework (MapReduce) which manages this communication is not efficient
enough. We discuss these drawbacks and their proposed solutions in detail
in Chapter 5.
However, these results reiterate the fact that the RAPID architecture is
good for multi-threaded machine learning applications and could be used to
run a highly parallel application.
25
Figure 4.7: Speedup on Multi-DPU
Figure 4.8: Scalability on Multi-DPU
26
CHAPTER 5
POTENTIAL RAPID ARCHITECTURAL
IMPROVEMENTS
As discussed in Chapter 4, we showed that RAPID architecture is a good fit
for machine learning applications when we need them to scale to hundreds
of threads. It is also a good architecture compared to Intel Haswell when we
compare performance-per-watt, but it does lag behind Intel Haswell in terms
of raw performance. However, due to its simplistic architecture, there is room
for making a lot of changes in RAPID architecture in order to improve its
performance and at the same time maintain its power-efficiency.
In this chapter, we propose architectural changes that could be made to
the RAPID platform and achieve our objective of proposing an architecture
suitable for machine learning applications. Performance can be improved by
making each dbCore more efficient (intra-DPU) as well as improving hard-
ware and software responsible for managing communication between DPUs
(inter-DPU).
5.1 Micro-Architecture
5.1.1 Out of Order Execution
In order to improve the performance of each core, different solutions can be
investigated. Out of Order execution can be used to extract more of the
available Instruction Level Parallelism to use more efficiently the different
available units.
During our analysis, we found that average IPC is 0.76 on RAPID com-
pared to 2.4 on Intel Haswell. The easiest way to improve IPC is making the
core OoO. We also found that for the instruction stream mixture of our algo-
rithm on the RAPID platform, we have seen that the mixture is well-balanced
between memory and arithmetic operations. However, because of instruction
27
scheduling made by the compiler, the dual issue rate is low. One approach
to mitigate the problem can be to manually write assembly code, but this
approach cannot be applied broadly. Instead, it appears more beneficial in
the general case to use an Out of Order pipeline with a small instruction
window to fix that issue.
In our case, all the applications seemed to be able to benefit from the
available possibilities of dual-issue with an instruction window as small as
eight instructions. That would result in a performance improvement of 1.73x
while the power consumption increase should stay limited. It seems that a
bigger instruction window is not required as the only goal would be to feed
the load/store unit and the arithmetic units (one general, and one multi-
plier/divider).
As indicated by the hardware designers of RAPID, an Out of Order pipeline
with a small instruction window seems to be good enough to improve the
dual-issue rate, while making it possible to mitigate cache misses in the L1.
It seems to be a modification of choice to make RAPID a better performer
on applications that cannot use the scratchpads and have to resort on the
caches, or for the applications too complex for the compilers.
5.1.2 Branch Predictors
The RAPID architecture does not have any branch prediction mechanism.
So, whenever a branch instruction is present in the instruction pipeline, all
the instructions further down the pipeline need to be stalled and the processor
wastes cycles during this duration without doing any effective work. In our
study, we found that on average about 8% of cycles are wasted in branch
stalls. In order to overcome this bottleneck, we recommend that it is essential
for machine-learning targeted computer architectures to have good branch
predictors [24], [25].
5.1.3 Faster Multipliers and Dividers
The current multipliers and dividers in the RAPID architecture are very
simplistic in design and take four to nine cycles to perform a computation.
As per our investigation, this results on average about 28% multiply and
28
divide cycle stalls. Hence, the instruction pipeline is stalled for some cycles
when performing these computations leading to a loss of performance. We
need either faster multipliers or pipelined multipliers [26], [27] to compensate
for these stalls.
5.1.4 Floating-Point Units
The RAPID hardware currently has only integer arithmetic logic units. How-
ever, many of the machine learning applications require floating-point arith-
metic particularly the ones that need to calculate probabilities and statistics.
In our study, we modified such applications to use only integer operations
at the cost of losing the accuracy of a single iteration and hence, performing
more iterations to get as accurate results as would have been possible with
floating-point units. This resulted in doing more computation compared to
the Intel Haswell platform, and hence an increased execution time on RAPID.
However, the presence of floating-point units would help to bridge the per-
formance gap between the two architectures.
5.2 Memory System
5.2.1 Cache Coherency
The current version of the RAPID hardware does not have hardware cache
coherence. Instead, the programmers must enforce the coherence themselves,
by issuing invalidation and flushing instructions. Because of the development
costs it incurs, the RAPID hardware designers have decided to implement
cache coherence in a latter version of the hardware based on our recommen-
dations. As discussed with them, the coherent caches should principally be
used to hold instructions and shared or concurrent data.
On the current hardware, when operating on shared data structures that
can be modified by several cores, the programmer needs to program the ATE
(Atomic Transaction Engine) to execute a Remote Procedure Call or issue
a hardware accelerated operation (invalidation another core, addition of a
value to a specified memory address, compare and swap etc.) on a specific
core and wait for the return.
29
The ATE is supposed to be a lightweight palliative to the lack of cache
coherence. However, for each call to the ATE, descriptors must be filled,
the transaction started, executed, resulting in the loss of hundreds of cycles.
When using remote procedure calls, the receiving core is interrupted, executes
the specified code, returns and the issuing core can proceed with the received
values, resulting in cycles lost on both cores. While faster to some extent
than using message passing, using the ATE has an important cost.
In our multi-DPU implementations, we would like to be able to distribute
the data structures across DPUs. To do that, we would need to be able to
update the status of certain variables atomically, lock the access to a resource,
modify it etc. Those different steps require atomic incrementation, spinning
until getting a lock, and performing atomic read and writes: all using the
ATE. Because of that, the use of the ATE would not be as lightweight and
negligible as desired.
Instead, cache coherence and atomic operations should make that easier.
Ideally, it would make all the operations carried out by the ATE in the
current version much faster too.
5.2.2 Last Level Cache
In the current version of RAPID, there are two levels of cache, L1 and L2.
L1 is private to each dbCore and L2 is shared between eight dbCores. Each
L1 cache is 32 kB direct-mapped and each L2 is 128 kB, four-way associative
as described in table 2.2. During our analysis, we found that the L2 suffered
a very high miss-rate of about 34% on RAPID. On the other hand, the Intel
Haswell suffers an L2 miss-rate of about 30% as well but it has a huge 45
MB LLC to compensate for the high L2 miss-rate. We believe that having
a similar-sized LLC (shared between all dbCores within a DPU) in RAPID
would also help improve its performance and potentially save cycles by not
performing long latency memory accesses.
5.2.3 Bigger DRAM
Machine Learning applications traditionally work on input datasets whose
sizes are a few GBs. On RAPID, each DPU has only 8 GB of memory which is
30
insufficient to run real-time machine learning applications. We circumvented
this limitation in our analysis by using smaller input datasets. Typically,
based on our investigation machine learning application would require about
30 GB of memory per DPU. Resulting from this study, RAPID designers have
decided to have more memory in the next version of RAPID architecture.
5.3 Network
Even though designed to be a scale-out architecture, the RAPID system in its
current version suffers from a high latency to communicate from one DPU
to another, even though using an Infiniband interface. This high latency
is mostly because different dbCores of a DPU cannot program the network
interface themselves and have to signal the ARM core running the operating
system so that it sends the packets. Because of that architecture, the latency
that can be measured from a dbCore to send a packet can range from ten
to one-hundred microseconds. This latency is far away from the figure of
sub-microsecond advertised by the constructors like Mellanox (for example,
at [28])
In order to improve the latency of the network, it seems crucial that the
dbCores are able to send packets without the intervention of the ARM core.
Just like on more conventional systems using InfiniBand, being able to fill
the Queue Pairs with new Work Queue Elements from the dbCores, then
seen by the InfiniBand Host Channel Adapter would be a big leap forward
for the network subsystem of RAPID.
31
CHAPTER 6
CONCLUSION
We set out to investigate computer architecture characteristics that would be
suitable for machine learning applications. In this process, we analyzed and
evaluated two architecture platforms - Intel’s Haswell and Oracle’s RAPID
for six machine learning applications, K-Means, KNN, Linear Regression,
LDA, DNN and Radix Sort. While Intel Haswell is a high-performance,
power-hungry architecture, RAPID is a low-performance, power-efficient ar-
chitecture. This gave us the opportunity to study architectures which were
at the two end-points of the design spectrum, and thereby propose opti-
mizations to the RAPID architecture converge on a suitable design point for
customized architectures for machine learning.
During our study, we found that in order for machine learning applications
to perform well, we need to have architectures which are highly parallel as
well as that have efficient per-thread performance. RAPID was very effec-
tive on the former requirement (applications scaled across tens of RAPID
cores), however it lacked in performance at a per-thread granularity com-
pared to Intel Haswell. However, it fared better to Intel Haswell when we
compared performance-per-watt. In light of these observations, we believed
that RAPID’s performance can be improved at the cost of losing some power
efficiency. So finally, we suggested architectural changes like Out of Order
Processing, Cache Coherence, Last Level Caches, better Branch Predictors,
faster multipliers and dividers, Low Latency Networks, each needing to be
made to RAPID’s architecture to be an efficient architecture for machine
learning applications.
32
REFERENCES
[1] D. D. Jensen, “Knowledge evaluation: Statistical evaluations,” in Hand-
book of Data Mining and Knowledge Discovery. Oxford University
Press, Inc., 2002, pp. 475–489.
[2] “Machine Learning Overview,” https://en.wikipedia.org/wiki/Machine
learning.
[3] S. Russell and P. Norvig, “A modern approach,” Artificial Intelligence.
Prentice-Hall, Englewood Cliffs, vol. 25, p. 27, 1995.
[4] D. L. Poole, A. K. Mackworth, and R. Goebel, Computational Intelli-
gence: A Logical Approach. Oxford University Press New York, 1998,
vol. 1.
[5] G. F. Luger, Artificial Intelligence: Structures and Strategies for Com-
plex Problem Solving. Pearson Education, 2005.
[6] M. Hutter, “Computational aspects,” Universal Artificial Intellegence:
Sequential Decisions Based on Algorithmic Probability, pp. 209–229,
2005.
[7] A. P. Saygin, I. Cicekli, and V. Akman, “Turing test: 50 years later,”
in The Turing Test. Springer, 2003, pp. 23–78.
[8] A. L. Samuel, “Some studies in machine learning using the game of
checkers,” IBM Journal of Research and Development, vol. 3, no. 3, pp.
210–229, 1959.
[9] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Machine Learning:
An Artificial Intelligence Approach. Springer Science & Business Media,
2013.
[10] “Google’s Tensor Processing Unit,” https://drive.google.com/file/d/
0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view.
[11] “Nvidia DGX-1,” http://www.nvidia.com/object/
deep-learning-system.html.
[12] “Intel Movidius,” https://www.movidius.com/.
33
[13] “Intel Haswell Architecture,” http://www.intel.com/
content/dam/www/public/us/en/documents/manuals/
64-ia-32-architectures-optimization-manual.pdf.
[14] J. MacQueen, “Some methods for classification and analysis of
multivariate observations,” in Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, Volume 1:
Statistics. Berkeley, Calif.: University of California Press, 1967.
[Online]. Available: http://projecteuclid.org/euclid.bsmsp/1200512992
pp. 281–297.
[15] N. S. Altman, “An introduction to kernel and nearest-neighbor
nonparametric regression,” The American Statistician, vol. 46, no. 3,
pp. 175–185, 1992. [Online]. Available: http://www.tandfonline.com/
doi/abs/10.1080/00031305.1992.10475879
[16] D. A. Freedman, Statistical Models: Theory and Practice. Cambridge
University Press, 2009.
[17] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
Journal of Machine Learning Research, vol. 3, no. Jan, pp. 993–1022,
2003.
[18] “MNIST Dataset,” http://yann.lecun.com/exdb/mnist/.
[19] A. L. Shimpi, “Intel Haswell Architecture,” http://www.anandtech.
com/show/6355/intels-haswell-architecture/8.
[20] “OpenMP,” http://www.openmp.org/.
[21] “Amdahl’s Law,” https://en.wikipedia.org/wiki/Amdahl’s law.
[22] “Load Balance,” https://en.wikipedia.org/wiki/Load balancing
(computing).
[23] “Intel VTune Amplifier,” https://software.intel.com/en-us/
intel-vtune-amplifier-xe.
[24] R. E. Kessler, “The alpha 21264 microprocessor,” IEEE Micro, vol. 19,
no. 2, pp. 24–36, 1999.
[25] J. E. Smith, “A study of branch prediction strategies,” in Proceedings of
the 8th Annual Symposium on Computer Architecture. IEEE Computer
Society Press, 1981, pp. 135–148.
[26] T. Hanyu and M. Kameyama, “A 200 MHz pipelined multiplier us-
ing 1.5 V-supply multiple-valued MOS current-mode circuits with dual-
rail source-coupled logic,” IEEE Journal of Solid-State Circuits, vol. 30,
no. 11, pp. 1239–1245, 1995.
34
[27] M. Hatamian and G. L. Cash, “A 70-MHz 8-bit/spl times/8-bit parallel
pipelined multiplier in 2.5-/spl mu/m CMOS,” IEEE Journal of Solid-
State Circuits, vol. 21, no. 4, pp. 505–513, 1986.
[28] M. Technologies, “Infiniband Performance,” http://www.mellanox.
com/page/performance infiniband.
35
