Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel Reconfigurable Architecture by Soorishetty, Anuraag (Author) et al.
Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel
Reconfigurable Architecture
by
Anuraag Soorishetty
A Thesis Presented in Partial Fulfillment
of the Requirements for the Degree
Master of Science
Approved November 2019 by the
Graduate Supervisory Committee:
Chaitali Chakrabarti, Chair
Hun Seok Kim
Robert LiKamWa
ARIZONA STATE UNIVERSITY
December 2019
ABSTRACT
This thesis presents efficient implementations of several linear algebra kernels, ma-
chine learning kernels and a neural network based recommender systems engine onto
a massively parallel reconfigurable architecture, Transformer. The linear algebra ker-
nels include Triangular Matrix Solver (TRSM), LU Decomposition (LUD), QR De-
composition (QRD), and Matrix Inversion. The machine learning kernels include
an LSTM (Long Short Term Memory) cell, and a GRU (gated Recurrent Unit) cell
used in recurrent neural networks. The neural network based recommender systems
engine consists of multiple kernels including fully connected layers, embedding layer,
1-D batchnorm, Adam optimizer, etc.
Transformer is a massively parallel reconfigurable multicore architecture designed
at the University of Michigan. The Transformer configuration considered here is 4
tiles and 16 General Processing Elements (GPEs) per tile. It supports a two level
cache hierarchy where the L1 and L2 caches can operate in shared (S) or private (P)
modes. The architecture was modeled using Gem5 and cycle accurate simulations
were done to evaluate the performance in terms of execution times, giga-operations
per second per Watt (GOPS/W), and giga-floating-point-operations per second per
Watt (GFLOPS/W).
This thesis shows that for linear algebra kernels, each kernel achieves high per-
formance for a certain cache mode and that this cache mode can change when the
matrix size changes. For instance, for smaller matrix sizes, L1P, L2P cache mode is
best for TRSM, while L1S, L2S is the best cache mode for LUD, and L1P, L2S is the
best for QRD. For each kernel, the optimal cache mode changes when the matrix size
is increased. For instance, for TRSM, the L1P, L2P cache mode is best for smaller
matrix sizes (N = 64, 128, 256, 512) and it changes to L1S, L2P for larger matrix sizes
(N = 1024). For machine learning kernels, L1P, L2P is the best cache mode for all
i
network parameter sizes.
Gem5 simulations show that the peak performance for TRSM, LUD, QRD and
Matrix Inverse in the 14nm node is 97.5, 59.4, 133.0 and 83.05 GFLOPS/W, respec-
tively. For LSTM and GRU, the peak performance is 44.06 and 69.3 GFLOPS/W.
The neural network based recommender system was implemented in L1S, L2S
cache mode. It includes a forward pass and a backward pass and is significantly
more complex in terms of both computational complexity and data movement. The
most computationally intensive block is the fully connected layer followed by Adam
optimizer. The overall performance of the recommender systems engine is 54.55
GFLOPS/W and 169.12 GOPS/W.
ii
ACKNOWLEDGMENTS
I would like to extend my sincere gratitude to my thesis advisor Dr. Chaitali Chakrabarti
for her continuous support and motivation throughout my work. I am thankful to my
committee members Dr. Hun Seok Kim and Dr. Robert LiKamWa for providing me
feedback and their time on my research. I would also like to thank Subhankar Pal,
Siying Feng from the University of Michigan; Jian Zhou, Yan Xiong and Srimayee
Kanagala from Arizona State University for their help throughout my work. I am
also grateful to be part of DARPA’s Software Defined Hardware (SDH) program.
Lastly, I would like to thank my family and friends for their continuous motivation
and support.
iii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Linear Algebra: Algorithms and Architecture . . . . . . . . . . . . . . . . . . . . . 2
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Efficient Mapping of Linear Algebra Kernels . . . . . . . . . . . . . . . 6
1.4.2 Efficient Mapping of RNN Kernels . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 Efficient Mapping of Recommender Systems Engine . . . . . . . . 8
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 BACKGROUND: KERNEL ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Matrix Algebra Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Triangular Matrix Solver (TRSM) . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 LU Decomposition (LUD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 QR Decomposition (QRD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Machine Learning Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Long Short Term Memory (LSTM) cell . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Gated Recurrent Unit (GRU) cell . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Recommender Systems Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Rectified Linear Unit (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iv
CHAPTER Page
2.3.3 1-D Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.6 Binary Cross Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.7 Adam Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 MAPPING KERNELS ON RECONFIGURABLE MULTICORE AR-
CHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Transformer - A Reconfigurable Multicore Architecture . . . . . . . . . . . . 23
3.2 Mapping: Linear Algebra Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Triangular Matrix Solver (TRSM) . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 LU Decomposition (LUD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 QR Decomposition (QRD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Mapping: Machine Learning Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 LSTM Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 GRU Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Mapping: Recommender Systems Engine . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Embedding layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 1-D Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.5 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.6 Binary Cross Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.7 Adam Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
v
CHAPTER Page
4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Transformer Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Calculation of Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Performance of Linear Algebra Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Triangular Matrix Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.2 LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.4 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Performance of Machine Learning Kernels . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.1 LSTM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.2 GRU cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Performance of Recommender Systems Engine . . . . . . . . . . . . . . . . . . . . 48
5 CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Key Take-aways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
vi
LIST OF TABLES
Table Page
4.1 Transformer Configuration Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 TRSM: Execution Times (in ms) for Different Matrix Sizes . . . . . . . . . . . 39
4.3 LUD: Execution Times (in ms) for Different Matrix Sizes . . . . . . . . . . . . . 41
4.4 Matrix Inverse: Execution Times (in ms) for Different Matrix Sizes . . . . 44
4.5 LSTM: Execution Times (in ms) for Different Parameter Sizes . . . . . . . . 46
4.6 GRU: Execution Times (in ms) for Different Parameter Sizes . . . . . . . . . 47
4.7 Performance Data for 256 Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.8 Performance Metrics for Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.9 Performance Metrics for Backward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vii
LIST OF FIGURES
Figure Page
2.1 Triangular Matrix Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Givens Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 An LSTM Cell (Adapted from [28]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 A Basic GRU Cell [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Neural Network for Recommender Systems [33] . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Transformer Architecture (Adapted from [44]) . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Pseudo Code for LUD when A is of Size N ×N . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Mapping of LU for the First Three Column-Updates (Left to Right) . . . 27
3.4 Mapping of Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Cycle at Which Each Sub-Diagonal Entry can be Annihilated for N=8 . 29
3.6 Mapping of Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 TRSM: Hit Rates for L1 and L2 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 TRSM: GFLOPS/W for Different Matrix Sizes. . . . . . . . . . . . . . . . . . . . . . . 40
4.3 LUD: The Time for Which Each GPE is Active in the Two Scheduling
Schemes for v1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 LUD: Hit Rates for L1 and L2 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 LUD: GFLOPS/W for Different Matrix Sizes . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 QRD: Hit Rates for L1 and L2 Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 QRD: GFLOPS/W for Different Matrix Sizes . . . . . . . . . . . . . . . . . . . . . . . . 44
4.8 Matrix Inverse: Hit Rates for L1 and L2 Caches . . . . . . . . . . . . . . . . . . . . . 45
4.9 Matrix Inverse: GFLOPS/W for Different Matrix Sizes . . . . . . . . . . . . . . . 45
4.10 LSTM: GFLOPS/W for Different Parameter Sizes. . . . . . . . . . . . . . . . . . . . 46
4.11 GRU: GFLOPS/W for Different Parameter Sizes . . . . . . . . . . . . . . . . . . . . . 47
4.12 Recsys: Breakdown of Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
viii
Chapter 1
INTRODUCTION
Linear algebra is one of the pillars of high performance computing (HPC) and more
recently, machine learning. Since the introduction of the first microprocessor, there
has been consistent improvement in performance of acceleration of linear algebra
kernels. Today, many of these kernels have been successfully mapped to multicore
architectures for better performance.
Multi-core architectures come in many guises, that include General-Purpose Pro-
cessors (GPPs), digital signal processors (DSPs), graphic processors (GPU), etc. Cur-
rently, there is a significant rise in heterogeneous computing systems where a multi-
core CPU is paired with a GPU or other specialized accelerators to achieve higher
performance. For example, Qualcomm’s Snapdragon 855+ [1] chip has an octa-core
processor, an AI engine with a CPU and a GPU, etc. Even Apple’s A13 bionic [2]
chip has an hexa-core processor with four low power cores, a four core GPU, etc.
Highest performance is achieved by application-centric systems that are heavily
optimized for a specific set of kernels (ASICs). The main difference between GPPs
and ASIC architectures are their efficiency and programmability. GPPs are very
flexible and programmable depending on the application, on the other hand, ASICs
provide 10× - 100× higher performance comparatively with limited programmability.
While ASICs and GPPs are at two ends of the architecture spectrum, in between,
there are several architecture families with different degrees of programmability and
computation efficiency. These include Coarse-Grained Reconfigurable Architecture
(CGRAs), Field-Programmable Gate Arrays (FPGAs) and Application-Specific In-
struction Set Processors (ASIPs). So current architecture research has focused on
1
building systems that have the programmability of GPPs combined with the efficiency
of ASICs. Transformer is one such processing architecture designed by researchers at
the University of Michigan that bridges the gap between the programmability abnd
efficiency. It is a low power multicore architecture that is highly programmable. It
supports C/C++ making it as programmable as CPUs; the programming model itself
resembles a GPU with a simple Application Programming Interface (API).
In this thesis, we focus on implementing data-intensive routines in linear algebra
and machine learning on Transformer.
1.1 Linear Algebra: Algorithms and Architecture
Linear algebra is a branch of mathematics that mainly focuses on linear functions
such as a1x1 + a2x2 + ... + anxn = b, and linear equations such as (x1, ..., xn) →
a1x1 + a2x2 + ... + anxn. These are generally represented by matrices and vectors.
Linear algebra has a wide range of applications in statistics, scientific computing, etc.
All the performance-critical linear algebra routines are grouped into an umbrella
called the Basic Linear Algebra Subprograms (BLAS) [36]. These are software rou-
tines that provide standard building blocks for performing basic vector and matrix
operations. They are divided into three levels. Level 1 BLAS performs scalar- vector
and vector-vector operations, Level 2 BLAS performs matrix-vector operations, and
Level 3 BLAS performs matrix-matrix operations.
Most of the libraries that provide linear algebra kernels are built upon the BLAS
interface. Examples of BLAS libraries are AMD Core Math Library (ACML) [3], Au-
tomatically Tuned Linear Algebra Software (ATLAS) [56], OpenBLAS [58]. ATLAS is
a an open source library that automatically generates optimized implementations for
any arbitrary architecture, OpenBLAS is an open source library for many of the pop-
ular architectures. Software packages such as Armadillo, LAPACK [13], LINPACK,
2
GNU Octave, Mathematica, MATLAB, NumPy, and R rely on BLAS to provide
portability and scalability. GPU-accelerated BLAS is widely used and adopted by
TensorFlow, PyTorch, Caffe, keras, etc.
BLAS has influenced many computer vendors to come up with their own imple-
mentations, e.g., MKL (Math Kernel Library) [4] for Intel processors, CuBLAS [5] for
Nvidia based GPUs, vecLib framework for Apple processors and BLIS & libFLAME
for AMD processors, etc. All these libraries are targeted to work efficiently on their
specific architectures and lack portability.
On the other hand, there are open source implementations such as GotoBLAS
[25], MAGMA [22], BLASX [55], PLASMA [17], etc. which are targeted mainly for
multicores and GPGPUs, but lack platform support. Recent implementations with
multi GPU systems have been quite promising with BLASX, which minimizes the
data transfer overhead with global memory tile structure. A study on CPU vs GPU
implementations of linear algebra [37] concluded that CPUs can be used for high
response requirements and GPUs for high parallel requirements.
Linear algebra kernels have also been accelerated by Application-Specific Inte-
grated Circuits (ASIC) [52, 46], general-purpose graphic processing units (GPGPU)
[30, 40, 53, 42] and Field Programmable Gate Arrays (FPGA) [34, 32]. ASICs are less
programmable as they target a specific problem and try to implement it efficiently.
Systolic array based solutions have also been developed with unidirectional or bidi-
rectional data flow for linear algebra kernels [21]; their major drawback is scalability
[45]. FPGAs are more flexible but have higher power requirements.
A popular alternative to FPGAs is Coarse-Grained Reconfigurable Architectures
(CGRA) which have the advantages of shorter reconfiguration time and lower power
consumption. Examples of CGRAs are REDEFINE [12], ADRES [41], DySER [26],
and LAC [47, 48, 49], where the data-path functional units can be reconfigured based
3
on the computation requirements. These architectures achieve ASIC-level efficiencies
while supporting multiple memory access patterns. For instance, LAC [49] shows
that orders of magnitude improvement in efficiency is possible with relatively simple
customizations and fine-tuning of memory hierarchy configurations. PLASTICINE
[51] is a recently proposed architecture which achieves high performance per watt
compared to FPGAs. Here both the data-path computing units and the on-chip
scratchpad access patterns are configurable.
1.2 Machine Learning
Machine learning (ML) is the scientific study of algorithms and statistical models
that computer systems use to perform a specific task without using explicit instruc-
tions and relying on patterns and inference instead [6]. Currently, the most common
structures to perform ML are Deep Neural Networks (DNNs) and Recurrent Neural
Networks (RNNs). A large number of computations in these networks rely on lin-
ear algebra kernels. Examples include matrix-vector multiplication, matrix-matrix
multiplication, Singular Value Decomposition (SVD), least squares, etc.
The neural networks learn from very large training datasets and provide recog-
nition accuracy close to or even better than human-level perception. For higher
throughput, high performance GPUs are used to accelerate the training and infer-
ence tasks of the DNNs. Even software libraries such as Caffe, TensorFlow, Pytorch,
Keras, etc. have GPU support.
However there are many other solutions that employ ASICs, SoCs, FPGAs, etc.
NeuroCGRA [31], DCNN acclerator [54], Eyeriss [19] and CATERPILLAR [39] are
some CGRA implementations that focus on the reconfigurability aspects needed for
DNNs. NeuroCGRA allows the processing elements and the network to dynamically
morph into either conventional CGRA or a neural network, depending on the hosted
4
application. Eyeriss achieves high throughput and energy efficiency by minimizing
the data movement energy cost.
There are also FPGA based accelerators for DNNs and RNNs [57, 38, 50, 18] that
are less efficient than ASIC implementations [43]. RNNFast [29] is an accelerator for
RNNs that leverages an emerging class of non-volatile memory called domain-wall
memory (DWM). LSTM based FPGAs [59, 27] have also been developed to target
low power with high accuracy.
Now-a-days, every computer vendor has dedicated a specialized processor for han-
dling machine learning tasks. Cloud TPU [7] is the custom-designed machine learn-
ing ASIC by Google that powers many of its products, including Translate, Photos,
Search, Assistant, and Gmail. Apple has Neural Engine [8, 9] to support its machine
learning, and Qualcomm has its own AI engine [10] to accelerate AI-enabled user ex-
periences on Qualcomm Snapdragon mobile platforms, etc. Quite recently, Pytorch
also announced its TPU (Tensor Processing Unit) support.
1.3 Motivation
Linear algebra and machine learning algorithms are important data processing
kernels in scientific computing, statistics, optimization, etc. Due to their inherent
high complexity and different data access patterns, these kernels form the bottlenecks
in real-time applications.
The goal of this work is to provide accelerator-like solutions for several key linear
algebra and machine learning algorithms. We choose a reconfigurable architecture,
Transformer, which combines the programmability of a GPP with the performance
efficiency of an ASIC. It provides run-time reconfigurable memory as well as processor
to memory interconnection patterns.
5
1.4 Thesis Contributions
The main contribution of this thesis is efficient implementation of several linear
algebra and machine learning routines on a massively parallel run-time reconfigurable
architecture, Transformer. We focus on the implementation of certain key linear al-
gebra kernels, namely, Triangular Matrix Solver (TRSM), LU Decomposition (LUD),
QR Decomposition (QRD), Matrix Inversion, and machine learning kernels, such as,
an LSTM (Long Short Term Memory) cell and a GRU (Gated Recurrent Unit) cell,
on Transformer. We also map a full scale neural network used in recommendation
systems onto this architecture.
Transformer is a multi-core architecture designed at the University of Michigan,
Ann Arbor. It consists of MIMD/SPMD-style General-purpose Processing Elements
(GPEs) clustered into tiles, where each tile has its own Local Control Processor (LCP).
The GPEs store local data in L1 banks while the tiles share data via the L2 banks.
Both L1 and L2 caches can be configured into shared (S) or private (P) modes. Each
processing element is an ARM core which implements Thumb Instruction Set Archi-
tecture (ISA). The architecture is modeled using the Gem5 architectural simulator.
We evaluate the performance of the kernel algorithms with respect to execu-
tion times, giga-operations per second per Watt (GOPS/W), giga-floating-point-
operations per second per Watt (GFLOPS/W) for 14nm technology node. We also
present the L1/L2 cache hit rates.
1.4.1 Efficient Mapping of Linear Algebra Kernels
For each of the linear algebra kernels, we focus on different scheduling schemes
and different Transformer L1 and L2 cache configurations (S or P). We show that
each kernel achieves high performance for a certain cache configuration and that this
6
cache configuration can change when the matrix size changes, or even the scheduling
pattern changes (for very large sizes), thus enabling run-time reconfiguration.
For Triangular Matrix Solver (TSRM), which solves for X in AX = B, each GPE
is assigned the task of computing one or more columns of X, depending on the size
of the matrix. We see that for small matrix sizes L1P, L2P has the best performance
while for larger matrix sizes (1024× 1024), L1S, L2P does better.
LU Decomposition (LUD) of a matrix involves factorizing a square matrix into a
lower triangular matrix L and an upper triangular matrix U . Unlike TRSM, where
the computations are parallel and independent across GPEs, the initial approach of
LUD (v1) is to parallelize each column-update, that involves computing one column
of L and one row of U among all GPEs. A blocked LUD (v2) is also implemented
which is a combination of LUD v1, TRSM and GEMM (Dense General Matrix Matrix
Multiplication). Here, we see that for small matrix sizes, L1S, L2S has the best
performance while for larger matrix sizes 1024× 1024, L1S, L2P does better.
QRD decomposes a matrix into an orthogonal matrix and an upper triangular
matrix with successive applications of Givens rotation. Here, one or more columns
is assigned to each GPE. The workload distribution is uneven even among GPEs in
a tile, so we use only one tile implementation to avoid the overhead of cache flushes.
We observe that L1P, L2S performs better for matrices of sizes 64× 64 to 256× 256.
The matrix inversion is a combination of two kernels, namely, LUD and TRSM.
Since the best cache mode for LUD is different from the best cache mode for TRSM, we
utilize the runtime reconfiguration feature of Transformer. For instance, for N = 512,
LUD operates in L1S, L2P mode and TRSM operates in L1P, L2P mode. Use of
reconfiguration reduces the execution time from 32.93ms to 27.32ms.
All four matrix algebra kernels have good performance. Cycle-accurate simula-
7
tions using Gem5 show that the peak performance in the 14nm node is 97.5, 62.2,
133.0 and 84.5 GFLOPS/W for TRSM, LUD, QRD and Matrix Inverse, respectively.
1.4.2 Efficient Mapping of RNN Kernels
We focus on LSTM (Long Short Term Memory) and GRU (Gated Recurrent Unit)
cell in recurrent neural networks. We use the LSTM cell for training; the GRU cell is
only used for inference. We find that the majority of the operations here are matrix-
vector products. We distribute the workload in such a way that each GPE works on
different blocks of data. As a result, L1P, L2P has the best performance for different
parameter sizes. Cycle-accurate simulations using Gem5 show a peak performance of
44.06 and 69.3 GFLOPS/W for LSTM and GRU, respectively.
1.4.3 Efficient Mapping of Recommender Systems Engine
We also map a neural network based recommender systems engine that takes in a
list of movies that a user has liked and tries to predict a score for all other movies. It
is an autoencoder based system where the input is first encoded into lower dimensions
and then decoded back to original dimension so that the network learns something
new.
The network consists of an embedding layer, ReLU, 1-D batch normalization,
dropout and fully connected layers. The inputs are read from DRAM and are stored
back into DRAM after each layer computation. We use only one cache configuration
where both the L1 and L2 banks are shared. The computations in each layer are
mapped such that the GPE utilization is high. For example, ReLU and dropout are
mapped such that each GPE processes equal number of rows of intermediate outputs.
Our results show that the fully connected layer (backward pass) has a peak perfor-
8
mance of 201.65 GFLOPS/W, followed by batchnorm+ReLU (backward pass) with
182.83 GFLOPS/W and dropout (forward pass) with 165.57 GFLOPS/W. The exe-
cution time of the backward pass is larger compared to the forward pass. The per-
formances of the forward and the backward passes are 43.48 and 61.86, GFLOPS/W
respectively, and the overall performance is 54.55 GFLOPS/W.
1.5 Organization
The rest of the thesis is divided into the following chapters.
• Chapter 2 includes a brief introduction of linear algebra and machine learning
kernels.
• Chapter 3 describes the multicore architecture, Transformer, and mapping of
the different linear algebra and machine learning kernels onto Transformer.
• Chapter 4 summarizes performance results of all kernels. It also lists the best
cache configuration for each kernel, when applicable.
• Chapter 5 concludes the thesis and also presents extensions of this work.
9
Chapter 2
BACKGROUND: KERNEL ALGORITHMS
This chapter gives a brief introduction to the matrix algebra kernels (Section 2.1),
machine learning kernels (Section 2.2) and neural network application (Section 2.3)
under study. The matrix algebra kernels are Triangular Matrix Solver (TRSM),
LU Decomposition (LUD), QR Decomposition (QRD) and Matrix Inversion. The
Machine Learning kernels include an LSTM (Long Short-Term Memory) cell, and a
GRU (Gated Recurrent unit) cell. The neural network application is a recommender
systems engine.
2.1 Matrix Algebra Kernels
2.1.1 Triangular Matrix Solver (TRSM)
This Level-3 BLAS (Basic Linear Algebra Subprograms) routine solves a system
of linear equations of the form AX = B, where A is an upper or lower triangular
matrix, andX & B are dense matrices. An example is shown in Figure 2.1. Triangular
systems help in solving Matrix Inverse using LUD, QRD or Cholesky factorization.
Figure 2.1: Triangular Matrix Solver
10
Depending on whether A is an upper or a lower triangular matrix, this algorithm
employs backward or forward substitution. In both cases, a column of X can be
computed using matrix A and a column of B. The pseudo-code for TRSM is shown
in Algorithm 1.
Algorithm 1 Triangular Matrix Solver (Lower Triangular)
for k = 1 : N do
X1k = B1k / A11
for i = 2 : N do
s = Bik
for j = 1 : N − 1 do
s = s− Aij * Xjk
end for
Xik = s / Aii
end for
end for
Columns of X can be solved independently using columns of B but each column
has its own serial computational dependency. Hence, this can be solved by unrolling
the outermost loop.
2.1.2 LU Decomposition (LUD)
LUD involves factorizing a square matrix A into a product of a lower triangular
matrix, L and an upper triangular matrix, U , given by A = LU . LUD is used for
inverting matrices as well as for solving a system of linear equations. The pseudo-code
for LUD is shown in Algorithm 2. Here, LUD is computed by Gaussian elimination
where U overwrites A; L is stored separately.
11
There is a serial dependency within each column and across columns that limits
the data parallelism. In addition to storing L and U separately, there are other ways
where the LUD can be done in-place, that means A is overwritten by L and U by
keeping the diagonal elements as 1.
Algorithm 2 LU Decomposition
for k = 1 : n do
for j = k : n do
Ljk = Ajk / Akk
end for
for j = k + 1 : n do
for i = k + 1 : n do
Aij = Aij − Ljk * Aki
end for
end for
end for
For higher utilization, an alternative method (v2) [23] is to divide the matrix
into blocks and solve using a combination of LUD, GEMM (General Matrix - Matrix
multiplication) and TRSM calls. Dividing the matrix into four blocks makes use of
the following set of computations,A11 A12
A21 A22
 =
L11 0
L21 L22

U11 U12
0 U22

LUD: A11 = L11U11;
TRSM: A21 = L21U11;
TRSM: A12 = L11U12;
GEMM followed by LUD: A22 − L21U12 = L22U22
12
This method can be extended to support dividing the matrix into more blocks
such as 9, 16, 25 blocks and so on.
2.1.3 QR Decomposition (QRD)
QRD involves factorizing a square or a non-square matrix A into a product of an
orthogonal matrix Q and an upper triangular matrix R, given by A = QR. Here,
R overwrites A. Each element below the diagonal is annihilated from the last row
using Given’s rotation [24]. The construction of Given’s rotation matrices is shown
in Figure 2.2, where a, b are the elements of the rows, one of which is annihilated.
The entire row is then updated using the generated Givens matrix. Multiplying I
(Identity matrix) with the same rotation matrices yields Q.
Figure 2.2: Givens Rotation
The pseudo-code for QRD is shown in Algorithm 3. The annihilation starts from
the first column and proceeds to the last column. Within a column, the annihilation
is done in reverse order. The row.rot step is performed using the Givens matrix.
Algorithm 3 QR Decomposition
for j = 1 : N do
for i = m : −1 : j + 1 do
[c, s] = Givens(A(i− 1, j), A(i, j))
A(i− 1 : i, j : N) = row.rot(A(i− 1 : i, j : N), c, s)
end for
end for
13
2.1.4 Matrix Inversion
The inverse of a matrix is a matrix which when multiplied by the original matrix
results in an identity matrix,
AA−1 = I (or) A−1A = I
where A, A−1 and I are square matrices.
A singular matrix having determinant zero is not invertible and does not have
an inverse. Pseudo inverse is used instead for these cases and also for rectangular
matrices. The analytical solution for matrix inversion is given by Equation 2.1, where
adj(A) is the adjoint matrix of A and det(A) is the determinant of matrix A.
A−1 =
1
det(A)
∗ adj(A) (2.1)
There are many methods to invert a matrix involving either a LUD, QRD or
Cholesky decomposition followed by triangular solvers. Here, we use a combination
of LUD and TRSM to compute A−1. The steps are:
A = LU LU Decomposition
LY = I TRSM: Forward Substitution
UX = Y TRSM: Back Substitution
The above example is shown by using LU decomposition with two triangular
matrix solvers.
2.2 Machine Learning Kernels
2.2.1 Long Short Term Memory (LSTM) cell
LSTM is a special kind of Recurrent Neural Network (RNN) that is capable of
learning long-term dependencies. It consists of different gates that allow or reject
14
information that passes through the network. These include an input gate it, forget
gate ft, and output gate ot. In addition, an LSTM has a memory cell and a hidden
state. A basic LSTM cell [28] is shown in Figure 2.3.
Figure 2.3: An LSTM Cell (Adapted from [28])
The LSTM gates are given by
ft = σ(Wfxt + Ufht−1 + bf ) (2.2)
it = σ(Wixt + Uiht−1 + bi) (2.3)
ot = σ(Woxt + Uoht−1 + bo) (2.4)
c˜t = tanh(Wcxt + Ucht−1 + bc) (2.5)
where ft, it, ot and at are the forget gate, input gate, output gate and input activation,
respectively. σ is the sigmoid and tanh is the hyperbolic tangent function. The
internal state and hidden state are defined by
ct = ft ⊙ ct−1 + it ⊙ c˜t (2.6)
ht = ot ⊙ tanh(ct) (2.7)
15
where ⊙ represents element wise multiplication.
The backward pass of an LSTM consists of calculating the gradients of all the
gates, states and parameters. This can be divided into four phases. The first phase
computes δht and δct given by,
δht = ∆t +∆ht (2.8)
δct = δht ⊙ ot ⊙ (1− tanh2(ct)) + δct+1 ⊙ ft+1 (2.9)
where ∆t is the difference in the output and ∆ht is the output from the previous
state. The second phase computes the gradients of all the gates given by,
δc˜t = δct ⊙ it ⊙ (1− c˜2t ) (2.10)
δit = δct ⊙ at ⊙ it ⊙ (1− it) (2.11)
δft = δct ⊙ ct−1 ⊙ ft ⊙ (1− ft) (2.12)
δot = δht ⊙ tanh(ct)⊙ ot ⊙ (1− ot) (2.13)
The third phase is the calculation of the input gradient, δxt and gradient of the
previous state, ∆ht−1 given by,
δxt = W
T
i δit +W
T
f δft +W
T
o δot +W
T
c δc˜t (2.14)
∆ht−1 = UTi δit + U
T
f δft + U
T
o δot + U
T
c δc˜t (2.15)
16
The final phase is to calculate the gradients of the internal parameters. They are,
δW =
T∑
t=0
(δitxt + δftxt + δotxt + δc˜txt) (2.16)
δU =
T−1∑
t=0
(δit+1ht + δft+1ht + δot+1ht + δc˜t+1ht) (2.17)
δb =
T∑
t=0
(δit+1 + δft+1 + δot+1 + δc˜t+1) (2.18)
where T is the number of states. Finally, these parameters can be updated using
Stochastic Gradient Descent (SGD) or any other standard optimizers.
2.2.2 Gated Recurrent Unit (GRU) cell
GRU is similar to LSTM, but simpler. It consists of only two gates, an update
gate and a reset gate. It does not have an internal state [20]. The block diagram of
a GRU cell is shown in Figure 2.4.
Figure 2.4: A Basic GRU Cell [20]
The forward pass is given by the following equations,
zt = σ(Wzxt + Uzht−1 + bz) (2.19)
17
rt = σ(Wrxt + Urht−1 + br) (2.20)
c˜t = tanh(Wcxt + Uc(rt ⊙ ht−1) + bc) (2.21)
where zt, rt and at are the update gate, reset gate, and input activation, respectively.
W , U and b represent the learnable parameters. The hidden state is given by
ht = (1− zt)⊙ ht−1 + zt ⊙ c˜t (2.22)
2.3 Recommender Systems Engine
The recommender system, adapted from [33] is a neural network that takes in a list
of movies that a user has liked and tries to predict a score for all other movies. As it is
an autoencoder-based system, the input is first encoded into a hidden representation
(hidden_dim), which has a dimension less than the input dimension (emb_dim), and
then decoded back into a higher dimension (output_dim). This helps the neural
network to learn something new rather than just learning an identity mapping. The
architecture for the neural network is shown in Figure 2.5.
The network consists of an embedding layer, ReLU, 1-D batch normalization,
dropout and fully connected layers. The user information is first embedded into
embedding_dimension, then to hidden_dimension and finally to output_dimension
(same as the number of movies). During the training phase, a group of users form
a batch and after each batch’s forward and backward pass, the weights are updated
using Adam Optimizer. The output is sent through a sigmoid layer to measure the
binary cross entropy loss. Next we provide a brief description of all the layers.
18
Figure 2.5: Neural Network for Recommender Systems [33]
2.3.1 Embedding Layer
The embedding layer utilizes a weight matrix called the embedding matrix. The
number of rows in this matrix = number of movies, N and number of columns =
embedding dimension, D.
In the forward pass, the input to the embedding layer is the list of movies a user
has rated, and the output is the sum of all the D-dimensional vectors that resemble
the corresponding rated movies.
In the backward pass, the gradient of a row of the embedding matrix is the sum of
the D-dimensional vectors (of the incoming gradients). An entry in the D-dimensional
vector is nonzero if a user has rated that movie. The operation in both the forward
19
and backward passes is a D-dimensional vector addition.
2.3.2 Rectified Linear Unit (ReLU)
In the forward pass, ReLU adds a non-linearity by passing only the values which
are greater than zero, rest all are assigned zeros as shown in Equation 2.23, where
x is an input to the neuron. As it is an element wise operation, the dimensions are
unchanged.
f(x) = max(0, x) (2.23)
Similarly, in the backward pass, ReLU is computed by multiplying the incoming
gradient with the ReLU mask (comprised of 1’s and 0’s based on the values passed
initially) and sent to the previous layer.
2.3.3 1-D Batch Normalization
This layer has two weight vectors, γ and β. The length of the vectors for the first
batchnorm layer is emb_dim and the length of the vectors of the second batchnorm
layer is hidden_dim.
In the forward pass, the mean and variance are computed for each dimension
which are then used to find the normalized values, Xˆ, as shown in Equation 2.24,
where ϵ = 0.00001. Then, the normalized values of each user are multiplied with γ
followed by addition with β to get Y shown in Equation 2.25. The multiplication is
a matrix-vector product and the addition is a vector sum.
Xˆ =
X − µ√
σ2 + ϵ
(2.24)
Y = γXˆ + β (2.25)
20
The backward pass comprises of three gradient calculations which are the input
gradient, γ gradient and β gradient. The input gradient is calculated using Equation
2.26, where ⊙ indicates element-wise multiplication and B is the batch size. The γ
and β gradients are given by Equations 2.27 and 2.28, respectively.
∂J
∂X
=
γ ⊙√σ2 + ϵ
B
[
−∂J
∂γ
⊙ Xˆ +B ∂J
∂Y
− ∂J
∂β
]
(2.26)
∂J
∂γ
=
B∑
i=1
∂J
∂yi
.xˆi (2.27)
∂J
∂β
=
B∑
i=1
∂J
∂yi
(2.28)
2.3.4 Dropout
In the forward pass, each element is either passed or dropped with a dropout
probability, p given by the user. The passed values are scaled by 1 − p. Similar
to ReLU, a dropout mask (comprised of 1’s and 0’s) is generated which is used in
backward pass to multiply with the incoming gradient.
2.3.5 Fully Connected Layer
The computation in the forward pass is a matrix-vector multiplication followed
by a vector addition shown in Equation 2.29.
Y = W TX + b (2.29)
The backward pass comprises of three gradient calculations that include the input
gradient, weight gradient and bias gradient given by Equations 2.30, 2.31 and 2.32.
The input gradient is a matrix multiplication of the incoming gradient with the weight
matrix. The weight matrix gradient is a matrix multiplication of the transposed
21
weight gradient with its input (forward pass). The bias gradient is the addition of
incoming gradient across its dimensions.
∂J
∂X
=
∂J
∂Y
.W (2.30)
∂J
∂W
=
B∑
i=1
(
∂J
∂yi
)T
.xi (2.31)
∂J
∂b
=
B∑
i=1
(
∂J
∂yi
)T
(2.32)
2.3.6 Binary Cross Entropy Loss
Binary cross entropy loss is a sigmoid layer followed by cross entropy loss. The
equation for sigmoid is shown in Equation 2.33.
f(x) =
1
1 + e−x
(2.33)
The term binary is used because of two classes 0 or 1. So when using this loss,
the formulation of Cross Entropy loss for binary problems is often given by,
CE =
∑
i
[
− tilog[f(xi)]− (1− ti)log[1− f(xi)]
]
(2.34)
where f(x) is the output and t is the target. For each user, cross entropy is computed
on all movies.
2.3.7 Adam Optimizer
The Adam optimizer [35] is an extension of stochastic gradient descent used to
update the network weights in the training phase. Instead of adapting the parameter
learning rates based on the average first moment (the mean) as in RMSProp, Adam
also makes use of the average of the second moments of the gradients (the uncentered
variance).
22
Chapter 3
MAPPING KERNELS ON RECONFIGURABLE MULTICORE
ARCHITECTURE
This chapter presents a brief description of the reconfigurable multicore architecture
that was used in this study (Section 3.1), followed by procedures to map matrix alge-
bra kernels (Section 3.2), machine learning kernels (Section 3.3), and a recommender
systems engine (Section 3.4) onto this architecture.
3.1 Transformer - A Reconfigurable Multicore Architecture
Transformer is a scalable, energy-efficient, reconfigurable multicore architecture
that is being designed by Subhankar Pal and other researchers at the University of
Michigan. It has a structure similar to a recent sparse matrix multiplication acceler-
ator [44]. It consists of many in-order General-purpose Processing Elements (GPEs),
distributed on-chip cache memories, crossbars and a high-bandwidth DDR interface
through a two-level on-chip memory hierarchy consisting of reconfigurable caches and
crossbars. Fig.3.1 shows the Transformer architecture. The following is a brief de-
scription of Transformer.
General-Purpose Processing Element (GPE): A GPE is a single-issue, 4-
stage, in-order core that is optimized for energy-efficient computation. It has signifi-
cantly lesser silicon footprint and has lower power consumption than modern proces-
sors. This allows Transformer to be built with many such cores. Each GPE executes
instructions in ARM Thumb ISA. They do not support SIMD vector instructions
but have a Floating Point Unit, capable of executing single-precision floating-point
operations.
23
Figure 3.1: Transformer Architecture (Adapted from [44])
Local Control Processor (LCP): The GPEs within a tile are managed and
scheduled to work through the Local Control Processor. Each LCP has the same
micro-architecture as a GPE and executes instructions from the Thumb ISA. The
LCP has a set of hardware FIFO work queues through which 32-bit packets are
pushed to the GPEs by the LCP. The status queues inturn send data back to the
LCP. These queues are private to every GPE.
Reconfigurable Cache: The architecture has two levels of reconfigurable data
cache (R-D cache) banks - L1 and L2 (within GPEs and LCPs respectively). Every
level consists of an SRAM memory array connected to the control units. They can
operate in each of the following modes:
1. Cache: Every SRAM memory bank functions as a traditional write-back cache
with LRU replacement policy. It can be either shared or private cache.
2. Scratchpad: The SRAM memory bank can also be configured as software-
managed scratchpad.
24
Sychronization ScratchPad: Transformer has a global scratchpad memory
for synchronization. This allows for the implementation of software coherence and
standard primitives such as locks, condition variables, barriers, and semaphores. Most
algorithm implementations only need to use barriers for synchronization before and
after a kernel and the need to synchronize during execution can usually be avoided.
3.2 Mapping: Linear Algebra Kernels
The following sub sections give a brief description of the kernel implementations
on the Transformer.
3.2.1 Triangular Matrix Solver (TRSM)
The A and B matrices are initially stored in the DRAM and copied into the
L1 banks through the L2. As the columns of X can be computed in parallel in
either a forward or a backward substitution, and since there is a serial dependency
of computations within a column, each GPE is assigned the task of computing one
or more columns of X (depending on the size of the matrix). Column B and matrix
A are copied into each L1 bank if a private cache is used or a single copy is shared
among all the L1 banks if a shared cache is used. Each GPE computes the X values in
a column one-by-one as shown in Algorithm 1 and stores them in the L1 cache. After
an entire column of X is solved, the updated values are flushed to DRAM through
L2.
When the size of the matrix is very large, the computations within a column can
be partitioned into n blocks. In such a case, the L1 cache can hold only a part of A, B
and X. The X values from block (i) are used to compute the values of block (i+ 1),
and so on. This method requires flushing data into DRAM / L2 after computation
of every block.
25
3.2.2 LU Decomposition (LUD)
Unlike TRSM, where the computations on columns are done in parallel, in LUD
there is a computational dependency across columns. Assuming that each row is
assigned to a GPE, the initial approach (v1) is to parallelize each column-update,
that involves computing one column of L and one row of U per update.
Consider a matrix of size N = 64. In the first column-update, the first row is
used as a pivot to compute the multiplication coefficients for all the rows, as shown
by Loop 1 in Fig.3.2. These coefficients are then used to update the matrix A, shown
in Loop 2, by GPEs 2 through 64. The updated pivot row is the first row of U , and
the updated A values along the first column form the first column of L. Now, for
the second column-update, the second row is used as pivot and the same procedure
is followed to get the second row of U and second column of L. Here, only GPEs 3
through 64 are active. This process repeats until we reach A64,64. Figure 3.3 illustrates
Figure 3.2: Pseudo Code for LUD when A is of Size N ×N
the first three column-updates. The updated values of each column-update stay in
L1 and are flushed to DRAM by performing a cache flush after every column-update.
The main drawback of this mapping is that the rows above the pivot row are
not used in further computations and so the GPEs assigned to those rows become
idle. Thus, the number of idle GPEs increase with every column-update. In a 4-tile
Transformer design, if each tile is assigned N/4 consecutive rows, then GPEs in tile 0
26
Figure 3.3: Mapping of LU for the First Three Column-Updates (Left to Right)
become inactive after N/4 rows have been processed. Even inside a tile, the workload
distribution is uneven.
Assigning non-consecutive rows to tiles may help all the tiles remain active until
the end, but the overall GPE utilization still stays low. Instead, assigning consecutive
rows to consecutive tiles makes sure that the GPEs stop working one after the other
after they have processed their rows. This technique can be utilized to power down
tiles, thereby saving power.
For higher utilization, an alternative method as mentioned in Section 2.1.2 is to
divide the matrix into blocks and solve using a combination of LUD, GEMM (General
Matrix - Matrix multiplication) and TRSM calls. The LUD step (A11 = L11U11), is
implemented using the above method (v1). Since the LUD works on a smaller matrix,
the imbalance in the workload distribution is not as much.
Matrix-matrix multiplication is computed by Transformer in the following way.
Figure 3.4 shows how the matrices are divided into blocks of 16 for the matrices of
size 64×64. Each GPE works on 16 elements of one row of A and 16× 16 block of B.
In every tile, the assignment of GPEs is the same, so each tile maintains an individual
copy of B. For instance, GPE0 of tile0 works on the first set of 16 elements of rows
1, 5, 9, 13 of A and its assigned block of 16×16 elements of B, GPE1 of tile0 works
on the second set of 16 elements of A for the same set of rows of A (1, 5, 9, 13) to
generate the partial sums. GPE2 of tile0 works on the third set of 16 elements and
27
GPE3 of tile0 works on the fourth set of 16 elements for the same rows of A and
assigned block of B. GPEs 4, 5, 6, 7 of tile0 work on the 16 elements of rows 2, 6, 10,
14, GPEs 8, 9, 10, 11 work on the 16 elements of rows 3, 7, 11, 15 and GPEs 12, 13,
14, 15 work on the 16 elements of rows 4, 8, 12, 16. GPE0 of tile1 works on the first
set of 16 elements of rows 17, 21, 25, 29 of A and the assigned blocks of B; GPE1 of
tile1 works on the second set of 16 elements for the same set of rows. As each GPE
works on a different block of data, there is no data overlap and the partial sums are
updated in their corresponding L1 memory locations to get the final C.
Figure 3.4: Mapping of Matrix-Matrix Multiplication
3.2.3 QR Decomposition (QRD)
QRD has computational dependency along rows as well as along columns. Since
the computations across the columns are independent of each other, each column is
assigned to a GPE. The computations start by first annihilating the last value of the
first column and proceeding to annihilate elements above and to the right.
Consider scheduling of an example matrix of size 8 shown in Figure 3.5. The
numerical entries show the cycle at which each sub-diagonal entry can be annihilated.
28
The annihilation is started by GPE0 of tile 0 at cycle 1 where the c & s values are
computed using the last two rows. The same process is repeated by GPE0 for the
next row at cycle 2. At cycle 3, the GPE1 starts working on the next column starting
from the last row and GPE0 continues to work on next row.
QRD has serial computational dependency across columns as well as rows. As
a result, the data parallelism is limited. The maximum parallelism is N/2 at stage
N − 1 for N ×N matrix.
Figure 3.5: Cycle at Which Each Sub-Diagonal Entry can be Annihilated for N=8
For the case when N = 64, and there are 64 GPEs, every GPE is assigned one
column and follows the scheduling scheme shown in Figure 3.5. All GPEs in tile 1
have to wait till tile 0 has annihilated the first 16 entries of a row. In the same way
tile 2 has to wait till tile 1 has annihilated the next 16 entries (a total of 32). As
each GPE is responsible for annihilating one column of A, all the rows are updated
multiple times across all the tiles. To ensure that tiles 1 and 2 work on the updated
rows, a L1 cache flush is required after every row update. So in this mapping a total
of 48+ 32+ 16 cache flushes are required. For matrices with larger sizes, the number
of cache flushes are even higher. Each cache flush costs 0.2 - 0.4 µs and so in order
to avoid overhead of cache flushes, we chose to implement QRD using only one tile
of 16 GPEs.
29
In the single tile implementation, after the annihilation of the first 16 columns is
done, annihilation on the next set of 16 columns start i.e., each block is independent of
the other. We ensure that the serial dependency across columns is respected by storing
the status of the columns (annihilated or not) using the synchronization scratchpad.
Matrix data is read from the L1 banks and the updated row values are stored back
into the L1 banks. In the end, after the entire annihilation, the Q and R values are
read out from the L1 banks.
3.2.4 Matrix Inversion
As mentioned in Section 2.1.4, we use the LU decomposition and Triangular solvers
to find the inverse of a matrix. Both TRSM and LUD are mapped as described in
Sections 3.2.1 and 3.2.2, respectively.
3.3 Mapping: Machine Learning Kernels
3.3.1 LSTM Cell
We consider the computations in an LSTM cell for both forward and backward
pass. In all cases, the input data and weight matrices are stored in DRAM and read
into L1 banks before computation. After the computations, the results are flushed
back to DRAM through L2.
The computations in an LSTM cell are divided into phases. The forward pass
consists of two phases: the first phase is the calculation of all the 4 gates given by
Equations 2.2 to 2.5. Computations of the 4 gates are assigned to 4 tiles (1 gate
per tile). The computations in each gate are matrix-vector multiplication followed by
either a sigmoid or a hyperbolic tangent function.
Matrix-vector multiplication is computed by Transformer in the following way.
30
Figure 3.6: Mapping of Matrix-Vector Multiplication
Figure 3.6 shows how the matrices are divided into blocks of 16 for matrix of size
64×64. Each GPE works on 16 elements of row of A and 16 elements of B. For
instance, GPE0 of tile0 works on the first set of 16 elements of rows 1, 5, 9, 13 of A
and first 16 elements of B; GPE1 of tile0 works on the second set of 16 elements for
the same set of rows of A and the second set of 16 elements of B and so on. GPE2 of
tile0 works on the third set of 16 elements and GPE3 of tile0 works on the fourth set
of 16 elements for the same rows. GPEs 4, 5, 6, 7 of tile0 work on the 16 elements of
rows 2, 6, 10, 14, GPEs 8, 9, 10, 11 work on the 16 elements of rows 3, 7, 11, 15 and
GPEs 12, 13, 14, 15 work on the 16 elements of rows 4, 8, 12, 16. GPE0 of tile1 works
on the first set of 16 elements of rows 17, 21, 25, 29 of A and first 16 elements of B;
GPE1 of tile1 works on the second set of 16 elements for the same set of rows and
so on. The final C outputs are assigned a particular L1 memory location. After each
product computation by the GPEs, the partial sums stored in the memory location
are updated.
The second phase is the calculation of internal state and hidden state given by
Equations 2.6 and 2.7. The element-wise computations are divided equally among
31
GPEs.
The backward pass consists of four phases. In the first phase, we assume the
incoming gradient δht, is available. Computation of δct is distributed equally among
all GPEs. The second phase consists of calculation of gradients of all the gates. Each
tile is assigned the task of computing the gradient of a particular gate, so 1 gate
per tile. The gradient computation is again a matrix-vector multiplication which is
implemented by dividing the matrix into blocks of 16 and assigning each block to a
GPE. The third phase is the calculation of the input gradient, δxt which is calculated
using all the gates. The gate-level computations are mapped to each tile, and are
added to get the final input gradient.
The final step is the calculation of the weight gradients and bias gradients. The
mapping is again a matrix-vector multiplication divided into blocks of 16. Each set
of weight and bias gradients of a gate are assigned to one tile.
3.3.2 GRU Cell
For GRU, we only perform inference i.e., forward pass. The input data and weight
matrices are stored in DRAM and the results are flushed after computation of every
phase. Unlike LSTM, where each gate is mapped to each tile, here, the computation
of each gate, is divided across all tiles. This is again a matrix-vector multiplication
performed in blocks of 16 as shown in Figure 3.6. Two gates are computed one after
the other. Finally, the hidden state computation is divided across all GPEs.
3.4 Mapping: Recommender Systems Engine
The mapping of all layers in the neural network for the recommender system
(Figure 2.5) is discussed below.
For all layers, the weights and input data are initially stored in DRAM. They
32
are copied into L1 through L2 before computation and stored back into DRAM after
computation. So a cache flush has to be performed after the computation of every
layer.
3.4.1 Embedding layer
The embedding matrix is of size N ×D, where N is the number of movies, and D
is the embedding dimension. In our evaluation, N = 10768 and D = 800. We assume
that every user has rated at least a movie. The values of the embedding matrix are
initialized randomly using the Normal(0, 0.01) distribution. In the forward pass, each
GPE computes the D-dimensional vector of each user by adding all the D-dimensional
vectors corresponding to the movies the user has rated. For forward pass, the output
of the embedding layer is a matrix of size B × D, where B is the batch size. We
consider a batch of B = 256 users, so vector additions corresponding to 4 users are
assigned to each GPE.
Similarly, in the backward pass, the computations of 10,768 movies are distributed
to 64 GPEs. Each GPE sums the D-dimensional vectors (of the incoming gradients)
of the users that rated that movie and the parameters are updated by the Adam
optimizer to generate a new embedding matrix.
3.4.2 ReLU
Both the forward and backward pass ReLU computations are distributed across
all GPEs. The ReLU operation is performed on each element of a row of size D.
Since each GPE is assigned 4 users, it computes on 4 rows of the embedding layer
output matrix.
33
3.4.3 1-D Batch Normalization
The input and output of the first batch normalization layer (BatchNorm1 in Figure
2.5) input and output is of size B×emb_dim, and the input and output of the second
one (BatchNorm2 in Figure 2.5) is of size B×hidden_dim, where B is the batch size.
The γ’s are initialized with 1’s and the β’s with 0’s. In the forward pass, processing
is done using columns, and each GPE is assigned emb_dim/64 or hidden_dim/64
columns. Each GPE computes mean, variance, normalized values Xˆ and uses these to
compute Y (see Equation 2.25). Y is computed using matrix vector product followed
by vector addition.
In the backward pass, the input gradient (Equation 2.26) is calculated by assigning
multiple users to a GPE, and the γ (Equation 2.27) and β (Equation 2.28) gradients
are calculated by assigning multiple columns to each GPE, where each column con-
tains gradient of all users in a batch.
3.4.4 Dropout
Similar to ReLU, dropout in both the forward and backward pass are implemented
by mapping multiple rows to each GPE.
3.4.5 Fully Connected Layer
The weight matrices of the hidden layer and the output layer are initialized
randomly using Uniform(min = −scale,max = scale) distribution where scale
is 1/emb_dim and 1/hidden_dim, respectively. Their sizes are (hidden_dim x
emb_dim) and (output_dim x hidden_dim) and are initially stored in transposed
form. The bias vector of the hidden layer is initialized with zeros and that of the
output layer with bias_offset.
34
The forward pass is a matrix-matrix multiplication (mapped as described in sec-
tion 3.2.2) of the previous layer’s output with the transposed weight matrix followed
by bias addition. The matrix-matrix multiplication is distributed across all 64 GPEs.
Each GPE works on blocks of size 16 × 16. The partial sums are added after every
block’s computation.
The backward pass comprises of three gradient calculations that include the input
gradient (Equation 2.30), weight gradient (Equation 2.31) and bias gradient (Equa-
tion 2.32). For the matrix dimensions to match, the incoming gradient ∂J
∂Y
, and the
weight matrix, W have to be transposed. The transpose operation is done by copying
the rows and storing it as columns in separate memory locations. It has to be done
for both the fully connected layers. Both the matrix-matrix multiplications for input
gradient and weight matrix gradient are mapped as in forward pass i.e., computed
in blocks of 16. For the bias gradient, each GPE sums the incoming gradient
(
∂J
∂Y
)
along the columns.
3.4.6 Binary Cross Entropy Loss
Each GPE computes the loss for each user using Equation 2.34. After all the
losses are computed by all the GPEs, LCP0 averages them to get the final loss.
3.4.7 Adam Optimizer
The Adam optimizer updates the embedding matrix, γ’s, β’s, weight matrices
and biases. The embedding matrix and weight matrices are updated by assigning
multiple rows to a GPE. The γ’s, β’s, and bias vectors are updated by dividing the
computations equally among GPEs. This module was implemented by Siying Feng
from University of Michigan.
35
Chapter 4
RESULTS
We present Transformer implementation results for different linear algebra and ma-
chine learning kernels. We evaluate performance on Transformer using the following
metrics: execution time, Giga Operations Per Second (GOPS), Giga Floating-point
Operations per second (GFLOPS), GOPS/W, GFLOPS/W. We also present the L1
and L2 cache hit rates.
4.1 Transformer Configuration
Transformer is modeled using the Gem5 [15, 16] architectural simulator. The
simulations are run on an Intel Core i5-4690 CPU @ 3.50GHz × 4, 32GB RAM,
512 GB storage and Ubuntu 16.04. The architectural configuration used here has 4
tiles with 16 GPEs per tile running on 1 GHz. It has two levels of cache hierarchy,
where the L1 cache size is fixed at 4kB and the L2 cache size is fixed at 64kB. The
simulation statistics such as execution time, power, GOPS, GFLOPS and cache misses
are obtained from Gem5 statistics file. The cache configurations used here are:
1. L1 shared cache, L2 shared cache (L1S, L2S)
2. L1 shared cache, L2 private cache (L1S, L2P)
3. L1 private cache, L2 shared cache (L1P, L2S)
4. L1 private cache, L2 private cache (L1P, L2P)
All the configuration parameters have been listed in Table 4.1.
36
General Processing Element(GPE) 1-issue, 4-stage, in-order CPU @ 1.0 GHz
Number of GPEs per tile 16
Number of tiles 4
Number of L1 cache banks per tiles 16
Number of L2 cache banks 4
Size of each L1 cache bank 4 KB
Size of each L2 cache bank 64 KB
DRAM 4 GB
Work/Status queue 32 bit 4-entry FIFO buffer
Sync SPM 16 KB, 1-ported
R-DCache (per bank) CACHE mode: 4 KB, 4-way set-associative, 1-ported
Table 4.1: Transformer Configuration Parameters
4.2 Calculation of Performance Metrics
The statistics file generated by Gem5 is used for obtaining the GPE-wise L1 cache
hit rates, tile-wise L2 cache hit rates, the number of operations, total execution time,
etc. The power consumption values are obtained for 14nm technology. The power
model was developed by Siying Feng, University of Michigan. A summary of the
procedure to calculate power is as follows.
Static power is computed by summing up the static power of the individual Trans-
former components. Dynamic power is calculated by dividing the total dynamic en-
ergy of the system by the total execution time.
For ARM cores, the static and dynamic power data are derived from its online
specification (40 nm) and scaled to 14 nm. The dynamic energy is calculated by
multiplying the given dynamic power with the number of active cycles.
For re-configurable caches, the static power and energy per transaction are gener-
ated using CACTI model [14] for 14 nm node. The dynamic energy is computed by
37
multiplying energy per transaction with the total number of accesses obtained from
the input statistics file.
The performance is measured by Equations 4.1 and 4.2:
GOPS = committed_ops/(sim_secs ∗ 109); (4.1)
GFLOPS = float_ops/(sim_secs ∗ 109) (4.2)
The parameters committed_ops, float_ops and sim_secs are obtained from
gem5 stats file. Specifically, committed_ops correspond to number of operations
executed, float_ops correspond to number of floating point operations executed and
sim_secs refers to the execution time for processing.
4.3 Performance of Linear Algebra Kernels
We evaluate the performance of the different kernels for different cache mode
operations and data sizes. The performance metrics are execution time, GOPS/W
and GFLOPS/W. For kernels that are composed of different basic kernels, we utilize
the reconfiguration capability of Transformer to derive the best performance.
4.3.1 Triangular Matrix Solver
Table 4.2 shows the execution times (in ms) for different cache configurations when
operating on matrix sizes of 128× 128 to 1024× 1024. We see that for small matrix
sizes, L1P, L2P has the best performance while for larger matrix sizes (1024× 1024)
L1S, L2P does better. L1P, L2P mode does not perform as well for larger sizes since
the memory requirement is larger than what the L1 cache bank can support.
Figures 4.1(a) and 4.1(b) show the cache hit rates of L1 and L2, respectively. We
observe that L1S, L2S has high L1 hit rates but relatively poor L2 hit rate. We find
38
N L1S, L2S L1S, L2P L1P, L2S L1P, L2P
128 0.15 0.15 0.137 0.132
256 1.28 1.33 0.989 0.981
512 10.33 10.12 9.32 7.22
1024 104.55 85.4 178.61 143.7
Table 4.2: TRSM: Execution Times (in ms) for Different Matrix Sizes
that both L1S, L2P and L1P, L2P have higher L2 hit rates and perform well until
N = 512. Due to insufficient L1 cache bank support, they do not perform well for
N = 1024, where L1S, L2P has the best performance.
(a) L1 hit rates (b) L2 hit rates
Figure 4.1: TRSM: Hit Rates for L1 and L2 Caches
Fig.4.2 shows the GFLOPS/W for different matrix sizes for all the cache modes.
Here too we see that L1P, L2P does better until N = 512 and when N = 1024, L1S,
L2P has significantly better performance. Our GFLOPS/W numbers are fairly high
with 84-96 GFLOPS/W for N < 512. The GOPS/W values for N = 128, 256, 512
and 1024 are also fairly high at 240, 250, 261 and 205, respectively.
39
Figure 4.2: TRSM: GFLOPS/W for Different Matrix Sizes.
4.3.2 LU Decomposition
As mentioned in Section 3.2.2, LUD v1 can be scheduled by assigning consecutive
or non-consecutive rows to each GPE. Fig. 4.3 shows the time a GPE is active when
N=128. In the first approach (Fig. 4.3(a)), 32 consecutive rows are assigned to each
tile, and so GPE0 in tile 0 is assigned 1st and 17th rows. In the second approach
(Fig. 4.3(b)), consecutive rows are assigned to consecutive GPEs across tiles and so
GPE0 is assigned 1st and 65th rows. In both approaches, each GPE computes on two
rows. The main advantage of approach 1 is that we can shut down the tiles after they
are done computing their rows, thus saving power. This technique, on average, saves
power up to 27%.
Table 4.3 shows the execution times (in ms) of both v1 and v2 for different cache
configurations when operating on matrix sizes of 128 × 128 to 1024 × 1024. Recall
that v2 uses v1 mapping for computing LUD at the block level followed by TRSM
and GEMM on the blocks. We can clearly see that v2 outperforms v1 for all matrix
sizes. We also see that until N = 256, L1S, L2S performs well and then for N > 256,
L1S, L2P has slightly better performance. Here, L1 private cache does not have good
performance because the v1 schedule divides each column-update across tiles. So, a
40
(a) Consecutive rows assigned to GPEs
(b) Non-consecutive rows assigned to each GPE
Figure 4.3: LUD: The Time for Which Each GPE is Active in the Two Scheduling
Schemes for v1.
N L1S, L2S L1S, L2P L1P, L2S L1P, L2P Using
v1 v2 v1 v2 v1 v2 v1 v2 Reconfig.
128 0.67 0.46 0.69 0.5 0.69 0.57 0.67 0.55 0.45
256 3.58 2.06 3.44 2.1 3.52 3.78 3.64 2.9 2.01
512 25.61 12.96 25.52 12.62 27.63 40.05 24.5 19.52 11.9
1024 168.66 99.6 169.62 97.25 371.25 382.6 157.17 143.77 91.45
Table 4.3: LUD: Execution Times (in ms) for Different Matrix Sizes
shared configuration for L1 does better. The reconfiguration scheme performs better
for all matrix sizes.
41
Figures 4.4(a) and 4.4(b) show the cache hit rates of L1 and L2, respectively. As
LUD v2 is a combination of TRSM, LUD v1 and GEMM, we can see some of their
characteristics. For example, LUD v1 favours a shared L1 cache, so L1S, L2S and
L1S, L2P have high L1 hit rates. As we saw in TRSM, L1P, L2P has good L1 and
L2 rates, here also we see that L1P, L2P has high L2 hit rates.
(a) L1 hit rates (b) L2 hit rates
Figure 4.4: LUD: Hit Rates for L1 and L2 Caches
Figure 4.5: LUD: GFLOPS/W for Different Matrix Sizes
Fig. 4.5 shows the GFLOPS/W for LUD v2. There is a linear increase in GFLOPS/W
as the matrix size increases until N=512. The numbers for N = 512 and 1024 are
close to 62 GFLOPS/W. The GOPS/W for the best cache configuration for N=128,
256, 512 and 1024 are 100.38, 133.69, 163.21 and 161.32, respectively.
42
4.3.3 QR Decomposition
We evaluate the performance of QRD on matrix sizes of 64× 64, 128× 128, and
256 × 256. As we use only a single tile, there is no difference in using L2 in shared
or private mode. The execution times for L1P, L2S are 0.35 ms, 1.54 ms and 13.3 ms
for N = 64, 128 and 256, respectively. L1P, L2S works better for all the matrix sizes
as each GPE works independently on a column.
Figures 4.6(a) and 4.6(b) show the cache hit rates of L1 and L2, respectively. Both
configurations have very poor L1 hit rates, while L1P, L2S has good L2 hit rates.
(a) L1 hit rates (b) L2 hit rates
Figure 4.6: QRD: Hit Rates for L1 and L2 Caches
Fig.4.7 shows the GFLOPS/W for different matrix sizes. We see that the GFLOPS/W
is as high as 130 for N=256. The GOPS/W for N=64, 128 and 256 are 124.93, 183.7
and 182.75, respectively.
4.3.4 Matrix Inversion
The Matrix Inversion kernel is composed of three different kernels: LUD, forward
substitution and backward substitution. An analysis of execution time shows that
LUD is dominant with 42.79%, followed by forward substitution at 29.96% and back-
ward substitution at 27.75%. Table 4.4 shows the execution times (in ms) for different
cache configurations when operating on matrix sizes of 128×128 to 1024×1024. The
43
Figure 4.7: QRD: GFLOPS/W for Different Matrix Sizes
’reconfiguration’ scheme, which uses the best cache mode for each kernel, has the best
performance.
N L1S, L2S L1S, L2P L1P, L2S L1P, L2P Reconfig.
128 0.78 0.82 0.84 0.86 0.72
256 4.83 4.95 5.76 5.14 3.9
512 33.71 32.93 58.7 33.96 27.32
1024 279.66 268.06 739.82 385.22 268.06
Table 4.4: Matrix Inverse: Execution Times (in ms) for Different Matrix Sizes
Figures 4.8(a) and 4.8(b) show the cache hit rates of L1 and L2, respectively. As
inversion is a combination of LUD v2 and TRSM, we can see some of their character-
istics on both L1 and L2 hit rates. While L1P, L2S and L1P, L2P have comparatively
low L1 and high L2 rates, the major contributor here is LUD v2 which has high L1
hit rates for L1S, L2S and L1S, L2P. Even though TRSM has good L1 and L2 hit
rates for L1P, L2S and L1P, L2P, and also with low L2 rates for L1S, L2S and L1S,
L2P, they have lower execution times.
Next, we present GFLOPS/W results for the different configurations in Fig.4.9.
We see that reconfiguration helps increase GFLOPS/W. For instance, for N = 512
44
(a) L1 hit rates (b) L2 hit rates
Figure 4.8: Matrix Inverse: Hit Rates for L1 and L2 Caches
use of reconfiguration helps increase GFLOPS/W from 72.67 (L1S, L2P) to 83.05.
Since reconfiguring cache mode only costs 1 cycle, support for reconfiguration is a
winning proposition.
Figure 4.9: Matrix Inverse: GFLOPS/W for Different Matrix Sizes
4.4 Performance of Machine Learning Kernels
4.4.1 LSTM cell
The LSTM cell computations are primarily matrix-vector multiplications followed
by a sigmoid or a hyberbolic tangent function. Table 4.5 shows the execution times
45
(in ms) for different cache configurations when operating on parameter sizes of 128
to 1024, where the input size is equal to the dimension of the LSTM cell. We see
that L1P, L2P has the best performance for all parameter sizes. The main reason is
that the matrix-vector multiplication is distributed such that each GPE is assigned a
block of 16 elements in a row. As there is no overlap in data between any two GPEs,
L1P, L2P is always better.
N L1S, L2S L1S, L2P L1P, L2S L1P, L2P
128 0.39 0.22 0.18 0.14
256 1.25 0.98 0.88 0.58
512 4.6 3.26 2.64 1.88
1024 18.95 14.1 11 7.85
Table 4.5: LSTM: Execution Times (in ms) for Different Parameter Sizes
Figure 4.10: LSTM: GFLOPS/W for Different Parameter Sizes.
Fig.4.10 shows the GFLOPS/W for different parameter sizes for all the cache
modes. The GFLOPS/W increase till N = 512 and then start decreasing. Here too
we see that L1P, L2P does better. The GOPS/W values for N = 128, 256, 512 and
1024 are at 102.51, 97.31, 111.23 and 106.19, respectively.
46
4.4.2 GRU cell
Similar to LSTM, the majority of the computations in a cell is matrix-vector
multiplication. The results shown here are only for inference. We evaluate the per-
formance on parameter sizes of 128 to 1024. Table 4.6 shows the execution times for
all cache configurations. Similar to LSTM, here also L1P, L2P does better because
of the scheduling scheme used for matrix-vector multiplication.
N L1S, L2S L1S, L2P L1P, L2S L1P, L2P
128 0.08 0.05 0.05 0.04
256 0.33 0.18 0.2 0.14
512 1.29 0.68 0.66 0.43
1024 8.87 4.22 2.74 1.54
Table 4.6: GRU: Execution Times (in ms) for Different Parameter Sizes
Figure 4.11: GRU: GFLOPS/W for Different Parameter Sizes
Fig.4.11 shows the GFLOPS/W for different parameter sizes. We see that the
GFLOPS/W is as high as 69.3 for N=1024. The GOPS/W for N=128, 256, 512 and
1024 are 111.77, 131.97, 153.98 and 163.37, respectively.
47
4.5 Performance of Recommender Systems Engine
The dataset used here is the movielens 10M dataset[11] provided by DARPA. It
consists of 10 million ratings and 100,000 tag applications applied to 10,678 movies
by 69,878 users. The hyper parameters of the network are as follows:
1. Batch_size: 256; embedding dimension: 800; hidden dimension: 400; output
dimension: 10678;
2. Bias_offset: -10; dropout probability: 0.5;
3. Optimizer: ADAM optimizer; learning rate: 0.01; beta: (0.9, 0.999); weight
decay: 0; epsilon: 1e-8.
While the architectural configuration is the same as the description in Section 4.1,
here we only consider shared cache mode for both L1 and L2 banks.
The initialization part is done by one of the LCP and takes about 1.25s. It includes
1. Loading the DARPA dataset, which is stored in Numpy array file (.npy), onto
the DRAM.
2. Initializing the weight matrices for the Embedding layer (embedding matrix),
BatchNorm (γ′s and β′s), Linear layer (Weights and biases) and Adam Opti-
mizer.
Table 4.7 shows the overall results for a batch of 256 users.
Exec. Time GOPS GFLOPS GOPS/W GFLOPS/W Cache hit rates (L1, L2)
1.062s 39.02 12.59 169.12 54.55 98.45, 22.05
Table 4.7: Performance Data for 256 Users
48
Since the embedding layer (both forward and backward) would slightly vary for
each batch and the final batch would have only 246 users, we can project that the
execution time for the entire dataset would be close to 273 times the execution time
of one batch of 256 users, i.e. 289.926s. The initialization cost is thus distributed
across all 273 batches.
The breakdown of the execution time for one batch is shown in Figure 4.12. In
the forward pass, we see that the second fully connected layer (Linear 2 in Figure
2.5) is the one that takes the longest time to execute. This is because the matrix
multiplication involves dimensions of 256×400 and 400×10768. This is very high when
compared to the first fully connected layer (Linear 1 in Figure 2.5) with dimensions
256×800 and 800×400. Even in the backward pass, Linear 2 has the longest execution
time followed by loss+sigmoid. The rest of the layers take a very low proportion of
the total execution time. The execution time of the forward pass is 372.8ms and the
backward pass is 506.53ms.
(a) Forward Pass (b) Backward Pass
Figure 4.12: Recsys: Breakdown of Execution Time
The performance metrics for the forward pass is shown in Table 4.8 and for the
backward pass (with Adam Optimizer) is shown in Table 4.9. The average L1 hit
rates are above 90% and L2 hit rates are around 20%. Here, we see that the fully
49
connected layers have the greatest GOPS/W and GFLOPS/W followed by batchnorm
and dropout layers. The transpose operation has very high GOPS/W and very poor
GFLOPS/W as there are no floating point operations. The GFLOPS/W for the
forward pass is 43.48 and for the backward pass is 61.86. The GOPS/W for the
forward pass is 148.62 and for the backward pass is 180.89. Overall, the system
achieves 54.55 GFLOPS/W and 169.12 GOPS/W.
Layer Execution (ms) GOPS/W GFLOPS/W Hit rates in (L1, L2)
Embedding + ReLU1 18.62 53.70 25.21 96.82, 9.19
BatchNorm1 1.03 130.45 18.00 95.28, 38.66
Dropout1 6.14 165.57 1.62 96.00, 56.61
Linear1 + ReLU2 14.36 186.35 64.37 99.00, 18.42
BatchNorm2 0.76 94.75 12.86 95.04, 29.20
Dropout2 3.77 140.79 1.38 95.97, 57.81
Linear2 + Sigmoid 328.11 150.75 44.45 98.41, 27.77
Table 4.8: Performance Metrics for Forward Pass
Scalability: Our evaluation was done for a smaller set of data (one batch of 256
users). All other hyper parameter values were the same as given by DARPA. Here
we comment on the mapping if the number of users, number of movies, batch_size,
etc. are higher.
• Increasing the number of users doesn’t affect the computations because each
mini-batch has a fixed set of users and is independent of the other. There will
be a slight impact on the embedding layer if the number of ratings by each user
increases or decreases because the embedding layer sums up all the movies’
vectors from the embedding matrix.
50
Layer Execution (ms) GOPS/W GFLOPS/W Hit rates in (L1, L2)
Loss + Sigmoid 74.78 105.91 38.92 94.02, 26.76
Transpose2 22.28 178.36 8E-06 67.33, 5.16
Linear2 359.56 193.50 68.76 99.17, 14.30
Dropout2 0.08 61.85 30.40 91.35, 0
BatchNorm2 + ReLU2 1.99 182.83 17.24 93.03, 25.28
Transpose1 1.01 211.55 2.7E-04 71.72, 12.20
Linear1 26.19 201.65 71.54 99.18, 11.58
Dropout1 0.16 62.85 31.17 91.52, 0.012
BatchNorm1 + ReLU1 5.86 137.58 12.60 92.56, 23.70
Embedding 14.58 135.19 58 99.41, 14.86
Adam Optimizer 182.68 64.61 19.21 97.03, 31.07
Table 4.9: Performance Metrics for Backward Pass
• Increasing the number of movies impacts the output layer and the embedding
layer. The size of the output layer matrix multiplication increases along with
the rows of the embedding matrix.
• If batch size increases, the number of iterations through the network will become
less. As a result, more number of users will be processed in a single batch.
• If the sizes of other parameters increase, the major impact would be to the
fully connected layer as it increases the number of multiplications. It would
also affect the transpose. All other layers except for the batch norm have
element-wise operations, the change in their execution time wouldn’t be that
significant. For the batch norm, the dimensions for calculating the mean and
variances would increase.
51
Chapter 5
CONCLUSION
5.1 Summary
In this thesis, we presented the implementations of (i) few linear algebra kernels
including Triangular Matrix Solver (TRSM), LU Decomposition (LUD), QR Decom-
position (QRD), Matrix Inversion, (ii) machine learning kernels including an LSTM
cell, a GRU cell and (iii) a neural network based recommender systems engine onto
a massively parallel reconfigurable architecture, Transformer. We investigated the
performance of these kernels for different kernel sizes and different L1 and L2 cache
configurations (shared and private) when operating in the 14nm technology node. The
performance metrics were execution times, giga-floating-point-operations per second
per Watt (GFLOPS/W), and L1/L2 cache misses.
For the linear algebra kernels, each kernel achieves high performance for a certain
cache configuration and this cache configuration can change when the kernel sizes
change. For instance, for smaller matrix sizes, L1P, L2P was best for TRSM, L1S,
L2S was best for LUD, and L1P, L2S was best for QRD. For each kernel, the optimal
cache mode changed when the matrix size increased. For instance, for LUD, the
optimal cache mode changed from L1S, L2S to L1S, L2P for N > 512. For the
machine learning kernels, L1P, L2P was best for all kernel sizes.
The reconfigurable cache mode features of Transformer are utilized in the imple-
mentation of LUD and matrix inverse. For N = 512, implementing the LUD step
using L1S, L2P and the TRSM step using L1P, L2P resulted in an improvement of
11.83 GFLOPS/W from 72.67 (L1S, L2P) to 84.5. Cycle-accurate simulations using
52
Gem5 show that the peak performance is 97.5, 62.2, 133.0 and 84.5 GFLOPS/W for
TRSM, LUD, QRD and Matrix Inverse, respectively. For LSTM and GRU, the peak
performance is 44.06 and 69.3 GFLOPS/W, respectively.
The neural network based recommender systems was implemented in L1S, L2S
mode. It consists of multiple kernels of which one of the fully connected layers takes
88% of the time in the forward pass and 70% of the time in the backward pass. The
large computational complexity is due to operations with large weight matrix of size
(400× 10768). While this step results in 44.45 GFLOPS/W in the forward pass and
68.76 GFLOPS/W in the backward pass, there is room for improvement. Overall,
the recommender systems engine achieves a performance of 54.55 GFLOPS/W and
169.12 GOPS/W.
5.2 Key Take-aways
All the linear algebra kernels were mapped to the GPEs in a way that maximized
GPE utilization. We find that based on mapping strategy, different kernels have
different memory access patterns resulting in different favourable cache modes. For
instance, private cache is beneficial when there are no dependencies between compu-
tations and so no data overlap across GPEs. For example, TRSM has no dependency
across columns and so each column is assigned to each GPE. Also, if the L1 cache
is large enough, a copy of data can be stored in each private L1 cache bank. For
example, TRSM maintains an independent copy of A and a column of B in each L1
memory bank. This works only if the matrix sizes are small compared to the size of
the cache.
Shared cache is beneficial when the data doesn’t fit in the cache and workload is
memory bound which is the case for large matrix sizes. Shared cache is also beneficial
when every GPE needs the same copy of data for their computations and the L1 bank
53
is not large enough to hold it. For example, in LUD, parallelizing each column update
as well as sharing the data across GPEs can only be supported by use of shared cache.
5.3 Extensions
The work presented in this thesis is an initial exploration of linear algebra and
machine learning kernels on a parallel multi-core architecture.
We considered only private and shared mode operations for L1 and L2 caches. We
can investigate performance of the kernels using different combinations of cache and
scratchpad modes and even hybrid cache modes. A hybrid cache mode has half of the
memory bank configured as a cache and the other half as a scratchpad. The other
area of improvement would be to employ different scheduling schemes. For example,
a block QR approach can be implemented that uses all 4 tiles instead of just one.
For the recommendation system, we implemented all kernels in shared cache mode.
The next step would be to implement the computation intensive kernels such as fully
connected layer and Adam optimizer using multiple cache modes and picking the one
with the best performance. We project that fully connected layer would benefit from
use of L1S, L2P mode. Since Transformer implements cache reconfiguration in 1
cycle, operating different kernels of the recommendation system using different cache
modes will help improve its performance even more.
54
REFERENCES
[1] Qualcomm Snapdragon 855+, https://www.qualcomm.com/products/
snapdragon-855-plus-mobile-platform, 2019. [Online; accessed October 25,
2019].
[2] Apple A13 bionic chip, https://en.wikipedia.org/wiki/Apple_A13, 2019.
[Online; accessed October 25, 2019].
[3] AMD Math Core Library, https://developer.amd.com/wordpress/media/
2012/10/acml_userguide.pdf.pdf, 2019. [Online; accessed October 25, 2019].
[4] Intel MKL, https://software.intel.com/en-us/articles/
mkl-reference-manual, 2019. [Online; accessed October 25, 2019].
[5] CuBLAS, https://developer.nvidia.com/cublas, 2019. [Online; accessed
October 25, 2019].
[6] Machine Learning, https://en.wikipedia.org/wiki/Machine_learning,
2019. [Online; accessed October 29, 2019].
[7] Google’s Tensor Processing Unit, https://cloud.google.com/tpu/, 2019. [On-
line; accessed October 29, 2019].
[8] Apple’s A11 Bionic, https://en.wikipedia.org/wiki/Apple_A11#Neural_
Engine, 2019. [Online; accessed October 29, 2019].
[9] Apple’s A12 Bionic, https://en.wikipedia.org/wiki/Apple_A12, 2019. [On-
line; accessed October 29, 2019].
[10] Qualcomm’s AI engine, https://www.qualcomm.com/products/smartphones/
mobile-ai, 2019. [Online; accessed October 29, 2019].
[11] MovieLens 10M Dataset, http://files.grouplens.org/datasets/
movielens/ml-10m.zip, 2019. [Online; accessed October 29, 2019].
[12] M. Alle, K. Varadarajan, A. Fell, N. Joseph, S. Das, P. Biswas, J. Chetia, A. Rao,
SK. Nandy, R. Narayan, et al. Redefine: Runtime reconfigurable polymorphic
asic. ACM Transactions on Embedded Computing Systems (TECS), 9(2):11,
2009.
[13] E. Angerson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz,
S. Hammarling, J. Demmel, C. Bischof, and D. Sorensen. Lapack: A portable lin-
ear algebra library for high-performance computers. In Supercomputing ’90:Pro-
ceedings of the 1990 ACM/IEEE Conference on Supercomputing, pages 2–11,
Nov 1990.
[14] R. Balasubramonian, A. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas.
Cacti 7: New tools for interconnect exploration in innovative off-chip memories.
ACM Trans. Archit. Code Optim., 14(2):14:1–14:25, June 2017.
55
[15] N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, J. Hestness,
D. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. ACM SIGARCH
Computer Architecture News, 39(2):1–7, 2011.
[16] N. Binkert, R. Dreslinski, L. Hsu, K. Lim, A. Saidi, and S. Reinhardt. The m5
simulator: Modeling networked systems. IEEE MICRO, (4):52–60, 2006.
[17] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear
algebra algorithms for multicore architectures. Parallel Comput., 35(1):38–53,
January 2009.
[18] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi. A dynamically
configurable coprocessor for convolutional neural networks. In Proceedings of
the 37th Annual International Symposium on Computer Architecture, ISCA ’10,
pages 247–257, 2010.
[19] Y. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks. IEEE Journal
of Solid-State Circuits, 52(1):127–138, Jan 2017.
[20] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio. Learning phrase representations using RNN
encoder–decoder for statistical machine translation. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 1724–1734, Doha, Qatar, October 2014. Association for Computational
Linguistics.
[21] C.Peddawad and A. Goel. Matrix-matrix multiplication using systolic array
architecture in bluespec team. 2015.
[22] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Ya-
mazaki. Accelerating Numerical Dense Linear Algebra Calculations with GPUs,
pages 3–28. Springer International Publishing, Cham, 2014.
[23] K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense
linear algebra computations. SIAM Rev., 32(1):54–135, March 1990.
[24] W. Givens. Computation of plane unitary rotations transforming a general ma-
trix to triangular form. Journal of the Society for Industrial and Applied Math-
ematics, 6(1):26–50, 1958.
[25] K. Goto and Robert. Geijn. Anatomy of high-performance matrix multiplication.
ACM Trans. Math. Softw., 34(3):12:1–12:25, May 2008.
[26] V. Govindaraju, C. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam,
and C. Kim. Dyser: Unifying functionality and parallelism specialization for
energy-efficient computing. IEEE Micro, 32(5):38–51, 2012.
[27] Y. Guan, Z. Yuan, G. Sun, and J. Cong. Fpga-based accelerator for long short-
term memory recurrent neural networks. In 2017 22nd Asia and South Pacific
Design Automation Conference (ASP-DAC), pages 629–634, Jan 2017.
56
[28] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput.,
9(8):1735–1780, November 1997.
[29] M. Hossein, A. Bacha, L. Zhou, and R. Teodorescu. Rnnfast: An accelerator for
recurrent neural networks using domain wall memory. ArXiv, abs/1812.07609,
2018.
[30] J. Humphrey, D. Price, K. Spagnoli, A. Paolini, and E. Kelmelis. CULA: hybrid
GPU accelerated linear algebra routines. InModeling and Simulation for Defense
Systems and Applications V, volume 7705, pages 9 – 15. International Society
for Optics and Photonics, SPIE, 2010.
[31] S. M. A. H. Jafri, T. N. Gia, S. Dytckov, M. Daneshtalab, A. Hemani, J. Plosila,
and H. Tenhunen. Neurocgra: A cgra with support for neural networks. In 2014
International Conference on High Performance Computing Simulation (HPCS),
pages 506–511, July 2014.
[32] W. José, A. R. Silva, H. Neto, and M. Véstias. Analysis of matrix multiplication
on high density virtex-7 fpga. In 2013 23rd International Conference on Field
programmable Logic and Applications, pages 1–4, Sep. 2013.
[33] Oleksii K. and Boris G. Training deep autoencoders for collaborative filtering,
2017.
[34] F. A. Khan, R. A. Ashraf, Q. H. Abbasi, and A. A. Nasir. Resource efficient
parallel architectures for linear matrix algebra in real time adaptive control al-
gorithms on reconfigurable logic. In 2008 Second International Conference on
Electrical Engineering, pages 1–9, March 2008.
[35] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Interna-
tional Conference on Learning Representations, 12 2014.
[36] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear
algebra subprograms for fortran usage. ACM Trans. Math. Softw., 5(3):308–323,
September 1979.
[37] F. Li, Y. Ye, Z. Tian, and X. Zhang. Cpu versus gpu: which can perform matrix
computation faster—performance comparison for basic linear algebra subpro-
grams. Neural Computing and Applications, 31(8):4353–4365, Aug 2019.
[38] S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu. Fpga acceleration of recurrent
neural network based language model. In 2015 IEEE 23rd Annual International
Symposium on Field-Programmable Custom Computing Machines, pages 111–
118, May 2015.
[39] Y. Li and A. Pedram. Caterpillar: Coarse grain reconfigurable architecture for
accelerating the training of deep neural networks. 2017 IEEE 28th Interna-
tional Conference on Application-specific Systems, Architectures and Processors
(ASAP), pages 1–10, 2017.
57
[40] F. Magoules and A. Ahamed. Alinea: An advanced linear algebra library for
massively parallel computations on graphics processing units. The International
Journal of High Performance Computing Applications, 29(3):284–310, 2015.
[41] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. Adres: An ar-
chitecture with tightly coupled vliw processor and coarse-grained reconfigurable
matrix. In International Conference on Field Programmable Logic and Applica-
tions, pages 61–70. Springer, 2003.
[42] R. Nath, S. Tomov, and J. Dongarra. Accelerating gpu kernels for dense linear
algebra. In High Performance Computing for Computational Science – VECPAR
2010, pages 83–92, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.
[43] E. Nurvitadhi, Jaewoong Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr.
Accelerating recurrent neural networks in analytics servers: Comparison of fpga,
cpu, gpu, and asic. In 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), pages 1–4, Aug 2016.
[44] S. Pal, J. Beaumont, D. Park, S. Amarnath, A.and Feng, C. Chakrabarti, H. Kim,
D. Blaauw, T. Mudge, and R. Dreslinski. Outerspace: An outer product based
sparse matrix multiplication accelerator. In 2018 IEEE International Symposium
on High Performance Computer Architecture (HPCA), pages 724–736. IEEE,
2018.
[45] G. Papa and J. Silc. Linear algebra in one-dimensional systolic arrays. Infor-
matica (Slovenia), 24, 06 2000.
[46] D. Patel, M. Shabany, and P. G. Gulak. A low-complexity high-speed qr de-
composition implementation for mimo receivers. In 2009 IEEE International
Symposium on Circuits and Systems, pages 33–36, May 2009.
[47] A. Pedram, A. Gerstlauer, and R. A. Van De Geijn. Floating point architecture
extensions for optimized matrix factorization. In 2013 IEEE 21st Symposium on
Computer Arithmetic, pages 49–58, 2013.
[48] A. Pedram, A. Gerstlauer, and R. A. Van De Geijn. Algorithm, architecture,
and floating-point unit codesign of a matrix factorization accelerator. IEEE
Transactions on Computers, 63(8):1854–1867, 2014.
[49] A. Pedram, S. Z. Gilani, N. S. Kim, R. v. d. Geijn, M. Schulte, and A. Gerstlauer.
A linear algebra core design for efficient level-3 blas. In 2012 IEEE 23rd Interna-
tional Conference on Application-Specific Systems, Architectures and Processors,
pages 149–152, July 2012.
[50] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal. Memory-centric accel-
erator design for convolutional neural networks. In 2013 IEEE 31st International
Conference on Computer Design (ICCD), pages 13–19, Oct 2013.
58
[51] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pe-
dram, C. Kozyrakis, and K. Olukotun. Plasticine: A reconfigurable architecture
for parallel patterns. In 2017 ACM/IEEE 44th Annual International Symposium
on Computer Architecture (ISCA), pages 389–402, 2017.
[52] S. Steffl and S. Reda. Lacore: A supercomputing-like linear algebra accelerator
for soc-based designs. In 2017 IEEE International Conference on Computer
Design (ICCD), pages 137–144, Nov 2017.
[53] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra. Dense linear algebra solvers
for multicore with gpu accelerators. In 2010 IEEE International Symposium on
Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pages
1–8, April 2010.
[54] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei. Deep convolutional
neural network architecture with reconfigurable computation patterns. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 25(8):2220–2233,
Aug 2017.
[55] L. Wang, W. Wu, Z. Xu, J. Xiao, and Y. Yang. Blasx: A high performance
level-3 blas library for heterogeneous multi-gpu computing. In Proceedings of the
2016 International Conference on Supercomputing, ICS ’16, pages 20:1–20:11,
New York, NY, USA, 2016. ACM.
[56] R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software.
In SC ’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing,
pages 38–38, Nov 1998.
[57] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing fpga-
based accelerator design for deep convolutional neural networks. In Proceedings
of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, FPGA ’15, pages 161–170, 2015.
[58] X.-Y Zhang, Q. Wang, and Y.-Q Zhang. Openblas: A high performance blas
library on loongson 3a cpu. 22:208–216, 12 2011.
[59] Y. Zhang, C. Wang, L. Gong, Y. Lu, F. Sun, C. Xu, X. Li, and X. Zhou. Im-
plementation and optimization of the accelerator based on fpga hardware for
lstm network. In 2017 IEEE International Symposium on Parallel and Dis-
tributed Processing with Applications and 2017 IEEE International Conference
on Ubiquitous Computing and Communications (ISPA/IUCC), pages 614–621,
Dec 2017.
59
