TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on
  GPUs by Rivera, Cody et al.
TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on
GPUs
Cody Riveraa,1, Jieyang Chenb,1, Nan Xiongc, Shuaiwen Leon Songd, Dingwen Taoe,∗
aThe University of Alabama, Tuscaloosa, AL 35487, USA
bOak Ridge National Laboratory, Oak Ridge, TN 37830, USA
cThe University of Tennessee, Knoxville, TN 37996, USA
dThe University of Sydney, NSW 2006, Australia
eWashington State University, Pullman, WA 99164, USA
Abstract
Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have
been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus
on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully
utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In
this paper, we propose two efficient algorithms—TSM2R and TSM2L—for two classes of tall-and-skinny matrix-matrix
multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input
matrices is tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-
skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We
implement our proposed algorithms and test on several modern Nvidia GPU micro-architectures. Experiments show
that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.1x∼3x and improves
the memory bandwidth utilization and computing power utilization by 8%∼47.6% and 7%∼37.3%, respectively, when
the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.1x∼3.5x
and improve the memory bandwidth utilization by up to 55% when the regular-shaped matrix size is relatively small.
Keywords: Matrix-matrix multiplication, Tall-and-skinny matrix, GEMM, GPU, Optimization, CUDA, Performance
1. Introduction
Matrix-matrix multiplication (GEMM) has been one
of the most extensively used linear algebra operations
in big data analytics and scientific computations. Due
to many factors (such as algorithms, input data, etc.)
the size or shape of input matrices for GEMM usually
varies when it is used in different applications. For ex-
ample, many modern highly scalable scientific simula-
tion packages in the field of fluid dynamics, such as Fi-
nite Element Method (FEM) needs to compute many
GEMM with small-sized input matrix. Artificial neu-
ral networks (ANN) involve using GEMM with small
to medium input matrices. Matrix decompositions uses
GEMM with large-sized input matrices [1, 2, 3, 4].
∗Corresponding author
Email address: dingwen.tao@wsu.edu (Dingwen Tao)
1Cody Rivera and Jieyang Chen have contributed equally to this
work.
Thus, besides large-sized input, which has already been
extensively optimized during the past decades, GEMM
with small to medium sized input has also drawn much
attention to recent researchers. For instance, Dong et al.
[5] proposed MAGMA-Batched, which aims to batch
small input matrices into larger ones in order to utilize
the highly optimized implementations for large input
size on GPUs. Heinecke et al. [6] proposed to speed up
GEMM with small input using architecture and instruc-
tion level optimization on modern CPU architectures.
Although previous works have focused on optimizing
GEMM with different matrix sizes, most of them only
assume that the input matrices are regular-shaped. In
other words, the size mentioned in their works usually
refers to both dimensions of the input matrix. For exam-
ple, a small matrix means both of its width and height
are small and their magnitudes are close to each other.
When the dimensions of the input matrices have signif-
icant difference, we consider them as irregular-shaped
Preprint submitted to Journal of Parallel and Distributed Computing July 29, 2020
ar
X
iv
:2
00
2.
03
25
8v
4 
 [c
s.D
C]
  2
8 J
ul 
20
20
inputs. In particular, many irregular-shaped inputs in-
volve tall-and-skinny matrices, in which their widths are
significantly smaller then their heights. To the best of
our knowledge, GEMM with tall-and-skinny input has
not been fully studied and optimized for, but it has been
widely used in many applications [7]. For instance, re-
cent highly optimized K-means implementations [8, 9]
use GEMM as their core computation, and the input size
is mostly tall-and-skinny. This is because the number of
centroids is usually far less than the number of input
data points. Moreover, when GEMM is used for encod-
ing checksums for many algorithm-based fault tolerance
applications [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
the input involves a tall-and-skinny checksum weight
matrix.
Previous efforts made for optimizing GEMM with
regular-shaped input may not work for the non-regular
shaped input. For instance, Chen et al. [13] illus-
trates that shows that calculating GEMM with tall-and-
skinny input using vendors highly optimized linear alge-
bra library (e.g., cuBLAS [21]) is slower than disassem-
bling the tall-and-skinny input matrix into several vec-
tors and then applying matrix-vector multiplications in-
stead. However, it can be easily seen that even with this
workaround the computation is not efficient, since ele-
ments in input matrices are accessed by the GPU more
times than necessary. Although the performance can be
optimized by grouping many tall-and-skinny input ma-
trices into large ones similar to the approach proposed,
there are cases where this grouping approach is not fea-
sible. For example, tall-and-skinny input matrices may
be generated one at a time from a producer process in
users workflow. Grouping several of them into a large
matrix requires extended waiting time, which is not ap-
plicable for time-sensitive applications. On the other
hand, the memory space may limit the total number of
matrices that can fit into the memory at the same time,
if the input matrices are large (e.g., multiplication of
regular-shaped large and tall-and-skinny matrices).
In this work, we target on optimizing the computa-
tion of GEMM with tall-and-skinny input on the GPU
platform since many applications that use GEMM are
deployed on GPUs. So, our optimization greatly ben-
efits those applications. The key insight of our work
is that the computation characteristic of GEMM on
modern computing systems is not always unchanged
as we change the shape of input matrices. For exam-
ple, when the sizes of regular-shaped matrices are large
(i.e., m ' k ' n  1 for an m × k matrix multi-
plying an k × n matrix), the compute-to-load ratios of
each element in the input matrices are O(m) or O(n).
So, the regular-shaped GEMM operations are usually
compute-bound especially for large matrices. However,
when the input is tall-and-skinny (i.e., m ' k  n or
m  k ' n), the average compute-to-load ratio is re-
duced to around O(n) ≈ O(1). Moreover, when k is
very small (i.e., m  k ' n), each GPU thread would
not perform enough workload to hide latency and hence
low occupancy. Therefore, depending on the relation-
ship between m, k, and n, and the performance char-
acteristics of GPUs, the computation can be compute-
bound, memory-bound, or latency-bound. Specifically,
when (1) m ' k  n, as n gets larger, it moves to-
ward compute-bound; (2) m ' k  n, as n gets smaller,
it moves toward memory-bound; and (3) m  k ' n,
it moves toward latency-bound. To optimize GEMM
with tall-and-skinny input, it is critical to design a com-
putation algorithm that considers all compute-bound,
memory-bound, and latency-bound cases.
The main contributions of this paper include:
• We study the limitation of current state-of-the-art
GEMM implementations with tall-and-skinny in-
puts (i.e., m ' k  n or m  k ' n). With
benchmarking, we find that the under-utilization of
GPU resources is the main reason for performance
degradation when the input is tall-and-skinny.
• To handle a broad spectrum of tall-and-skinny in-
puts for GEMM on GPUs, we design two classes
of algorithms with optimizations focusing on dif-
ferent tall-and-skinny input cases: (1) TSM2R is
designed to handle a large regular-shaped matrix
multiplying a tall-and-skinny matrix (i.e., m ' k 
n); (2) TSM2L is designed to handle a tall-and-
skinny matrix multiplying a small regular-shaped
matrix (i.e., m  k ' n).
• We present a performance model for TSM2R and
compare it with our evaluation performance re-
sults. Moreover, we examine the inadequacies of
the model for TSM2L and further improve it based
on our observations.
• We carefully implement TSM2R and TSM2L us-
ing CUDA C2 and evaluate them on four genera-
tions of Nvidia GPUs including Kepler, Maxwell,
Pascal, and Volta. Experiments show that our
TSM2R and TSM2L can achieve 1.1x∼3.0x and
1.1x∼3.5x speedups, respectively, with different
tall-and-skinny inputs, compared to the state-of-
the-art GEMM library cuBLAS.
2The code is available at https://github.com/codyjrivera/
tsm2x-imp.
2
The rest of this paper is organized as follows. In Sec-
tion 2, we give a formal definition of tall-and-skinny
matrix and show some preliminary benchmark results of
the GEMM with tall-and-skinny matrix using cuBLAS.
In Section 3, we propose our detailed design of TSM2R
and TSM2L for two different kinds of tall-and-skinny in-
puts. In Section 4, we present our evaluation results. In
Section 5, we conclude the paper.
2. Background
2.1. Tall-and-Skinny Input for GEMM
In this work we restrict our scope to handle irregular-
shaped inputs that involve tall-and-skinny matrices. The
tall-and-skinny input size means that, for the two input
matrices, at least one matrix is tall-and-skinny (i.e., one
dimension is significantly smaller than the other). For
example, either (i) input matrix A with size 20480 ×
20480 and matrix B with size 20480 × 2 or (ii) input
matrix A with size 20480 × 2 and matrix B with size
2× 2 is considered as tall-and-skinny input in our work.
Tall-and-skinny matrix is typical class of matrices that
can be found in irregular-shaped inputs for GEMM. In
this paper, we focus on optimizing GEMM with (i) one
large regular input matrix and one tall-and-skinny in-
put matrix and (ii) one tall-and-skinny input matrix and
one small regular input matrix. In this paper, for the
first case, we refer matrix A as the larger input matrix
(m× k) and matrix B (k× n) as the tall-and-skinny input
matrix, where m ' k  n; for the second case, we re-
fer matrix A as the tall-and-skinny input matrix (m × k)
and matrix B (k × n) as the smaller input matrix, where
m  k ' n. We choose these input sizes and shapes
because we believe they can expose most of the chal-
lenges in all kinds of tall-and-skinny input, so the de-
sign idea and optimization techniques introduced in this
paper can be easily applied to other cases with slight
modification. Also, for simplicity purpose, we choose
to let the larger matrix in (i) and smaller matrix in (ii) to
be square-shaped in most of our experiments. Our opti-
mization can work with other non-squared input as well
with similar effects.
2.2. cuBLAS
One of the most commonly used standard linear alge-
bra libraries optimized for GPU is the cuBLAS library
developed by Nvidia. The cuBLAS is the core comput-
ing library of many big data and scientific computing
applications. For example, it is the GPU computing li-
brary for MAGMA heterogeneous linear algebra library
[22, 23, 24], cuLA library [25], and cuDNN deep learn-
ing library [26]. With Nvidia’s deep optimization, the
cuBLAS library is able to provide state-of-the-art per-
formance in many use cases. For example, with large
regular-shaped input matrix, their GEMM implementa-
tion is able to achieve near peak GPU performance [27].
However, we found that the GEMM subroutine is not
fully optimized with certain input matrix sizes [10]. For
example, with inputs that involve tall-and-skinny ma-
trices, the GEMM operation in current best implemen-
tation (cuBLAS 9.0 running on NVIDIA Tesla K40c
GPU) only uses less than 10% of the theoretical peak
memory bandwidth on average with n = 2 (as demon-
strated in Figure 7 (a)-(b)). When n = 16, the same
GEMM operation only uses less than 20% of the theo-
retical peak memory bandwidth on average (as demon-
strated in Figure 7 (g)-(h)). The resource utilization is
even lower with larger input dimensions. By compar-
ing the two input sizes, it can be seen for input with
smaller n values, the computation utilizes higher mem-
ory bandwidth (close to memory bound). On the other
hand, for input with larger n values, the computation uti-
lizes higher computing power (close to compute bound).
However, since we are unable to analyze the GEMM im-
plementation in none open-sourced cuBLAS library, it
is hard to tell their exact computation characteristic.
3. Design Methodologies
To handle the GEMM with two different classes of
tall-and-skinny inputs on GPUs described in Section
2.1, we design two efficient algorithms: TSM2R and
TSM2L. TSM2R is to handle inputs with one large-to-
medium regular-shaped matrix and one tall-and-skinny
matrix, while TSM2L is to handle inputs with one tall-
and-skinny matrix and one small regular-shaped matrix.
Note that “R” or “L” means that the tall-and skinny ma-
trix is multiplied on the right or left.
3.1. Design of TSM2R
In this section, we describe our proposed algorithm
TSM2R for GEMM with a large regular-shaped matrix
and a tall-and-skinny matrix.
3.1.1. Insight on Tall-and-Skinny Input
For regular-shaped GEMM (m × k matrix multiplies
k×n matrix), the input matrices size is O(mk+kn), while
the computing time complexity is O(mkn), so each ele-
ment in input matrices is used O(m) or O(n) times within
the entire computation process. Since loading data from
GPU off-chip DRAM (i.e., global memory) to GPU is
3
expensive and to avoid extensive data load operations,
one common optimization for this kind of problem is
minimizing the number of times each element needs
to be loaded into the GPU by using fast on-chip mem-
ory (e.g., cache, registers) to enable data reuse. As the
number of loads reduces, optimized GEMM tends to be
compute-bound. For example, current GEMM imple-
mentation in cuBLAS library can reach near bare-metal
performance on GPUs [27].
However, unlike regular-shaped GEMM, when one
matrix is tall-and-skinny (e.g., n  m, k), each element
in the input matrices is used O(n) times on average:
(m×k)×n times+(k×n)×m times
m×k+k×n ≈ O(n) times.
Depending on the size of n and target GPU peak com-
puting power and memory throughput ratio, the compu-
tation can be either compute-bound or memory-bound.
When n gets smaller, the computation tends to be
memory-bound. Otherwise, the problem tends to be
compute-bound. In either case, the problem is always
near the boundary between memory bound and compute
bound, so it is critical to design an algorithm that is op-
timized for both two cases.
3.1.2. Algorithm Design
As the core of our optimization, algorithm design
plays an important role. First, we need to consider how
to fit the workload of our TSM2R into the programming
model of CUDA (i.e., thread hierarchy). Although the
workload can be easily decomposed into many indepen-
dent smaller workloads, careful consideration on work-
load distribution is still necessary, since any unneces-
sary performance penalty can drastically cause GPU re-
source underutilization. Several factors are considered
in our design:
1. Total number of global memory accesses;
2. Efficiency on global memory throughput;
3. Utilization on global memory throughput;
4. Parallelism of overall workload;
5. On-chip memory utilization;
6. Streaming Multiprocessor (SM) utilization;
7. Optimization for compute & memory-bound cases.
To achieve good performance, there must exist enough
number of active threads in each SM of GPU to ensure
proper instruction and memory access latency hiding.
So, in our algorithm we divide the workload by assign-
ing n rows of matrix A to n different threads. Each
vector-matrix multiplication is assigned to one thread
(i.e., (A[i, :] × B)). The benefit is three-fold: 1) this en-
sures high parallelism and high SM occupancy; 2) since
Require: input matrix A (m × k) and B (k × n), output matrix
C (m × n)
1: for i = 1 to n do
2: for j = 1 to k do
3: C[global tid, i]+ = A[global tid, j] × B[ j, i]
4: end for
5: end for
Algorithm 1: Each thread’s workload with inner product.
Require: input matrix A (m × k) and B (k × n)
Require: output matrix C (m × n)
1: Reg1:n ← C[global tid, 1 : n]
2: for i = 1 to k do
3: Reg1:n+ = A[global tid, i] × B[i, 1 : n]
4: end for
5: C[global tid, 1 : n]← Reg1:n
Algorithm 2: Each thread’s workload with outer product.
the number of elements of matrix A is much higher
than matrix B, this kind of distribution ensures mini-
mum number of memory accesses in favor of matrix
A; 3) it also enables high memory access efficiency
and throughput, since all memory accesses to matrix A
are naturally coalesced (assuming matrices are stored in
column-major by convention).
As for the vector-matrix multiplication assigned to
each thread, to further reduce the number of memory ac-
cesses to matrix A, we use outer-product style computa-
tion instead of the regular inner-product style computa-
tion. As shown in Algorithm 1, if we use inner-product,
each element of matrix A is repeatedly referenced n
times. On the other hand, if we use outer-product as
shown in Algorithm 2, each element of matrix A is ref-
erenced only once. (Please note, as we will discuss in
later sections, when n is larger than a certain threshold,
elements in matrix A still need to be referenced more
than once due to the limited resources available for each
thread, but it is still far lower than using inner-product).
For large matrix A, the benefit is significant, since it
greatly reduces the total number of global memory ac-
cesses during the entire GEMM computation. Also, the
outer-product style does not bring any extra memory ac-
cesses to matrix B compared to inner-product style. The
only cost for outer-product is extra registers holding n
intermediate results. However, with proper tuning, they
only bring little performance impact compared with ex-
tra memory accesses.
3.1.3. Efficient Off-Chip Memory Access
One key factor of optimizing memory intensive ap-
plications is ensuring high off-chip memory access ef-
ficiency. Depending on the GPU model type or run-
time configurations, global memory (off-chip) accesses
4
of threads within the same warp can to coalesced into
128 byte- or 32 byte-transactions [28] if their access ad-
dresses fall into the same 128 byte- or 32 byte-segments
in global memory, which enables efficient use of mem-
ory bandwidth. Otherwise, the GPU still loads memory
in 128 byte- or 32 byte-transactions, but it may contain
unrequested data that are stored in neighbor addresses,
which causes inefficient memory accesses.
Since each thread reads one row of matrix A and
the matrix is stored in column-major by convention,
memory accesses are naturally coalesced when threads
within the same warp access elements on different rows
but on the same column. So, 100% memory access ef-
ficiency is achieved on matrix A. However, for matrix
B, all threads access the same element at the same time,
which results a single memory transaction containing
one requested element and several unrequested neigh-
bor elements. So, only 8 bytes128 bytes = 6.25% or
8 bytes
32 bytes =
25% memory access efficiency is achieved for access-
ing 64-bit double floating point elements. Although the
total number of elements in matrix B is small, given that
each element needs to be accessed n times, this ineffi-
cient access pattern can still greatly impact the overall
performance.
To improve the efficiency of memory accesses to ma-
trix B, we utilize shared memory in GPU. Since it is
located on-chip, shared memory gives us the speed of
L1 cache and it is fully programmable. Threads within
one thread block can use shared memory to share data.
So, one key advantage of shared memory is that it elim-
inates the need for the consistency between patterns of
data loading and data using pattern, which enables us to
load global memory in the most efficient way and keep
the way that we use data as before.
By using shared memory for accessing matrix B, we
can reduce the total number of memory accesses and
enable coalesced memory access. As shown in Algo-
rithm 3, for each iteration, instead of letting threads re-
quest elements they need individually by themselves in-
efficiently, we now let a block of threads work together
to fetch a tile of matrix B into the shared memory in a
coalesce-compatible way (Line 11). Then during com-
putation, each thread references elements in matrix B
through the shared memory instead of loading each one
of them individually from global memory. This reduces
the total number of accesses to matrix B from global
memory (from n to n/t1 per element). Also, threads
in a same thread block fetch elements of matrix B col-
umn by column, which enables coalesced memory ac-
cess and greatly improves memory-access efficiency to
100%. Moreover, we also introduce three parameters:
t1, t2, and t3 in Algorithm 3. These parameters are used
Require: input matrix A (m × k) and B (k × n), output matrix
C (m × n)
1: t1 ← tile size B, t2 ← tile size C, t3 ← tile size A
2: Register: A1, A2, ..., At3
3: Register: C1,C2, ...,Ct2
4: Shared Memory: currB with size t1 × t2
5: Threads per thread block← t1
6: Total thread blocks← m/t1
7: for p = 1 to n with step size = t2 do
8: C1:t2 ← C[global tid, p : p + t2 − 1]
9: for j = 0 to k with step size = t1 do
/* Load a tile of B into shared memory */
10: ThreadsSynchronization()
11:
currB[global tid, 1 : t2]← B[ j+global tid, p : p+ t2−1]
12: ThreadsSynchronization()
13: for l = j to j + t1 with step size = t3 do
/* Load a tile of A into registers*/
14: A1:t3 ← A[global tid, l : l + t3 − 1]
15: C1:t2 + = A1:t3 × currB[l : l + t3, 1 : t2]
16: end for
17: end for
18: C[global tid, p : p + t2 − 1]← C1:t2
19: end for
Algorithm 3: TSM2R with shared memory.
for adjusting the performance and will be discussed in
later sections.
3.1.4. Optimizing Use of Shared Memory
Although fast, elements in shared memory still need
to be loaded into registers before using [29]. Its access-
ing speed can affect the overall performance. Shared
memory is divided into several same-sized memory
banks for fast parallel accesses. Different threads ac-
cessing different memory banks can be done simultane-
ously. So, total b memory banks can speedup overall
shared memory throughput by up to b times compared
to the throughput of one single memory bank. However,
if x threads in the same warp access different data from
the same memory bank, x-way bank conflict would oc-
cur and each request is processed sequentially, which
dramatically reduces the accessing throughput by a fac-
tor of 1/x.
In our algorithm, threads in the same thread block
load data from global memory into shared memory col-
umn by column to enable fast coalesced global memory
access. Then threads access data from shared memory
row by row during computation. How we store elements
in shared memory will affect how these elements are ac-
cessed from memory banks, which affects the through-
put of shared memory. We have two ways of storing
a tile of matrix B in shared memory: column-major
storage and row-major storage. To choose between the
5
Bank 0
Bank 1
Bank 2
Bank 3
Bank 28
Bank 29
Bank 31
Bank 30
Bank 0
Bank 1
Bank 2
Bank 3
Bank 28
Bank 29
Bank 31
Bank 30
0
1
2
3
28
29
30
32
33
34
35
60
61
62
0
1
2
3
28
29
30
32
33
34
35
60
61
62
31 63 31 63
0
0
1
1
14
14
15
16
16
17
17
30
30
31
32
32
33
33
46
46
47
48
48
49
49
62
62
63
15 31 47 63
… …
No bank conflict 2-way bank conflict
Thread 0
Thread 1
Thread 2
Thread 3
Thread 28
Thread 29
Thread 30
Thread 31
Thread 0
Thread 16
Thread 1
Thread 17
Thread 14
Thread 30
Thread 15
Thread 31
Figure 1: Comparing column-major (left) with row-major (right) stor-
age for storing a 64 × 2 tile of matrix B in shared memory. Blue
and yellow squares represent elements in the first and second column.
When one warp of 32 threads accessing 32 elements in one column
(e.g. element 0 to 31 of the first column), the column-major storage
brings no bank conflict and row-major storage brings 2-way bank con-
flict, which reduces throughput by half.
two ways, we need to analyze and compare which way
brings the least overall bank conflict. We assume the
size of one tile of matrix B is t1 × t2 and t1 is the multi-
ply of total number of memory banks b for simplicity.
For column-major storage, elements (32-bit words or
64-bit words) in the same column of one tile of ma-
trix B are stored in successive memory banks. So, for
shared memory with b memory banks, every t1 elements
of one column are stored in b different successive mem-
ory banks with each bank stores at most t1b elements and
is accessed by at most warp sizeb threads at the same time,
which may potentially cause bank conflict if warp sizeb is
greater than one.
For row-major storage, elements in the same row of
matrix B are stored in successive memory banks. So,
elements of the same column are stored in bt2 different
banks with each bank stores t1×t2b elements from one col-
umn. Since each bank has t2 times more elements from
one column, totally each bank has at most t2 times more
thread accessing at the same time: warp sizeb × t2, which
may also potentially cause bank conflict.
On modern NVIDIA GPUs, the warp size is fixed to
32 and total number of banks is also 32 [28], so column-
major storage does not cause bank conflict, since each
bank can only have up to one thread accessing. Row-
major storage can cause up to t2-way bank conflict,
which decreases overall shared memory throughput to
1
t2
of the peak throughput. As shown in Figure 1,
we load a 64 × 2 matrix tile into shared memory us-
ing column-major storage (left) and row-major storage
(right). When using column-major storage, threads in
one warp all access different banks, so no bank con-
flict occurs. But when using row-major storage, 32 el-
ements are stored in 16 banks causing 2-way bank con-
flict. When accessing elements in shared memory for
computation, threads in a warp all access the same el-
ement at a time in our algorithm. Although multiple
threads are accessing one bank, they are accessing the
same element, so one broadcast is initiated, which does
not cause bank conflict. It is the same for both stor-
age styles. So, we choose column-major storage as it
brings no bank conflict and potentially brings the high-
est shared memory throughput.
3.1.5. Overlapping Computation and Memory Access
Latency
During execution, for each instruction issuing mo-
ment, each warp scheduler picks an eligible warp and
send it to the corresponding component for execution.
A warp becomes eligible only if all operands of its next
instruction are ready. However, if a warp is loading
data from global memory, it would take several hun-
dred cycles before it can be ready for execution. To hide
this long latency, we can either increase the number of
threads reside in each SM to ensure there always ex-
ist eligible warps [30] or put independent instructions in
between data loading and data consuming operations, so
that warps are also eligible for execution during mem-
ory loading time. The first approach requires us to ad-
just the on-chip resource usage of each thread block. We
will leave that discussion in the next section. In this sec-
tion, we aim to add independent instructions in between
data loading and data consuming operations.
A shown in Algorithm 3, Line 11 and 14 load data
from global memory and Line 15 consumes data once
data is loaded. However, due to data dependency, there
is no independent instruction in between, so once each
warp issues global memory access requests, it must wait
for the requested elements to be ready before it can pro-
ceed to computation.
So, to add independent instructions, we use data
prefetching to mix the data loading and consuming be-
tween neighbor iterations. Specifically, instead of let-
ting each iteration loads data that is going to be used for
current iteration, we let the data needed for current iter-
ation to be loaded by the previous iteration, so that its
calculation will not be blocked by data loading (since
the data are ready). When doing calculation, it also
loads data that is going to be used for the next itera-
tion. By overlapping data loading and computation, we
can significantly improve memory bandwidth and SM
utilization. We apply data prefetching to both matrix A
and B.
As shown in Algorithm 4, we design our TSM2R with
data prefetching. Note that global tid and local tid rep-
resent the (global) thread ID in the grid and the (local)
thread ID in the block, respectively. In Line 2 and 3,
we allocate two sets of t3 registers for storing current
tile of elements of matrix A and next tile of element of
matrix A for prefetching. In Line 4 and 6, we allocate
6
LD C
LD NextB
LD NextA
Compute
LD NextA
Compute
ST CThreads Sync.
LD NextB
LD NextA
Compute
LD NextA
Compute
Threads Sync.
LD NextB
LD NextA
Compute
LD NextA
Compute
Threads Sync.
Figure 2: Example workload of one iteration of our optimized TSM2R
with data prefetching.
t2 registers for data prefetching of elements in matrix B,
and allocate t1 × t2 for storing currently loaded tile of
matrix B. Note that we cannot store current tile of ma-
trix B in registers, because elements in matrix B need to
be shared between threads during computation.
Require: input matrix A (m × k) and B (k × n), output matrix
C (m × n)
1: t1 ← tile size B, t2 ← tile size C, t3 ← tile size A
2: Register: currA1, currA2,...,currAt3
3: Register: nextA1, nextA2,...,nextAt3
4: Register: nextB1, nextB2,...,nextBt2
5: Register: C1, C2,...,Ct2
6: Shared Memory: currB with size t1 × t2
7: Threads per thread block← t1
8: Total thread blocks← m/t1
9: for p = 1 to n with step size = t2 do
10: C1:t2 ← C[global tid, p : p + t2 − 1]
11: currB[local tid, 1 : t2]← B[local tid, p : p + t2 − 1]
12: currA1:t3 ← A[global tid, 1 : t3]
13: for j = 0 to k with step size = t1 do
14: ThreadsSynchronization()
/*prefetch the next tile of B into registers*/
15: if j + t1 < n then
16: nextB1:t2 ← B[ j + t1 + local tid, p : p + t2 − 1]
17: end if
18: for l = j to j + t1 with step size = t3 do
/*prefetch the next tile of A into registers*/
19: if l + t3 < n then
20: nextA1:t3 ← A[global tid, l + t3 : l + t3 + t3 − 1]
21: end if
22: C1:t2 + = currA1:t3 × currB[l : l + t3, 1 : t2]
23: currA1:t3 ← nextA1:t3
24: end for
25: ThreadsSynchronization()
26: currB[local tid, 1 : t2]← nextB1:t2
27: end for
28: C[global tid, p : p + t2 − 1]← C1:t2
29: end for
Algorithm 4: TSM2R with shared memory and data
prefetching.
Before the core computation iteration (Line 13-27),
we pre-load current tile of matrix A and B into regis-
ters and shared memory (Line 11 and 12), so that com-
putation can start immediately as soon as we enter the
computation loop without being blocked by any data de-
pendency. The main computation resides in Line 22. To
overlap computation with memory accesses, we initiate
A
B
C
 Thread 0
 Thread 1
 Thread 2
 Thread 3
 Thread 0
 Thread 1
 Thread 2
 Thread 3
registers
holding 
current 
tile of A
shared mem.
holding current 
tile of B
t1
t2
t3
prefetch next tile A 
to registers
next tile becomes 
current tile in next iteration 
prefetch next 
tile B to registers
load next tile to 
shared mem.
before next 
iteration.
{one thread block
{one thread block
calculation on 
current tile
Figure 3: Matrix view of tall-and-skinny matrix matrix multiplication
with data prefetching.
loading for the next tile before the computation (Line
16 for matrix B and Line 20 for matrix A). We use two
loops for loading matrix A and B, because we want to
have the flexibility to adjust loading pace (tile size) dif-
ferently for the two matrices. We will discuss this in the
next subsection. Figure 2 and 3 show one iteration of
our optimized TSM2R with data prefetching. LD C and
ST C represent loading initial values from matrix C and
storing final results back to matrix C. Each iteration we
show three sub-iterations for loading matrix B. As we
can see, we load pre-load the next tile of matrix B in
concurrent with computation to improve memory band-
width utilization. A threads synchronization is inserted
in the end of each iteration. For the inner most iteration,
we do the actual computation and pre-load elements
from matrix A each time. Please note that the length
of each rectangle does not accurately represent the ex-
act execution time length and the ratio between number
of LD nextA and LD nextB is not necessarily two in
actual computation. Also, we show one thread block
with four threads only for illustration proposes. As we
will discuss in the next subsection that different param-
eter values can affect the length of each part and the ra-
tio between number of LD nextA and LD nextB. Espe-
cially on the execution time of LD nextA and Compute,
which will affect the characteristic of computation (i.e.
memory-bound or compute-bound). Also, for simplic-
ity, we ignore the part moving data between next tile
storage to current tile storage of each iteration.
3.1.6. Parameters Definition
In Algorithm 3 and Algorithm 4, we introduced
three adjustable parameters: t1, t2, and t3. In this sec-
tion, we first discuss how each parameter controls the
computation of our TSM2R. Then, we introduce our per-
formance model that estimates how certain performance
7
metrics change with these parameters. Finally, we ex-
plain our strategies of choosing values for these param-
eters in order to achieve high GPU resources utilization
and optimized overall performance. Please note that the
following discussions are all based on Algorithm 4.
3.1.7. Behaviors of Parameters
We first list the behaviors of each parameter below:
• t1 specifies the number of rows of one tile of matrix
B. To maximize use of available active threads and
to avoid any inefficient thread execution caused by
warp divergence, we let all threads in each thread
block participating in fetching elements of matrix
B. For fast coalesced global memory access, we
let each thread fetch one row, so t1 is also the total
number of threads in each thread block. Also, since
we let total m threads working on the computation,
the total number of thread blocks can be calculated
as: m/t1.
• t2 specifies the number of elements in matrix C
that each thread is working on at a time. It is used
to divide the overall workload into several smaller
workloads that are processed iteratively by each
thread. Smaller workload makes each thread’s
SM resource usage smaller, which allows us to
keep higher SM occupancy. However, dividing the
workload means we need to load matrix A repeat-
edly for each small workload. So, there is a trade-
off. t2 also affects the ratio between total number
of memory fetches and computation operations in
core part of our algorithm, which allows us to ad-
just the computation to be compute or memory-
bound (will be discussed later in detail).
• t3 specifies the number of elements in matrix A
that each thread fetches at a time. Since elements
fetches are independent to each other, they can be
done without blocking each other, so t3 can be used
to adjust the memory loading concurrency.
3.1.8. Performance Metrics Estimation
In this section, we introduce our parameters based
performance model that is used to estimate three im-
portant performance metrics: SM occupancy, memory
bandwidth utilization and computing power utilization.
These estimations will be used for optimizing the over-
all performance.
• Max SM occupancy estimation
With these parameters we can calculate the max
occupancy of each SM, which is defined as max
number of active threads per SM. (Some works
also use max number of warps, which is similar to
ours. We found that using max thread is more con-
sistent across our performance models. We also
choose thread block size to be the dividend of this
value to ensure expected number of threads are
active.) This occupancy is mainly bound by the
maximum hardware allowable number of threads
(HW MAX) and on-chip memory utilization per
thread. We first calculate the total number of reg-
isters utilized per thread. Since register utilization
can potentially be optimized by the nvcc compiler,
we use maximum number of registers to estimate
this value. First of all, there is a relatively fix
amount of registers uses for CUDA initial setup,
and we represent this amount as C. We get its
amount through offline profiling. Then, we need
two sets of t2 registers for storing elements of ma-
trix B for both next tile fetching and current tile
calculation. Please note that although the current
tile of matrix B is stored in shared memory, it still
needs to be transferred to registers for calculation.
Next, we need t2 registers for keeping intermediate
results of matrix C. Finally, we need two sets of t3
registers for storing elements of matrix A for both
next tile fetching and current tile calculation. So,
the total number of registers is:
Rthread = (t2 × 3 + t3 × 2) × bytes per elementbytes per register + C.
As for shared memory, through shared memory is
allocated per thread block, we calculate the av-
erage amount of shared memory that each thread
uses for consistent calculation here. Since the
size of allocated shared memory per thread block
is t1 × t2, and as we will discuss earlier that we
set t1 = threads per threadblock, the amount of
shared memory allocated for each thread on aver-
age is:
S thread = t2 × bytes per element.
So, the max SM occupancy can be calculated as:
MaxOccupS M = min(HW MAX,
RS M
Rthread
, S S MS thread ).
In the above calculation, RS M and S S M stand for
the max available registers and shared memory per
SM.
• Max memory bandwidth utilization estimation
Next, we estimate the max memory bandwidth uti-
lization of our algorithm when the computation
is memory-bound. In this case, loading elements
8
of matrix A dominates the computation instead of
floating point calculations in our algorithm. So, we
can estimate max memory bandwidth utilization
using the maximum number of concurrent global
memory accesses per SM. It can be calculated as:
Concurrentmem ≈ MaxOccupS M × t3.
Please note that, we only consider the memory ac-
cesses to matrix A here for simplicity. Since the
majority memory accesses are for matrix A, this
only brings minor inaccuracy. Then, similar to
[30, 31] we calculate the least number of concur-
rent memory accesses per SM needed to achieve
max memory bandwidth utilization using Little’s
Law:
Throughputmax mem = Peak Band.# o f S M×core clock ,
Concurrentmax mem = latencymem × Throughputmax mem.
The latencymem is the average global memory ac-
cess latency, which is considered as a constant in
our model and is obtained through offline profil-
ing. The estimated memory bandwidth utilization
is:
Utilmem =
Concurrentmem
Concurrentmax mem
.
• Max computing power utilization estimation
Next, we estimate the max computing power uti-
lization of our algorithm when the computation is
compute-bound. In this case, floating point calcu-
lation dominates the computation instead of mem-
ory accesses in our algorithm. So, we can estimate
max computing power utilization using the max-
imum number of concurrent floating point opera-
tions per SM. It can be calculated as:
Concurrentcomp = MaxOccupS M × t3 × t2.
Then, also similar to [30] we calculate the least
number of concurrent floating point operations per
SM needed to achieve max computing power uti-
lization using Little’s Law:
Throughputmax comp =
Peak Per f .
# o f S M×core clock ,
Concurrentmax comp = latencycomp × Throughputmax comp.
The latencycomp is the average latency of floating
point operations in our calculations, which is con-
sidered as a constant in our model and is obtained
through offline profiling. So, the estimated com-
puting power utilization is:
Utilcomp =
Concurrentcomp
Concurrentmax comp
.
• Determine compute-bound or memory-bound
Given parameters and GPU specification, we can
determine whether the current computation is
memory or compute-bound. This is mainly de-
termined by the inner most loop (Line 20-24) of
Algorithm 4. The memory loading instructions
(Line 21) are overlapping with computation (Line
23). Since Line 24 depends on memory loading re-
sults, it serves as an implicit synchronization point
for memory load and computation. The time takes
for the two parts will determine whether the current
computation is compute-bound or memory-bound.
So, we first estimate the time takes for computation
and memory access as follows:
timecomp =
t3×t2
Peak Per f .×# o f S M×OccupancyS M ,
timemem =
t3×bytes per elem.
Peak Band.×# o f S M×OccupancyS M .
Then, by comparing the two time costs, we
can determine whether the current computation is
compute-bound or memory-bound.
r = timecomptimemem =
t2
bytes per elem. × Peak Band.Peak Per f .
As we can see, when r is greater than one, the com-
putation is compute-bound. Otherwise, the compu-
tation is memory-bound. Also, since we divide the
original workload into several smaller workloads
using t2, this ratio is determined by t2. By adjusting
t2, the actual computation can be shifted between
compute and memory-bound. The boundary be-
tween the two cases can be calculated by setting
the ratio r = 1, so we get a threshold for t2:
tthreshold2 =
Peak Per f .
Peak Band. × bytes per elem.
Similarly, we can also estimate computation char-
acteristic of the original problem, in which the
workload is not divided into smaller workloads. In
this case, t2 is always fixed to k. So, by compar-
ing k with tthreshold2 we can estimate the computa-
tion characteristic. If k is greater than tthreshold2 , the
original problem is compute-bound; otherwise, it
is memory-bound. It can be easily seen, depending
on the value of t2 and k, the computation charac-
teristic of the current problem and original prob-
lem can be different, which can affect the overall
performance. We discuss this in later part of this
section.
3.1.9. Deciding Parameters
When choosing parameters, the first thing we should
determine is whether we should optimize for compu-
tation or memory bandwidth. This is determined by
9
whether the given TSM2R computation on the given GPU
should be compute or memory-bound. In the last sec-
tion, we proposed to estimate this characteristic by com-
paring n and tthreshold2 , so that we can accordingly adjust
parameters to optimize the computation.
In the case where original problem is memory-bound
(n ≤ tthreshold2 ), we need to keep the actual computa-
tion to be memory-bound also (let 1 ≤ t2 ≤ n) and
optimize for memory bandwidth utilization. On the
other hand, if the original problem is compute-bound
(n > tthreshold2 ), we first try to keep the actual computa-
tion to be compute-bound too (let tthreshold2 ≤ t2 ≤ n)
and optimize of computing power utilization. How-
ever, in the case where tthreshold2 is too high on the given
GPU, we also try to optimize it for memory-bound (let
1 ≤ t2 ≤ tthreshold2 ) and output the result parameters that
deliver better performance.
Algorithm 5 shows the parameter optimization pro-
cedure for t2 and t3. We first determine the computation
characteristic in Line 1. If it is memory-bound, we opti-
mize for total time cost to access needed elements from
global memory (Line 4). Otherwise, we optimize for
either total computation time (Line 9) or memory ac-
cess time (Line 14). Please note that we only count the
total amount of memory accesses to matrix A for sim-
plicity, since total accesses to matrix B is much less than
matrix A, so this simplification only brings minor inac-
curacy. Also, considering the total accesses to matrix B
would bring one additional parameter (t1), which can be
hard to optimize since t1 is also related to threads orga-
nization that is hard for modeling-based estimation. The
memory bandwidth utilization term (Utilmem) and com-
puting power utilization term (Utilcomp) is calculated us-
ing the equation mentioned before. Since we have two
parameters (t2 and t3) in our optimization target, we use
Gradient Descent (GD) to do the optimization. In GD,
based on our experience, we set initial value of both t2
and t3 to be 1, and step size to be 0.1. The stop thresh-
old is set to be 1e-4, since we do not need very accurate
precision. The final t2 and t3 are rounded to the nearest
integers.
To optimize t1, we found it only controls the number
of threads in each thread block. Since the total number
of threads is fixed to m, t1 only determines how these
threads are organized into thread blocks. There is trade-
off: if t1 is large, the total number of accesses to el-
ements of matrix B is reduced, however, large thread
block means large number of threads need to participate
in the same synchronization, which may have impact on
performance. On the other hand, if t1 is small, the total
number of accesses to elements of matrix B higher, but
the smaller thread block makes scheduling more flexi-
1: if n ≤ tthreshold2 then
2: Total memory ≈ m × k × nt2 × bytes per elem.
3: Bandwidth = PeakBand. × Utilmem
4: Use Gradient Descent to Optimize (t2 and t3):
Time = Total memoryBandwidth with 1 ≤ t2 ≤ n and 1 ≤ t3
5: Output: t2 and t3
6: else
7: Total f lops = m × k × n × 2
8: Compute power = PeakPer f . × Utilcomp
9: Use Gradient Descent to Optimize (t2 and t3):
Time1 =
Total f lops
Compute power with t
threshold
2 ≤ t2 ≤ k and 1 ≤ t3
10: t2(time1) ← t2
11: t3(time1) ← t3
12: Total memory ≈ n × n × kt2 × bytes per elem.
13: Bandwidth = PeakBand. × Utilmem
14: Use Gradient Descent to Optimize (t2 and t3) in
Time2 =
Total memory
Bandwidth with 1 ≤ t2 ≤ tthreshold2 and 1 ≤ t3
15: t2(time2) ← t2
16: t3(time2) ← t3
17: if Time1 < Time2 then
18: Output: t2(time1) and t3(time1)
19: else
20: Output: t2(time2) and t3(time2)
21: end if
22: end if
Algorithm 5: Parameter optimization for TSM2R.
ble and efficient. It is hard to determine the optimum
value of t1 theoretically, so we use offline profiling to
choose the best value. Specifically, once t2 and t3 are
determined, we benchmark different t1 values that can
divide MaxOccupS M as mentioned earlier, and choose
the t1 for the best performance. Although t1 seems to
have direct effect on shared memory allocation (or max
SM occupancy), it actually has limited impact on it,
since we fix the amount of shared memory per thread
(S thread = t2 × bytes per element).
3.2. Design of TSM2L
The algorithm proposed in the above sections—
TSM2R—is optimized for the case where a large regular-
shaped matrix multiples a tall-and-skinny matrix. In this
section, we first propose a new algorithm TSM2L to han-
dle the case where a tall-and-skinny matrix multiplies a
small regular-shaped matrix. For example, an input ma-
trix A of size 102400× 4 multiples an input matrix B of
size 4×4, where the tall-and-skinny matrix is multiplied
on the left. We then introduce two different optimization
approaches to overcome the bottleneck that this kind of
tall-and-skinny input poses.
3.2.1. Performance Bottlenecks
We start by adapting our previous algorithm TSM2R to
handle the new case without further optimization. How-
ever, applying the algorithm to this case reveals a bottle-
neck. We evaluate TSM2R on this case with matrices of
15360 × k and k × 16—where k varies—on an NVIDIA
10
Figure 4: Memory bandwidth usage of very small values of k with
double precision (m=15360, n=16).
Require: input matrix A (m × k) and B (k × n)
Require: output matrix C (m × n)
1: tc f ← thread count f actor
2: t1 ← tile size B, t2 ← tile size C, t3 ← tile size A
3: Register: currA1, currA2,...,currAt3
4: Register: nextA1, nextA2,...,nextAt3
5: Register: nextB1, nextB2,...,nextBt2
6: Register: C1, C2,...,Ct2
7: Shared Memory: currB with size t1 × t2
8: Threads per thread block← t1
9: Total thread blocks← m/(t1 × tc f )
10: Total threads← m/tc f
11: for r = global tid to m with step size = Total threads do
12: Perform Line 9-29 of Algorithm 4 with all
occurrences of the identifier global tid replaced by the
identifier r
13: end for
Algorithm 6: Proposed optimization-1 for TSM2L.
Tesla V100 GPU. As shown in Figure 4, as the inner di-
mension k decreases, the memory bandwidth usage also
decreases.
To explain these results, we expand upon the perfor-
mance model proposed in Section 3.6.2. This model
assumes that the maximum theoretical occupancy is
always achieved throughout the computation. How-
ever, on the one hand, since the algorithm loops k × nt2
times, and k is very small, each thread does not per-
form enough workload to hide the latency, resulting in
low occupancy. On the other hand, the program issues
much fewer global memory reads than the case with
large k, resulting in less efficient memory usage. There-
fore, TSM2R performs in a latency-bound mode (nei-
ther compute-bound nor memory-bound) on this input
case (i.e., a tall-and-skinny matrix multiplying a small
matrix), as indicated in a prior study [30].
3.2.2. Proposed Optimizations
Based on these observations, we further design two
optimizations for TSM2L. Both optimizations intend
to trade warp latency for memory-access latency by
launching fewer threads. As a result, each thread per-
forms more computation, and the accumulated warp la-
tency can be replaced by the memory-access latency.
Since we are launching fewer threads than the number
of rows of matrix A, we must divide it into several hor-
izontal tiles. Here we introduce a new parameter tc f to
represent the tile number of matrix A in our algorithm.
We launch mtc f threads in the new kernel.
The first optimization involves dividing the multipli-
cation into tc f parts, where each part consists of mul-
tiplying a mtc f -row tile of matrix A by the entire matrix
B. In essence, this optimization repeats the TSM2R algo-
rithm once for each tile of matrix A. In this optimiza-
tion, each element of matrix A is still accessed for nt2
times, though each element of B is loaded for tc f × nt2
times, which is tc f times more than that in TSM2R. We
describe the detail in Algorithm 6.
Require: input matrix A (m × k) and B (k × n), output matrix C
(m × n)
1: tc f ← thread count f actor
2: t1 ← tile size B, t2 ← tile size C, t3 ← tile size A
3: Register: currA1, currA2,...,currAt3
4: Register: nextA1, nextA2,...,nextAt3
5: Register: nextB1, nextB2,...,nextBt2
6: Register: C1, C2,...,Ct2
7: Register: nextC1, nextC2,...,nextCt2
8: Shared Memory: currB with size t1 × t2
9: Threads per thread block← t1
10: Total thread blocks← m/(t1 × tc f )
11: Total threads← m/tc f
12: for p = 1 to n with step size = t2 do
13: {Load currB and currA as described in Line 12 and 13 of
Algorithm 4}
14: for j = 0 to k with step size = t1 do
15: ThreadsSynchronization()
/*prefetch the next tile of B into nextB*/
16: C1:t2 ← C[global tid, p : p + t2 − 1]
17: for r = global tid to m with step size = Total threads do
/*prefetch the next tile of C into registers*/
18: if r + Total threads ≤ m then
19: nextC1:t2 ← C[r+Total threads, p : p + t2 − 1]
20: end if
21: Perform Line 19-25 of Algorithm 4, with all occurrences
of the identifier global tid replaced by the identifier r
/*store the sum so far in C*/
22: C[r, p : p + t2 − 1]← C1:t2
23: C1:t2 ← nextC1:t2
24: end for
25: ThreadsSynchronization()
26: Load the next tile of B into currB
27: end for
28: end for
Algorithm 7: Proposed optimization-2 for TSM2L.
The second optimization is to interleave the compu-
tation of the tiles, rapidly loading elements from differ-
ent tiles of matrix A and loading and storing interme-
diate sums in matrix C. Once a t1 × t2 tile of matrix
B is loaded, the intermediate results are loaded, com-
puted, and stored for each tile of matrix A. The C regis-
11
(a) Speedup (single) (b) Mem. bw. util. (single)
(c) Speedup (double) (d) Mem. bw. util. (double)
Figure 5: Performance comparison with single and double precision
using different tc f (m=107, k=n=16).
ter set is loaded with values from matrix C which con-
tains the product accumulated so far. After the com-
putation is finished, the values are stored to matrix C
again as the next tile of matrix A is prepared for com-
putation. To quickly switch between tiles, values from
matrix C are prefetched in addition to the prefetching
already described in Algorithm 4. A new set of regis-
ters, nextC[1...t2], is used to store the values of C asso-
ciated with the next tile of A. The elements of A and
B are accessed only nt2 times, though each element of
C is accessed kt1 × nt2 times. However, since we do not
achieve either high occupancy or high memory band-
width in this case, we are not as concerned about issuing
more memory read instructions. We describe the detail
in Algorithm 7.
Figure 5 illustrates the effects of the two optimiza-
tions on both performance and memory bandwidth us-
age. As fewer and fewer threads are launched, the im-
pact of warp latency is replaced with that of different
kinds of latency such as memory bandwidth latency. As
a result, computation time decreases and memory ac-
cess bandwidth increases in this case. Note that for this
case the number of threads launched must be reduced to
at least 164 of m before a significant decline in speedup or
memory bandwidth utilization occurs. This is attributed
to that the kernel must manage so many threads that so
so little work when m = 107.
Therefore, we must choose an appropriate tc f , deter-
mining the number of threads to launch for each ker-
nel. If the algorithm is launched with an insufficient
number of threads, the parallelism becomes too low and
hence the performance would suffer. If the algorithm is
launched with too many threads, the performance would
be impacted by warp latency just as it does in the naive
adaptation of TSM2R. We thus must determine an appro-
priate tc f for each target system with offline profiling.
4. Experimental Evaluation
4.1. Experiments Setup
We implement our TSM2R and TSM2L using CUDA
C for both single and double floating-point input. We
disable compiler auto unrolling for better control on
register allocation. We evaluate our optimized im-
plementations on two heterogeneous testbed clusters,
which are Darwin [32] at Los Alamos National Lab-
oratory and PantaRhei [33] at the University of Al-
abama. We run each test on a single GPU node with
single GPU. We conduct our tests on four different
commonly used modern Nvidia GPUs with four differ-
ent micro-architectures: Kepler, Maxwell, Pascal, and
Volta. For Kepler GPU, we use Tesla K40c, which has
1430 GFLOPS peak double floating-point performance
and 288 GB/s memory bandwidth. For Maxwell GPU,
we use Tesla M40, which has 213 GFLOPS peak dou-
ble floating-point performance and 288 GB/s memory
bandwidth. For Pascal GPU, we use Tesla P100, which
has 4600 GFLOPS peak double floating-point perfor-
mance and 720 GB/s memory bandwidth. For Volta
GPU, we use Tesla V100, which as 7500 GFLOPS peak
double floating-point performance and 900 GB/s mem-
ory bandwidth.
For comparison, we compare our TSM2R and TSM2L
with GEMM in the current latest cuBLAS library [21]
and latest BLASX library [34]. Also, we try to
compare our work with KBLAS [35], however since
its GEMM kernel is based on cuBLAS, its perfor-
mance is identical to cuBLAS, so we omitted its re-
sults. Each test is repeated multiple times to reduce
noise and timed using CUDA Events API. We mea-
sure performance by calculating the performance of
FAMD instructions. We also measure the global mem-
ory throughput using nvprof on the command line with
--metrics gld throughput option. In addition, we
use --metrics gld efficiency option to verify if
100% global memory access efficiency is achieved in
our optimization.
Our input matrices are initialized with random
floating-point numbers between 0 and 1. We test the
multiplication of a large squared matrix and a tall-and-
skinny matrix for TSM2R and the multiplication of a
tall-and-skinny matrix and a small squared matrix for
TSM2L. Specifically, for TSM2R, the size of the large
regular-shaped matrix is from 10240×10240 to 30720×
30720, and the size of the tall-and-skinny matrix ranges
12
from 10240 × n to 30730 × n with n equals 2, 4, 8, and
16. For TSM2L, the size of the tall-and-skinny matrix
ranges from 104× k to 107× k with k equals 8 or 16, and
the size of the small regular-shaped matrix is 8 or 16.
4.2. Evaluation of TSM2R
In this section, we first evaluate the performance
of TSM2R with different input sizes and compare it
with state-of-the-art libraries on K40c, M40, P100, and
V100.
4.2.1. Tests with Different Optimization Combinations
We use the GEMM in cuBLAS as comparison base-
line. We apply different combinations of optimization
in TSM2R and compare them with GEMM in cuBLAS
and BLASX. We have totally four versions of TSM2R:
• V0: the most straightforward inner product version
as described in Algorithm 1;
• V1: the outer production version as in Algorithm
2. This version reduces the total number of global
memory accesses from algorithm level;
• V2: based on outer production version as in Algo-
rithm 2, we add the use of shared memory, leading
to more efficient global memory access to matrix
B;
• V3: based on the outer production version as in Al-
gorithm 2 and the use of shared memory, we add
data prefetch. This is the best version of our op-
timized implementation, which is described in Al-
gorithm 4.
Limited by the page space, we only show the re-
sult on K40c GPU. Our optimization behaves similar on
other GPUs. To evaluate our optimization, we need to
determine by which resource our program is bounded.
Since, tthreshold2(k40c) ≈ 40, the computation is always mem-
ory bound for the given n values. The optimized param-
eters are: t2 = n, t3 = 4, and t1 = 128. The param-
eters are only applied to the last to versions of TSM2R.
Figure 6 shows the speedup of different versions in sin-
gle and double precision. From the results, we can see
that the TSM2R-V0 suffers from really poor performance
due to the requirement of much higher number of global
memory accesses in the inner product version. The
TSM2R-V1, on the other hand, significantly improve the
performance compared to TSM2R-V0 (2.2x∼4.7x faster),
since it requires much lower number of global memory
accesses. TSM2R-V2 further improves the efficiency of
global memory access to matrix B, which plays a vital
(a) single (2) (b) double (2) (c) single (4) (d) double (4)
(e) single (8) (f) double (8) (g) single (16) (h) double (16)
Figure 6: Speedup comparison with single and double precision on
K40c (n = 2, 4, 8, 16).
(a) single (n=2) (b) double (n=2) (c) single (4) (d) double (4)
(e) single (8) (f) double (8) (g) single (16) (h) double (16)
Figure 7: Memory bandwidth utilization comparison on K40c (n = 2,
4, 8, 16).
role in the overall performance. In addition, the shared
memory shares tiles of matrix B between threads within
a thread block also reduced the total number of mem-
ory accesses to matrix B. This leads to additional 1.1x
to 2.1x speedup. Finally, the data prefetch introduced
in TSM2R-V3 further mitigate the memory access bot-
tleneck, which brings additional 1.3x∼3.5x speedup.
4.2.2. Memory Throughput Analysis
Figure 7 shows the memory throughput of
TSM2R-V3, cuBLAS and BLASX in both single
and double precision on K40c GPU. Result show that
TSM2R brings 12.5%∼26.6% (17.6% on average) im-
provement on memory bandwidth utilization compared
with cuBLAS and 20.1%∼38.8% (24.3% on average)
improvement compared with BLASX.
4.2.3. Tests on Different Micro-architectures
In addition to Kepler micro-architecture, we also con-
duct test on newer Maxwell, Pascal, and Volta GPUs.
Similar as with Kepler GPU, we get tthreshold2(m40) ≈ 6 and
tthreshold2(p100) ≈ 50. Tesla M40 has slower computing power,
13
(a) Speedup (b) Comp. power util.
Figure 8: Speedup and computing power utilization comparison with
double precision on M40 (n = 16).
(a) Speedup (b) Mem. bw. util.
Figure 9: Speedup and memory bandwidth utilization comparison
with double precision on P100 (n = 16).
so the computation with input with n = 16 is com-
pute bound. Our parameter optimization procedure also
output parameters in favor of computing optimization:
t2 = 8, t3 = 4, and t1 = 256. As shown in Figure 8, our
optimized implementation achieves 1.1x -1.9x (1.47x
on average) speedup on Tesla M40 with 7% to 37.3%
(20.5% on average) computing power usage improve-
ment compared to the GEMM function in cuBLAS 9.0.
P100 has much stronger computing power, as we can
see the computation with input with n = 16 is mem-
ory bound. Our parameter optimization procedure also
output parameters in favor of memory optimization:
t2 = 4, t3 = 4, and t1 = 128. As shown in Figure
9, our optimized implementation achieves 1.1x∼3.0x
(2.15x on average) speedup on Tesla P100 with 17% to
47.6% (34.7% on average) memory bandwidth utiliza-
tion improvement compared to the GEMM function in
cuBLAS.
We also test TSM2R on the Nvidia Tesla V100 GPU
with Volta micro-architecture. As shown in Figure 10,
speedups of up to 1.35x are achieved on single pre-
cision, while speedups of up to 3.2x are achieved on
double precision. Note that the speedup for n = 16 on
single precision is slower than cuBLAS. This is due to
cuBLAS’s single-precision GEMM being optimized for
32 × 32 matrices; thus we no longer target this case. Fi-
nally, we note that the kernels achieve higher memory
bandwidth utilization on V100 than on other GPUs, as
shown in Figure 11. This is partly attributed to the im-
provements of Volta over previous micro-architectures.
More specifically, V100 with improved HBM2 memory
(a) single (2) (b) double (2) (c) single (4) (d) double (4)
(e) single (8) (f) double (8) (g) single (16) (h) double (16)
Figure 10: Speedup comparison with single and double precision on
V100 (n = 2, 4, 8, 16).
(a) single (2) (b) double (2) (c) single (4) (d) double (4)
(e) single (8) (f) double (8) (g) single (16) (h) double (16)
Figure 11: Memory bandwidth utilization with single and double pre-
cision on V100 (n = 2, 4, 8, 16).
(a) Speedup (b) Mem. bw. util.
Figure 12: Performance comparison with double-precision rectangu-
lar input on V100 (m = 15360, n = 16).
allows more workloads to obtain up to 19% more mem-
ory bandwidth utilization than Pascal GPUs, according
to its whitepaper [36].
4.2.4. Tests on Non-squared Input
We also evaluate TSM2R with rectangular input ma-
trices (m × k) on V100, where k is smaller than m by
certain small integer factors. Evaluating this case re-
veals very little performance impact, as demonstrated
in Figure 12. Although smaller than m, k is still large
enough to ensure the kernel to follow our performance
model. The memory bandwidth utilization of the kernel
remains similar to the case where m = k, and the per-
14
(a) single (8) (b) double (8)
(c) single (16) (d) double (16)
Figure 13: Speedup comparison with single and double precision on
V100 (k = n = 8, 16).
(a) single (8) (b) double (8)
(c) single (16) (d) double (16)
Figure 14: Memory bandwidth utilization with single and double pre-
cision on V100 (k = n = 8, 16).
formance of the kernel scales linearly with the matrix
size.
4.3. Evaluation of TSM2L
We next evaluate the performance of TSM2L and com-
pare it with cuBLAS on V100. For TSM2L, we must
choose the variable tc f for each matrix input combina-
tion. We obtain these values through experiments that
vary tc f . As a result, for m = 104, 105, 106, 107, we
select tc f values as 1, 1, 2, and 8 for single precision
and values 1, 1, 1, and 4 for double precision. Consider-
ing two proposed optimizations for TSM2L, we have two
versions of TSM2L, i.e., TSM2L-Opt1 and TSM2L-Opt2
in total.
As shown in Figure 13, TSM2L can obtain speedups
over cuBLAS ranging from 1.1x∼3.5x in single preci-
sion and speedups from 1.0x∼1.7x in double precision.
TSM2L-Opt1 generally performs better on single preci-
sion input than TSM2L-Opt2, while TSM2L-Opt2 per-
forms better than TSM2L-Opt1 in several double preci-
sion cases. In addition, as shown in Figure 14, TSM2L
achieves memory bandwidth utilization of up to 55%
peak global memory bandwidth. In single precision,
TSM2L utilizes significantly more memory bandwidth
than cuBLAS, up to 40% more. However, in double
precision, TSM2L uses only slightly more memory band-
width in the case that k = n = 8, and in the case that k =
n = 16, cuBLAS uses more memory bandwidth. How-
ever, since TSM2L still outperforms cuBLAS, this can
be explained by inefficient memory use patterns in the
GEMM kernel.
5. Conclusion
In this work, we first analyze the performance bot-
tleneck of current GEMM in the latest cuBLAS li-
brary. We identify that current implementations lack
of full utilization of computing power or memory
bandwidth when the input shape is tall-and-skinny.
Then, we discovered the potential challenges of opti-
mizing tall-and-skinny GEMM since its workload can
vary between compute bound, memory bound, and la-
tency bound. Next, we propose two high-performance
GEMM algorithms—TSM2R and TSM2L—on GPUs for
tall-and-tinny input with several optimization tech-
niques focusing on GPU resource utilization. Finally,
experiment results show that our optimized implemen-
tations can achieve speedups tall-and-skinny GEMM
with diverse input sizes on modern GPUs.
Acknowledgements
This research was supported by the National Sci-
ence Foundation under Grants OAC-1948447, OAC-
2034169, OAC-2003624, and OAC-202042084. We
would like to thank the University of Alabama for pro-
viding the startup fund in supporting this work.
15
References
[1] MAGMA: Matrix Algebra on GPU and Multicore Architec-
tures, http://icl.cs.utk.edu/magma/.
[2] J. Chen, L. Tan, P. Wu, D. Tao, H. Li, X. Liang, S. Li, R. Ge,
L. Bhuyan, Z. Chen, GreenLA: green linear algebra software for
gpu-accelerated heterogeneous computing, in: SC16: Interna-
tional Conference for High Performance Computing, Network-
ing, Storage and Analysis, IEEE, 2016, pp. 667–677.
[3] L. Tan, S. Kothapalli, L. Chen, O. Hussaini, R. Bissiri, Z. Chen,
A survey of power and energy efficient techniques for high per-
formance numerical linear algebra operations, Parallel Comput-
ing 40 (10) (2014) 559–573.
[4] L. Tan, S. L. Song, P. Wu, Z. Chen, R. Ge, D. J. Kerbyson, Inves-
tigating the interplay between energy efficiency and resilience
in high performance computing, in: 2015 IEEE International
Parallel and Distributed Processing Symposium (IPDPS), IEEE,
2015, pp. 786–796.
[5] T. Dong, A. Haidar, P. Luszczek, S. Tomov, A. Abdelfattah,
J. Dongarra, Magma batched: A batched blas approach for small
matrix factorizations and applications on gpus, Tech. rep., Tech-
nical report (2016).
[6] A. Heinecke, G. Henry, M. Hutchinson, H. Pabst, Libxsmm: ac-
celerating small matrix multiplications by runtime code genera-
tion, in: SC16: International Conference for High Performance
Computing, Networking, Storage and Analysis, 2016.
[7] J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao,
N. DeBardeleben, Q. Guan, Z. Chen, Tsm2: optimizing tall-
and-skinny matrix-matrix multiplication on gpus, in: Proceed-
ings of the ACM International Conference on Supercomputing
(ICS), 2019, pp. 106–116.
[8] I. S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clus-
tering and normalized cuts, in: Proceedings of the tenth ACM
SIGKDD international conference on Knowledge discovery and
data mining, 2004, pp. 551–556.
[9] K-means by NVIDIA, https://github.com/NVIDIA/
kmeans.
[10] J. Chen, X. Liang, Z. Chen, Online algorithm-based fault tol-
erance for cholesky decomposition on heterogeneous systems
with gpus, in: 2016 IEEE International Parallel and Distributed
Processing Symposium (IPDPS), 2016.
[11] K.-H. Huang, J. Abraham, et al., Algorithm-based fault toler-
ance for matrix operations, Computers, IEEE Transactions on.
[12] J. Chen, S. Li, Z. Chen, Gpu-abft: Optimizing algorithm-based
fault tolerance for heterogeneous systems with gpus, in: 2016
IEEE International Conference on Networking, Architecture
and Storage (NAS).
[13] J. Chen, H. Li, S. Li, X. Liang, P. Wu, D. Tao, K. Ouyang, Y. Liu,
K. Zhao, Q. Guan, et al., Fault tolerant one-sided matrix decom-
positions on heterogeneous systems with gpus, in: SC18: In-
ternational Conference for High Performance Computing, Net-
working, Storage, and Analysis, IEEE Press, 2018, p. 68.
[14] D. Tao, S. L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E. Z.
Zhang, D. Kerbyson, Z. Chen, New-sum: A novel online abft
scheme for general iterative methods, in: Proceedings of the
25th ACM International Symposium on High-Performance Par-
allel and Distributed Computing (HPDC), 2016.
[15] P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao,
X. Liang, J. Chen, Z. Chen, Towards practical algorithm based
fault tolerance in dense linear algebra, in: Proceedings of the
25th ACM International Symposium on High-Performance Par-
allel and Distributed Computing (HPDC), 2016.
[16] P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen,
D. Tao, X. Liang, K. Ouyang, Z. Chen, Silent data corruption
resilient two-sided matrix factorizations, in: Proceedings of the
22nd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP), 2017.
[17] X. Liang, J. Chen, D. Tao, S. Li, P. Wu, H. Li, K. Ouyang, Y. Liu,
F. Song, Z. Chen, Correcting soft errors online in fast fourier
transform, in: SC17: International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis, ACM,
2017, p. 30.
[18] P. Wu, C. Ding, L. Chen, F. Gao, T. Davies, C. Karlsson,
Z. Chen, Fault tolerant matrix-matrix multiplication: correct-
ing soft errors on-line, in: Proceedings of the second workshop
on Scalable algorithms for large-scale systems, ACM, 2011, pp.
25–28.
[19] J. Chen, Fault tolerant and energy efficient one-sided matrix de-
compositions on heterogeneous systems with gpus, Ph.D. thesis,
UC Riverside (2019).
[20] D. Tao, S. Di, X. Liang, Z. Chen, F. Cappello, Improving perfor-
mance of iterative methods by lossy checkponting, in: Proceed-
ings of the 27th International Symposium on High-Performance
Parallel and Distributed Computing (HPDC), 2018, pp. 52–65.
[21] Basic Linear Algebra on NVIDIA GPUs, https:
//developer.nvidia.com/cublas.
[22] S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear al-
gebra for hybrid GPU accelerated manycore systems, Parallel
Computing.
[23] S. Tomov, R. Nath, H. Ltaief, J. Dongarra, Dense linear alge-
bra solvers for multicore with gpu accelerators, in: 2010 IEEE
International Symposium on Parallel & Distributed Processing,
Workshops and PhD Forum (IPDPSW), IEEE, 2010, pp. 1–8.
[24] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. To-
mov, I. Yamazaki, Accelerating numerical dense linear algebra
calculations with gpus, Numerical Computations with GPUs.
[25] CULA, www.culatools.com.
[26] cuDNN, https://developer.nvidia.com/cudnn.
[27] cuBLAS Benchmark, http://developer.download.
nvidia.com/compute/cuda/compute-docs/
cuda-performance-report.pdf.
[28] CUDA Programming Guide, http://docs.nvidia.
com/cuda/cuda-c-programming-guide/index.html#
multiprocessor-level.
[29] PTX Programming Guide, http://docs.nvidia.com/
cuda/parallel-thread-execution/index.html#
data-movement-and-conversion-instructions-ld.
[30] V. Volkov, Understanding latency hiding on gpus, Ph.D. thesis,
University of California, Berkeley (2016).
[31] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi,
A. Moshovos, Demystifying gpu microarchitecture through mi-
crobenchmarking, in: Performance Analysis of Systems & Soft-
ware (ISPASS), 2010 IEEE International Symposium on, 2010.
[32] Darwin cluster, https://www.osti.gov/biblio/
1441285-darwin-cluster.
[33] PantaRhei cluster, https://www.dingwentao.com/
experimental-system.
[34] L. Wang, W. Wu, Z. Xu, J. Xiao, Y. Yang, Blasx: A high per-
formance level-3 blas library for heterogeneous multi-gpu com-
puting, in: Proceedings of the 2016 International Conference on
Supercomputing (ICS), ACM, 2016, p. 20.
[35] A. Abdelfattah, D. Keyes, H. Ltaief, Kblas: An optimized li-
brary for dense matrix-vector multiplication on gpu accelerators,
ACM Transactions on Mathematical Software (TOMS) 42 (3)
(2016) 18.
[36] Nvidia Tesla V100 GPU Architecture, https://images.
nvidia.com/content/volta-architecture/pdf/
volta-architecture-whitepaper.pdf.
Cody Rivera is an undergradu-
ate student studying Computer Sci-
ence and Mathematics at the Uni-
versity of Alabama from Fall 2018.
He is also in the Randall Re-
search Scholars Program, an hon-
ors interdisciplinary undergraduate
research program. His research in-
terests include computer science theory, algorithms, and
high-performance computing.
Jieyang Chen is a postdoctoral
researcher in the Computer Sci-
ence and Mathematics Division at
Oak Ridge National Laboratory
(ORNL). He received his master
and Ph.D. degrees in Computer
Science from University of Cali-
fornia, Riverside in 2014 and 2019.
He received a bachelor’s degree in Computer Science
and Engineering from Beijing University of Technology
in 2012. Before joining ORNL, he interned at Pacific
Northwest National Laboratory and Los Alamos Na-
tional Laboratory. His research interests include high-
performance computing, parallel and distributed sys-
tems, and big data analytics. He has published over 20
peer-reviewed high-quality papers in prestigious HPC
and Big Data conferences and journals, such as ICS,
HPDC, PPoPP, SC, BigData, IPDPS, TPDS.
Nan Xiong graduated with a
masters degree in Computer Sci-
ence from University of Califor-
nia, Riverside in 2018. She also
received a masters degree in Civil
Engineering from University of
Southern California in 2014 and a bachelors degree in
Civil Engineering from Tianjin University in 2012. She
is interested in HPC, heterogeneous computing with
GPU accelerators, and high-performance big data an-
alytics.
Shuaiwen Leon Song is cur-
rently a senior lecturer (tenured
associate professor) at the school
of computer science of Univer-
sity of Sydney and the direc-
tor of Future System Architecture
Lab (FSA). He is affiliated with
USYD nanoscience hub and Syd-
ney Quantum Academy. He is also an affiliated profes-
sor with University of Washington Electrical Engineer-
ing department. Prior to his appointment at University
of Sydney, he worked for U.S. Department of Energy
as a senior research scientist and technical lead. His
research interests include holistic system design, sys-
tem architecture and high performance computing. His
most recent works target future accelerator-driven sys-
tem design for AI and planet-scale virtual reality. He is a
Lawrence scholar, Paul E. Torgersen scholar, a recipient
of IEEE TCHPC early career award and DOE pathway
to excellence research award. He widely published in
the major HPC and computer architecture conferences,
including ISCA, HPCA, MICRO, ASPLOS and SC. His
past work received a 2017 HiPEAC paper award, two
SC best paper runner-ups, and 2018 IISWC best paper
finalist. During his tenure at PNNL, he led two LDRD
projects on AI driven future HPC system design and
large-scale data analytics acceleration.
Dingwen Tao is a tenure-track as-
sistant professor in the School of
EECS at Washington State Uni-
versity. Prior to that, he worked
as a tenure-track assistant profes-
sor at the University of Alabama,
and interned at Brookhaven Na-
tional Laboratory, Argonne Na-
tional Laboratory, and Pacific Northwest National Lab-
oratory. He received his bachelors degree in Mathe-
matics from University of Science and Technology of
China in 2013 and his Ph.D. degree in Computer Sci-
ence from University of California, Riverside in 2018.
His research interests include high-performance com-
puting, parallel and distributed systems, and big data an-
alytics. Specifically, he focuses on scientific data reduc-
tion and management, resilience and fault tolerance at
scale, and distributed machine learning and deep learn-
ing. He has published over 30 high-quality papers in
prestigious international conferences and journals, such
as ACM ICS, ACM HPDC, ACM PPoPP, IEEE Big-
Data, IEEE Cluster, IEEE IPDPS, IEEE MSST, IEEE
TPDS, ACM/IEEE SC, ICPP, PACT, IJHPCA, includ-
ing two Best Paper awards.
