




The Dissertation Committee for Jianyu Huang
certifies that this is the approved version of the following dissertation:
Practical Fast Matrix Multiplication Algorithms
Committee:









Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT AUSTIN
August 2018
Dedicated to my family for their love and support.
Acknowledgments
“Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.”
— Samuel Beckett
My five-year Ph.D. adventure at UT Austin was an amalgam of frus-
tration, happiness, disappointment, excitement, anxiety, tranquility, and many
other emotions. But it was a priceless treasure for me to experience this unfor-
gettable journey, which further teaches me to be brave and persistent enough
to overcome any challenges going forward in life. I was especially fortunate to
work with incredible people full of inspiration and innovative spirit.
This research has been funded by the following sources: NSF Award CCF-1714091:
SHF: Small: Making Strassen’s Algorithm Practical, NSF Award ACI-1550493: Collab-
orative Research: SI2-SSI: Sustaining Innovation in the Linear Algebra Software Stack
for Computational Chemistry and other Sciences, NSF Award ACI-1148125: Collabora-
tive Research: SI2-SSI: A Linear Algebra Software Infrastructure for Sustained Innovation
in Computational Chemistry and other Sciences, NSF Award CCF-1218483: SHF: Small:
Algorithm/Architecture Co-Design for Low Power and High Performance Linear Algebra
Fabrics, an Intel Parallel Computing Center grant, a gift from Qualcomm, the Microelec-
tronics and Computer Development Fellowship, and Graduate School Summer Fellowship
from UT Austin. The author acknowledges the Texas Advanced Computing Center (TACC)
at The University of Texas at Austin for providing HPC resources that have contributed to
the research results reported within this dissertation. The author also gratefully acknowl-
edges Microway, Inc. for providing access to their Nvidia Tesla GPU-accelerated compute
cluster.
v
First and foremost, I owe my deepest gratitude to my advisor, Prof.
Robert van de Geijn. Without his continuous optimism, enthusiasm, en-
couragement, confidence, and support, it would have been impossible for me
to complete this five-year journey. Despite the supervisor-student hierarchy,
Robert always respected and trusted me, and allowed me to explore wild re-
search problems as if we were colleagues, while still guiding me in refining
new ideas by sharing his experience and insight. He, along with Dr. Maggie
Myers, whom I also want to thank deeply, appreciated my efforts regardless of
whether my research was a success or failure, treated my mistakes as a learning
experience, and supported me and my decisions whenever needed, all of which
made me feel as if I were their own child. This kind of relationship made my
whole journey much more joyful.
I would also like to thank other members of my thesis committee: Prof.
Don Batory, Prof. George Biros, Dr. Greg Henry, and Prof. Keshav Pingali.
Their sharp insights and constructive feedback were absolutely invaluable to
improving my thesis. Furthermore, Prof. George Biros taught me numeri-
cal linear algebra and parallel computing, and Prof. Don Batory taught me
automatic software design; both laid the foundation for my research with high-
performance computing software. Dr. Greg Henry has collaborated with me
in my first successful research project and offered me many pieces of advice.
Prof. Don Batory and Prof. Keshav Pingali have served on my committee
since my research preparation exams.
I am also deeply grateful to my friends and colleagues in the SHPC
group: Field Van Zee, Tyler Smith, Devin Matthews, Chenhan Yu, Devangi
vi
Parikh, Victor Eijkhout, Tze Meng Low, Kyungjoo Kim, Martin Schatz, Bryan
Marker, Woody Austin, and Leslie Rice. Field and Tyler exposed me to the
details of the BLIS framework, and answered my questions with great patience.
Chenhan is exceptionally smart and hard-working. He provided me with his
insight and intuition whenever I had a difficult problem. Devin introduced
me to tensors and facilitated the idea of applying my research for a higher-
demensional space. Devangi led pedagogical efforts towards the education of
high-performance computing, which gave my research a fresh perspective. My
research is built on the shoulders of giants and the infrastructure developed
by all members of the SHPC group.
In addition, I want to thank many other friends and fellow graduate
students during the five-year journey: Xiaofan Lu, Zuocheng Ren, Bo Xiong,
Xinyu Wang, Yuepeng Wang, Yuanzhong Xu, Yu Feng, Wenzhi Cui, Hongkun
Yang, Xue Chen, Jia Chen, Mochamad Asri, Yinan Zhao, Ruohan Gao, Wen-
guang Mao, Jian He, Kostas Ferles, Zhuode Liu, Ruohan Zhang, Mei Wang,
and Zhiting Zhu. My five-year journey has been so colorful and inspiring with
the help and support from all of you!
I want to also thank Prof. Wenjun Hu for opening the door to research
for me. Wenjun hired me to work as a research intern in Microsoft Research
Asia, a world-class research institute in Beijing. She later moved to Yale, and
continued to offer me great advice during my early Ph.D. career.
Last but not least, this work would not have been possible without the
support of my family. I would especially like to thank my mother, Haiyun
Sun, and my father, Chao Huang, for their unconditional love.
vii
Jianyu Huang
The University of Texas at Austin
August 2018
viii
Practical Fast Matrix Multiplication Algorithms
Publication No.
Jianyu Huang, Ph.D.
The University of Texas at Austin, 2018
Supervisor: Robert van de Geijn
Matrix multiplication is a core building block for numerous scientific
computing and, more recently, machine learning applications. Strassen’s al-
gorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long
fascinated computer scientists due to its startling property of reducing the
number of computations required for multiplying n × n matrices from O(n3)
to O(n2.807). Over the last half century, this has fueled many theoretical
improvements such as other variations of Strassen-like FMM algorithms. Pre-
vious implementations of these FMM algorithms led to the “street wisdom”
that they are only practical for large, relatively square matrices, that they re-
quire considerable workspace, and that they are difficult to achieve thread-level
parallelism. The thesis of this work dispels these notions by demonstrating
significant benefits for small and non-square matrices, requiring no workspace
beyond what is already incorporated in high-performance implementations
of matrix multiplication, and achieving performance benefits on multi-core,





List of Figures xiv
Chapter 1. Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline of the dissertation . . . . . . . . . . . . . . . . . . . . 6
Chapter 2. Related Work 8
2.1 High-performance matrix multiplication algorithm . . . . . . . 8
2.1.1 Computing C = αAB + βC . . . . . . . . . . . . . . . . 8
2.1.2 Computing with submatrices . . . . . . . . . . . . . . . 9
2.1.3 Basic Linear Algebra Subprograms (BLAS) . . . . . . . 9
2.1.4 Level-3 BLAS matrix-matrix multiplication . . . . . . . 10
2.1.5 The GotoBLAS algorithm . . . . . . . . . . . . . . . . . 12
2.1.6 Hierarchical memory architectures . . . . . . . . . . . . 12
2.1.7 Multi-threaded implementation . . . . . . . . . . . . . . 14
2.1.8 Other approaches . . . . . . . . . . . . . . . . . . . . . 14
2.2 Literature review for FMM algorithms . . . . . . . . . . . . . . 15
2.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
x
Chapter 3. A Practical Strassen’s Algorithm 22
3.1 Strassen’s algorithm reloaded . . . . . . . . . . . . . . . . . . . 23
3.1.1 The basic idea . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Classic Strassen’s algorithm . . . . . . . . . . . . . . . . 24
3.1.3 Practical considerations . . . . . . . . . . . . . . . . . . 24
3.1.4 One-level Strassen reloaded . . . . . . . . . . . . . . . 25
3.1.5 Two-level Strassen reloaded . . . . . . . . . . . . . . . 28
3.1.6 Additional levels . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Implementation and analysis . . . . . . . . . . . . . . . . . . . 30
3.2.1 Implementations . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Performance model . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Performance experiments . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Single node experiments . . . . . . . . . . . . . . . . . . 40
3.3.2 Many-core experiments . . . . . . . . . . . . . . . . . . 43
3.3.3 Distributed memory experiments . . . . . . . . . . . . . 46
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4. Practical Fast Matrix Multiplication Algorithms 50
4.1 Fast matrix multiplication basics . . . . . . . . . . . . . . . . . 51
4.1.1 One-level fast matrix multiplication algorithms . . . . . 51
4.1.2 Kronecker product . . . . . . . . . . . . . . . . . . . . . 54
4.1.3 Recursive block indexing (Morton-like ordering) . . . . . 54
4.1.4 Representing two-level FMM with the Kronecker product 55
4.1.5 Additional levels of FMM . . . . . . . . . . . . . . . . . 56
4.2 Implementation and analysis . . . . . . . . . . . . . . . . . . . 57
4.2.1 Code generation . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Performance model . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.4 Incorporating the performance model into the code gen-
erator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Performance experiments . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Implementation and architecture information . . . . . . 69
xi
4.3.2 Benefit of hybrid partitions . . . . . . . . . . . . . . . . 70
4.3.3 Sequential and parallel performance . . . . . . . . . . . 72
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 5. A Practical Strassen’s Algorithm for Tensor Con-
traction 75
5.1 Background on high-performance tensor contraction . . . . . . 76
5.1.1 Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.2 Tensor contraction . . . . . . . . . . . . . . . . . . . . . 77
5.1.3 General stride layouts . . . . . . . . . . . . . . . . . . . 79
5.1.4 Block scatter matrix view . . . . . . . . . . . . . . . . . 81
5.2 Strassen’s algorithm for tensor contraction . . . . . . . . . . . 83
5.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 Micro-kernel . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.3 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Performance model . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5 Performance experiments . . . . . . . . . . . . . . . . . . . . . 96
5.5.1 Single node experiments . . . . . . . . . . . . . . . . . . 97
5.5.2 Distributed memory experiments . . . . . . . . . . . . . 103
5.6 Related work on tensor contraction . . . . . . . . . . . . . . . 106
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 6. A Practical Strassen’s Algorithm on GPUs 109
6.1 Background on Nvidia GPUs . . . . . . . . . . . . . . . . . . . 110
6.1.1 GPU programming model . . . . . . . . . . . . . . . . . 110
6.1.2 Nvidia Volta GPUs . . . . . . . . . . . . . . . . . . . . 112
6.1.3 Matrix multiplication on GPUs . . . . . . . . . . . . . . 113
6.2 Strassen’s algorithm on Nvidia GPUs . . . . . . . . . . . . . . 120
6.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.1 Exploiting more parallelism . . . . . . . . . . . . . . . . 122
6.3.2 Reducing memory requirement . . . . . . . . . . . . . . 126
6.3.3 Handling the fringes . . . . . . . . . . . . . . . . . . . . 127
xii
6.3.4 Adapting software prefetching . . . . . . . . . . . . . . . 127
6.4 Performance experiments . . . . . . . . . . . . . . . . . . . . . 128
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Chapter 7. Conclusion 133
7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Appendices 142
Appendix A. Table of Acronyms 143
Appendix B. Table of Symbols 144
Appendix C. 〈3, 2, 3〉 FMM algorithm 145
Appendix D. Derivation of Kronecker Product 148
Appendix E. Numerical Stability 155
E.1 Numerical stability for Strassen’s algorithm . . . . . . . . . . . 155
E.1.1 Theoretical bound . . . . . . . . . . . . . . . . . . . . . 155
E.1.2 Empirical experiment . . . . . . . . . . . . . . . . . . . 157
E.2 Numerical stability for FMM algorithms . . . . . . . . . . . . 157
E.2.1 Theoretical bound . . . . . . . . . . . . . . . . . . . . . 157
E.2.2 Empirical experiment . . . . . . . . . . . . . . . . . . . 159





2.1 Left: illustration of the BLIS implementation of the Goto-
BLAS algorithm. All computation is cast in terms of a highly
optimized micro-kernel. Right: the same algorithm, but ex-
pressed as loops. The left figure originally appeared in [123],
and the right figure originally appeared in [109]. . . . . . . . 11
2.2 Chronological decrease of the exponents ω for the arithmetic
compelxity of (fast) matrix multiplication, O(nω). Note that
we only cover some important papers. . . . . . . . . . . . . . 17
3.1 All operations for one-level Strassen. Note that each row is a
special case of general operation (3.2). . . . . . . . . . . . . . 23
3.2 Left: modification of Figure 2.1 that implements the represen-
tative computation M = (X+Y )(V +W );D+= M ;E+= M of
general operation (3.2). X, Y are submatrices of A; V , W are
submatrices of B; D, E are submatrices of C; M is the inter-
mediate matrix product. Note that the packing buffers Ãi and
B̃p stay in cache. Right: the same algorithm, but expressed as
loops (γ0 = γ1 = δ = ε = 1). . . . . . . . . . . . . . . . . . . . 26
3.3 A side-by-side comparison of the BLIS implementation of the
GotoBLASalgorithm and our modifications for implementing
the representive computation M = (X + Y )(V + W );D+ =
M ;E+= M . Left: Figure 2.1; Right: Figure 3.2. . . . . . . . 27
3.4 Computations for two-level Strassen. . . . . . . . . . . . . . 29
3.5 Notation table for Strassen performance model. . . . . . . . 34
3.6 Theoretical run time breakdown analysis of BLAS dgemm and
various implementations of Strassen. The time shown in
the first column for dgemm, one-level Strassen, two-level
Strassen can be computed separately by multiplying the pa-
rameter in τ column with the number in the corresponding en-
tries. Due to the software prefetching effects, the row marked
with (∗) needs to be multiplied by an additional parameter
λ ∈ [0.5, 1], which denotes the prefetching efficiency. λ is ad-
justed to match BLIS dgemm performance. . . . . . . . . . . 35
3.7 The coefficient NXm mapping table for computing Tm in the per-
formance model. . . . . . . . . . . . . . . . . . . . . . . . . . 36
xiv
3.8 Performance of the various implementations on an Intel Xeon
E5 2680 v2 (Ivybridge) processor (single core). Left: modeled
performance. Right: actual performance. The range of the y-
axis does not start at 0 to make the graphs more readable and
28.32 marks theoretical peak performance for this architecture. 39
3.9 Performance of the various implementations on an Intel Xeon
E5 2680 v2 (Ivybridge) processor (one and ten cores). Left: 5
core. Right: 10 core. The range of the y-axis does not start at
0 to make the graphs more readable. . . . . . . . . . . . . . . 42
3.10 Performance of one-level ABC Strassen, BLIS, and MKL, on
an Intel Xeon Phi (KNC) coprocessor (for 60 cores with 240
threads). The performance results are sampled such that k is a
multiple of 480 = 2× kC . . . . . . . . . . . . . . . . . . . . . . 45
3.11 Performance of the various implementations on distributed mem-
ory (weak scalability). . . . . . . . . . . . . . . . . . . . . . . 48
4.1 All operations for one-level Strassen. Compared to Figure 3.1,
the indeces for the submatrices of A, B, and C have been changed. 52
4.2 Theoretical and practical speedup for various FMM algorithms
(double precision). m̃k̃ñ is the number of multiplication for
classical matrix multiplication algorithm. R is the number of
multiplication for fast matrix multiplication algorithm. Theo-
retical speedup is the speedup per recursive step. Practical #1
speedup is the speedup for one-level FMM algorithms compared
with gemm when m = n = 14400, k = 480 (rank-k updates).
Practical #2 speedup is the speedup for one-level FMM algo-
rithms compared with gemm when m = n = 14400, k = 12000
(approximately square). We report the practical speedup of the
best implementation of our generated code (generated gemm)
and the implementations in [9] (linked with Intel MKL) on sin-
gle core. More details about the experiment setup is described
in Section 4.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Illustration of recursive block storage indexing (Morton-like or-
dering) [37] on m× k matrix A where the partition dimensions
m̃ = k̃ = 2 for three-level recursions. . . . . . . . . . . . . . . 55
4.4 Notation table for performance model. . . . . . . . . . . . . . 61
4.5 The equations for computing the execution time T and Effective
GFLOPS in our performance model. . . . . . . . . . . . . . . 61
xv
4.6 The various components of arithmetic and memory operations
for BLAS gemm and various implementations of FMM algo-
rithms. The time shown in the first column for gemm and
L-level FMM algorithms can be computed separately by multi-
plying the parameter in τ column with the arithmetic/memory
operation number in the corresponding entries. . . . . . . . . 62
4.7 The coefficient NXa /N
X
m mapping table for computing Ta/Tm in
the performance model. Here M̃L =
∏L−1
















l=0 Wl, RL =
∏L−1
l=0 Rl. . . . . . . . . . . . . . . . . . . . . 63
4.8 Performance of generated one-level ABC, AB, Naive FMM im-
plementations on single core when m=n=14400, k varies. Left
column: actual performance; Right column: modeled perfor-
mance. Top row: one-level, ABC; Middle row: one-level, AB;
Bottom row: one-level, Naive. . . . . . . . . . . . . . . . . . . 65
4.9 Performance of generated two-level ABC FMM implementa-
tions on single core when m=k=n; m=n= 14400, k varies;
k=1024, m=n vary. Left column: actual performance; Right
column: modeled performance. Top row: m=k=n; Middle row:
m=n=14400, k varies; Bottom row: k=1024, m=n vary. . . . 66
4.10 Selecting FMM implementations with the performance model. 68
4.11 Benefit of hybrid partitions over other partitions. . . . . . . . 71
4.12 Performance of the best implementation of our generated FMM
code and reference implementations [9] on one socket (10 core).
Top row: our implementations; Bottom row: reference im-
plementations from [9] (linked with Intel MKL). Left column:
m=k=n; Middle column: m=n=14400, k varies; Right column:
k=1024, m=n vary. . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 An example to illustrate Strassen’s algorithm for tensor con-
traction. The dashed lines denotes Strassen 2 × 2 partitions
mapping from block scatter matrix view (bottom) to the orig-
inal tensor (top). In this example the partitions are regular
subtensors, but this is not required in general. . . . . . . . . 80
5.2 Notation table for performance model. . . . . . . . . . . . . . 88
5.3 Equations for computing the execution time T and Effective
GFLOPS in our performance model. . . . . . . . . . . . . . . 88
xvi
5.4 Various components of arithmetic and memory operations for
tblis TC and various implementations of Strassen TC. The
time shown in the first column for tblis TC and L-level Strassen
can be computed separately by multiplying the parameter in τ
column with the arithmetic/memory operation number in the
corresponding entries. Here NIm =
∏
i∈Im Ni = Ni0 · . . . ·Nim−1 ,
NJn =
∏
j∈Jn Nj = Nj0 · . . . ·Njn−1 , NPk =
∏
p∈Pk Np = Np0 · . . . ·
Npk−1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Coefficient WXa /W
X





the performance model. . . . . . . . . . . . . . . . . . . . . . 90
5.6 Modeled performance (solid line) and actual performance (dots)
of various implementations for synthetic data on single core. y-
axis for all figures is Effective GFLOPS (2·NIm ·NJn ·NPk/time).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.7 Actual performance (dots) of various implementations for syn-
thetic data on one socket. y-axis for all figures is Effective
GFLOPS (2 ·NIm ·NJn ·NPk/time). . . . . . . . . . . . . . . 94
5.8 Normalized root-mean-square error (NRMSE) between the ac-
tual and modeled performance for synthetic data on single core.
NRMSE is defined as the root of mean square error normalized
by the mean value of the measurements, which shows the model
prediction accuracy. . . . . . . . . . . . . . . . . . . . . . . . 95
5.9 Performance for representative user cases of benchmark from
[112]. TC is identified by the index string, with the tensor
index bundle of each tensor in the order C-A-B, e.g., Cabcd+=
AaebfBdfce is denoted as abcd-aebf -dfce. Top: performance on
single core. Bottom: performance on one socket. . . . . . . . 100
5.10 Performance for the contraction Zabij+=Wabef ·Tefij with vary-
ing Na : Ni ratio. Left: performance on single core. Right: per-
formance on one socket. . . . . . . . . . . . . . . . . . . . . . 102
5.11 Weak scalability performance result of the various implemen-
tations for a 4-D tensor contraction CCSD application on dis-
tributed memory: Zabij := WbmejTaeim. CTF: the performance
of the Cyclops Tensor Framework [110] (linked with Intel MKL). 105
6.1 Memory and thread hierarchy in the CUDA programming en-
vironment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
xvii
6.2 Illustration of the gemm implementation in CUTLASS [67].
CUTLASS partitions the operand matrices into blocks in the
different levels of the device, thread block, warp, and thread.
Here we show block sizes typical for the large sgemm: mS =
128, nS = 128, kS = 8; mW = 4×mR = 32, nW = 8× nR = 64;
nR = 8, nR = 8. . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 CUTLASS specifies six strategies of block sizes at each level in
Figure 6.2 for different matrix shapes and problem sizes. . . . 117
6.4 CUTLASS performance with different strategies of block sizes. 118
6.5 All operations for one-level Strassen. Duplicate of Figure 4.1
for easy reference. . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6 Specialized kernel that implements the representative computa-
tion M = (X + Y )(V + W );D+= M ;E+= M of each row of
computations in Figure 6.5 based on Figure 6.2. X, Y are sub-
matrices of A; V , W are submatrices of B; D, E are submatrices
of C; M is the intermediate matrix product. . . . . . . . . . 121
6.7 Reordered operations based on Figure 4.1 with multi-kernel
streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.8 Performance of various Strassen implementations on V100
with single precision. The theoretical peak for the machine is
14.13 TFLOPS. The 1-level and 2-level reference implementa-
tions are from [72] (linked with cuBLAS 9.0). The 2-level hy-
brid implementation replaces the cublasSgemm function in the
1-level reference implementation [72] with our 1-level Strassen
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . 130
C.1 All operations for an example of one-level 〈3, 2, 3〉 FMM algorithm.147
E.1 Empirical stability experiment for Strassen. The actual max-
norm error and mean error vs. theoretical error bound for
square matrices in double precision and single precision. . . . . 156
E.2 Empirical stability experiment for 1-level and 2-level FMM al-
gorithms. The actual max-norm error for square matrices in




At the core of dense linear algebra lies matrix multiplication (gemm),
a fundamental primitive of great importance to numerous scientific disciplines
such as machine learning [125, 18], numerical linear algebra [63, 39, 2, 43],
graph analysis [66], and more. Considerable effort has been made to improve
the performance of this task over the last decades.
Strassen’s algorithm (Strassen) [115], the original Fast Matrix Mul-
tiplication (FMM)1 algorithm, has fascinated theoreticians and practitioners
alike since it was first published in 1969. That paper demonstrated that mul-
tiplication of n × n matrices can be achieved in less than the O(n3) floating
point operations required by a conventional matrix multiplication. It has led to
many variants (other Strassen-like FMM algorithms) that improve upon this
result [129, 12, 100, 104] as well as practical implementations [32, 59, 26, 9].
The method can yield a shorter execution time than the best conventional
algorithm with a modest degradation in numerical stability [50, 28, 6] by only
incorporating a few levels of the recursion that underlies the method.
1Fast matrix multiplication (FMM) algorithms are those matrix multiplication algo-
rithms which perform asympototically fewer arithmetic operations than the classical algo-
rithm. Strassen can be found in Section 3.1, and another example of FMM algorithms is
given in Appendix C.
1
1.1 Motivation
From 30,000 feet, Strassen and similar Strassen-like FMM algorithms
can be described as shifting computation with submatrices from multiplica-
tions to additions, reducing the O(n3) term at the expense of adding O(n2)
complexity. For current architectures, of greater consequence is the additional
memory movements that are incurred when the algorithm is implemented in
terms of a conventional gemm provided by a high-performance implementa-
tion through the Basic Linear Algebra Subprograms (BLAS) [30] interface. A
secondary concern has been the extra workspace that is required. This simul-
taneously limits the size of problem that can be computed and makes it so
an implementation is not plug-compatible with the standard calling sequence
supported by the BLAS. These constraints expose the “street wisdom” for
implementing Strassen and similar Strassen-like FMM algorithms:
• Strassen and similar Strasen-like FMM algorithms are only practical for
very large matrices.
• For Strassen and similar Strassen-like FMM algorithms to be effective,
the matrices being multiplied should be relatively square.
• Strassen and other Strassen-like FMM algorithms inherently require sub-
stantial workspace.
• A Strassen or other similar Strassen-like FMM algorithm interface must
allow the caller to pass in workspace or allocate substantial workspace in-
ternally.
2
• It is hard to demonstrate speedup on multi-core architectures.
These limitations motivate us to work towards the theory and practice
of the fast matrix multiplication algorithms. The goal of this dissertation is
to explore the practical implementation of Strassen and Strassen-like FMM
algorithms on various architectures to overcome these limitations.
1.2 Solution
An important recent advance in the high-performance implementa-
tion of gemm is the BLAS-like Library Instantiation Software (BLIS frame-
work) [124], a careful refactoring of the best-known approach to implementing
conventional gemm introduced by Goto [38]. A similar effort made for Nvidia
GPUs is CUDA Templates for Linear Algebra Subroutines (CUTLASS) [67].
Of importance to the solution to the basic problem above are the building
blocks that BLIS and CUTLASS expose, minor modifications of which sup-
port a new approach to implementating Strassen and similar Strassen-like
FMM algorithms. This approach changes data movement between memory
layers and can thus mitigate the negative impact of the additional lower-order
terms incurred by Strassen and other Strassen-like FMM algorithms. These
building blocks have similarly been exploited to improve upon the performance
of, for example, the computation of the K-Nearest Neighbor [131, 97] and Ten-
sor Contraction [85, 112] problem. The result is a family of Strassen and
other Strassen-like FMM implementations, members of which attain superior
performance depending on the sizes of the matrices.
3
The resulting family improves upon prior implementations of Strassen
and Strassen-like FMM algorithms in a number of surprising ways:
• It can outperform classical gemm even for small square matrices.
• It can achieve high performance for rank-k updates (gemm with a small
“inner matrix size”), a case of gemm frequently encountered in the imple-
mentation of libraries like LAPACK [2].
• It requires no additional workspace beyond the small buffers that are already
incorporated in BLIS and CUTLASS.
• It can incorporate directly the multi-threading in traditional gemm imple-
mentations.
• It can be plug-compatible with the standard gemm interface supported by
the BLAS and cuBLAS.
• It can be incorporated into practical distributed memory implementations
of gemm.
Most of these advances run counter to “street wisdom.”
1.3 Contributions
This disseration makes the following contributions:
4
• It dispels the conventional notions regarding implementing Strassen
and similar Strassen-like FMM algorithms, outperforming the conven-
tional matrix multiplication for small and non-square matrices and requiring
no additional workspace other than what is already embeded in the high-
performance implementations of gemm.
• It demonstrates performance benefits of Strassen on single core,
multi-core, many-core, and distributed parallel CPU architectures.
Thread-level parallelism without the extra overhead of task parallelism is
attained by adopting the same loop structure and parallel scheme as the
high-performance gemm implementation.
• It facilitates a code generator to automatically implement families
of Strassen-like FMM algorithms, which expresses the composition of
multi-level FMM algorithms as Kronecker products.
• It builds an accurate performance model, which can also be generated
by the code generator and used for guiding the choice of a FMM implementa-
tion as a function of problem size and matrix shape, facilitating the creation
of “poly-algorithms” [77] that select the best algorithm without exhaustive
empirical search.
• It provides the first efficient implementation of Strassen for tensor
contraction, the multi-dimensional generalization of matrix multiplication,
without the explicit transposition of data that inherently incur significant
memory movement and workspace overhead.
5
• It achieves more parallelism and requires less memory with a prac-
tical Strassen implementation on Nvidia GPUs, which can be viewed
as massively-parallel, throughput-oriented computing engines. The special-
ized kernel is developed to utilize the memory hierarchy and thread hier-
archy, to reduce the additional workspace requirement through reusing the
shared memory and the register files, and to exploit the inter-kernel paral-
lelism as well as the intra-kernel parallelism.
1.4 Outline of the dissertation
The rest of this dissertation is organized as follows:
• Chapter 2 discusses the literature related to the high-performance implemen-
tations of matrix multiplication algorithms and the theoretical and practical
breakthroughs of Strassen and similar Strassen-like fast matrix multipli-
cation algorithms.
• Chapter 3 presents a novel practical implementation of Strassen, which
demonstrates significant speedups for small problem sizes and non-square
matrix shapes, avoids additional workspace requirement other than what is
already in gemm, and achieves better performance on various CPU archi-
tectures.
• Chapter 4 further develops a code generator framework to automatically
implement a large family of fast matrix multiplication algorithms suitable
for multiplications of arbitrary matrix sizes and shapes. Performance models
6
are incorporated into the code generator to guide the choice of a FMM
implementation as a function of problems sizes and matrix shapes.
• Chapter 5 extends the novel insights from Chapter 3 into tensor contraction,
a higher dimensional generalization of matrix multiplication.
• Chapter 6 describes how to apply Strassen on GPUs with less memory
and more parallelism.
• Chapter 7 concludes this dissertation and proposes ideas for future research.
• Appendices A and B provide tables of acronyms and symbols used through-
out this dissertation.
• Appendix C shows an example of Strassen-like FMM algorithms other than
Strassen.
• Appendix D derives that the composition of multi-level FMM algorithms
can be expressed as Kronecker products.





In this chapter, we will look at related work, including a brief de-
scription of the state-of-the-art implementations and algorithms for matrix
multiplication, followed by a brief literature review of Strassen and similar
Strassen-like FMM algorithms over the last half century.
2.1 High-performance matrix multiplication algorithm
We start by discussing the conventional matrix multiplication (gemm),
how it is supported as a library routine by the Basic Linear Algebra Subpro-
grams (BLAS) [30], how modern implementations block for caches, and how
that implementation supports multi-threaded parallelization.
2.1.1 Computing C = αAB + βC
Consider C = αAB + βC, where C, A, and B are m × n, m × k, and
k × n matrices, respectively, and α and β are scalars. If the (i, j) entry of
C, A, and B are respectively denoted by ci,j, ai,j, and bi,j, then computing






which requires 2mnk floating point operations (flops).
2.1.2 Computing with submatrices
Important to our discussion is that we partition the matrices and stage
the matrix-multiplication as computations with submatrices. For example, let

































C00 = (A00B00 + A01B10) + C00 C01 = (A00B01 + A01B11) + C01
C10 = (A10B00 + A11B10) + C10 C11 = (A10B01 + A11B11) + C11
computes C = AB + C via eight multiplications and eight additions with
submatrices, still requiring approximately 2mnk flops.
2.1.3 Basic Linear Algebra Subprograms (BLAS)
The Basic Linear Algebra Subprograms (BLAS) [73, 31, 30] specifies an
interface for a set of dense linear algebra (DLA) operations on which higher
level linear algebra libraries, such at LAPACK [2] and libflame [121], are
built. The BLAS operations are divided into three sets: the level-1 BLAS
(vector-vector operations), the level-2 BLAS (matrix-vector operations), and
the level-3 BLAS (matrix-matrix operations). Examples of BLAS libraries on
CPUs include IBM ESSL [60], Intel MKL [61], ATLAS [3], GotoBLAS [40],
OpenBLAS [90], and BLIS [14]. Other implementations of BLAS operations
9
with similar interfaces on GPUs include clBLAS [20], Nvidia cuBLAS [87], and
AMD rocBLAS [1].
2.1.4 Level-3 BLAS matrix-matrix multiplication
(General) matrix-matrix multiplication (gemm) is supported in the
level-3 BLAS [30] interface as
DGEMM( transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc )
SGEMM( transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc )
where “DGEMM” and “SGEMM” are for double precision and single precision,
respectively. These calls support
C = αAB + βC, C = αATB + βC,
C = αABT + βC, and C = αATBT + βC
depending on the choices of transa and transb (indicating whether the ma-
trices A and B should be transposed). The parameters C, A, and B are the
arrays used to store the matrices C, A, and B. The parameters m, n, k, alpha,
and beta are coresponding to m, n, k, α, and β in Section 2.1.1. The pa-
rameters lda, ldb, and ldc specify the leading dimension of A, B, and C,
that is, the the number of elements between successive columns (for column
major storage) in memory. In this dissertation, we focus on the special case
α = 1 and β = 1 for brevity. By internally allowing both a row stride and
a column stride for A, B, and C (as the BLIS framework does), transposition
can be easily supported by swapping these strides. It suffices then to consider





























Pack Bp  → Bp 
~ 
Bp 
~ mC Ci Ai mC 
4th#loop#around#micro5kernel#
+=#










Y W Dj 
X Vj Cj 
4th#loop#around#micro5kernel#
+=#
kC Yp Wp Dj 
kC 
Xp Vp Cj 
3rd#loop#around#micro5kernel#
+=#
Pack Vp + Wp → Bp 
~ 
Bp 
~ mC Di Yi 


























Update Cij , Dij  
nC nC 
Loop 5 for jc =0 : n−1 steps of nc
Jc =jc : jc+nc−1
Loop 4 for pc =0 : k−1 steps of kc
Pc =pc : pc+kc−1
B(Pc,Jc) → B̃ // Pack into B̃p
Loop 3 for ic =0 : m−1 steps of mc
Ic = ic : ic+mc−1
A(Ic,Pc) → Ã // Pack into Ãi
// mac o-kernel
Loop 2 for jr =0 : nc−1 steps of nr
Jr =jr : jr+nr−1
Loop 1 for ir =0 : mc−1 steps of mr
Ir = ir : ir+mr−1
// micro-kernel
Loop 0 for pr =0 : kc−1







Figure 2.1: Left: illustration of the BLIS implementation of the GotoBLAS algorithm. All computation
is cast in terms of a highly optimized micro-kernel. Right: the same algorithm, but expressed as loops.
The left figure originally appeared in [123], and the right figure originally appeared in [109].
11
2.1.5 The GotoBLAS algorithm
A key insight underlying modern high-performance implementations of
gemm is to organize the computations by partitioning the operands into blocks
for temporal locality, and to pack (copy) such blocks into contiguous buffers
that fit into various levels of memory for spatial locality. Figure 2.1 illustrates
the GotoBLAS algorithm as implemented in BLIS.
• Cache blocking parameters {mC , nC , kC} determine the submatrix sizes
of Bp (kC × nC) and Ai (mC × kC), such that they fit in various caches.
During the computation, row panels Bp are contiguously packed into
buffer B̃p to fit in the L3 cache.
1 Blocks Ai are similarly packed into
buffer Ãi to fit in the L2 cache.
• Register block sizes {mR, nR} relate to submatrices in registers that con-
tribute to C. In the micro-kernel (the inner most loop), a small mR×nR
micro-tile of C is updated by a pair of mR×kC and kC×nR micro panels
of Ãi and B̃p.
The above parameters can be analytically chosen [80].
2.1.6 Hierarchical memory architectures
The modern computer architecture with a hierarchical memory archi-
tectures has memory that is laid out as a pyramid, with fast and small memo-
1If an architecture does not have an L3 cache, this panel is still packed to make the data
contiguous and to reduce the number of TLB entries used.
12
ries close to the processor, and slow and large memories far away from it [48].
Gunnels et al. [44] describes a family of algorithms for matrix multipli-
cation on hierarchical memory architectures. The theoretical insight showed
that, depending on the matrix shape of matrix multiplication that is executed
at the current level of cache, there are two locally-optimal strategies for the
matrix shape of matrix multiplication that must happen at the next highest
level of cache. That motivates a tree of locally-optimal decisions, each path
from root to leaf of which represents a member of the family of algorithms.
Smith [106] further derived a new family of algorithms for optimiz-
ing the I/O cost of matrix multiplication at multiple levels of the memory
hierarchy simultaneously. He showed that one can compose two loops per
level of cache in order to encounter an optimal subproblem, whose shapes
arise from algorithms that attain the theoretical I/O lower bounds for matrix
multiplications for a single level of cache, at each layer of the memory hierar-
chy. With the expansion of the family of algorithms to allow more flexibility
and to include GotoBLAS as a part of this family, the Multilevel Opti-
mized Matrix-matrix Multiplication Sandbox (MOMMS) [105] was created to
demonstrate the performance improvements of practical algorithms from the
family over the state-of-the-art GotoBLAS in terms of I/O cost. Impor-
tantly to that work, it showed some members of the family of algorithms can
outperform the state-of-the-art when the bandwidth to main memory is low.
Future computer architectures are expected to be bandwidth bound, while the
conventional implementations of Strassen and similar Strassen-like FMM al-
13
gorithms reduce the total number of computations at the cost of consuming
more bandwidth. That work may be the answer to how to develop Strassen
and similar Strassen-like FMM algorithms based on matrix multiplication in
a low bandwidth scenario. Details of this go beyond this dissertation.
2.1.7 Multi-threaded implementation
BLIS exposes all the illustrated loops in Figure 2.1, requiring only the
micro-kernel to be optimized for a given architecture. In contrast, in the Go-
toBLAS implementation the micro-kernel and the first two loops around it
form an inner-kernel (known as a “macro-kernel” in BLIS terminology) that
is implemented as a unit. As a result, the BLIS implementation exposes five
loops (two more than the GotoBLAS implementation) that can be paral-
lelized, as discussed in [108].
2.1.8 Other approaches
In the early stages, high-performance gemm typically required hand-
written assembly code optimized for a specific processor. To avoid this tedious
task, autotuning is proposed where the best parameters are determined by
measuring performance empirically when thosed parameters are tuned. The
Portable High Performance ANSI C (PHiPAC) [11] project provided guidelines
for writing the high-performance gemm in C and introduced the code genera-
tor to autogenerate and find the best parameters for a given system. Further,
the Automatically Tuned Linear Algebra Software (ATLAS) [126] were built
14
on these insights and made autogeneration and autotuning of BLAS libraries
mainstream. The main contribution of ATLAS was the idea of casting matrix
multiplication in terms of a basic unit of computation, which was originally
called “on-chip multiply” and is known as “inner-kernel” or “macro-kernel”
now. However, as pointed out by Smith [106], the I/O cost of the algorithms
used by ATLAS is 50% higher than the tight lower bound in [107]. On the other
hand, model-driven optimization can be more effective to select the best pa-
rameters with the best performance for a given architecture. Yotov et al. [130]
showed that one can identify the near-optimal block sizes for gemm with the
analytical model, and thus empirical search is unnecessary. Low [80] further
applied the analytical model approach to the more modern GotoBLAS.
K̊agström [63] showed that the level-3 BLAS operations can be imple-
mented mostly based on gemm. This is sometimes referred to as “poorman’s
BLAS” in the sense that it is only necessary to optimize gemm, then all level-3
BLAS can be built in terms of this “for free”, which brings both modularity
and high performance.
2.2 Literature review for FMM algorithms
There is an extensive literature on the theory and practice for Strassen
and FMM algorithms. Here we only mention the most influential ones.
15
2.2.1 Theory
A history for complexity improvements
Until the late 1960’s, it was widely accepted that the number of oper-
ations for the multiplication of n× n matrices was essentially O(n3). In 1969,
Strassen [115] demonstrated that it suffices to perform matrix multiplication
with only O(n2.8074) arithmetic operations, which opened a new era of reseach
for fast matrix multiplication algorithms. Figure 2.2 shows the chronological
improvement of the exponents ω for the FMM arithmetic complexity O(nω)
over the last decades.
In 1971, Winograd [129] proved that the minimum required number of
multiplications for 2 × 2 and 2 × 2 submatrix multiplication is seven. This
paper further led to a more efficient Winograd variant of Strassen’s algorithm,
in which the recursive step has 15 instead of 18 submatrix additions. In 1978,
Pan [91] constructed an exact algorithm with O(n2.7951) arithmetic complexity,
by trilinear aggregation and a base case of multiplying 70 × 70 submatrix
multiplications in 143640 operations. In 1979, Bini et al. [12] presented a
method for multiplying 3×2 and 2×2 submatrices with O(n2.7799) complexity,
by introducing the notion of arbitrary precision approximate (APA) algorithms
and border rank. In 1981, Schönhage [100] further generalized that notion of
APA and border rank, proved the asymptotic sum inequality, and obtained
O(n2.522) complexity based on 3× 3 and 3× 3 submatrix multiplications. The
following year, Coppersmith and Winograd [23] achieved O(n2.4966) complexity
with similar techniques, the first result beating O(n2.5). In 1986, Strassen [116]
16


















2.522 [100] 2.4966 [23]
2.4785 [116]



























Figure 2.2: Chronological decrease of the exponents ω for the arithmetic compelxity of (fast) matrix
multiplication, O(nω). Note that we only cover some important papers.
17
introduced a new method called the laser method technique with O(n2.4785)
complexity, which was adopted by Coppersmith and Winograd [22] one year
later to push to the famous bound O(n2.376). This new bound had been the
state of the art for the following twenty years, until finally being advanced by
Stothers [114] with the new upper bound O(n2.374), in 2010. Two years later,
Williams [128] futher improved the bound to O(n2.3728642). More recently, in
2014, François [74] achieved the current world champion bound O(n2.3728639)
by polishing up Williams’ methods.
This is just a small sample of the important breakthroughs in this area.
Numerical stability
Bini and Lotti [13] proves the “normwise stability” and provides the
first general error bound for Strassen and similar Strassen-like FMM algo-
rithms. Since then, a number of papers [50, 28, 25] have been published to
analyze and improve the numerical properties of various FMM algorithms.
Even more recently, Ballard et al. [6] explored ways to improve the accuracy
both thoeretically and empirically. These papers show FMM algorithms are
reasonably numerically stable as long as only a few levels of recursions are
adopted. More details about the numerical stability of Strassen and similar
Strassen-like FMM algorithms can be found in Appendix E.
18
Feasible algorithms
A number of theoretical research papers mentioned above [12, 100, 23,
116, 23, 114, 128, 74] are of purely theoretical interest and not feasible in
practice. On the one hand, these papers only prove the existence of fast APA
algorithms with better asymptotic complexity, which are valid only for inputs
beyond any practical size [100]. On the other hand, those APA algorithms
exhibit serious numerical instability problems [9].
A recent review [33] pointed out that the current complexity record
for “moderate” problem size with n < 1000000 is O(n2.7734), achieved by Pan
with the 44×44 submatrix multiplications in [92]. Remarkably, Smirnov [104]
proposed systematic approaches to discover feasible algorithms and presented
several new algorithms, including an algorithm with complexity O(n2.7743).
Most recently, Karstadt and Schwartz [65] obtained a faster FMM algorithm
with the same base case size and asymptotic complexity as Winograd [129],
but with the coefficient reduced from 6 to 5.
2.2.2 Practice
Layering FMM algorithms upon standard gemm calls
There is a rich history of work that combines a few levels of Strassen
and related variants with calls to highly optimized implementations of the
gemm that are part of the BLAS: IBM’s ESSL library [60] long included such
routines as an option and portable libraries were reported in Huss-Lederman
et al. [59] and Douglas et al. [32].
19
Lower level implementation
A very thorough treatment was given by D’Alberto et al. [27], which
not only supported the layering upon gemm but also studied how to optimize
below the BLAS interface. However, in these studies, extra temporary matrices
are still required.
Sequential
Bailey [5], Douglas et al. [32], and Huss-Lederman [59] provided sequen-
tial implementations of Strassen’s algorithm. Kaporin [64] implemented the
algorithms presented in [91, 71], showing that the running time for one-level
of such algorithms (e.g., 70 × 70 submatrix multiplications) was comparable
with Strassen’s algorithm on a sequential machine.
Parallelization for shared memory
Scheduling tasks (multiplications with submatrices) to multiprocessors
and multi-threaded architectures has been widely studied [27, 9, 26, 70]. A
fundamental problem is that Strassen creates 7l such tasks, where l equals
the number of levels of Strassen. This makes it difficult to load balance [9].
Other Strassen-like FMM algorithms suffer similar problems.
Parallelization for distributed memory
The distributed memory implementation of Strassen by Luo and
Drake [81] based on Cannon’s algorithm [17] inspired the more practical im-
20
plementation by Grayson and van de Geijn [42] that builds upon the SUMMA
algorithm for distributed memory matrix-matrix multiplication by van de
Geijn and Watts [120]. Ballard [7] and Lipshitz [79] demonstrated a parallel
Strassen implementation for distributed memory that minimizes communi-
cation cost.
Parallelization for GPUs
Strassen and Winograd [129] variants are efficiently implemented on a
single GPU in [76, 72]. Lai et al. [72] supported arbitrary sized matrices by use
of dynamic peeling [117] and exploited multi-kernel concurrency to leverage the
inter-block and intra-block parallelism in the lowest level of recursion, based
on the cuBLAS library [87].
2.3 Summary
We have reviewed the related work on the high-performance imple-
mentations of the conventional matrix multiplication and the theoretical and
practical breakthroughs of Strassen and similar Strassen-like FMM algo-
rithms over the last half century. More details for the work directly related to
this dissertation will be given in the later chapters.
21
Chapter 3
A Practical Strassen’s Algorithm
Let us now focus on the practical implementation of Strassen, the
original Fast Matrix Multiplication algorithm. The orthodox implementation
of Strassen entails “street wisdom” that it is only practical for large and
relatively square matrices, that it needs substantial workspace, that it suf-
fers the overhead of extra memory movement that is incurred, and that it
is laborious to achieve thread-level parallelism. In this chapter, we counter
conventional wisdom, demostrating significant performance improvements for
small and non-square matrices, requiring no workspace beyond what is already
incorporated in the high-performance implementation of gemm, and achiev-
ing speedup on single core, multi-core, many-core, and distributed memory
architectures.
This chapter is based on the conference paper [56] with minor modifications: “Jianyu
Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. Strassen’s algorithm
reloaded. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis (SC 16), pages 59:1-59:12. IEEE Press, 2016.” I am the
main contributor in charge of problem formulation, algorithm development, performance
analysis, and experimental validations.
22
M0 =(A00 + A11)(B00 +B11); C00+= M0;C11+= M0;
M1 =(A10 + A11)B00; C10+= M1;C11−= M1;
M2 =A00(B01 −B11); C01+= M2;C11+= M2;
M3 =A11(B10 −B00); C00+= M3;C10+= M3;
M4 =(A00 + A01)B11; C01+= M4;C00−= M4;
M5 =(A10 − A00)(B00 +B01); C11+= M5;
M6 =(A01 − A11)(B10 +B11); C00+= M6;
Figure 3.1: All operations for one-level Strassen. Note that each row is a
special case of general operation (3.2).
3.1 Strassen’s algorithm reloaded
In this section, we present the basic idea and practical considerations of
Strassen, decomposing it into a combination of general operations that can
be adapted to the high-performance implementation of a traditional gemm.
3.1.1 The basic idea






for X ∈ {A,B,C} (3.1)
then it can be verified that the operations in Figure 3.1 also compute C = AB+
C, requiring only seven multiplications with submatrices. The computational
cost is, approximately, reduced from 2mnk flops to (7/8)2mnk flops, at the
expense of a lower-order number of extra additions. Figure 3.1 describes what
we will call one-level Strassen.
23
3.1.2 Classic Strassen’s algorithm
Each of the matrix multiplications that computes an intermediate result
Mk can itself be computed with another level of Strassen’s algorithm. This
can then be repeated recursively.
If originally m = n = k = 2d, where d is an integer, then the cost
becomes
(7/8)d 2n3 = (7/8)log2(n) 2n3 = nlog2(7/8)2n3 = 2n2.807 flops.
In this discussion, we ignored the increase in the total number of extra addi-
tions, which contributes a lower-order term.
3.1.3 Practical considerations
A high-performance implementation of a traditional matrix-matrix mul-
tiplication requires careful attention to details related to data movements be-
tween memory layers, scheduling of operations, and implementations at a very
low level (often in assembly code). Practical implementations recursively per-
form a few levels of Strassen until the matrices become small enough so that
a traditional high-performance gemm is faster. At that point, the recursion
stops and a high-performance gemm is used for the subproblems. In prior
implementations, the crossover point is usually as large as 2000 for double
precision square matrices on a single core of an x86 CPU [26, 9]. We will see
that, for the same architecture, one of our implementations has a crossover
point as small as 500 (Figure 3.8).
24
In an ordinary matrix-matrix multiplication, three matrices must be
stored, for a total of 3n2 floating point numbers (assuming all matrices are
n × n). The most naive implementation of one-level Strassen requires an




(for M0 through M6) and ten ma-




for A00 + A11, B00 + B11, etc. A careful ordering of the
computation can reduce this to two matrices [15]. We show that the compu-
tation can be organized so that no temporary storage beyond that required
for a high-performance traditional gemm is needed. In addition, it is easy to
parallelize for multi-core and many-core architectures with our approach, since
we can adopt the same parallel scheme advocated by BLIS.
The general case where one or more dimensions are not a convenient
multiple of a power of two leads to the need to either pad matrices or to treat
a remaining “fringe” carefully [59]. Traditionally, it is necessary to pad m, n,
and k to be even. In our approach this can be handled internally by padding
Ãi and B̃p, and by using tiny (mR×nR) buffers for C along the fringes (much
like the BLIS framework does, see Section 2.1.5).
3.1.4 One-level Strassen reloaded
The operations summarized in Figure 3.1 are all special cases of
M = (X + δY )(V + εW ); D+= γ0M ; E+= γ1M ; (3.2)
for appropriately chosen γ0, γ1, δ, ε ∈ {−1, 0, 1}. Here, X and Y are subma-
























mC Ei Yi 
mC Di Xi 








kC Yp Wp Ej 
kC 
Xp Vp Dj 







Y W Ej 

























mC Ci Ai mC 
L2	cache	















Loop 5 for jc =0 : n−1 steps of nc
Jc =jc : jc+nc−1
Loop 4 for pc =0 : k−1 steps of kc
Pc =pc : pc+kc−1
V (Pc,Jc) + εW (Pc,Jc) → B̃p // Pack into B̃p
Loop 3 for ic =0 : m−1 steps of mc
Ic = ic : ic+mc−1
X(Ic,Pc) + δY (Ic,Pc) → Ãi // Pack into Ãi
// macro-kernel
Loop 2 for jr =0 : nc−1 steps of nr
Jr =jr : jr+nr−1
Loop 1 for ir =0 : mc−1 steps of mr
Ir = ir : ir+mr−1
// micro-kernel
Loop 0 for pr =0 : pc−1 steps of 1
Mr(Ir,Jr) += Ãi(Ir, pr) B̃p(pr,Jr)
endfor
D(Ir + ic,Jr + jc) += γ0Mr(Ir,Jr)






Figure 3.2: Left: modification of Figure 2.1 that implements the representative computation M = (X +
Y )(V +W );D+= M ;E+= M of general operation (3.2). X, Y are submatrices of A; V , W are submatrices
of B; D, E are submatrices of C; M is the intermediate matrix product. Note that the packing buffers Ãi























mC Ei Yi 
mC Di Xi 








kC Yp Wp Ej 
kC 
Xp Vp Dj 







Y W Ej 

























mC Ci Ai mC 
L2	cache	















Figure 3.3: A side-by-side comparison of the BLIS implementation of the GotoBLASalgorithm and our
modifications for implementing the representive computation M = (X + Y )(V + W );D+= M ;E+= M .
Left: Figure 2.1; Right: Figure 3.2.
27
Let us focus on how to modify the algorithm illustrated in Figure 2.1
in order to accommodate the representative computation
M = (X + Y )(V +W );D+= M ;E+= M.
As illustrated in Figure 3.2 and Figure 3.3, the key insight is that the additions
of matrices V +W can be incorporated in the packing into buffer B̃p and the
additions of matrices X + Y in the packing into buffer Ãi. Also, when a small
block of (X + Y )(V + W ) is accumulated in registers it can be added to the
appropriate parts of both D and E, multiplied by γ0 and γ1, as needed, inside
a modified micro-kernel. This avoids multiple passes over the various matrices,
which would otherwise add a considerable overhead from memory movements.




C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
 , A =

A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3




B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3

















. Then it can be verified
that the computations in Figure 3.4 compute C = AB + C. The operations
found there can be cast as special cases of
M = (X0 + δ1X1 + δ2X2 + δ3X3)× (V0 + ε1V1 + ε2V2 + ε3V3);
C0+= γ0M ;C1+= γ1M ;C2+= γ2M ;C3+= γ3M
28
M0 = (A0,0+A2,2+A1,1+A3,3)(B0,0 + B2,2+B1,1+B3,3); C0,0+= M0; C1,1+= M0; C2,2+= M0; C3,3+= M0;
M1 = (A1,0+A3,2+A1,1+A3,3)(B0,0 + B2,2); C1,0+= M1; C1,1−= M1; C3,2+= M1; C3,3−= M1;
M2 = (A0,0+A2,2)(B0,1+B2,3+B1,1 + B3,3); C0,1+= M2; C1,1+= M2; C2,3+= M2; C3,3+= M2;
M3 = (A1,1+A3,3)(B1,0+B3,2+B0,0 + B2,2); C0,0+= M3; C1,0+= M3; C2,2+= M3; C3,2+= M3;
M4 = (A0,0+A2,2+A0,1+A2,3)(B1,1 + B3,3); C0,0−= M4; C0,1+= M4; C2,2−= M4; C2,3+= M4;
M5 = (A1,0+A3,2+A0,0+A2,2)(B0,0 + B2,2+B0,1+B2,3); C1,1+= M5; C3,3+= M5;
M6 = (A0,1+A2,3+A1,1+A3,3)(B1,0 + B3,2+B1,1+B3,3); C0,0+= M6; C2,2+= M6;
M7 = (A2,0+A2,2+A3,1+A3,3)(B0,0 + B1,1); C2,0+= M7; C3,1+= M7; C2,2−= M7; C3,3−= M7;
M8 = (A3,0+A3,2+A3,1+A3,3)(B0,0); C3,0+= M8; C3,1−= M8; C3,2−= M8; C3,3+= M8;
M9 = (A2,0+A2,2)(B0,1+B1,1); C2,1+= M9; C3,1+= M9; C2,3−= M9; C3,3−= M9;
M10 = (A3,1+A3,3)(B1,0+B0,0); C2,0+= M10;C3,0+= M10;C2,2−= M10;C3,2−= M10;
M11 = (A2,0+A2,2+A2,1+A2,3)(B1,1); C2,0−= M11;C2,1+= M11;C2,2+= M11;C2,3−= M11;
M12 = (A3,0+A3,2+A2,0+A2,2)(B0,0 + B0,1); C3,1+= M12;C3,3−= M12;
M13 = (A2,1+A2,3+A3,1+A3,3)(B1,0 + B1,1); C2,0+= M13;C2,2−= M13;
M14 = (A0,0+A1,1)(B0,2+B2,2+B1,3 + B3,3); C0,2+= M14;C1,3+= M14;C2,2+= M14;C3,3+= M14;
M15 = (A1,0+A1,1)(B0,2+B2,2); C1,2+= M15;C1,3−= M15;C3,2+= M15;C3,3−= M15;
M16 = (A0,0)(B0,3+B2,3+B1,3+B3,3); C0,3+= M16;C1,3+= M16;C2,3+= M16;C3,3+= M16;
M17 = (A1,1)(B1,2+B3,2+B0,2+B2,2); C0,2+= M17;C1,2+= M17;C2,2+= M17;C3,2+= M17;
M18 = (A0,0+A0,1)(B1,3+B3,3); C0,2−= M18;C0,3+= M18;C2,2−= M18;C2,3+= M18;
M19 = (A1,0+A0,0)(B0,2+B2,2+B0,3 + B2,3); C1,3+= M19;C3,3+= M19;
M20 = (A0,1+A1,1)(B1,2+B3,2+B1,3 + B3,3); C0,2+= M20;C2,2+= M20;
M21 = (A2,2+A3,3)(B2,0+B0,0+B3,1 + B1,1); C0,0+= M21;C1,1+= M21;C2,0+= M21;C3,1+= M21;
M22 = (A3,2+A3,3)(B2,0+B0,0); C1,0+= M22;C1,1−= M22;C3,0+= M22;C3,1−= M22;
M23 = (A2,2)(B2,1+B0,1+B3,1+B1,1); C0,1+= M23;C1,1+= M23;C2,1+= M23;C3,1+= M23;
M24 = (A3,3)(B3,0+B1,0+B2,0+B0,0); C0,0+= M24;C1,0+= M24;C2,0+= M24;C3,0+= M24;
M25 = (A2,2+A2,3)(B3,1+B1,1); C0,0−= M25;C0,1+= M25;C2,0−= M25;C2,1+= M25;
M26 = (A3,2+A2,2)(B2,0+B0,0+B2,1 + B0,1); C1,1+= M26;C3,1+= M26;
M27 = (A2,3+A3,3)(B3,0+B1,0+B3,1 + B1,1); C0,0+= M27;C2,0+= M27;
M28 = (A0,0+A0,2+A1,1+A1,3)(B2,2 + B3,3); C0,0−= M28;C1,1−= M28;C0,2+= M28;C1,3+= M28;
M29 = (A1,0+A1,2+A1,1+A1,3)(B2,2); C1,0−= M29;C1,1+= M29;C1,2+= M29;C1,3−= M29;
M30 = (A0,0+A0,2)(B2,3+B3,3); C0,1−= M30;C1,1−= M30;C0,3+= M30;C1,3+= M30;
M31 = (A1,1+A1,3)(B3,2+B2,2); C0,0−= M31;C1,0−= M31;C0,2+= M31;C1,2+= M31;
M32 = (A0,0+A0,2+A0,1+A0,3)(B3,3); C0,0+= M32;C0,1−= M32;C0,2−= M32;C0,3+= M32;
M33 = (A1,0+A1,2+A0,0+A0,2)(B2,2 + B2,3); C1,1−= M33;C1,3+= M33;
M34 = (A0,1+A0,3+A1,1+A1,3)(B3,2 + B3,3); C0,0−= M34;C0,2+= M34;
M35 = (A2,0+A0,0+A3,1+A1,1)(B0,0 + B0,2+B1,1+B1,3); C2,2+= M35;C3,3+= M35;
M36 = (A3,0+A1,0+A3,1+A1,1)(B0,0 + B0,2); C3,2+= M36;C3,3−= M36;
M37 = (A2,0+A0,0)(B0,1+B0,3+B1,1 + B1,3); C2,3+= M37;C3,3+= M37;
M38 = (A3,1+A1,1)(B1,0+B1,2+B0,0 + B0,2); C2,2+= M38;C3,2+= M38;
M39 = (A2,0+A0,0+A2,1+A0,1)(B1,1 + B1,3); C2,2−= M39;C2,3+= M39;
M40 = (A3,0+A1,0+A2,0+A0,0)(B0,0 + B0,2+B0,1+B0,3); C3,3+= M40;
M41 = (A2,1+A0,1+A3,1+A1,1)(B1,0 + B1,2+B1,1+B1,3); C2,2+= M41;
M42 = (A0,2+A2,2+A1,3+A3,3)(B2,0 + B2,2+B3,1+B3,3); C0,0+= M42;C1,1+= M42;
M43 = (A1,2+A3,2+A1,3+A3,3)(B2,0 + B2,2); C1,0+= M43;C1,1−= M43;
M44 = (A0,2+A2,2)(B2,1+B2,3+B3,1 + B3,3); C0,1+= M44;C1,1+= M44;
M45 = (A1,3+A3,3)(B3,0+B3,2+B2,0 + B2,2); C0,0+= M45;C1,0+= M45;
M46 = (A0,2+A2,2+A0,3+A2,3)(B3,1 + B3,3); C0,0−= M46;C0,1+= M46;
M47 = (A1,2+A3,2+A0,2+A2,2)(B2,0 + B2,2+B2,1+B2,3); C1,1+= M47;
M48 = (A0,3+A2,3+A1,3+A3,3)(B3,0 + B3,2+B3,1+B3,3); C0,0+= M48;
Figure 3.4: Computations for two-level Strassen.
29
by appropriately picking γi, δi, εi ∈ {−1, 0, 1}. Importantly, the computation
now requires 49 multiplications for submatrices as opposed to 64 for a conven-
tional gemm.
To extend the insights from Section 3.1.4 so as to integrate two-level
Strassen into the BLIS gemm implementation, we incorporate the addition
of up to four submatrices of A and B, the updates of up to four submatrices
of C inside the micro-kernel, and the tracking of up to four submatrices in the
loops in BLIS.
3.1.6 Additional levels
A pattern now emerges. The operation needed to integrate k levels of







;Cr+= γrM for r = 0, . . . , lC − 1. (3.3)
For each number, l, of levels of Strassen that are integrated, a table can
then be created that captures all the computations to be executed.
3.2 Implementation and analysis
We now discuss the details of how we adapt the high-performance Go-
toBLAS approach to these specialized operations to yield building blocks for
a family of Strassen implementations. Next, we also give a performance
model for comparing members of this family.
30
3.2.1 Implementations
We implemented a family of algorithms for up to two levels1 of Strassen
for double precision arithmetic and data, building upon the BLIS framework.
Building blocks
The BLIS framework provides three primitives for composing gemm:
a routine for packing Bp into B̃p, a routine for packing Ai into Ãi, and a micro-
kernel for updating an mR × nR submatrix of C. The first two are typically
written in C while the last one is typically written in inlined assembly code2.
To implement a typical operation given in (3.3),
• the routine for packing Bp is modified to integrate the addition of mul-
tiple matrices Vt into packed buffer B̃p;
• the routine for packingAi is modified to integrate the addition of multiple
matrices Xs into packed buffer Ãi; and
• the micro-kernel is modified to integrate the addition of the result to
multiple submatrices.
1We can support three or more levels of Strassen, by modifying the packing routines
and the micro-kernel to incorporate more summands. However, the crossover point for
the three-level Strassen to outperform all one/two-level Strassen implementations is very
large (∼ 10000 for square matrices). There are also concerns regarding to numerical stability
issues with many levels of recursions. So we don’t go beyond two levels in this chapter.
2The inlined assembly code is expressed in GNU extended assembly syntax with asm and
written to utilize SIMD vector instructions such as SSE2 and AVX.
31
Variations on a theme
The members of our family of Strassen implementations differ by how
many levels of Strassen they incorporate and which of the above described
modified primitives they use:
• Naive Strassen: A traditional implementation with temporary buffers.
• AB Strassen: Integrates the addition of matrices into the packing of
buffers Ãi and B̃p but creates explicit temporary buffers for matrices M .
• ABC Strassen: Integrates the addition of matrices into the packing
of buffers Ãi and B̃p and the addition of the result of the micro-kernel
computation to multiple submatrices of C. For small problem size k
(this shape is also called rank-k update), this version has the advantage
over AB Strassen that the temporary matrix M is not moved in and
out of memory multiple times. The disadvantage is that for large k the
submatrices of C to which contributions are added are moved in and out
of memory multiple times instead.
Different choices lead to a family of Strassen implementations.
3.2.2 Performance model
In order to compare the performance of the traditional BLAS dgemm
routine and the various implementations of Strassen, we define the effective
32
GFLOPS metric for m× k × n matrix multiplication, similar to [9, 42, 79]:
effective GFLOPS =
2 ·m · n · k
time (in seconds)
· 10−9. (3.4)
We next derive a model to predict the execution time T and the effective
GFLOPS of the traditional BLAS dgemm and the various implementations
of Strassen. Theoretical predictions allow us to compare and contrast dif-
ferent implementation decisions, help with performance debugging, and (if
sufficiently accurate) can be used to choose the right member of the family of
implementations as a function of the number of threads used and/or problem
size.
Assumption
Our performance model assumes that the underlying architecture has a
modern memory hierarchy with fast caches and relatively slow main memory
(DRAM). We assume the latency for accessing the fast caches can be ignored
(either because it can be overlapped with computation or because it can be
amortized over sufficient computation) while the latency of loading from main
memory is exposed. For memory store operations, our model assumes that
a lazy write-back policy guarantees the time for storing into fast caches can
be hidden. The slow memory operations for BLAS dgemm and the various
implementation of Strassen consist of three parts:
• memory packing in (adapted) dgemm routine;
• reading/writing the submatrices of C in (adapted) dgemm routine; and
33
τa Time (in seconds) of one arithmetic (floating point) operation.
τb
(Bandwidth) Amortized time (in seconds) of 8 Bytes contiguous data
movement from DRAM to cache.
T Total execution time (in seconds).
Ta The total time for arithmetic operations (in seconds).
Tm Time for memory operations (in seconds).

















m Tm for writing submatrices in packing routines (Figure 3.3).
T
C×








Tm for reading or writing submatrices, related to the temporary buffer
as part of Naive Strassen and AB Strassen.
NXa /N
X





Figure 3.5: Notation table for Strassen performance model.
• reading/writing of the temporary buffer that are part of Naive Strassen
and AB Strassen, outside (adapted) dgemm routine.
Based on these assumptions, the execution time is dominated by the arithmetic
operations and the slow memory operations.
Notation
Parameter τa denotes the time (in seconds) of one arithmetic (floating
point) operation, i.e., the reciprocal of the theoretical peak GFLOPS of the
system. Parameter τb (bandwidth, memory operation) denotes the amortized
time (in seconds) of one unit (one double precision floating point number, or





For single core, we need to further multiply it by the number of channels.
34
type τ dgemm one-level two-level
T×a - τa 2mnk 7× 2m2 n2 k2 49× 2m4 n4 k4
TA+a - τa - 5× 2m2 k2 95× 2m4 k4
TB+a - τa - 5× 2k2 n2 95× 2k4 n4
TC+a - τa - 12× 2m2 n2 144× 2m4 n4























































































Figure 3.6: Theoretical run time breakdown analysis of BLAS dgemm and
various implementations of Strassen. The time shown in the first column
for dgemm, one-level Strassen, two-level Strassen can be computed sep-
arately by multiplying the parameter in τ column with the number in the
corresponding entries. Due to the software prefetching effects, the row marked
with (∗) needs to be multiplied by an additional parameter λ ∈ [0.5, 1], which
denotes the prefetching efficiency. λ is adjusted to match BLIS dgemm per-
formance.
The total execution time (in seconds), T , is broken down into the time
for arithmetic operations, Ta, and memory operations:
T = Ta + Tm. (3.5)













dgemm 1 1 1 - - -
one-level
ABC 12 12 12 - - -
AB 12 12 7 - - 36
Naive 7 7 7 19 19 36
two-level
ABC 194 194 144 - - -
AB 194 194 49 - - 432
Naive 49 49 49 293 293 432
Figure 3.7: The coefficient NXm mapping table for computing Tm in the per-
formance model.
Arithmetic Operations















TC+a denote the arithmetic time of extra additions with submatrices of A,
B, C, respectively. For dgemm since there are no extra additions, Ta =
2mnk · τa. For one-level Strassen, Ta is comprised of 7 submatrix multi-
plications, 5 extra additions with submatrices of A, 5 extra additions with
submatrices of B, and 12 extra additions with submatrices of C. Therefore,
Ta = (1.75mnk+2.5mk+2.5kn+6mn) · τa. Note that the matrix addition ac-
tually involves 2 floating point operations for each entry because they are cast
as FMA instructions. Similar analyses can be applied to compute Ta of a two-




The total data movement overhead is determined by both the original
matrix sizes m, n, k, and block sizes mC , nC , kC in our implementation Fig-
ure 3.2. We characterize each memory operation term in Figure 3.6 by its
read/write type and the amount of memory (one unit=double precision float-




m · TA×m +NB×m · TB×m +NC×m · TC×m +NA+m · TA+m +NB+m · TB+m
+NC+m · TC+m , (3.7)
where TA×m , T
B×
m are the data movement time for reading from submatrices of
A, B, respectively, for memory packing in (adapted) dgemm routine; TC×m is
the data movement time for loading and storing submatrices of C in (adapted)




m are the data movement time for loading or
storing submatrices of A, B, C, respectively, related to the temporary buffer as
part of Naive Strassen and AB Strassen, outside (adapted) dgemm rou-
tine; the NXm s denote the corresponding coefficients, which are also tabulated
in Figure 3.7.
All write operations (T Ã×m , T
B̃×
m for storing submatrices of A, B, re-
spectively, into packing buffers) are omitted because our assumption of lazy
write-back policy with fast caches. Notice that memory operations can recur
37
multiple times depending on the loop in which they reside. For instance, for










submatrices of C as intermediate result inside the micro-kernel.
This is a step function proportional to k, because submatrices of C are used
to accumulate the rank-k update in the 5th loop in Figure 3.2.
3.2.3 Discussion
From the analysis summarized in Figures 3.6 and 3.7 we can make
predictions about the relative performance of the various implementations. It
helps to also view the predictions as graphs, which we give in Figure 3.8, using
parameters that capture the architecture described in Section 3.3.1.
• Asymptotically, the two-level Strassen implementations outperform
corresponding one-level Strassen implementations, which in turn out-
perform the traditional dgemm implementation.
• The graph for m = k = n, 1 core, shows that for smaller square matri-
ces, ABC Strassen outperforms AB Strassen, but for larger square
matrices this trend reverses. This holds for both one-level and two-level
Strassen. The reason is that, for small k, ABC Strassen reduced the
number of times the temporary matrix M needs to be brought in from
memory to be added to submatrices of C. For large k, it increases the
number of times the elements of those submatrices of C themselves are
moved in and out of memory.
38
































m=k=n, 1 core, modeled
Modeled DGEMM
Modeled One-level ABC Strassen
Modeled Two-level ABC Strassen
Modeled One-level AB Strassen
Modeled Two-level AB Strassen
Modeled One-level Naive Strassen
Modeled Two-level Naive Strassen









































































m=n=16000, k varies, 1 core, modeled
Modeled DGEMM
Modeled One-level ABC Strassen
Modeled Two-level ABC Strassen
Modeled One-level AB Strassen
Modeled Two-level AB Strassen
Modeled One-level Naive Strassen
Modeled Two-level Naive Strassen









































































k=1024, m=n vary, 1 core, modeled
Modeled DGEMM
Modeled One-level ABC Strassen
Modeled Two-level ABC Strassen
Modeled One-level AB Strassen
Modeled Two-level AB Strassen
Modeled One-level Naive Strassen
Modeled Two-level Naive Strassen









































Figure 3.8: Performance of the various implementations on an Intel Xeon
E5 2680 v2 (Ivybridge) processor (single core). Left: modeled performance.
Right: actual performance. The range of the y-axis does not start at 0 to
make the graphs more readable and 28.32 marks theoretical peak performance
for this architecture.
39
• The graph for m = n = 16000, k varies, 1 core, is particularly interest-
ing: it shows that for k equal to the appropriate multiple of kC (k = 2kC
for one-level and k = 4kC for two-level) ABC Strassen performs dra-
matically better than the other implementations, as expected.
The bottom line: the analysis predicts that, depending on the problem size, a
different implementation may have its advantages.
3.3 Performance experiments
We give details on the performance experiments for our implementa-
tions. The current version of Strassen dgemm is designed for the Intel
Sandy-Bridge/Ivy-Bridge processor and Intel Xeon Phi coprocessor (MIC Ar-
chitecture, KNC).
3.3.1 Single node experiments
Firstly, we reveal the details of single node experiments, using single-
core and multi-core Strassen implementations.
Implementation
The implementations are in C, utilizing SSE2 and AVX intrinsics and
assembly, compiled with the Intel C compiler version 15.0.3 with optimization
flag -O3. In addition, we compare against the standard BLIS implementation
(Version 0.1.8) from which our implementations are derived as well as Intel
MKL’s dgemm (Version 11.2.3) [61].
40
Target architecture
We measure the CPU performance results on the Maverick system at
the Texas Advanced Computing Center (TACC). Each node of that system
consists of a dual-socket (10 cores/socket) Intel Xeon E5-2680 v2 (Ivy Bridge)
processors with 12.8 GB/core of memory (Peak Bandwidth: 59.7 GB/s with
four channels) and a three-level cache: 32 KB L1 data cache, 256 KB L2 cache
and 25.6 MB L3 cache. The stable CPU clockrate is 3.54 GHz when a single
core is utilized (28.32 GFLOPS peak, marked in the graphs) and 3.10 GHz
when five or more cores are in use (24.8 GLOPS/core peak). To set thread
affinity and to ensure the computation and the memory allocation all reside
on the same socket, we use KMP AFFINITY=compact.
We choose the parameters nR = 4, mR = 8, kC = 256, nC = 4096
and mC = 96. This makes the size of the packing buffer Ãi 192 KB and
B̃p 8192 KB, which then fit the L2 cache and L3 cache, respectively. These
parameters are consistent with parameters used for the standard BLIS dgemm
implementation for this architecture.
Each socket consists of 10 cores, allowing us to also perform multi-
threaded experiments. Parallelization is implemented mirroring that described
in [108], using OpenMP directives that parallelize the third loop around the
micro-kernel in Figure 3.2.
41
















































































































































































































































Figure 3.9: Performance of the various implementations on an Intel Xeon E5
2680 v2 (Ivybridge) processor (one and ten cores). Left: 5 core. Right: 10




Results when using single core are presented in Figure 3.8 (right col-
umn). As expected, eventually two-level AB Strassen performs best, hand-
ily beating conventional dgemm. The exception is the case where k is fixed
to equal 1024 = 4 × kC , which is the natural blocking size for a two-level
Strassen based on our ideas. For those experiments ABC Strassen wins
out, as expected.
Figure 3.9 reports results for five and ten cores, all within the same
socket. We do not report results for twenty cores (two sockets), since this
results in a substantial performance reduction for all our implementations,
including the standard BLIS dgemm, relative to the MKL dgemm. This
exposes a performance bug in BLIS that has been reported.
When using many cores, memory bandwidth contention affects the per-
formance of the various Strassen implementations, reducing the benefits rel-
ative to a standard dgemm implementation.
3.3.2 Many-core experiments
To examine whether the techniques scale to a large number of cores, we




The implementations of ABC Strassen are in C and AVX512 intrinsics
and assembly, compiled with the Intel C compiler version 15.0.2 with opti-
mization flag -mmic -O3. The BLIS and ABC Strassen both parallelize the
second and third loop around the micro-kernel, as described for BLIS in [108].
Target architecture
We run the Xeon Phi performance experiments on the SE10P Copro-
cessor incorporated into nodes of the Stampede system at TACC. This copro-
cessor has a peak performance of 1056 GFLOPS (for 60 cores with 240 threads
used by BLIS) and 8 GB of GDDR5 DRAM with a peak bandwidth of 352
GB/s. It has 512 KB L2 cache, but no L3 cache.
We choose the parameters nR = 8, mR = 30, kC = 240, nC = 14400
and mC = 120. This makes the size of the packing buffer Ãi 225 KB and
B̃p 27000 KB, which fits L2 cache and main memory separately (no L3 cache
on Xeon Phi). These choices are consistent with those used by BLIS for this
architecture.
Results
As illustrated in Figure 3.10, relative to the BLIS dgemm implemen-
tation, the one-level ABC Strassen shows a nontrivial improvement for a
rank-k update with a fixed (large) matrix C. While the BLIS implementation
on which our implementation of ABC Strassen is based used to be highly
44











































Figure 3.10: Performance of one-level ABC Strassen, BLIS, and MKL, on
an Intel Xeon Phi (KNC) coprocessor (for 60 cores with 240 threads). The
performance results are sampled such that k is a multiple of 480 = 2× kC .
45
competitive with MKL’s dgemm (as reported in [108]), recent improvements
in that library demonstrate that the BLIS implementation needs an update.
We do not think there is a fundamental reason why our observations can-
not be used to similarly accelerate MKL’s dgemm, since it also implements
GotoBLAS algorithm.
3.3.3 Distributed memory experiments
Finally, we demonstrate how the ABC Strassen implementation can
be used to accelerate a distributed memory implementation of dgemm.
Implementation
We implement the Scalable Universal Matrix Multiplication Algorithm
(SUMMA) [120] with MPI. This algorithm distributes the algorithm to a mesh
of MPI processes using a 2D block cyclic distribution. The multiplication is
broken down into a sequence of rank-b updates,
C := AB + C =
(




= A0B0 + · · ·+ AK−1BK−1 + C
where each Ap consists of (approximately) b columns and each Bp consists of
(approximately) b rows. For each rank-b update Ap is broadcast within rows of
the mesh and Bp is broadcast within columns of the mesh, after which locally




The distributed memory experiments are performed on the same ma-
chine described in Section 3.3.1, using the mvapich2 version 2.1 [58] imple-
mentation of MPI. Each node has two sockets, and each socket has ten cores.
Results
Figure 3.11 reports weak scalability on up to 32 nodes (64 sockets,
640 cores). For these experiments we choose the MPI mesh of processes to
be square, with one MPI process per socket, and attained thread parallelism
among the ten cores in a socket within BLIS, MKL, or any of our Strassen
implementations.
It is well-known that the SUMMA algorithm is weakly scalable in the
sense that efficiency essentially remains constant if the local memory dedicated
to matrices A, B, C, and temporary buffers is kept constant. For this reason,
the local problem size is fixed to equal m = k = n = 16000 so that the global
problem becomes m = k = n = 16000 × N when an N × N mesh of sockets
(MPI processes) is utilized. As expected, the graph shows that the SUMMA
algorithm is weakly scalable regardless of which local gemm algorithm is used.
The local computation within the SUMMA algorithm matches the shape for
which ABC Strassen is a natural choice when the rank-k updates are per-
formed with b = 1024. For this reason, the one-level and two-level ABC
Strassen implementations achieve the best performance.
What this experiment shows is that the benefit of using our Strassen
47







































m=k=n=16000·N on N×N MPI mesh









Figure 3.11: Performance of the various implementations on distributed mem-
ory (weak scalability).
48
implementations can be easily transferred to other algorithms that are rich in
large rank-k updates.
3.4 Summary
We have presented novel insights into the implementations of Strassen
that greatly reduce overhead that was inherent in previous formulations and
had been assumed to be insurmountable. These insights have yielded a family
of algorithms that outperform conventional high-performance implementations
of gemm as well as naive implementations. Components that are part of the
BLIS framework for implementing BLAS-like libraries are modified to facilitate
implementation. Implementations and performance experiments are presented
to demonstrate performance benefits for single core, multi-core, many-core,
and distributed memory parallel implementations. Together, this advances
nearly 50 years of research into the theory and practice of Strassen.
Our analysis shows that the ABC Strassen implementation fulfills our
claim that Strassen can outperform classical gemm for small matrices and
small k while requiring no temporary buffers beyond those already internal to
high-performance gemm implementations. The AB Strassen implementation





matrix for an L-level Strassen.
49
Chapter 4
Practical Fast Matrix Multiplication
Algorithms
Over the last half century, Strassen has fueled many theoretical im-
provements such as other variations of Strassen-like FMM algorithms. In this
chapter, we extend insights on how to express multiple levels of Strassen
in terms of Kronecker products [51] to multi-level FMM algorithms, facilitat-
ing a code generator for all FMM methods from [9] (including Strassen), in
terms of the building blocks created in Chapter 3, but allowing different FMM
algorithms to be used for each level. Importantly and unique to this work,
the code generator also yields performance models that are accurate enough
to guide the choice of a FMM implementation as a function of problem size
and shape, facilitating the creation of “poly-algorithms” [77]. Performance
results from single core and multi-core shared memory systems support the
theoretical insights.
This chapter is based on the conference paper [55] with minor modifications: “Jianyu
Huang, Leslie Rice, Devin A. Matthews, and Robert A. van de Geijn. Generating families
of practical fast matrix multiplication algorithms. In 31th IEEE International Parallel and
Distributed Processing Symposium (IPDPS 2017), pages 656-667, May 2017.” I am the main
contributor in charge of problem formulation, algorithm development, performance analysis,
and experimental validations.
50
4.1 Fast matrix multiplication basics
We now present the basic idea that underlies families of FMM algo-
rithms and how to generalize a one-level formula to multi-level FMM algo-
rithms utilizing Kronecker products and recursive block storage indexing.
4.1.1 One-level fast matrix multiplication algorithms
In [9], the theory of tensor contractions is used to find a large number
of FMM algorithms. In this subsection, we use the output (the resulting
algorithms) of their approach.
Generalizing the partitioning for Strassen, consider C := C + AB,
where C, A, and B are m × n, m × k, and k × n matrices, respectively. A
〈m̃, k̃, ñ〉 FMM algorithm is defined [13, 9, 6] by partitioning
C=
 C0 ··· Cñ−1... ...
C(m̃−1)ñ ··· Cm̃ñ−1
, A=




 B0 ··· Bñ−1... ...
B(k̃−1)ñ ··· Bk̃ñ−1
 .
Note that Ai, Bj, and Cp are the submatrices of A, B and C, with a single
index in the row major order. Then, C := C + AB is computed by,














Cp+= wprMr (p = 0, ..., m̃ñ− 1)
(4.1)
51
M0 =(A0 + A3)(B0 +B3); C0+= M0;C3+= M0;
M1 =(A2 + A3)B0; C2+= M1;C3−= M1;
M2 =A0(B1 −B3); C1+= M2;C3+= M2;
M3 =A3(B2 −B0); C0+= M3;C2+= M3;
M4 =(A0 + A1)B3; C1+= M4;C0−= M4;
M5 =(A2 − A0)(B0 +B1); C3+= M5;
M6 =(A1 − A3)(B2 +B3); C0+= M6;
Figure 4.1: All operations for one-level Strassen. Compared to Figure 3.1,
the indeces for the submatrices of A, B, and C have been changed.
where (×) is a matrix multiplication that can be done recursively, uir, vjr, and
wpr are entries of a (m̃k̃)×R matrix U , a (k̃ñ)×R matrix V , and a (m̃ñ)×R
matrix W , respectively. Therefore, the classical matrix multiplication which
needs m̃k̃ñ submatrix multiplications can be completed with R submatrix
multiplications. The set of coefficients that determine the 〈m̃, k̃, ñ〉 algorithm
is denoted as JU, V,W K.
For example, assuming thatm, n, and k are all even, one-level Strassen






for X ∈ {A,B,C} (4.2)
and computations in Figure 4.1,
J
1 0 1 0 1 −1 00 0 0 0 1 0 1
0 1 0 0 0 1 0
1 1 0 1 0 0 −1
,
1 1 0 −1 0 1 00 0 1 0 0 1 0
0 0 0 1 0 0 1
1 0 −1 0 1 0 1
,
1 0 0 1 −1 0 10 0 1 0 1 0 0
0 1 0 1 0 0 0
1 −1 1 0 0 1 0
K
(4.3)
specifies JU, V,W K for one-level Strassen. Another example of 〈3, 2, 3〉 FMM
algorithm can be found in Appendix C.
52
〈m̃, k̃, ñ〉 Ref. m̃k̃ñ R
Speedup (%)
Theory
Practical #1 Practical #2
Ours [9] Ours [9]
〈2, 2, 2〉 [115] 8 7 14.3 11.9 -3.0 13.1 13.1
〈2, 3, 2〉 [9] 12 11 9.1 5.5 -13.1 7.7 7.7
〈2, 3, 4〉 [9] 24 20 20.0 11.9 -8.0 16.3 17.0
〈2, 4, 3〉 [6] 24 20 20.0 4.8 -15.3 14.9 16.6
〈2, 5, 2〉 [6] 20 18 11.1 1.5 -23.1 8.6 8.3
〈3, 2, 2〉 [6] 12 11 9.1 7.1 -6.6 7.2 7.5
〈3, 2, 3〉 [6] 18 15 20.0 14.1 -0.7 17.2 16.8
〈3, 2, 4〉 [6] 24 20 20.0 11.9 -1.8 16.1 17.0
〈3, 3, 2〉 [6] 18 15 20.0 11.4 -8.1 17.3 16.5
〈3, 3, 3〉 [104] 27 23 17.4 8.6 -9.3 14.4 14.7
〈3, 3, 6〉 [104] 54 40 35.0 -34.0 -41.6 24.2 20.1
〈3, 4, 2〉 [9] 24 20 20.0 4.9 -15.7 16.0 16.8
〈3, 4, 3〉 [104] 36 29 24.1 8.4 -12.6 18.1 20.1
〈3, 5, 3〉 [104] 45 36 25.0 5.2 -20.6 19.1 18.9
〈3, 6, 3〉 [104] 54 40 35.0 -21.6 -64.5 19.5 17.8
〈4, 2, 2〉 [6] 16 14 14.3 9.4 -4.7 11.9 12.2
〈4, 2, 3〉 [9] 24 20 20.0 12.1 -2.3 15.9 17.3
〈4, 2, 4〉 [6] 32 26 23.1 10.4 -2.7 18.4 19.1
〈4, 3, 2〉 [6] 24 20 20.0 11.3 -7.8 16.8 15.7
〈4, 3, 3〉 [6] 36 29 24.1 8.1 -8.4 19.8 20.0
〈4, 4, 2〉 [6] 32 26 23.1 -4.2 -18.4 17.1 18.5
〈5, 2, 2〉 [6] 20 18 11.1 7.0 -6.7 8.2 8.5
〈6, 3, 3〉 [104] 54 40 35.0 -33.4 -42.2 24.0 20.2
Figure 4.2: Theoretical and practical speedup for various FMM algorithms
(double precision). m̃k̃ñ is the number of multiplication for classical matrix
multiplication algorithm. R is the number of multiplication for fast matrix
multiplication algorithm. Theoretical speedup is the speedup per recursive
step. Practical #1 speedup is the speedup for one-level FMM algorithms
compared with gemm when m = n = 14400, k = 480 (rank-k updates). Prac-
tical #2 speedup is the speedup for one-level FMM algorithms compared with
gemm when m = n = 14400, k = 12000 (approximately square). We report
the practical speedup of the best implementation of our generated code (gener-
ated gemm) and the implementations in [9] (linked with Intel MKL) on single
core. More details about the experiment setup is described in Section 4.3.1.
53
Figure 4.2 summarizes a number of such algorithms1 that can be found
in the literature that we eventually test in Section 4.3. We only consider
2 ≤ m̃, k̃, ñ ≤ 6 and don’t include arbitrary precision approximate (APA)
algorithms [12], due to their questionable numerical stability.
4.1.2 Kronecker product
If X and Y are m × n and p × q matrices with (i, j) entries denoted
by xi,j and yi,j, respectively, then the Kronecker product [41] X ⊗ Y is the
mp× nq matrix given by
X ⊗ Y =
 x0,0Y · · · x0,n−1Y... . . . ...
xm−1,0Y · · · xm−1,n−1Y
 .
Thus, entry (X ⊗ Y )p·r+v,q·s+w = xr,syv,w.
4.1.3 Recursive block indexing (Morton-like ordering)
An example of recursive block storage indexing (Morton-like ordering)
[37] is given in Figure 4.3. In this example, A undergoes three levels of recursive
splitting, and each submatrix of A is indexed in row major form. By indexing
A, B, and C in this manner, data locality is maintained when operations are
performed on their respective submatrices.
1In Figure 4.2, the symmetric rotations (e.g., 〈2, 3, 4〉 vs. 〈2, 4, 3〉) may have different
performances. This is determined by the block size kC and the partition dimension k̃. If
k̃ is relatively large for rank-k updates, then the problem size k/k̃ after partition might be




































Figure 4.3: Illustration of recursive block storage indexing (Morton-like order-
ing) [37] on m × k matrix A where the partition dimensions m̃ = k̃ = 2 for
three-level recursions.
4.1.4 Representing two-level FMM with the Kronecker product
In [51], it is shown that multi-level 〈2, 2, 2〉 Strassen can be repre-
sented with Kronecker products. In this section, we extend this insight to
multi-level FMM algorithms, where each level can use a different choice of
〈m̃, k̃, ñ〉.
Assume each submatrix of A, B, and C is partitioned with another
level of 〈m̃′, k̃′, ñ′〉 FMM algorithm with the coefficients JU ′, V ′,W ′K, and Ai,
Bj, Cp are the submatrices of A, B and C, with a single index in two-level
recursive block storage indexing. Then it can be verified2 that C := C + AB
is computed by,











(V ⊗ V ′)j,rBj
)
;
Cp+= (W ⊗W ′)p,rMr(p = 0, ..., m̃ñ · m̃′ñ′ − 1)
2The complete proof is given as Theorem D.1 in Appendix D.
55
where ⊗ represents Kronecker Product. Note that U⊗U ′, V ⊗V ′, and W⊗W ′
are (m̃k̃ · m̃′k̃′)× (R ·R′), (k̃ñ · k̃′ñ′)× (R ·R′), (m̃ñ · m̃′ñ′)× (R ·R′) matrices,
respectively.
The set of coefficients of a two-level 〈m̃, k̃, ñ〉 and 〈m̃′, k̃′, ñ′〉 FMM
algorithm can be denoted as JU ⊗ U ′, V ⊗ V ′,W ⊗W ′K.
For example, the two-level Strassen is represented by the coefficients
JU⊗U, V ⊗V,W⊗W K where JU, V,W K are the one-level Strassen coefficients
given in Equation (4.3).
4.1.5 Additional levels of FMM
Comparing one-level and two-level FMM algorithms, the same skeleton
pattern emerges. The formula for defining L-level FMM algorithms is given
by3,
for r = 0, ...,
∏L−1

























Wl)p,rMr(p = 0, ...,
∏L−1
l=0 m̃lñl − 1)
(4.4)
The set of coefficients of an L-level 〈m̃l, k̃l, ñl〉 (l=0, 1, ..., L−1) FMM







3The complete proof is given as Theorem D.2 in Appendix D.
56
4.2 Implementation and analysis
The last section shows that families of one-level FMM algorithms can be
specified by 〈m̃, k̃, ñ〉 and JU, V,W K. It also shows how the Kronecker product
can be used to generate multi-level FMM algorithms that are iterative rather
than recursive. In this section, we discuss a code generator that takes as
input 〈m̃, k̃, ñ〉 and JU, V,W K and as output generates implementations that
build upon the primitives that combine taking linear combinations of matrices
with the packing routines and/or micro-kernels that underlie BLIS. The code
generator also provides a model of cost for each implementation that can then
be used to choose the best FMM algorithm for a matrix of given size and
shape. This code generator can thus generate code for arbitrary levels of FMM
algorithms that can use different FMM choices at each level. In this way, we
have generated and compared more than 20 FMM algorithms (Figure 4.2).
4.2.1 Code generation
Our code generator generates various implementations of FMM algo-
rithms with double precision arithmetic, based on the coefficient representation
JU, V,W K, levels of recursion, and packing routine/micro-kernel incorporation
specifications.
There are two stages for our code generator: generating the skeleton
framework and generating the typical operations given in (4.1).
57
Generating the skeleton framework
During this stage, the code generator
• Computes the Kronecker Product of the coefficient matrices JUl, Vl,WlK







• Generates the matrix partition code by conceptual recursive block stor-
age indexing with 〈m̃l, k̃l, ñl〉 partition dimensions for each level.







l=0 ñl, it generates dynamic peeling
[117] code to handle the remaining “fringes” after invoking FMM, which
requires no additional memory.
Generating the typical operations
To generate the code for the typical operations in (4.1), the generator
• Generates packing routines (written in C), that sum a list of submatrices
of A integrated into the packing routine, yielding Ãi, and similarly sum
a list of submatrices of B integrated into the packing routine, yielding
B̃p, extending what is illustrated in Figure 2.1 and Figure 3.2.
• Assembles a specialized micro-kernel comprised of a hand-coded opti-




In Chapter 3, a number of variations on the theme illustrated in Fig-
ure 3.2 are discussed:
• Naive FMM: A classical implementation with temporary buffers for
storing the sum of A, B, and the intermediate matrix product Mr.
• AB FMM: The packing routines incorporate the summation of sub-
matrices of A, B into the packing of buffers Ãi and B̃p but explicit
temporary buffers for matrices Mr are used.
• ABC FMM: AB FMM, but with a specialized micro-kernel that
incorporates addition of Mr to multiple submatrices of C.
Incorporating the generation of these variations into the code generator yields
over 100 FMM implementations.4
4.2.2 Performance model
In this subsection, we provide a performance model to predict the ex-
ecution time T for the various FMM implementations generated by our code
generator. Theoretical estimation helps us better understand the computation
and memory footprint of different FMM implementations, and allows us to
avoid exhaustive empirical search when searching for the best implementation
for different problem sizes and shapes. Most importantly, our code generator
423 FMM algorithms (Figure 4.2) × 3 variations × 2 levels.
59
can embed our performance model to guide the selection of an FMM imple-
mentation as a function of problem size and shape, with the input 〈m̃l, k̃l, ñl〉
and JUl, Vl,WlK specifications on each level l. These performance models are
themselves automatically generated.
Assumption
Basically, we assume that the architecture has two layers of modern
memory hierarchy: fast caches and relatively slow main memory (DRAM).
For read operations, the latency for accessing cache can be ignored, while the
latency for accessing the main memory is counted; For write operations, we
assume a lazy write-back policy such that the time for writing into fast caches
can be hidden. Based on these assumptions, the memory operations for gemm
and various implementations of FMM algorithms are decomposed into three
parts:
• memory packing shown in Figures 2.1 and 3.2;
• reading/writing the submatrices of C in Figures 2.1 and 3.2; and
• for Naive FMM and AB FMM, reading/writing of the temporary
buffer.
Notation
Notation is summarized in Figure 4.4. The total execution time, T , is
dominated by arithmetic time Ta and memory time Tm ( 2© in Figure 4.5).
60
τa Time (in seconds) of one arithmetic (floating point) operation.
τb
(Bandwidth) Amortized time (in seconds) of 8 Bytes contiguous data
movement from DRAM to cache.
T Total execution time (in seconds).
Ta Time for arithmetic operations (in seconds).
Tm Time for memory operations (in seconds).

















m Tm for writing submatrices in packing routines (Fig. 2.1 and 3.2).
T
C×








Tm for reading or writing submatrices, related to the temporary buffer
as part of Naive FMM and AB FMM.
NXa /N
X





nnz(X) Non-zero entry number in matrix or vector X.
Figure 4.4: Notation table for performance model.
1© Effective GFLOPS = 2 ·m · n · k/T · 10−9
2© T = Ta + Tm
3© Ta = N
×
a · T×a +NA+a · TA+a +NB+a · TB+a
+NC+a · TC+a
4© Tm = N
A×
m · TA×m +NB×m · TB×m +NC×m · TC×m
+NA+m · TA+m +NB+m · TB+m +NC+m · TC+m
Figure 4.5: The equations for computing the execution time T and Effective
GFLOPS in our performance model.
Arithmetic operations
Ta is decomposed into submatrix multiplications (T
×
a ) and submatrix




a ) ( 3© in Figure 4.5). TX+a has a coefficient 2 be-
cause under the hood the matrix additions are cast into FMA operations. The




type τ gemm L-level







































































Figure 4.6: The various components of arithmetic and memory operations for
BLAS gemm and various implementations of FMM algorithms. The time
shown in the first column for gemm and L-level FMM algorithms can be
computed separately by multiplying the parameter in τ column with the arith-
metic/memory operation number in the corresponding entries.
= nnz(
⊗









U):,r)− 1) = nnz(
⊗
U)−RL subma-
trix additions. Note that X:,r denotes the rth column of X.
Memory operations
Tm is a function of the submatrix sizes {m/M̃L, k/K̃L, n/ÑL}, and






































N B̃×m - - - -
NC×m 1 nnz(
⊗
W ) RL RL
NA+m - - - nnz(
⊗
U)+RL
NB+m - - - nnz(
⊗
V )+RL





Figure 4.7: The coefficient NXa /N
X
m mapping table for computing Ta/Tm in
the performance model. Here M̃L =
∏L−1
l=0 m̃l, K̃L =
∏L−1













l=0 Wl, RL =
∏L−1
l=0 Rl.
ory operation can repeat multiple times according to the loop in which they
reside. Tm is broken down into several components, as shown in 4© in Fig-
ure 4.5. Each memory operation term is characterized in Figure 4.6 by its
read/write type and the amount of memory in units of 64-bit double precision
elements. Note that T Ã×m ,T
B̃×
m are omitted in 4© because of the assumption








eτb has an additional parameter λ ∈ [0.5, 1], which
denotes the prefetching efficiency. This λ is adapted to match gemm perfor-
63
mance. Note that this is a ceiling function proportional to k, because rank-k
updates for accumulating submatrices of C recur dk/K̃L
kc
e times in 4th loop in
Figure 2.1 and Figure 3.2. The corresponding coefficients NXm are tabulated
in Figure 4.7. For example, for Naive FMM and AB FMM, computing
Cp+= (
⊗
W )p,rMr(p = 0, ...) in Equation (4.4) involves 2 read and 1 write




We can make estimations about the run time performance of the various
FMM implementations generated by our code generator, based on the analysis
shown in Figures 4.5, 4.6 and 4.7. We use effective GFLOPS (defined in 1© in
Figure 4.5) as the metric to compare the performance of these various FMM
implementations. The architecture-dependent parameters for the model are
given in Section 4.3.1. We demonstrate the performance of two representative
groups of experiments in Figures 4.8 and 4.9.
• Contrary to what was observed in Chapter 3, Naive FMM may perform
better than ABC FMM and AB FMM for relatively large problem
sizes. For example, in Figure 4.8, 〈3, 6, 3〉 (with the maximum theoretical
speedup among all FMM algorithms we test, given in Figure 4.2) has bet-
ter Naive FMM performance than ABC FMM and AB FMM. This
is because the total number of times for packing in 〈3, 6, 3〉 is very large
(NA×m = nnz(
⊗
U), NB×m = nnz(
⊗
V )). This magnifies the overhead for
packing with AB FMM and ABC FMM.
64



















m = n = 14400, 1 level, ABC, 1 core, Actual








m = n = 14400, 1 level, ABC, 1 core, Modeled



















m = n = 14400, 1 level, AB, 1 core, Actual
〈2, 2, 2〉 〈2, 3, 2〉 〈2, 3, 4〉 〈2, 4, 3〉
〈2, 5, 2〉 〈3, 2, 2〉 〈3, 2, 3〉 〈3, 2, 4〉
〈3, 3, 2〉 〈3, 3, 3〉 〈3, 3, 6〉 〈3, 4, 2〉
〈3, 4, 3〉 〈3, 5, 3〉 〈3, 6, 3〉 〈4, 2, 2〉
〈4, 2, 3〉 〈4, 2, 4〉 〈4, 3, 2〉 〈4, 3, 3〉
〈4, 4, 2〉 〈5, 2, 2〉 〈6, 3, 3〉 BLIS
MKL








m = n = 14400, 1 level, AB, 1 core, Modeled
〈2, 2, 2〉 〈2, 3, 2〉 〈2, 3, 4〉 〈2, 4, 3〉
〈2, 5, 2〉 〈3, 2, 2〉 〈3, 2, 3〉 〈3, 2, 4〉
〈3, 3, 2〉 〈3, 3, 3〉 〈3, 3, 6〉 〈3, 4, 2〉
〈3, 4, 3〉 〈3, 5, 3〉 〈3, 6, 3〉 〈4, 2, 2〉
〈4, 2, 3〉 〈4, 2, 4〉 〈4, 3, 2〉 〈4, 3, 3〉
〈4, 4, 2〉 〈5, 2, 2〉 〈6, 3, 3〉 GEMM



















m = n = 14400, 1 level, Naive, 1 core, Actual








m = n = 14400, 1 level, Naive, 1 core, Modeled
Figure 4.8: Performance of generated one-level ABC, AB, Naive FMM imple-
mentations on single core when m=n=14400, k varies. Left column: actual
performance; Right column: modeled performance. Top row: one-level, ABC;
Middle row: one-level, AB; Bottom row: one-level, Naive.
65



















m = k = n, 2 level, ABC, 1 core, Actual







m = k = n
m = k = n, 2 level, ABC, 1 core, Modeled



















m = n = 14400, 2 level, ABC, 1 core, Actual








m = n = 14400, 2 level, ABC, 1 core, Modeled



















k = 1024, 2 level, ABC, 1 core, Actual








k = 1024, 2 level, ABC, 1 core, Modeled
Figure 4.9: Performance of generated two-level ABC FMM implementations
on single core when m=k=n; m=n=14400, k varies; k=1024, m=n vary. Left
column: actual performance; Right column: modeled performance. Top row:
m=k=n; Middle row: m=n=14400, k varies; Bottom row: k=1024, m=n vary.
66
• Contrary to what was observed in [9], for rank-k updates (middle column,
right column, Figure 4.9), 〈2, 2, 2〉 still performs the best with the ABC
FMM implementations ([9] observe some other shapes, e.g., 〈4, 2, 4〉,
tend to have higher performance). This is because their implementations
are similar to Naive FMM, with the overhead for forming the Mr
matrices explicitly.
• Figure 4.8 shows that for small problem size, when k is small, ABC
FMM performs best; when k is large, AB FMM and Naive FMM
perform better. That can be quantitatively explained by comparing the
coefficients of NXm in Figure 3.7.
• The graph for m = n = 14400, k varies, ABC, 1 core (left column, Fig-
ure 4.8; middle row, Figure 4.9) shows that for k equal to the appropriate
multiple of kC ( k =
∏L−1
l=0 k̃l × kC), ABC FMM achieves the best per-
formance.
4.2.4 Incorporating the performance model into the code generator
For actual performance, even the best implementation has some unex-
pected drops, due to the “fringes” which are caused by the problem sizes not
being divisible by partition dimesions m̃, k̃, ñ. This is not captured by our
performance model. Therefore, given the specific problem size and shape, we
choose the best two implementations predicted by our performance model as
the top two candidate implementations, and then measure the performance in
practice to pick the best one.
67



















m = k = n
BLIS MKL
〈2, 2, 2〉, 2 level, ABC 〈2, 2, 2〉, 2 level, AB
〈2, 2, 2〉, 1 level, ABC 〈2, 2, 2〉, 1 level, AB
Best FMM Selected FMM



















m = n = 14400
BLIS MKL
〈2, 2, 2〉, 2 level, ABC 〈2, 2, 2〉, 2 level, AB
〈2, 2, 2〉, 1 level, ABC 〈2, 2, 2〉, 1 level, AB
Best FMM Selected FMM





















〈2, 2, 2〉, 2 level, ABC 〈2, 2, 2〉, 2 level, AB
〈2, 2, 2〉, 1 level, ABC 〈2, 2, 2〉, 1 level, AB
Best FMM Selected FMM
Figure 4.10: Selecting FMM implementations with the performance model.
68
In Figure 4.10 we show the performance results on a single core by se-
lecting the generated FMM implementation with the guide of the performance
model, when m=k=n; m=n=14400, k varies; and k=1024, m=n vary.
Overall, this experiment shows that the performance model is accurate
enough in terms of relative performance between various FMM implementa-
tions to guide the choice of a FMM implementation, with the problem sizes and
shapes as the inputs. That will reduce the potential overhead of an exhaustive
empirical search.
4.3 Performance experiments
We present performance evaluations for various generated FMM imple-
mentations.
4.3.1 Implementation and architecture information
The FMM implementations generated by our code generator are written
in C, utilizing SSE2 and AVX assembly, compiled with the Intel C compiler
version 15.0.3 with optimization flag -O3 -mavx.
We compare against our generated dgemm (based on the packing rou-
tines and micro-kernel borrowed from BLIS, marked as BLIS in the perfor-
mance figures) as well as Intel MKL’s dgemm [61] (marked as MKL in the
performance figures).
We measure performance on a dual-socket Intel Xeon E5-2680 v2 (Ivy
69
Bridge, 10 cores/socket) processor with 12.8 GB/core of memory (Peak Band-
width: 59.7 GB/s with four channels) and a three-level cache: 32 KB L1 data
cache, 256 KB L2 cache and 25.6 MB L3 cache. The stable CPU clockrate is
3.54 GHz when a single core is utilized (28.32 GFLOPS peak, marked in the
graphs) and 3.10 GHz when ten cores are in use (24.8 GLOPS/core peak). To
set thread affinity and to ensure the computation and the memory allocation
all reside on the same socket, we disable hyper-threading explicitly and use
KMP AFFINITY=compact.
The blocking parameters, nR = 4, mR = 8, kC = 256, nC = 4096
and mC = 96, are consistent with parameters used for the standard BLIS
dgemm implementation for this architecture. This makes the size of the pack-
ing buffer Ãi 192 KB and B̃p 8192 KB, which then fit the L2 cache and L3
cache, respectively.
Parallelization is implemented mirroring that described in [108], using
OpenMP directives that parallelize the third loop around the micro-kernel in
Figure 3.2.
4.3.2 Benefit of hybrid partitions
First, we demonstrate the benefit of using different FMM algorithms
for each level.
We report the performance of different combinations of one-level/two-
level 〈2, 2, 2〉, 〈2, 3, 2〉, and 〈3, 3, 3〉 in Figure 4.11, when k is fixed to 1200
and m = n vary. As suggested and illustrated in Section 4.2.3, ABC FMM
70



















k = 1200, ABC, 1 core
BLIS MKL
〈2, 2, 2〉, 1 level 〈2, 3, 2〉, 1 level
〈3, 3, 3〉, 1 level 〈2, 2, 2〉, 2 level
〈2, 3, 2〉, 2 level 〈3, 3, 3〉, 2 level
〈2, 2, 2〉+〈2, 3, 2〉 〈2, 2, 2〉+〈3, 3, 3〉



















k = 1200, ABC, 10 core
BLIS MKL
〈2, 2, 2〉, 1 level 〈2, 3, 2〉, 1 level
〈3, 3, 3〉, 1 level 〈2, 2, 2〉, 2 level
〈2, 3, 2〉, 2 level 〈3, 3, 3〉, 2 level
〈2, 2, 2〉+〈2, 3, 2〉 〈2, 2, 2〉+〈3, 3, 3〉
Figure 4.11: Benefit of hybrid partitions over other partitions.
71
performs best for rank-k updates, which is why we only show the ABC FMM
performance.
Overall the hybrid partitions 〈2, 2, 2〉 + 〈2, 3, 2〉 and 〈2, 2, 2〉 + 〈3, 3, 3〉
achieve the best performance. This is because 1200 is close to 2 × 3 × kC ,
meaning that the hybrid partitions of 2 and 3 on the k dimension are more
favorable. This is consistent with what the performance model predicts. Per-
formance benefits are less for 10 cores due to bandwidth limitations, although
performance of hybrid partitions still beats two-level homogeneous partitions.
This experiment demonstrates the benefit of hybrid partitions, facili-
tated by the Kronecker product representation.
4.3.3 Sequential and parallel performance
Results when using a single core are presented in Figure 4.2 and Fig-
ures 4.8 and 4.9. Our generated ABC FMM implementation outperforms AB
FMM and Naive FMM and reference implementations from [9] for rank-k
updates (when k is small). For very large square matrices, our generated AB
FMM and Naive FMM can achieve competitive performance with refer-
ence implementations [9] that are linked with Intel MKL. These experiments
validate our performance model.
Figure 4.12 reports performance results for ten cores within the same
socket. Memory bandwidth contention impacts the performance of various
FMM algorithms when using many cores. Nonetheless we still observe the
speedup of FMM algorithms over gemm. For smaller matrices and special
72























e) m = k = n, 10 core, Ours





m = k = n
m = k = n, 10 core, Reference























e) m = n = 14400, 10 core, Ours






m = n = 14400, 10 core, Reference
〈2, 2, 2〉 〈2, 3, 2〉 〈2, 3, 4〉 〈2, 4, 3〉
〈2, 5, 2〉 〈3, 2, 2〉 〈3, 2, 3〉 〈3, 2, 4〉
〈3, 3, 2〉 〈3, 3, 3〉 〈3, 3, 6〉 〈3, 4, 2〉
〈3, 4, 3〉 〈3, 5, 3〉 〈3, 6, 3〉 〈4, 2, 2〉
〈4, 2, 3〉 〈4, 2, 4〉 〈4, 3, 2〉 〈4, 3, 3〉
〈4, 4, 2〉 〈5, 2, 2〉 〈6, 3, 3〉 BLIS
MKL























e) k = 1024, 10 core, Ours






k = 1024, 10 core, Reference
Figure 4.12: Performance of the best implementation of our generated FMM
code and reference implementations [9] on one socket (10 core). Top row:
our implementations; Bottom row: reference implementations from [9] (linked
with Intel MKL). Left column: m=k=n; Middle column: m=n=14400, k
varies; Right column: k=1024, m=n vary.
73
shapes such as rank-k updates, our generated implementations achieve better
performance than reference implementations [9].
4.4 Summary
We have discussed a code generator framework that can automatically
implement families of Strassen-like fast matrix multiplication algorithms in a
vast design space. To explore this space, automatic generation coupled with
analytic performance analysis is a necessity. On the one hand, the prototype
code generator generates various FMM implementations by expressesing the
composition of multi-level FMM algorithms as Kronecker products. It incor-
porates the matrix summations that must be performed for FMM algorithms
into the inherent packing and micro-kernel operations inside gemm, avoiding
extra workspace requirement and reducing the overhead of memory movement.
On the other hand, it generates an accurate performance model to guide the
selection of a FMM implementation as a function of problem size and shape,
facilitating the creation of poly-algorithms that select the best algorithm for a
problem size. Compared with state-of-the-art results, we observe a significant
performance improvement for smaller matrices and special matrix multiplica-




A Practical Strassen’s Algorithm for Tensor
Contraction
Tensor contraction (TC) is an important computational kernel widely
used in numerous applications. It is a multi-dimensional generalization of
matrix multiplication (gemm). While Strassen’s algorithm for gemm is well
studied in theory and practice, extending it to accelerate TC has not been pre-
viously pursued. Thus, we believe this to be the first work to demonstrate how
one can in practice speed up tensor contraction with Strassen’s algorithm. By
adopting a block-scatter-matrix format, a novel matrix-centric tensor layout,
we can conceptually view TC as gemm for a general stride storage, with an
implicit tensor-to-matrix transformation. This insight enables us to tailor the
novel insights for the practical implementation of Strassen in Chapter 3 to
TC, avoiding explicit transpositions (permutations) and extra workspace, and
reducing the overhead of memory movement that is incurred. Performance
benefits are demonstrated with a performance model as well as in practice on
This chapter is based on the journal paper [54] with minor modifications: “Jianyu
Huang, Devin A. Matthews, and Robert A. van de Geijn. Strassen’s algorithm for tensor
contraction. SIAM Journal on Scientific Computing, 40(3):C305-C326, 2018.” I am the
main contributor in charge of problem formulation, algorithm development, performance
analysis, and experimental validations.
75
modern single core, multi-core, and distributed memory parallel architectures,
achieving up to 1.3× speedup. The resulting implementations can serve as a
drop-in replacement for various applications with significant speedup.
In this chapter, we use the the special calligraphic font for tensors
(Section 5.1.1), the upright Roman boldface letters for matrix views of tensors
or normal matrices (Section 5.1.4), the normal math italic font for the notation
(time, coefficients, etc.) in the performance model (Section 5.4). For example,
T represents a tensor, T represents the matrix view of T or a normal matrix,
and T represents the total execution time (in seconds) in our performance
model in Section 5.4.
5.1 Background on high-performance tensor contrac-
tion
The definition and notation of tensors and tensor contraction are briefly
reviewed before describing the tensor layouts that enable high-performance
tensor contraction.
5.1.1 Tensor
The concept of matrices is extended to multiple dimensions through
the use of tensors. For example, consider a 3-dimensional (3-D) tensor T of
size 4×6×3. T can be thought of as a 3-dimensional array of elements, where
each element is given by indexing: Ti,j,k ∈ R. The possible values for i, j, and
k are determined by the lengths of the dimensions as given in the tensor size,
76
i.e., 0 ≤ i < Ni = 4, 0 ≤ j < Nj = 6, and 0 ≤ k < Nk = 3.
In general, a d-dimensional tensor T ∈ RNi0 ×· · ·×RNid−1 has elements
indexed as Ti0,...,id−1 ∈ R for all (i0, . . . , id−1) ∈ Ni0 × . . . × Nid−1 , where M ×
. . . × N is a shorthand notation for the set of all tuples (i, . . . , j), 0 ≤ i <
M ∧ . . .∧ 0 ≤ j < N . The length of the dimension indexed by some symbol x
is given by Nx ∈ N. The indices may be collected in an ordered index bundle
Id = (i0, . . . , id−1), such that TId ∈ R for all Id ∈ Ni0×. . .×Nid−1 . In general we
will denote the dimension of a tensor T as dT , and the bundle length NId ∈ N
as the total length of an index bundle Id, i.e., NId =
∏
i∈Id Ni = Ni0 · . . . ·Nid−1 .
5.1.2 Tensor contraction
Tensor contraction is the generalization of matrix multiplication to
many dimensions. As an example, consider the tensor contraction illustrated in
Figure 5.1a, Ca,b,c+=
∑Nd−1
d=0 Ad,c,a · Bd,b. The summation is usually suppressed
and instead implied by the Einstein summation convention, where indices that
appear twice (once for each of A and B) on the right-hand side are summed
over. In contrast to the definition of matrix multiplication in Chapter 2, tensor
contraction may have more than one index summed over and more than one
non-summed index in each of A and B. The groups of indices that correspond
to i, j, and p in the matrix case are grouped into index bundles Im, Jn, and Pk.
For this example, the bundles are (a, c), (b), and (d), respectively. Other than
involving more indices, tensor contraction is precisely the same mathematical
operation as matrix multiplication.
77
For general tensor contractions, let A, B, and C be tensors of any
dimensionality satisfying dA + dB − dC = 2k, k ∈ N. Then, let Im, Jn,
and Pk be index bundles with m = dA − k and n = dB − k. Last, let the
index reordering ΠA((i0, . . . , idA−1)) = (iπA(0), . . . , iπA(dA−1)) be defined by the
bijective map πA : {0, . . . , dA−1} → {0, . . . , dA−1}, and similarly for ΠB and





where juxtaposition of two index bundles (e.g., ImJn) denotes concatenation.
The indices in the bundles Im and Jn are generally called free, external, or un-
contracted indices, while the indices in the Pk bundle are called bound, internal,
or contracted indices.1 In the following we will suppress the explicit summation
over Pk. The number of leading-order floating point operations required for
tensor contraction is 2NIm ·NJn ·NPk = 2(
∏
i∈Im Ni) · (
∏
j∈Jn Nj) · (
∏
p∈Pk Np).
Thus, if the length of each dimension is O(N), the tensor contraction operation
requires O(Nm+n+k) flops.
The example illustrated in Figure 5.1a has index bundles as noted above
and index reordering given by ΠA((i0, i1, i2)) = (i2, i1, i0), ΠB((i0, i1)) = (i0, i1),
and ΠC((i0, i1, i2)) = (i0, i2, i1). Note that, for example, defining Im as (c, a)
would give different index reorderings—the choice of ordering withing the index
1The preferred terms depend on context and specific field of research. In some cases,
these terms have specific meaning beyond the indication of how summation is performed;
for example in quantum chemistry the terms internal and external refer to the diagrammatic
representation of tensor contractions [24].
78
bundles and the index reorderings is not unique. The number of floating point
operations and memory accesses for this contraction is identical to that for a
matrix multiplication of (Na ·Nc)×Nd, Nd×Nb, and (Na ·Nc)×Nb matrices,
if performed entirely in-place (i.e., without transposition).
5.1.3 General stride layouts
The well-known column-major and row-major matrix layouts may be
extended to tensors as the generalized column- and row-major tensor layouts,
where elements are stored contiguously along the first dimension or last dimen-
sion, respectively. However, in general we may assume only a general tensor
layout, which extends the general matrix layout [124] by replacing matrix row
and column strides (e.g., rsM and csM) with a stride associated to each tensor
dimension. For a d-dimensional tensor T indexed by Id, the strides sT ;ik ∈ N
for all 0 ≤ k < d form the set ST = (sT ;i0 , . . . , sT ;id−1), which gives General
Stride element LOCations relative to T0,...,0,
LOCGS(TId , ST ) =
d−1∑
k=0
ik · sT ;ik .
In general, the stride of the dimension indexed in T by a particular symbol x
is denoted by sT ;x. The generalized column-major and row-major layouts can
also be represented using a general stride layout, in which case sT ;ik =
∏k−1
l=0 Nil
and sT ;ik =
∏d−1
l=k+1 Nil , respectively.
In Figure 5.1a, C is stored in the generalized column-major layout. The
numbers represent the location of the element Ca,b,c relative to the element
79
56	 57	 58	 59	 60	 61	 62	 63	
40	 41	 42	 43	 44	 45	 46	 47	
24	 25	 26	 27	 28	 29	 30	 31	
8	 9	 10	 11	 12	 13	 14	 15	
48	 49	 50	 51	 52	 53	 54	 55	
32	 33	 34	 35	 36	 37	 38	 39	
16	 17	 18	 19	 20	 21	 22	 23	
0	 1	 2	 3	 4	 5	 6	 7	
0	 8	 16	 24	 32	 40	 48	 56	
1	 9	 17	 25	 33	 41	 49	 57	
2	 10	 18	 26	 34	 42	 50	 58	
3	 11	 19	 27	 35	 43	 51	 59	
4	 12	 20	 28	 36	 44	 52	 60	
5	 13	 21	 29	 37	 45	 53	 61	
6	 14	 22	 30	 38	 46	 54	 62	
7	 15	 23	 31	 39	 47	 55	 63	
35	 39	 43	 47	 51	 55	 59	 63	
34	 38	 42	 46	 50	 54	 58	 62	
33	 37	 41	 45	 49	 53	 57	 61	
32	 36	 40	 44	 48	 52	 56	 60	
3	 7	 11	 15	 19	 23	 27	 31	
2	 6	 10	 14	 18	 22	 26	 30	
1	 5	 9	 13	 17	 21	 25	 29	









C                      +=                   A                 ×                  B 
(a) Tensor contraction Ca,b,c+ = Ad,c,a · Bd,b with Na = 4, Nb = Nd = 8, and
Nc = 2. The relative location of each data element in memory is given assuming
a generalized column-major layout. For example, C is stored in the generalized
column-major layout. The numbers represent the location of the element Ca,b,c
relative to the element C0,0,0 in the tensor storage layout. The strides are sC;a =
1, sC;b = Na = 4, and sC;c = Na · Nb = 32. The element location of Ca,b,c is
a · sC;a + b · sC;b + c · sC;c = a+ 4b+ 32c.
0	 1	 2	 3	 4	 5	 6	 7	
16	 17	 18	 19	 20	 21	 22	 23	
32	 31	 32	 33	 34	 35	 36	 37	
48	 49	 50	 51	 52	 53	 54	 55	
8	 9	 10	 11	 12	 13	 14	 15	
24	 25	 26	 27	 28	 29	 30	 31	
40	 41	 42	 43	 44	 45	 46	 47	
56	 57	 58	 59	 60	 61	 62	 63	
0	 4	 8	 12	 16	 20	 24	 28	
1	 5	 9	 13	 17	 21	 25	 29	
2	 6	 10	 14	 18	 22	 26	 30	
3	 7	 11	 15	 19	 23	 27	 31	
32	 36	 40	 44	 48	 52	 56	 60	
33	 37	 41	 45	 49	 53	 57	 61	
34	 38	 42	 46	 50	 54	 58	 62	
35	 39	 43	 47	 51	 55	 59	 63	
0	 8	 16	 24	 32	 40	 48	 56	
1	 9	 17	 25	 33	 41	 49	 57	
2	 10	 18	 26	 34	 42	 50	 58	
3	 11	 19	 27	 35	 43	 51	 59	
4	 12	 20	 28	 36	 44	 52	 60	
5	 13	 21	 29	 37	 45	 53	 61	
6	 14	 22	 30	 38	 46	 54	 62	













0	 4	 8	 12	 16	 20	 24	 28	 cscatA 	
cbsA 	












rscatA 	 rbsA 	
8	 8	


















C                      +=                        A                 ×                      B 
(b) Block scatter matrix view of (a), where Ad,c,a, Bd,b, and Ca,b,c are mapped to
matrices Ai,p, Bp,j , and Ci,j : rscatT and cscatT denote the scatter vectors; rbsT
and cbsT denote the block scatter vectors. Element locations are given by the sum
of the row and column scatter vector entries.
Figure 5.1: An example to illustrate Strassen’s algorithm for tensor contrac-
tion. The dashed lines denotes Strassen 2×2 partitions mapping from block
scatter matrix view (bottom) to the original tensor (top). In this example the
partitions are regular subtensors, but this is not required in general.
80
C0,0,0 in the tensor storage layout. The strides are sC;a = 1, sC;b = Na = 4, and
sC;c = Na ·Nb = 32. The element location of Ca,b,c is a · sC;a + b · sC;b + c · sC;c =
a+ 4b+ 32c.
5.1.4 Block scatter matrix view
In [85] it is shown that tensors can be represented in a matrix-centric
layout that allows for a simple but efficient implementation of tensor con-
traction using the BLIS framework. The main idea of that work is that the
locations of tensor elements of T can be described in a matrix format, the
Scatter Matrix layout, for an Ni ×Nj matrix view of T , T, very similarly to
the general stride matrix layout,
LOCSM(Ti,j, rscatT , cscatT ) = rscatT ;i + cscatT ;j, (5.1)
where rscatT ∈ NNi and cscatT ∈ NNj . If we define the index bundle Ip of
size p as the set of indices of T that map to columns of T and the index
bundle Jq of size q = dT − p as the set of indices that map to rows of T, then
(by inspection of the general stride layout) we can see that the scatter vector











∀ (i0, . . . , ip−1) ∈ Ni0 × . . .×Nip−1 ;
and similarly for cscatT with respect to Jq.
The relative location of Ca,b,c in Figure 5.1a, or Ci,j in the matrix view
of C in Figure 5.1b is rscatC;i+cscatC;j (e.g., LOCSM(C2,3,1) = LOCSM(C6,3) =
81
rscatC;6 +cscatC;3 = 34+12). Here, (1) rscatC;i = a ·sC;a+c ·sC;c = a+32c, i =
a+c·Na = a+4c,∀ (a, c) ∈ Na×Nc; (2) cscatC;j = b·sC;b = 4b, j = b,∀ (b) ∈ Nb.
These scatter vectors are shown on the top and the left of the matrix view of
C in Figure 5.1b.
The general definition of tensor contractions gives a natural mapping
from tensors to matrices through the index bundles Im, Jn, and Pk. Thus,
the bundle Im defines rscatA and rscatC, Jn defines cscatB and cscatC, and Pk
defines cscatA and rscatB. If we define matrices Ai,k, Bk,j, and Ci,j and imbue
them with scatter matrix layouts using the scatter vectors from the correspond-
ing tensors, we can perform tensor contraction using the high-performance
matrix multiplication algorithm introduced in Section 2.1.5, without explicitly
forming those matrices in extra working buffers and incurring the associated
cost of data movement.
Since we are using the BLIS implementation of the GotoBLAS al-
gorithm, we can leverage the fact that these matrices will be partitioned and
packed to introduce further optimizations. In the micro-kernel (Figures 2.1
and 3.2), the matrix C will be partitioned into mR×nR blocks and the matri-
ces A and B will be partitioned into mR×kC and kC×nR slivers, respectively.
If we further partition kC into smaller increments of a new parameter kR, on
the order of mR and nR, then we will end up with only matrix blocks of very
small size. As in [85], we can partition the scatter vectors into very small
blocks of size mR, nR, and kR as well and use optimized algorithms in the
packing kernels (i.e., packing process in Section 2.1.5) and micro-kernel when
82
the scatter values for the current block are regularly spaced (i.e., strided). The
regular strides for each {m,n, k}R-sized block of {r, c}scatT (mR for rscatA
and rscatC, nR for cscatB and cscatC, kR for cscatA and rscatB), or zero if
no regular stride exists, are collected in a row/column block scatter vector
{r, c}bsT of length d Ni{m,n,k}R e and similarly for the other row/column scatter
vectors. With these block scatter vectors, we can then utilize efficient SIMD
vector load/store instructions for the stride-one index, or vector gather/scatter
fetch instructions for the stride-n index, in a favorable memory access pattern.
In Figure 5.1b, assuming mR = nR = kR = 4, rbsC = (1, 1), and
cbsC = (4, 4), since the regular strides for each four elements of rscatC and
cscatC are 1 and 4, respectively.
5.2 Strassen’s algorithm for tensor contraction
The operations summarized in Figure 4.1 are all special cases of
M = (X + δY)(V + εW); D+= γ0M; E+= γ1M; (5.2)
for appropriately chosen γ0, γ1, δ, ε ∈ {−1, 0, 1}. Here, X and Y are submatri-
ces of A, V and W are submatrices of B, and D and E are submatrices of C.
As in Chapter 3, this scheme can be extended to multiple levels of Strassen.
Instead of partitioning the tensor A into subtensors X and Y and so
on for B and C, we partition the matrix representations A, B, and C (block
scatter matrix view of A, B, C) as in the matrix implementation of Strassen.
Figure 5.1 provides an example to illustrate the partition mechanism. Block
83
scatter matrix layouts for these submatrices may be trivially obtained by par-
titioning the scatter and block scatter vectors of the entire matrices along the
relevant dimensions. Once imbued with the appropriate layouts, these subma-
trices may then be used in the BLIS-based Strassen of Chapter 3 along with
modifications the packing kernels and micro-kernel as in [85].
In fusing these two methodologies, we need to further address the con-
sideration of multiple block scatter vectors as required when packing and ex-
ecuting the micro-kernel. Methods for dealing with this issue are described
in Section 5.3.1. The advantage of using matrix partitions (which is enabled
by the block scatter layout) instead of tensor partitions is primarily that only
the product of the lengths of each index bundle, {NIm , NJn , NPk}, must be
considered when partitioning, and not the lengths of individual tensor dimen-
sions. For example, Strassen may be applied to any tensor contraction where
at least one dimension in each bundle is even in our approach, whereas the
last dimension (or rather, the dimension with the longest stride) should be
even when using subtensors.2 Additionally, when applying techniques such as
zero-padding or dynamical peeling [59, 117] in order to address edge cases, the
overhead is magnified for subtensor-based algorithms because the padding or
peeling applies to only a single tensor dimension; in our algorithm padding or
peeling may be applied based on the length of the entire index bundle, which
is necessarily longer and therefore incurs less overhead.
2A dimension other than the last could also be chosen for partitioning, but the spatial
locality of the partitioning would be destroyed.
84
5.3 Implementations
We now detail the modifications to the block scatter matrix-based pack-
ing kernel and micro-kernel as described in [85] for Strassen. We focus on
the double precision arithmetic and data.
5.3.1 Packing
When packing submatrices for Strassen using Equation (5.2), multi-
ple scatter and block scatter vectors must be considered. In our implemen-
tation, the block scatter vector entries for the corresponding blocks in both
input submatrices (or all submatrices for L-level Strassen) are examined. If
all entries are non-zero, then the constant stride is used in packing the cur-
rent block.3 Otherwise, the scatter vectors are used when packing the current
block, even though one or more of the input submatrix blocks may in fact
have a regular stride. In future work, we plan to exploit these cases for further
performance improvements.
5.3.2 Micro-kernel
As in Chapter 3, we use assembly-coded micro-kernels that include the
update to several submatrices of C from registers. In order to use this efficient
update, all block scatter vector entries for the relevant submatrix blocks of
C must be non-zero. Unlike in the packing kernel implementation, the case
3Note that when non-zero, the block scatter vector entries for different submatrices will
always be equal.
85
where only one or more of the submatrix blocks is regular stride would be more
difficult to take advantage of, as the micro-kernel would have to be modified
to flexibly omit or redirect individual submatrix updates.
5.3.3 Variations
We implement three variations of Strassen for tensor contraction on
the theme illustrated in Figure 3.2, extending Chapter 3.
• Naive Strassen: A classical implementation with temporary buffers.
Submatrices of matrix representations of A and B (A and B) are ex-
plicitly copied and stored as regular submatrices. Intermediate subma-
trices M are explicitly stored and then accumulated into submatrices of
matrix representation of C (C). We store the M submatrices as regu-
lar, densely-stored matrices, and handle their accumulation onto block
scatter matrix layout submatrices of C. Thus, the Naive Strassen
algorithm for tensor contraction is extremely similar to a ttdt-based
Strassen algorithm (see Section 5.6), except that the tensors are not
required to be partitioned into regular subtensors.
• AB Strassen: The packing routines incorporate the summation of sub-
matrices of matrix representations of A and B with implicit tensor-to-
matrix transformation into the packing buffers (see Section 5.3.1), but
explicit temporary buffers for matrices M are used.
• ABC Strassen: AB Strassen, but with a specialized micro-kernel (see
86
Section 5.3.2) that incorporates additions of M to multiple submatrices
of matrix representation of C with implicit matrix-to-tensor transfor-
mation. Thus, the ABC Strassen algorithm for tensor contraction
requires no additional temporary buffers beyond the workspace already
incorporated in conventional gemm implementations.
5.4 Performance model
In Chapter 3, a performance model was proposed to predict the execu-
tion time T for variations of Strassen for matrices. In this section, we ex-
tend that performance model to estimate the execution time T of ABC, AB,
and Naive variations of L-level Strassen for TC and the high-performance
non-Strassen TC routine we build on (see Section 5.1; using TBLIS imple-
mentation [85, 84] introduced in Section 5.5; denoted as tblis henceforth).
Due to the high dimensionality of tensors and enormous types and combina-
tions of permutations (transpositions) in TC, it is impractical to exhaustively
search for every tensor shape and tensor problem size to find the best variation
using empirical performance timings. Performance modeling helps us to bet-
ter understand the memory footprint and computation of different Strassen
implementations for TC and at least reduce the search space to pick the right
implementation. In our new model, besides input problem size, block sizes,
and the hardware parameters such as the peak GFLOPS and bandwidth, T
also depends on the shape of the tensors and the extra permutations in the
packing routines and in the micro-kernel. In [55] we showed that a similar
87
τa Time (in seconds) of one arithmetic (floating point) operation.
τb
(Bandwidth) Amortized time (in seconds) of 8 Bytes contiguous
data movement from slow main memory to fast cache.
ρa Penalty factor for arithmetic operation effciency.
ρb Penalty factor for bandwidth.
λ Prefetching efficiency.
T Total execution time (in seconds).
Ta Time for arithmetic operations (in seconds).
Tm Time for memory operations (in seconds).












m Tm for reading (sub)tensors in packing routines (Figure 3.3).
T
C×








Tm for reading or writing (sub)tensors, related to the temporary
buffer as part of Naive Strassen and AB Strassen.
WXa /W
X





Figure 5.2: Notation table for performance model.
1© Effective GFLOPS = 2 ·NIm ·NJn ·NPk/T · 10−9
2© T = Ta + Tm
3© Ta = W×a · T×a +WA+a · TA+a +WB+a · TB+a +W C+a · T C+a
4© Tm = W
A×
m · TA×m +WB×m · TB×m +W C×m · T C×m
+WA+m · TA+m +WB+m · TB+m +W C+m · T C+m
5© τa = 1/(ρa · Peak GFLOPS)
6© τb = 8/(ρb · Bandwidth)
Figure 5.3: Equations for computing the execution time T and Effective
GFLOPS in our performance model.
model is capable of predicting the best-performing fast matrix multiplication
algorithms from a large set of candidates in most circumstances. The same
predictive power should be applicable to tensor contractions as well, over a
wide range of tensor shapes and sizes.
88
type τ tblis L-level




























































Figure 5.4: Various components of arithmetic and memory operations for tb-
lis TC and various implementations of Strassen TC. The time shown in the
first column for tblis TC and L-level Strassen can be computed separately
by multiplying the parameter in τ column with the arithmetic/memory oper-
ation number in the corresponding entries. Here NIm =
∏
i∈Im Ni = Ni0 · . . . ·
Nim−1 , NJn =
∏
j∈Jn Nj = Nj0 · . . . ·Njn−1 , NPk =
∏
p∈Pk Np = Np0 · . . . ·Npk−1 .
Assumption
Similar to Chapter 3, we assume two layers of memory hierarchy: slow
main memory and fast caches.4 For write operations, the lazy write-back policy
is enforced such that the time for writing into fast caches can be hidden. For
4The latency from multiple levels of cache for modern processors is hidden by hardware
prefetching. Two layers of memory are good enough for modeling performance of regular




ABC AB Naive ABC AB Naive
W×a 1 7 7 7 49 49 49
WA+a - 5 5 5 95 95 95
WB+a - 5 5 5 95 95 95
W C+a - 12 12 12 144 144 144
WA×m 1 12 12 7 194 194 49
WB×m 1 12 12 7 194 194 49
W C×m 1 12 7 7 144 49 49
WA+m - - - 19 - - 293
WB+m - - - 19 - - 293
W C+m - - 36 36 - 432 432
Figure 5.5: Coefficient WXa /W
X






read operations, the latency for accessing the slow main memory is counted,
while the latency for accessing caches can be ignored.5
Notations
We summarize our notation in Figure 5.2. The total execution time,
T , can be decomposed into a sum of arithmetic time Ta and memory time Tm
( 2© in Figure 5.3).




As shown in 3©, Ta includes (sub)tensor contraction (T×a ) and (sub)tensor




a ). The corresponding coefficients W
X
a
for tblis TC and L-level various Strassen TC are enumerated in Figure 5.5.
For example, one-level Strassen TC has coefficients W×a = 7, W
A+
a = 5,
WB+a = 5, and W
C+
a = 12, because it involves 7 submatrix multiplications,
5 additions with subtensors of A, 5 additions with subtensors of B, and 12
additions with subtensors of C. Note that TXa is calculated by multiplying the
unit time τa with the arithmetic operation number in Figure 5.4. We com-
pute τa through 5©. The penalty factor ρa ∈ (0, 1] is introduced, due to the
extra computations involved in {r, c}scatT and {r, c}bsT , and the slow micro-
kernel invocation when the corresponding entries in rbsC or cbsC are 0 (see
Section 5.3.2; non-regular stride access).
Memory operations
Based on the above assumptions, Tm can be broken down into three
parts ( 4© in Figure 5.3):
• updating the temporary buffer that are parts of Naive Strassen and AB
Strassen (W T+m · T T+m );
• memory packing shown in Figures 2.1 and 3.2 (WA×m · TA×m , WB×m · TB×m );
• updating the submatrices of C shown in Figures 2.1 and 3.2 (W C×m · T C×m ).
91
The coefficients WXm are tabulated in Figure 5.5. T
X
m is a function of block sizes
{mC , kC , nC} in Figures 2.1 and 3.2, and the bundle lengths { NIm/2L, NJn/2L,
NPk/2
L } because the memory operation can repeat multiple times according
to which loop they reside in. Figure 5.4 characterizes each memory operation
term by its read/write type and the amount of memory in units of 64-bit double
precision elements. In order to get TXm , the memory operation number needs to
be multiplied by the bandwidth τb. We compute τb through 6©. We penalize the
effect of permutations without stride-one index accesss6 by setting ρb = 0.7.
A similar parameter is introduced in [112] for regular non-Strassen TC.








an extra parameter λ ∈ (0.5, 1], which denotes the prefetching efficiency. T C×m
is a ceiling function proportional to NPk , since rank-k updates for accumulating
submatrices of C recur dNPk/K̃L
kc
e times in fourth loop (Figures 2.1 and 3.2).
Discussion
We can estimate the run time performance of various implementations,
based on the performance model presented in Figures 5.3, 5.4 and 5.5. Here
we define Effective GFLOPS ( 1© in Figure 5.3) for TC as the metric to com-
pare the performance of various Strassen TC and tblis TC. The theoretical
peak GFLOPS and bandwidth information are given in Section 5.5. In Fig-
ure 5.6, we demonstrate the modeled and actual performance for a wide range
6See Section 5.3.1; the corresponding entries in neither rbsT or cbsT are 1, i.e., using
scatter/gather operation, or indirect memory addressing with (5.1).
92








1 2 3 4 5 6·103
1-level ABC
1 2 3 4 5 6·103
1-level AB
1 2 3 4 5 6·103
1-level Naive
1 2 3 4 5 6·103
2-level ABC
1 2 3 4 5 6·103
2-level AB
1 2 3 4 5 6·103
2-level Naive
(a) m=k=n; x-axis for all figures in this row is ÑIm = ÑPk = ÑJn .







1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103
(b) m=n=14400, k varies; x-axis for all figures in this row is ÑPk .







1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103 1 2 3 4 5 6·103
(c) k=1024, m=n vary; x-axis for all figures in this row is ÑIm = ÑJn .
Figure 5.6: Modeled performance (solid line) and actual performance (dots) of various implementations
for synthetic data on single core. y-axis for all figures is Effective GFLOPS (2 ·NIm ·NJn ·NPk/time).
93






2 4 6 8 10 12·103
1-level ABC
2 4 6 8 10 12·103
1-level AB
2 4 6 8 10 12·103
1-level Naive
2 4 6 8 10 12·103
2-level ABC
2 4 6 8 10 12·103
2-level AB
2 4 6 8 10 12·103
2-level Naive
(a) m=k=n; x-axis for all figures in this row is ÑIm = ÑPk = ÑJn .





2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103
(b) m=n=14400, k varies; x-axis for all figures in this row is ÑPk .





2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103 2 4 6 8 10 12·103
(c) k=1024, m=n vary; x-axis for all figures in this row is ÑIm = ÑJn .
Figure 5.7: Actual performance (dots) of various implementations for synthetic data on one socket. y-axis
for all figures is Effective GFLOPS (2 ·NIm ·NJn ·NPk/time).
94
TC shapes
NRMSE ( % )
tblis
1-level 2-level
ABC AB Naive ABC AB Naive
m=k=n 5.26 4.27 3.23 3.49 7.82 5.64 8.65
m=n=14400, k varies 4.88 3.95 5.31 4.81 7.17 5.68 4.57
k=1024, m=n vary 4.55 4.64 5.39 5.23 9.08 7.65 7.26
Figure 5.8: Normalized root-mean-square error (NRMSE) between the actual
and modeled performance for synthetic data on single core. NRMSE is de-
fined as the root of mean square error normalized by the mean value of the
measurements, which shows the model prediction accuracy.
of synthetic tensor sizes and shapes: m=k=n; m=n=14400, k varies; k=1024,
m=n vary. How we generate synthetic data is detailed in Section 5.5.1. In
Figure 5.8, we quantitatively show the model prediction accuracy.
• The model can predict the relative performance for various implementations
within 10% error bound.
• For m=k=n (Figure 5.6a), the ABC Strassen implementations outperform
tblis, when NIm , NJn , NPk are as small as 2kC , nearly 500, while Naive
Strassen cannot beat tblis until the problem size is larger than 2000.
• The “m=n=14400, k varies” graphs (Figure 5.6b) shows that when NPk
is small, ABC Strassen performs best; when NPk is large, AB Strassen
performs better. The coefficients WXm in Figure 5.5 help to illustrate the rea-
sons quantitatively. Two-level AB Strassen can achieve over 30% speedup
compared with TBLIS.
95
• According to the model, when NPk is equal to appropriate multiple of kC
(NPk = 2
L · kC for L-level), ABC Strassen achieves the best performance.
We will leverage this observation in our distributed memory experiment.
5.5 Performance experiments
We perform our experimental evaluations for synthetic data and real-
world benchmarks on a single node and on a distributed memory architecture.
The implementations are written in C++, utilizing AVX assembly, based on the
open source TBLIS framework [84]. We compare against TBLIS’s tensor con-
traction routine (marked as tblis) as well as the TTT routine from the MAT-
LAB Tensor Toolbox [4] (linked with Intel MKL [61], marked as ttt) for single
node and the tensor contraction routine from the Cyclops Tensor Framework
[110] (also linked with Intel MKL, marked with ctf) for distributed memory.
We measure the CPU performance results on the Maverick system at
the Texas Advanced Computing Center (TACC). Each node of that system
consists of a dual-socket (10 cores/socket) Intel Xeon E5-2680 v2 (Ivy Bridge)
processor with 256 GB memory (peak bandwidth: 59.7 GB/s with four chan-
nels) and a three-level cache (32 KB L1 data; 256 KB L2; 25.6 MB L3). The
stable CPU clockrate is 3.54 GHz when a single core is utilized (28.32 GFLOPS
peak, marked in the graphs) and 3.10 GHz when all 10 cores are in use (24.8
GFLOPS/core peak). We disable hyper-threading explicitly and set thread
affinity with KMP AFFINITY=compact which also ensures the computation and
the memory allocation all reside on the same socket.
96
The cache blocking parameters, mC = 96, nC = 4096, kC = 256, and
the register block sizes, mR = 8, nR = 4, are consistent with parameters used
for the standard BLIS dgemm implementation for this architecture. We use
the default value of kR = 4 as defined in TBLIS. This makes the size of the
packing buffer Ãi 192 KB and B̃p 8192 KB, which then fits the L2 cache
and the L3 cache, respectively. Parallelization is implemented mirroring that
described in [108], but with the number of threads assigned to each of the loops
in Figures 2.1 and 3.2 automatically determined by the TBLIS framework.
5.5.1 Single node experiments
We report the experimental results on single node for the synthetic
tensor contraction, real-world benchmark, and shape dependence.
Synthetic tensor contractions
To evaluate the overall performance of various Strassen TC compar-
ing against tblis TC for different tensor problem sizes, shapes, and permuta-
tions, we randomly generate TC test cases with 2-D to 6-D randomly permuted
tensors as operands and test all these implementations for each synthetic test
case, as shown in Figures 5.6 and 5.7. We choose step size 256 to sample
uniformly {NIm , NJn , NPk} for various tensor bundle lengths: square: m=k=n;
rank-NPk : m=n=14400, k varies; fixed-NPk : k=1024, m=n vary. For each
bundle length {NIm , NJn , NPk}, we randomly generate three {Im, Jn, Pk} 1-D,
2-D, or 3-D bundles, such that the product of each index length is close to
97
{NIm , NJn , NPk}. The order of {Im, Jn, Pk} is then randomly permuted.
The generated bundle lengths may not exactly match the original sam-
pled bundle lengths. When we plot the actual performance of these synthetic
test cases, we set effective bundle lengths ÑIm = ÑJn = ÑPk = (NIm · NJn ·
NPk)
1/3 for the square bundle lengths; ÑPk = NIm ·NJn ·NPk/(16000 · 16000)
for rank-NPk bundle lengths; and ÑIm = ÑJn = (NIm · NJn · NPk/1024)1/2 for
fixed-NPk bundle lengths.
For the square and rank-NPk tensor shapes on one core, tblis is rapidly
outpaced by ABC Strassen, with a crossover point of about 500 ≈ 2 · kC .
ABC Strassen is then shortly overtaken by AB Strassen and then by
two-level AB Strassen. As predicted by the performance model, the AB
Strassen implementation is best for very large problem sizes due to repeated
updates to C in the ABC Strassen algorithm. The Naive Strassen imple-
mentations are never the best in these experiments, although they may become
more efficient than AB Strassen for extremely large, square problems. These
trends are repeated in the 10-core experiments, although the crossover points
are moved to larger tensor sizes.
For the fixed-NPk shapes, total performance is lower for AB Strassen
and Naive Strassen with scalability for the algorithms being especially im-
pacted by the relatively smaller NIm and NJn sizes. For these shapes ABC
Strassen is always the fastest method above the crossover point with standard
tblis.
98
The actual performance data matches the predicted performance very
well, with some variation due to the randomization of the tensor lengths and
permutations. Using these performance models, it may be possible to ana-
lytically decide on which algorithm to apply for a given tensor contraction to
achieve the highest performance, allowing an automated and seamless inclu-
sion of Strassen into a TBLIS-like tensor framework.
Real-world benchmark
In Figure 5.9, we measure the performance of various implementations
for a subset of tensor contractions from the Tensor Contraction Benchmark
[111] on a single core and one socket. We present representative use cases
where NPk is nearly equal to or larger than 2kC (512), for which Strassen
can show performance benefits, as illustrated in Section 5.4. The right three
test cases represent various regularly blocked tensor contractions from coupled
cluster with single and double excitations (CCSD) [103, 47, 102], a workhorse
quantum chemistry computational method. The fourth case from the right
illustrates the performance of tblis and Strassen TC for a pure matrix
case. Comparing this case and the CCSD contractions highlights some of the
performance issues that exist in the current implementation of the packing
and matrix-to-block scatter matrix copy kernels (see Section 5.3.1 for details).
On one core, all Strassen implementations improve on tblis for these right
four cases, and in parallel one-level Strassen implementations give a speedup
































TBLIS 1-ABC 1-AB 1-Naive






























TBLIS 1-ABC 1-AB 1-Naive
TTT 2-ABC 2-AB 2-Naive
Figure 5.9: Performance for representative user cases of benchmark from [112].
TC is identified by the index string, with the tensor index bundle of each tensor
in the order C-A-B, e.g., Cabcd+= AaebfBdfce is denoted as abcd-aebf -dfce. Top:
performance on single core. Bottom: performance on one socket.
100
The gap between tblis and ttt for these contractions is due to ttt’s use
of Intel’s MKL library, which is more highly optimized than the BLIS/TBLIS
framework.
The left two benchmarks are again quantum chemistry applications us-
ing 3-D tensors that arise in density-fitting (DF) calculations [127, 34]. These
contractions are also structurally equivalent to certain contractions from the
coupled cluster with perturbative triples (CCSD(T)) method [96], where the
occupied (see Section 5.5.2) indices have been sliced. These cases show the im-
provement of tblis over ttt as noted in [85] but do not show a speedup from
Strassen except for one-level ABC Strassen on one core. Our Strassen
implementation performs the submatrix multiplications sequentially, with only
parallelization of each submatrix multiplication step. A more comprehensive
parallelization scheme, for example, using task-based parallelism [9], may show
better performance. Additionally, since the DF/CCSD(T) contractions are
highly “non-square,” an alternate fast matrix multiplication algorithm [9, 55]
may perform better.
Shape-dependence experiments
The performance of the “particle-particle ladder” tensor contraction
from CCSD, Zabij+= Wabef · Tefij, is reported for a range of tensor shapes
in Figure 5.10. In these experiments, the length of the virtual dimensions
{a, b, e, f} is varied with respect to the length of the occupied dimensions {i, j}
such that the total number of FLOPs is roughly similarly to a 16000× 16000
101





















































Figure 5.10: Performance for the contraction Zabij+=Wabef ·Tefij with varying
Na : Ni ratio. Left: performance on single core. Right: performance on one
socket.
102
matrix multiplication, and the ratio Na : Ni is used as a proxy for tensor shape.
A ratio of 1:1 would reflect an extremely poor quality of basis set for the overall
calculation but is common when the calculation employs regular blocking. The
other end of the scale, with a ratio of ∼ 5 : 1, would then correspond to
uneven blocking. This type of blocking allows for better load balancing and
lower overhead when Na and Ni are very unequal in the overall calculation.
The performance of tblis and all of the one-level Strassen algorithms
shows essentially no performance degradation across the entire range tested.
The two-level Strassen algorithms show some performance degradation at
larger ratios but still show improvement over tblis. Eventually, all Strassen
algorithms will cross over and perform worse than tblis, as evidenced by the
left two contractions in Figure 5.9 (these correspond to a ratio of about 22).
However, the good performance of Strassen out to reasonably large ratios
shows that it could be beneficial in both regular blocking and uneven blocking
scenarios.
5.5.2 Distributed memory experiments
We demonstrate how to use the Strassen TC implementations to
accelerate a distributed memory implementation of 4-D tensor contraction
that exemplifies the two-particle “ring” terms from CCSD. In our tests we set
the length of virtual indices {a, b, e} to 10× that of occupied indices {i, j,m},
which is a ratio commonly encountered in quantum chemistry calculations
using popular basis sets such as 6-311++G** [69] and cc-pVTZ [35]. The
103
problem sizes tested here correspond to calculations on systems with 80, 112,
160, 192, and 224 electrons (i.e., Ni = Nj = Nm ∈ {40, 56, 80, 96, 112} and
Na = Nb = Ne = 10 · Ni). We use the contraction Zabij := WbmejTaeim as a
demonstration example to show the performance benefit.
We implement a SUMMA-like [120] algorithm for 4-D tensor contrac-
tion with MPI, similar to the distributed memory implementation in Sec-
tion 3.3.3. Initially the tensors W , T , and Z are distributed to a P ×P mesh
of MPI processes using a 2-D block distribution over the a, b, and e dimen-
sions, with the i, j, and m dimensions stored locally (i.e., not distributed).
After slicing W and T along the e dimension, the contraction is broken down
into a sequence of K contractions of tensor slice pairs,
Z :=
(




such that the e index length for each tensor slice pair {We;p, Te;p} is N ′e =
Ne/K. For each tensor slice pair, 0 ≤ p < K, We;p is broadcast within rows
of the mesh, and Te;p is broadcast within columns of the mesh. Then a local
tensor contration for the received tensor slice pair is performed to update the
local block. Here tblis TC and various Strassen TC are used as a drop-in
replacement for this local tensor contraction.
We perform the distributed memory experiment on the same machine
as the single node experiment. The dual-socket processor has 10 cores on each
socket. We run one MPI process for each socket and leverage all 10 cores in a
104










































NIm(Nb ·Nj) = NPk(Ne ·Nm) = NJn(Na ·Ni) ≈ 16000 · P
on P × P MPI mesh









Figure 5.11: Weak scalability performance result of the various implementa-
tions for a 4-D tensor contraction CCSD application on distributed memory:
Zabij := WbmejTaeim. CTF: the performance of the Cyclops Tensor Frame-
work [110] (linked with Intel MKL).
105
socket with thread parallelism for all implementations. Figure 5.11 reports the
weak scalability performance result on up to 640 cores (32 nodes, 64 sockets).
In our experiments on P × P mesh of sockets (MPI processes), the
lengths of virtual indices are set to equal Na = Nb = Ne ≈ 400
√
P and
the lengths of occupied indices are set to equal Ni = Nj = Nm ≈ 40
√
P ,
which make NIm = NJn = NPk ≈ 16000 ·P . This guarantees the local memory
buffer allocated to Z,W , T is constant. Our experiments verify that the above
SUMMA-like algorithm is weakly scalable on this constant local memory setup,
regardless of which local TC implementation we use. The local e index length
N ′e is chosen close to N
′
e = 1024/Nm (i.e., N
′
e ∈ {25, 18, 12, 10, 9}) such that
the local TC computations are performed with NPk = N
′
e · Nm ≈ 4 · kC . The
tensor slice pairs in the local TC computations matches the shape when ABC
Strassen achieves the best performance. Therefore, the one-level and two-
level ABC Strassen implementations outperform all other implementations.
We also tested the Cyclops Tensor Framework (CTF) [110], which also
uses a SUMMA or nested SUMMA algorithm but with possibly different block
sizes and tensor distributions, as well as using the ttdt algorithm for local
tensor contractions. We show it here as a reference for state-of-the-art perfor-
mance.
5.6 Related work on tensor contraction
To the best of our knowledge, this work represents the first implemen-
tation of Strassen’s algorithm for tensor contraction.
106
For tensor contraction, recent work on high-performance tensor contrac-
tion [85, 112] serves as the motivation and basis for our present work, while
other research has focused on algorithms using tensor slicing [29, 94, 75, 83]
or on improving the efficiency of the so-called ttdt algorithm for tensor con-
traction [46, 45, 82, 113], where input tensors A and B are Transposed (per-
muted) and then used in a standard gemm algorithm, with the output then
being Transposed and accumulated onto the tensor C. ttdt could be used
to construct a Strassen algorithm for TC by transposing subtensors into
submatrices and vice versa and using a matrix implementation of Strassen
instead of gemm. However, we showed that this algorithm is essentially the
same as our Naive Strassen algorithm (see Section 5.3.3), which is often less
efficient than the other algorithms that we have implemented.
The gett algorithm [112] is a high-performance tensor contraction
implementation similar in many ways to the BLIS-based implementation by
Matthews [85]. As in our current implementation, formation of linear combi-
nations of input subtensors of A and B and output to multiple subtensors of
C could be fused with the internal tensor transposition and micro-kernel steps
of gett. However, the implementation would be restricted to regular subten-
sors rather than more general submatrices (see Section 5.2), which could have
possible negative performance implications (e.g., false sharing).
107
5.7 Summary
We have presented what we believe to be the first work to demonstrate
how to leverage Strassen’s algorithm for tensor contraction, and have shown
practical performance speedup on single core, multi-core, and distributed mem-
ory implementations. Using a block scatter matrix layout enables us to par-
tition the matrix view of the tensor, instead of the tensor itself, with au-
tomatic (implicit) tensor-to-matrix transformation, and the flexibility to fa-
cilitate Strassen’s 2-D matrix partition to multi-dimensional tensor spaces.
Fusing the matrix summation that must be performed for Strassen and the
transposition that must be conducted for tensor contraction with the pack-
ing and micro-kernel operations inside high-performance implementation of
gemm avoids extra workspace requirements and reduces the cost of additional
memory movement. We provided a performance model which can predict the
speedup of the resulting family of algorithms for different tensor shapes, sizes,
and permutations, with enough accuracy to reduce the search space to pick
the right implementation. We evaluated our families of implementations for
various tensor sizes and shapes on synthetic and real-world datasets, both
observing significant speedups comparing to the baseline (tblis) and naive
implementations (Naive Strassen), particularly for smaller problem sizes
(NIm , NJn , NPk ≈ 2kC , 4kC), and irregular shape (NPk is much smaller com-
paring to NIm , NJn). Together, this work demonstrates Strassen’s algorithm
can be applied for tensor contraction with practical performance benefit.
108
Chapter 6
A Practical Strassen’s Algorithm on GPUs
In the previous chapters, we explored how to develop practical imple-
mentations of Strassen, extensions to other Strassen-like FMM algorithms,
and extensions to higher-dimensional tensor contraction, all targeting CPUs.
In this chapter, we show how similar techniques extend the practical Strassen
approach to GPUs. Unlike CPUs, GPUs are designed as parallel, throughput-
oriented computing engines. We will show that the high-performance imple-
mentation of matrix multiplication on GPUs requires a somewhat different
philosophy from that on CPUs.
Several challenges must be overcome for a practical implementation of
Strassen on GPUs. First, the GPU architecture and programming model are
different from their counterparts for a CPU. In order to achieve high perfor-
mance, a practical implementation of Strassen needs to leverage the memory
hierarchy and thread hierarchy (Section 6.1.1) of a GPU. Second, a GPU has
a limited physical memory capacity. As previously discussed, conventional
Strassen implementations require some extra temporary memory for storing
intermediate submatrices, which limit the maximum problem size that can be
computed compared to gemm because of the GPU memory capacity. Third,
109
a GPU is a highly parallel, multi-threaded, many-core processor. Strassen
needs to be parallelized in multiple ways to fully utilize the computational
horsepower of GPUs. However, there is a tension between reducing the mem-
ory and exploiting more parallelism with the conventional implementation of
Strassen. Finally, the ratio between the theoretical peak performance and
peak memory bandwidth of a GPU is even higher (less favorable) than that of
a CPU. Strassen has a lower ratio of arithmetic operations to memory opera-
tions compared to gemm, which means Strassen only becomes advantageous
when the problem sizes are sufficiently large. Thus, the practical implemen-
tation of Strassen needs to reduce the extra data movement to save the
bandwidth and outperform gemm for small or moderate problem sizes.
6.1 Background on Nvidia GPUs
In this section, the GPU programming model and the newest Nvidia
GPUs are briefly reviewed before describing the high-performance implemen-
tation of gemm on such GPUs.
6.1.1 GPU programming model
The CUDA programming environment [88] assumes that the CUDA
program (kernel) is executed on physically independent devices (GPUs) as
coprocessors to the host (CPU). Figure 6.1 shows the memory and thread























The memory hierarchy on the GPU device includes three levels: global
memory, shared memory, and register files. The latency decreases while the
bandwidth increases through the memory hierarchy from global memory to
registers.
Thread hierarchy
A thread is the smallest execution unit in a CUDA program. A thread
block is a group of threads that run on the same core and shares a partition
of resources such as shared memory. Thread blocks communicate through
barrier synchronization. Multiple blocks are combined to form a grid, which
corresponds to an active CUDA kernel on the device. At runtime, a thread
block is divided into a number of warps for execution on the cores. A warp is a
set of 32 threads to execute the same instructions while operating on different
data in lockstep.
6.1.2 Nvidia Volta GPUs
Recently, Nvidia released the Tesla V100 accelerator (V100) with Volta
GV100 GPUs to boost high-performance computing (HPC), artificial intelli-
gence (AI), and graphics applications [36]. It is reported that the incoming
pre-exascale systems, such as the Summit and Sierra supercomputers, will be
equipped with Nvidia Tesla V100 GPUs [89]. Therefore, we foresee the impor-
tance of improving the performance of V100 GPUs for the HPC community.
112
The Tesla V100 GPU accelerator comprises 80 streaming multiproces-
sors (SMs). Each SM is partitioned into 4 processing blocks. Each processing
block consists of 2 Tensor Cores, 8 FP64 (double precision) cores, 16 FP32
(single precision) cores, and 16 INT32 cores, of which the last three are also
called CUDA cores. Each Tensor Core performs matrix multiplication and ac-
cumulation operations on small matrices with a size of 4× 4, achieving up to
64 floating point Fused-Multiply-Add (FMA) operations in one clock cycle. It
multiplies two FP16 (half precision) matrices of a size of 4×4 and adds to the
accumulation FP16 (half precision) or FP32 (single precision) matrices. The
tested Tesla V100 PCIe GPU accelerator has the base clock frequency 1.254
GHz and boost clock frequency 1.38 GHz. The theoretical peak performance
can reach 14.13 TFLOPS1 with single precision and 7.065 TFLOPS2 with dou-
ble precision, while Tensor Cores can deliver 113 TFLOPS3 for FP16/FP32
mixed precision. The tested Tesla V100 GPU is built using 16 GB HBM2
memory with 900 GB/s of bandwidth.
6.1.3 Matrix multiplication on GPUs
We review the high-performance implementation of matrix multiplica-
tion on Nvidia GPUs, based on Nvidia’s CUDA Templates for Linear Algebra
11 FMA/cycle × 2 FLOP/FMA × 1.38G (boost clock frequency) × 16 (# FP32 core)
× 4 (# processing block/SM) × 80 (# SM).
21 FMA/cycle × 2 FLOP/FMA × 1.38G (boost clock frequency) × 8 (# FP64 core) ×
4 (# processing block/SM) × 80 (# SM).
364 FMA/cycle × 2 FLOP/FMA × 1.38G (boost clock frequency) × 2 (# Tensor Core)
× 4 (# processing block/SM) × 80 (# SM).
113
Subroutines (CUTLASS) [67, 86], a collection of CUDA C++ templates and
abstractions to perform high-performance gemm operations. CUTLASS in-
corporates strategies for hierarchical partition and data movement similar to
cuBLAS [87], the state-of-the-art implementation of the BLAS implementa-
tion on Nvidia GPU, and can reach more than 90% of cuBLAS performance on
V100. Without loss of generality, we will focus on single precision arithmetic
henceforth.
Blocking strategies
CUTLASS organizes the computation by partitioning the operands into
blocks in the different levels of the device, thread block, warp, and thread.
Figure 6.2 illustrates the gemm implementation in CUTLASS.
• Device level: blocking for the thread blocks and shared memory.
The three operand matrices, A, B, and C, are partitioned into mS × kS,
kS × nS, and mS × nS blocks. Each thread block computes an mS × nS
block of C by accumulating the results of matrix products of an mS × kS
block of A and a kS × nS block of B. Therefore, the mS × nS block of C,
the output of the thread block, is referred as the C Accumulator. Since the
C Accumulator is updated many times, it needs to be lifted into the fastest
memory in the SM: the register file. The global memory in which the parts
of C corresponding to the C Accumulator only needs to be updated once
after the C Accumulator has accumulated the results of all matrix products






A B C 
C += A B 
+= 






















Pack B → B Tile 
Pack A → A Tile 
Update C 
A Tile B Tile 
A Fragment 
B Fragment 





Figure 6.2: Illustration of the gemm implementation in CUTLASS [67]. CUT-
LASS partitions the operand matrices into blocks in the different levels of the
device, thread block, warp, and thread. Here we show block sizes typical
for the large sgemm: mS = 128, nS = 128, kS = 8; mW = 4 × mR = 32,
nW = 8× nR = 64; nR = 8, nR = 8.
115
and B are “packed” (copied) from global memory into shared memory as
the A Tile and B Tile for data reuse, accessible by all threads in the same
thread block.
• Thread block level: blocking for the warps.
After the A Tile and B Tile are stored in shared memory, each individ-
ual warp computes a sequence of accumulated outer product by iteratively
loading an A Fragment (a column of the A Tile with height mW ) and a
B Fragment (a row of the B Tile with width nW ) from the corresponding
shared memory into register files along the k dimension and performing a
rank-1 update. Note that the C Accumulator is spatially partitioned across
all the warps within the same thread block, with each warp storing a non-
overlapping 2-D block in the register files.
• Warp level: blocking for the threads.
Each thread in a warp computes an mR×nR outer product with subvectors
of the A Fragment and subvectors of the B Fragment in a “strip-mining”
(cyclic) pattern. Each piece has a size of 4, because the largest granularity
of vector load is 128 bits (4 single precision floating point numbers), and this
helps to maximize the effective bandwidth. The total length of all pieces
for an individual thread in m dimension is mR, while the total length in n
dimension is nR. Since each warp has 32 threads (Section 6.1.1), CUTLASS
organizes the threads within the same warp in a 4× 8 or 8× 4 fashion such
that mW/mR = 4, nW/nR = 8, or mW/mR = 8, nW/nR = 4.
116
Strategy mS nS kS mR nR mW/mR nW/nR
Small 16 16 16 2 2 4 8
Medium 32 32 8 4 4 4 8
Large 64 64 8 8 8 4 8
Tall 128 32 8 8 4 8 4
Wide 32 128 8 4 8 4 8
Huge 128 128 8 8 8 4 8
Figure 6.3: CUTLASS specifies six strategies of block sizes at each level in
Figure 6.2 for different matrix shapes and problem sizes.
• Thread level: executing on the CUDA cores.
Each thread issues a sequence of independent FMA instructions to the
CUDA cores and accumulates an mR × nR outer product.
Choices of block sizes
The block sizes at each level {mS, nS, kS,mR, nR,mW , nW} in Figure 6.2
are constrained by hardware parameters such as the register size on a SM, the
maximum thread number for a thread block, the shared memory size for a SM,
and the maximum register number per thread. To optimize the performance
of gemm on such a GPU, these block sizes also need to be chosen to match
the maximum bandwidth of the global memory and shared memory, and the
theoretical peak performance of the device. Towards this goal, CUTLASS
customizes six different strategies of block sizes for different matrix shapes
and problem sizes, as shown in Figure 6.3. We report the performance of
these different strategies for square matrices on V100 in Figure 6.4.
117



























Figure 6.4: CUTLASS performance with different strategies of block sizes.
118
Algorithm 1 gemm on GPUs with software pipelining.
//Register: fragA[2][mR], fragB [2][nR], nextA[mR], nextB [nR]
//Register: accumC [mR × nR]
//Shared memory: tileA[kS ×mS ], tileB [kS × nS ]
load one mS × kS of A into tileA[kS ][mS ]
load one kS × nS of B into tileB [kS ][nS ]
syncthreads()
load subvectors of first column in tileA into fragA[0][mR]
load subvectors of first row in tileB into fragB [0][nR]
for block k = 0 : kS : k do
prefetch one subcolumn of next mS × kS block of A into nextA[mR]
prefetch one subrow of next kS × nS block of B into nextB [nR]
for warp k = 0 : 1 : kS do
prefetch subvectors of next column in tileA into fragA[(warp k + 1)%2][mR]
prefetch subvectors of next row in tileB into fragB [(warp k + 1)%2][nR]
accumC [mR][nR] += fragA[warp k%2][mR]fragB [warp k%2][nR]
end for
store nextA[mR] into tileA[kS ][mS ]
store nextB [nR] into tileB [kS ][nS ]
syncthreads()
end for
update mS × nS block of C with accumC [mR][nR]
Software prefetching
As shown in Algorithm 1, to keep the SM busy, CUTLASS uses global
and local software prefetching to hide the data movement latency. The com-
putations on the CUDA cores are overlapped with the data preloading from
the global memory and from the shared memory. A synchronization after the
data is stored to the shared memory is required to avoid the race conditions
of read between warp for the next iteration.4
4CUTLASS also provides the option of double buffering on the thread block level to
enable concurrent reading for the current iteration and writing for the next iteration. It
eliminates the synchronization but also doubles the cost of the shared memory and the
number of registers to hold the global memory fetches. On the Tesla V100 GPUs, the
option of double buffering on the thread block level is disabled.
119
M0 =(A0 + A3)(B0 +B3); C0+= M0;C3+= M0;
M1 =(A2 + A3)B0; C2+= M1;C3−= M1;
M2 =A0(B1 −B3); C1+= M2;C3+= M2;
M3 =A3(B2 −B0); C0+= M3;C2+= M3;
M4 =(A0 + A1)B3; C1+= M4;C0−= M4;
M5 =(A2 − A0)(B0 +B1); C3+= M5;
M6 =(A1 − A3)(B2 +B3); C0+= M6;
Figure 6.5: All operations for one-level Strassen. Duplicate of Figure 4.1 for
easy reference.
6.2 Strassen’s algorithm on Nvidia GPUs
Recall that the operations encountered in Figure 6.5 are all special cases
of
M = (X + δY )(V + εW ); D+= γ0M ; E+= γ1M ; (6.1)
for appropriately chosen γ0, γ1, δ, ε ∈ {−1, 0, 1}. Here, X and Y are submatri-
ces of A, V and W are submatrices of B, and D and E are submatrices of C.
As in Chapter 3, this scheme can be extended to multiple levels of Strassen.
We will modify the gemm implementation for GPUs illustrated in Fig-
ure 6.2 to accommodate the representative computation
M = (X + Y )(V +W );D+= M ;E+= M. (6.2)
As shown in Figure 6.6, the key insights are that we develop a special-
ized kernel for the representative operation (6.2) utilizing the GPU memory
hierarchy and thread hierarchy:





M += (X+Y)(V+W); D += M; E += M 
+= 
X V D 
+= 
+= 
















D X V 






E Y W 
Pack X + Y → A Tile 
Pack V + W → B Tile 
A Tile B Tile 
A Fragment 
B Fragment 





Figure 6.6: Specialized kernel that implements the representative computation
M = (X + Y )(V + W );D+ = M ;E+ = M of each row of computations
in Figure 6.5 based on Figure 6.2. X, Y are submatrices of A; V , W are
submatrices of B; D, E are submatrices of C; M is the intermediate matrix
product.
121
A Tile during the packing process (Section 6.1.3), avoiding the extra
workspace requirement, and reducing the additional memory movement
since the A Tile is reused for the temporary matrix sum, which is held
in the shared memory.
• Similarly, the summation of matrices V + W can be also incorporated
into the packed B Tile during the packing process.
• After the C Accumulator has accumulated its result of (X +Y )(V +W )
along the k dimension, it can update the appropriate parts of D and
E in the global memory once. This optimization avoids the required
workspace for temporary intermediate matrices Mi and reduces the ad-
ditional memory movement since the C Accumulator is kept in the reg-
ister files: it is fetched from the global memory into the register once in
the beginning, and it is written to D and E only after its computation
completes.
6.3 Implementations
6.3.1 Exploiting more parallelism
A straightforward implementation of Strassen’s algorithm based on our
specialized kernel (Section 6.2) would invoke a sequence of kernels sequen-
tially (7 kernels for one level, 49 kernels for two levels). This approach has
already achieved the intra-kernel parallelism across the thread blocks, warps,
and threads, which is utilized in the gemm implementation on a GPU. How-
122
ever, it is further possible to improve concurrency by exploiting more inter-
kernel parallelism. A careful observation of Figure 6.5 reveals that
• the ordering of these operations can be arbitrary;
• the dependencies between the kernels for these operations only occur for
the concurrent writes to different submatrices of C.
By invoking multiple independent kernels without write dependencies to dif-
ferent parts of C, we can achieve inter-kernel parallelism, which is especially
important for small problem sizes when there is limited intra-kernel parallelism
such that each kernel cannot saturate the workload for the GPU device and
for multi-level Strassen when the partitioned block size is small. We exploit
the inter-kernel parallelism in the following ways:
Multi-kernel streaming
CUDA programs can manage the concurrency across kernels through
streams [88], each of which is a sequence of commands that execute in order.
While the instructions within the same stream must be scheduled in sequen-
tial order, the commands from different streams may run concurrently out of
order. This helps to overlap computation with communication, as well as run
multiple kernels at the same time. To ensure every command in a particular




0 M1 =(A2 + A3)B0; C2+= M1;C3−= M1; 0
M4 =(A0 + A1)B3; C1+= M4;C0−= M4; 1
M5 =(A2 − A0)(B0 +B1); C3+= M5; 0
M6 =(A1 − A3)(B2 +B3); C0+= M6; 1
1 M2 =A0(B1 −B3); C1+= M2;C3+= M2; 0
M3 =A3(B2 −B0); C0+= M3;C2+= M3; 1
2 M0 =(A0 + A3)(B0 +B3); C0+= M0;C3+= M0; 0
Figure 6.7: Reordered operations based on Figure 4.1 with multi-kernel
streaming.
Figure 6.7 illustrates the multi-kernel streaming implementation with
the reordered operations based on Figure 6.5. We organize the kernels for
these operations into three stages with two streams. Synchronization of the
streams is imposed between the different stages to guarantee the correctness
of the operations because of the concurrent write dependencies to different
parts of C. The kernels for these operations in different streams within the
same stage can be interleaved or executed concurrently. We note that [72]
also leverage multi-kernel streaming to make use of concurrency across kernels
for the bottom level Strassen. However, it requires an additional workspace
of 15 extra submatrices, while our approach doesn’t require any additional
workspace due to our specialized kernels.
Element-wise atomic write to C
Since the dependencies across the kernels only exist during the concur-
rent write to different parts of C, we can “partially” remove the dependencies
and execute all operations fully concurrently in one stage with one operation
124
per stream by replacing the post-processing step5 of submatrices of C with
element-wise atomic write operation through atomicAdd. We find that this
way will benefit small problem sizes. However, when the problem sizes be-
come larger, the performance drops because element-wise atomic operations
incur a greater overhead.
Atomic write to C with larger granularity
Each thread finally updates an mR× nR block of C, so it is possible to
switch the granularity of atomic write from 1 (element-wise) to mR×nR (block-
wise). We need to construct a device array for the purpose of a mutex/lock.
Each entry of the device array corresponds to an mR × nR block of C. Once
a thread completes the accumulation of the C Accumulator in the register, it
will first lock the corresponding mutex/lock entries of mR × nR block of D
and E (different submatrices of C) before updating D and E, and unlock the
corresponding mutex/lock entries after the updates finish. The locks can be
implemented with atomicCAS and atomicExch functions.
Reducing the multi-kernel launch overhead
With the multi-kernel streaming, we still need to launch seven kernels
for one-level Strassen. For small problem sizes, the overhead for launching
multiple kernels cannot be ignored. To further reduce the overhead for ker-
5The post-processing step involves loading each element of D and E in the global memory,
adding or subtracting corresponding element in Mi residing in the C Accumulator in the
register, and storing back to the original location of D and E in the global memory.
125
nel launching and avoid the possible inefficiency caused by the CUDA stream
implementation, it is possible to just launch a kernel for all operations shown
in Figure 6.5 with a 3-D grid of thread blocks, while the gemm and special-
ized kernels in Figures 6.2 and 6.6 only require a 2-D grid. The additional
z dimension for the grid (blockIdx.z) corresponds to different operations in
Figure 6.5. The inter-kernel parallelism for multi-kernel streaming has thus
transformed into one extra dimensional concurrency inside the kernel. We
can utilize the atomic write with the granularity of either 1 or mR × nR to
guarantee the correctness of the result.
6.3.2 Reducing memory requirement
The conventional implementations of Strassen on GPUs based on the
functions provided by cuBLAS for matrix multiplication and matrix addition
(i.e., cublasSgemm, cublasSgeam) led to the “street wisdom” that there is a
trade-off between reducing the effective available memory and exploiting more
parallelism. More temporary workspace for intermediate result is traditionally
required to eliminate the dependency and increase the concurrency. With our
approach, however, we can achieve both reduced memory and more homo-
geneous parallelism, similar to data parallelism except the dependencies for
concurrent writes to different parts of C. We reuse the shared memory to
store the temporary sum and the register file to store the temporary matrix
product Mi (Section 6.2). The different operations in Figure 6.5 can be easily
parallelized with the help of multi-kernel streaming or a kernel with a 3-D grid
126
(Section 6.3.1).
6.3.3 Handling the fringes
Traditionally, for matrices with odd dimensions, we need to handle
the remaining fringes before applying Strassen. There are some well-known
approaches such as padding (i.e., adding rows or columns with zeros to get
matrices of even dimensions) and peeling (i.e., deleting rows or columns to
obtain even dimensioned matrices) [59, 117] followed by post-processing. In
our approach, fringes can be internally handled by padding the A Tile and B
Tile with zeros, and aligning the mC × nC C Accumulator along the fringes.
This trick avoids the handling of the fringes with extra memory or com-
putations because the packing and accumulation processes always occur for
high-performance implementation of gemm on GPUs, and we reuse the same
buffers.
6.3.4 Adapting software prefetching
As illustrated in Algorithm 2, instead of prefetching one mS×kS block
of A and one kS × nS block of B, Strassen requires the preloading of two
mS×kS blocks of X and Y and two kS×nS blocks of V and W . The required
register number per thread has since doubled, but the required sizes of shared
memory and global memory remain the same.
127
Algorithm 2 M = (X + Y )(V + W );D+ = M ;E+ = M on GPUs with
software prefetching
//Register: fragA[2][mR], fragB [2][nR]
//Register: next0A[mR], next1A[mR], next0B [nR], next1B [nR]
//Register: accumC [mR × nR]
//Shared memory: tileA[kS ×mS ], tileB [kS × nS ]
load the sum of one mS × kS of X and corresponding mS × kS of Y into tileA[kS ][mS ]
load the sum of one kS × nS of V and corresponding kS × nS of W into tileB [kS ][nS ]
syncthreads()
load subvectors of first column in tileA into fragA[0][mR]
load subvectors of first row in tileB into fragB [0][nR]
for block k = 0 : kS : k do
prefetch one subcolumn of next mS × kS block of X into next0A[mR]
prefetch one subcolumn of next mS × kS block of Y into next1A[mR]
prefetch one subrow of next kS × nS block of V into next0B [nR]
prefetch one subrow of next kS × nS block of W into next1B [nR]
for warp k = 0 : 1 : kS do
prefetch subvectors of next column in tileA into fragA[(warp k + 1)%2][mR]
prefetch subvectors of next row in tileB into fragB [(warp k + 1)%2][nR]
accumC [mR][nR] += fragA[warp k%2][mR]fragB [warp k%2][nR]
end for
store next0A[mR] + next1A[mR] into tileA[kS ][mS ]
store next0B [nR] + next1B [nR] into tileB [kS ][nS ]
syncthreads()
end for
update mS × nS block of D with accumC [mR][nR]
update mS × nS block of E with accumC [mR][nR]
6.4 Performance experiments
Experimental setup
We perform our experiments on a Tesla V100 PCIe accelerator which is
connected to an Intel Xeon Gold 6132 Skylake server. The Operating System
is CentOS Linux version 7.4.1708. We use CUDA Driver/Runtime Version
9.1/9.0 with CUDA Capability 7.0 and cuBLAS with version 9.0. The GNU
compiler version for compiling the host code is 6.4.0. The nvcc compiler flags
-O3 -Xptxas -v -std=c++11 -gencode arch=compute 70,code=sm 70 are
128
used. As presented in Section 6.1.2, the tested Tesla V100 PCIe accelerator
has a theoretical peak performance of 14.13 TFLOPS with single precision.
Measurement
We report the single precision floating point results using square ma-
trices with the size of m = k = n for each dimension. To time the CUDA
execution of kernels running on the GPU, we use the CUDA events that have
a resolution of approximately half a microsecond. As before, we take Effective
TFLOPS as the main metric to compare the performance of various imple-
mentations.
Effective TFLOPS =
2 ·m · n · k
time (in seconds)
· 10−12. (6.3)
CUTLASS and our implementations of Strassen based on CUTLASS are
tested with different strategies of block sizes to select the highest performing
setup.
Result
Figure 6.8 reports the single precision floating point performance of
cuBLAS, CUTLASS, and various Strassen implementations on V100. The
1-level and 2-level reference implementations [72] are linked with cuBLAS 9.0,
with the operations in Strassen restructured to have only two temporary
matrices.
For the 2-level hybrid implementation, we use reference implementation
129
























1 level, Ours 1 level, Reference
2 level,Hybrid 2 level, Reference
Figure 6.8: Performance of various Strassen implementations on V100 with
single precision. The theoretical peak for the machine is 14.13 TFLOPS. The 1-
level and 2-level reference implementations are from [72] (linked with cuBLAS
9.0). The 2-level hybrid implementation replaces the cublasSgemm function
in the 1-level reference implementation [72] with our 1-level Strassen imple-
mentation.
130
in the top level, and our 1-level implementation in the bottom level.6
By comparing the performance of various implementations, we make
the following observations:
• For 1-level, our implementation of Strassen outperforms CUTLASS and
cuBLAS when the problem sizes m = k = n are as small as 3000. The
reference implementation cannot get the comparable performance with our
implementation until the problem sizes are larger than 10000.
• For 2-level, the reference implementation cannot beat the hybrid implemen-
tation until the problem size is larger than 15000.
• Our implementation has the same memory consumption as CUTLASS, while
the 1-level reference implementation consumes much more memory.
• Our 1-level Strassen implementation and 2-level hybrid implementation
achieve the best performance over the entire spectrum of problem sizes com-
pared to the reference implementation, with no or less additional memory
consumption.
6We also tried to extend the insights from Figure 6.6 to 2-level Strassen implementation.
However, it requires up to four times more registers per thread compared to gemm. Due




We have presented a practical implementation of Strassen’s algorithm
on GPUs, which outperforms the state-of-the-art implementation on small
problem sizes and consumes no additional memory compared to gemm. By
developing a specialized kernel, we utilized the GPU memory hierarchy and
thread hierarchy. By reusing the shared memory to store the temporary ma-
trix sum during the packing process and the register files to hold the tempo-
rary matrix product during the accumulation process, we avoided the extra
workspace requirement and reduced the additional memory movement. Be-
sides the intra-kernel parallelism across the thread blocks, warps, and threads
similar to gemm implementation on GPUs, we also exploited the inter-kernel
parallelism with multi-kernel streaming, atomic write with different granular-
ity, and launching a kernel with a 3-D grid to reduce the multi-kernel launch
overhead. The fringes can be handled internally during the packing and ac-
cumulation process. We also leveraged the software prefetching to hide the
latency of data movement across the memory hierarchy. Together, we achieved




In this dissertation, we explored practical implementations of Strassen’s
algorithm and other Strassen-like fast matrix multiplication algorithms on
CPU and GPU architectures. The conventional implementations of these
Strassen and similar Strassen-like FMM algorithms led to the “street wis-
dom” that they are only practical for large, relatively square matrices, that
they require considerable workspace, and that they are difficult to achieve
thread-level parallelism. We dispelled these notions, demonstrating significant
benefits for small and non-square matrices, requiring no workspace beyond
what is already incorporated in high-performance implementations of matrix
multiplication, and achieving performance benefits on the Intel Xeon Phi pro-
cessor with 60 cores executing 240 threads and on the latest Volta GPU device
with over 14 TFLOPS theoretical peak performance for single precision. The
resulting families of algorithms can serve as a drop-in replacement for matrix
multiplication, which has numerous scientific applications. This study showed
that Strassen and Strassen-like fast matrix multiplication algorithms can be
incorporated into libraries for practical use.
133
7.1 Results
In this dissertation, a number of novel contributions have been reported.
• Practical implementations of Strassen on various CPU architec-
tures. Key to the work is that the linear combinations of submatrices that
underlie Strassen can be incorporated in and composed with the prim-
itives of high-performance gemm implementations exposed by the BLAS-
like Library Instantiation Software (BLIS), which discloses the fundamental
building blocks below the BLAS interface that can be used to build the new
algorithms and fuse a sequence of BLAS-like operations to avoid repeated
memory movements that constitute overhead. Incorporating the matrix ad-
ditions that must be performed for Strassen into the inherent packing and
micro-kernel operations inside high-performance implementations of gemm
avoids extra workspace and reduces the cost of extra memory movement.
Adopting the same loop structures as high-performance gemm implemen-
tations allows parallelizations of Strassen with simple but efficient data
parallelization without the expenses of task parallelism. Our implementa-
tions demonstrated performance benefits over the conventional gemm on
a variety of architectures, such as single core, multi-core, many-core, and
distributed memory parallel CPU architectures.
• Code and model generation for the practical implementations of
Strassen-like FMM algorithms. We developed a code generator frame-
work which can automatically implement families of FMM algorithms. This
134
code generator expresses the composition of multi-level FMM algorithms
as Kronecker products. It incorporates the matrix summations that must
be performed for FMM algorithms into the inherent packing and micro-
kernel operations inside gemm, avoiding extra workspace requirement and
reducing the overhead of memory movement. Importantly, it generates a
performance model that is accurate enough to guide the selection of a FMM
implementation as a function of problem size and shape, facilitating the cre-
ation of “poly-algorithms” that select the best algorithm for a problem size.
Without the requirement for exhaustive empirical search, our generated im-
plementations of various FMM algorithms outperformed the state-of-the-art
implementations, especially for smaller matrices and special matrix multi-
plication shapes such as rank-k updates.
• Generalization of practical implementations of Strassen for higher-
dimensional tensor contraction. This work presents the first efficient
implementation of Strassen’s algorithm for tensor contraction, a significant
problem with numerous applications. It describes how to extend Strassen’s
algorithm to tensor contraction without the explicit transposition of data
that inherently incurs significant memory movement and workspace over-
head; it provides a performance model for the cost of the resulting family
of algorithms; it details the practical implementation of these algorithms,
including how to exploit variants of the primitives that underlie BLIS and a
data layout to memory for the tensors; it demonstrates practical speedup on
modern single core and multi-core CPUs; it illustrates how the local use of
135
the Strassen’s tensor contraction algorithm on each node improves perfor-
mance of a simple distributed memory tensor contraction. Together, these
results unlock a new frontier for the research and application of Strassen’s
algorithm.
• Practical implementations of Strassen on Nvidia GPUs. The GPU
architecture is different from a traditional CPU in a number of aspects:
a GPU has its own programming model; it has limited physical memory
size; it is designed as a parallel, throughput-oriented computing engine; it
has a higher ratio of the computation power to the memory bandwidth.
We overcame these challenges for a practical implementation of Strassen
on GPUs, which outperforms the state-of-the-art implementations on small
problem sizes and consumes the same memory compared to the conventional
gemm. We leveraged the GPU thread and memory hierarchy by designing
a dedicated kernel, reduced the additional memory consumption through
reusing the shared memory for the temporary matrix addition result and
the register files for the temporary matrix product in Strassen, and ex-
ploited the inter-kernel parallelism as well as the intra-kernel parallelism.
Finally, our specialized kernel for Strassen on GPUs can attain both more




There are several avenues of research for future work suggested by this
dissertation.
• Special structures of matrix multiplication and tensor contraction.
In this dissertation we target dense matrix multiplication and tensor con-
traction, which have numerous applications. However, the structure of the
matrix and tensor operands may be symmetric [63, 39, 21, 98], which yields
a number of new challenges, like more efficient storage or layout format.
How to explore those structure patterns and combine with Strassen and
similar Strassen-like FMM algorithms can be investigated.
• Other level-3 BLAS. K̊agström et al. [63] and Higham [49] demonstrated
that gemm and FMM can be the basis for level-3 BLAS. In [39, 124, 122], it
is discussed how the GotoBLAS algorithm and BLIS framework for gemm
can be extended to implement other level-3 BLAS (matrix-matrix opera-
tions). This sets the stage for a more thorough investigation of how the
various modifications of Strassen and Strassen-like FMM algorithms can
similarly be applied ot the other level-3 BLAS.
• Higher-level linear algebra functionality. Unique to our implementa-
tion of Strassen and FMM algorithms is the fact that it already attains
high performance for a rank-k update for a relatively small k. Many oper-
ations in libraries like LAPACK [2], ScaLAPACK [19], libflame [121, 43],
137
PLAPACK [118], and Elemental [95], cast higher level linear algebra func-
tionality like LU (with pivoting), QR, and Cholesky factorization in terms
of rank-k updates. A natural question becomes how to accelerate these
important operations with the new Strassen or other FMM algorithms.
This takes the subject full circle in the sense that the original Strassen paper
actually discussed how LU factorization could be accelerated [115].
• Future hierarchical memory architectures. In Section 2.1.6, we briefly
reviewed the related work on implementing a family of algorithms on gen-
eral hierarchical memory architectures. Future computer architectures are
expected to have a lower ratio between the rate of data movement from the
main memory to the caches and the rate of computation, requiring alter-
native gemm algorithms, e.g., presented in [44, 106]. We intend to explore
how to extend these algorithms to Strassen and other Strassen-like FMM
algorithms with the goal of achieving good performance in a low bandwidth
scenario.
• Machine learning primitives. Building upon BLIS, fusing gemm with
other operations can benefit the performance of operations encountered in
machine learning such as the “K-Nearest Neighbor” computation [131]. The
idea is that many memory movements can be avoided by incorporating the
processing of the output of a gemm operation into the implementation, not
unlike how we incorproate the addition of submatrices into the packing of
those submatrices in our Strassen implementation. We plan to investigate
138
whether such machine learning operations can be further accelerated by
incorporating Strassen or other FMM algorithms rather than classical
gemm.
• Mixed precision. More levels of Strassen and Strassen-like FMM al-
gorithms may lose precision due to numerical instability issues. It may
be possible to combine the proposed techniques with Extended and Mixed
Precision BLAS [78] to get higher speedup and maintain precision.
• Task parallelism. Task parallelism and various parallel schemes are pro-
posed in the recent literature [26, 9]. In [9], the authors of that paper discuss
alternative ways for achieving thread-level parallelism: parallelism with the
gemm calls for the individual multiplications with submatrices; task-level
parallelism where each multiplication with submatrices is viewed as a task;
and a combination of these techniques. In Chapter 6, we exploited such
inter-kernel and the intra-kernel parallelism. We need to pursue how our
techniques compare to these and how to combine these with our advances
in Strassen and Strassen-like FMM algorithms. It may also be possible to
utilize our performance model to help better task scheduling.
• Searching for new FMM algorithms with the performance model.
Finding new FMM algorithms by searching the coefficient matrix JU, V,W K
is an NP-hard problem [68]. It may be possible to prune branches with the
performance model as the cost function during the search process.
139
• Communication lower bound. The asymptotic communication lower
bound for Strassen and the conventional matrix multiplication has been
characterized, in [62, 7, 101, 10]. We would like to investigate how to ap-
ply our performance model to constrain the coefficients of the cubic and
quadratic terms and get more precise lower bound for specific architectures.
• Pedagogical outreach. We have created a pedagogical “sandbox” we
call BLISlab [57, 52], which teaches relative novices to the science of high-
performance computing how to optimize gemm within a simplified BLIS-like
framework in courses [119, 93]. We intend to extend this exercise to also
guide the participants through the optimizations that underlie Strassen
and the Strassen-like FMM implementations.
• Distributed memory. Practical distributed memory parallel algorithms
for gemm are invariably variations on the Scalable University Matrix Mul-
tiplication Algorithm (SUMMA) [120, 99]. In Chapter 3, we discuss how a
SUMMA algorithm can benefit from a call to Strassen for the node-level
matrix-matrix multiplication. In contrast, in [42], SUMMA was extended
to incorporate Strassen, with Strassen applied at the top level, paral-
lelism within each multiplication with submatrices, and a call to a standard
dgemm for the local call. On the one hand, the performance of our approach
that incorporates Strassen in the local gemm needs to be compared to
these implementations. On the other hand, it may be possible to add a
local Strassen gemm into these parallel implementations. Alternatively,
140
the required packing may be incorporated into the communication of the
data.
In short: there are many variations of the Strassen and similar Strassen-like







APA Arbitrary Precision Approximate.
ATLAS Automatically Tuned Linear Algebra Software.
BLAS Basic Linear Algebra Subprogram.
BLIS BLAS-like Library Instantiation Software.
CCSD Coupled Cluster with Single and Double excitations.
CUTLASS CUDA Templates for Linear Algebra Subroutines.
CTF Cyclops Tensor Framework.
DLA Dense Linear Algebra.
DF Density-fitting.
DRAM Dynamic Random Access Memory.
FMA Fused Multiply Add.
FMM Fast Matrix Multiplication.
gemm General Matrix Multiplication.
GFLOPS Giga (109) FLoating-point Operations Per Second.
HPC High-Performance Computing.
MOMMS Multilevel Optimized Matrix-matrix Multiplication Sandbox.
NRMSE Normalized Root-Mean-Square Error.
PHiPAC Portable High Performance ANSI C.
SM Streaming Multiprocessor.
SUMMA Scalable Universal Matrix Multiplication Algorithm.
TACC Texas Advanced Computing Center.
TC Tensor Contraction.




α, β, ... Scalar variables.
a, b, c, ... Vector variables.
A,B,C, ... Matrix variables.
A,B, C, ... Tensor variables.
A,B,C, ... Matrix views of tensors.
m,n, k Matrix dimensions.
mC , nC , kC Cache blocking parameters for GotoBLAS.
mR, nR Register blocking parameters for GotoBLAS or CUTLASS.
mS, nS, kS Shared memory blocking parameters for CUTLASS.
mW , nW Warp blocking parameters for CUTLASS.
Im, Jn, Pk Index bundles. See Section 5.1.1.
〈m̃, k̃, ñ〉 Partitions for FMM algorithms. See Section 4.1.1.
JU, V,W K The set of coefficients that determine the 〈m̃, k̃, ñ〉 algorithm.
rscat, cscat Scatter vectors for the matrix view of tensors.




〈3, 2, 3〉 FMM algorithm
In this appendix, we present an example of a FMM algorithm with
〈3, 2, 3〉 partition, originally from [6, 8, 9].
Consider C := C + AB, where C, A, and B are m × n, m × k, and
k × n matrices, respectively. Assuming that m and n can be evenly divided
by 3, and k is even, this 〈3, 2, 3〉 FMM algorithm has the partition
C=
 C0 C1 C2C3 C4 C5
C6 C7 C8
, A=
 A0 A1A2 A3
A4 A5


















B and C, respectively, with a single index in the row major order. It can be
verified that the operations in Figure C.1 also compute C := C+AB, requiring
only 15 multiplications with submatrices instead of 3× 2× 3 = 18 submatrix
multiplication. Theoretically this algorithm can reach (18 − 15)/15 = 20%
speedup, if we ignore the lower-order term of submatrix additions.
The following JU, V,W K specifies the set of coefficients that determine
this one-level 〈3, 2, 3〉 FMM algorithm, where U , V , and W are 6× 15, 6× 15,
and 9× 15 matrices. Note that the entries uir, vjr, and wpr in the rth column
of JU, V,W K, specify the cofficients for the operation in rth row in Figure C.1,
145
and i, j, p are mapped to the submatrix index in (C.1). For example, In the
first column of JU, V,W K, −1, 1, 1 of U are corresponding to A1, A3, and A5;
1, 1 of V are corresponding to B4 and B5; and −1 of W are corresponding to
C2. These coefficients are mapped to the operation in the first row:
M0 = (−A1 + A3 + A5)(B4 +B5);C2−= M0;
U=

0 −1 −1 0 −1 0 0 0 −1 −1 0 0 0 0 −1
−1 0 0 0 1 0 0 0 0 1 0 0 1 0 1
0 1 1 1 1 1 −1 0 1 0 0 0 0 1 1
1 0 0 0 0 0 1 1 0 0 0 0 0 0 −1
0 0 0 −1 0 0 1 0 1 0 0 1 0 −1 0




0 0 0 0 1 1 0 0 0 −1 0 0 1 1 0
0 0 −1 0 1 0 0 0 1 −1 0 0 1 0 0
0 1 0 1 0 0 0 0 1 0 1 1 0 −1 0
0 0 0 −1 0 0 1 1 0 0 0 0 1 1 0
1 −1 −1 0 1 0 0 0 0 0 0 0 1 0 1




0 0 −1 0 −1 1 0 0 0 0 0 0 1 0 0
0 0 1 0 1 −1 0 0 0 1 0 0 0 0 0
−1 −1 0 1 0 0 −1 1 0 0 1 1 0 0 −1
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
0 0 0 0 1 −1 0 0 0 1 0 0 0 0 −1
0 0 0 1 0 0 −1 1 0 0 0 1 0 0 0
0 0 0 −1 0 1 0 0 0 0 1 0 0 −1 0
0 −1 1 0 0 0 0 0 1 0 0 −1 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0

.
Other FMM algorithms with different partitions found in the literature
and tested in our experiment in Chapter 4 are listed in Figure 4.2 and more
details can be found in [53].
146
M0 = (−A1 + A3 + A5)(B4 +B5); C2−= M0;
M1 = (−A0 + A2 + A5)(B0 −B5); C2−= M1; C7−= M1;
M2 = (−A0 + A2)(−B1 −B4); C0−= M2; C1+= M2; C7+= M2;
M3 = (A2 − A4 + A5)(B2 −B3 +B5); C2+= M3; C5+= M3; C6−= M3;
M4 = (−A0 + A1 + A2)(B0 +B1 +B4); C0−= M4; C1+= M4; C4+= M4;
M5 = (A2)(B0); C0+= M5; C1−= M5; C3+= M5; C4−= M5; C6+= M5;
M6 = (−A2 + A3 + A4 − A5)(B3 −B5); C2−= M6; C5−= M6;
M7 = (A3)(B3); C2+= M7; C3+= M7; C5+= M7;
M8 = (−A0 + A2 + A4)(B1 +B2); C7+= M8;
M9 = (−A0 + A1)(−B0 −B1); C1+= M9; C4+= M9;
M10 = (A5)(B2 +B5); C2+= M10;C6+= M10;C8+= M10;
M11 = (A4 − A5)(B2); C2+= M11;C5+= M11;C7−= M11;C8+= M11;
M12 = (A1)(B0 +B1 +B3 +B4); C0+= M12;
M13 = (A2 − A4)(B0 −B2 +B3 −B5); C6−= M13;
M14 = (−A0 + A1 + A2 − A3)(B5); C2−= M14;C4−= M14;
Figure C.1: All operations for an example of one-level 〈3, 2, 3〉 FMM algorithm.
147
Appendix D
Derivation of Kronecker Product
In this appendix, we derive that the coefficient matrices of a multi-level
FMM algorithm can be represented as the Kronecker product of the coefficient
matrices of each level.
Theorem D.1. (2-level) For a 2-level FMM algorithm, assuming the set of
coefficients that determine the first level 〈m̃, k̃, ñ〉 FMM algorithm is JU, V,W K,
the set of coefficients that determine the second level 〈m̃′, k̃′, ñ′〉 FMM algo-
rithm is JU ′, V ′,W ′K, and the submatrices of A, B and C are indexed with
a 2-level recursive block storage indexing, then the set of coefficients of this
two-level FMM algorithm is JU ⊗ U ′, V ⊗ V ′,W ⊗W ′K.
Proof. Following our representations in Section 4.1, the first level 〈m̃, k̃, ñ〉
FMM algorithm with the set of coefficients JU, V,W K can be rewritten as,














Cp+= wprMr (p = 0, ..., m̃ñ− 1)
(D.1)
where (×) is a matrix multiplication that can be done recursively, uir, vjr, and
wpr are entries of a (m̃k̃)×R matrix U , a (k̃ñ)×R matrix V , and a (m̃ñ)×R
148
matrix W , respectively. Note that here Ai, Bj, and Cp are the submatrices of
A, B and C, with a single index in the row major order.
Similarly, the second level 〈m̃′, k̃′, ñ′〉 FMM algorithm with the set of
coefficients JU ′, V ′,W ′K can be rewritten as,























′ = 0, ..., m̃′ñ′ − 1)
(D.2)
where u′i′r′ , v
′
j′r′ , and w
′
p′r′ are entries of a (m̃
′k̃′)×R′ matrix U ′, a (k̃′ñ′)×R′
matrix V ′, and a (m̃′ñ′)× R′ matrix W ′, respectively. Note that here A′i, B′j,
and C ′p are the submatrices of Ai, Bj and Cp of the first level, with a single
index in the row major order.
By passing (D.2) into (D.1), the two-level FMM algorithm composed
of the above first level and second level FMM algorithms can be written as,
for r = 0, ..., R− 1,


























(p = 0, ..., m̃ñ− 1; p′ = 0, ..., m̃′ñ′ − 1)
(D.3)
where A(i,i′), B(j,j′), and C(p,p′) are the i
′th subblock in the second level par-
tition of ith submatrix in the first level partition of A, the j′th subblock in
149
the second level partition of jth submatrix in the first level partition of B, the
p′th subblock in the second level partition of pth submatrix in the first level
partition of C, respectively.
Through interchanging the order of summation, the two-level FMM
algorithm can be rewritten as, (D.3) can be rewritten as,
for r = 0, ..., R− 1,
































(p = 0, ..., m̃ñ− 1; p′ = 0, ..., m̃′ñ′ − 1).
(D.4)
Since the submatrices of A, B and C are indexed with a 2-level recursive block
storage indexing, A(i,i′), B(j,j′), and C(p,p′) can be indexed with a single index
in the two-level recursive blocking block storage indexing, that is, A
i·m̃′k̃′+i′ ,
Bj·k̃′ñ′+j′ , Cp·m̃′ñ′+p′ , respectively. Therefore, (D.4) can be rewritten as,
for r = 0, ..., R− 1,

































(p = 0, ..., m̃ñ− 1; p′ = 0, ..., m̃′ñ′ − 1).
(D.5)
150
Let’s use a%b to denote the remainder of a/b, and assume that r∗ = r ·R′+r′,
i∗ = i · m̃k̃ + i′, j∗ = j · k̃ñ + j′, and p∗ = p · m̃ñ + p′. Then, r = br∗/R′c,













, p′ = p∗%(m̃′ñ′). By replacing r, r′, i, i′, j, j′, p, and p′
with r∗, i∗, j∗, and p∗, and merging the summation, (D.5) can be rewritten as,





























(p∗ = 0, ..., m̃ñ · m̃′ñ′ − 1).
(D.6)
If X and Y are m × n and p × q matrices with (i, j) entries denoted
by xi,j and yi,j, respectively, then the Kronecker product [41] X ⊗ Y is the
mp× nq matrix given by
X ⊗ Y =
 x0,0Y · · · x0,n−1Y... . . . ...
xm−1,0Y · · · xm−1,n−1Y
 .
Thus, entry (X ⊗ Y )p·r+v,q·s+w = xr,syv,w, or, (X ⊗ Y )i,j = xbi/pc,bj/qcyi%p,j%q.
With this definition of Kronecker product, (D.6) can be rewritten as,
151











(V ⊗ V ′)j∗,r∗ Bj∗
)
;
Cp∗+= (W ⊗W ′)p∗,r∗Mr∗
(p∗ = 0, ..., m̃ñ · m̃′ñ′ − 1).
(D.7)
Thus, assume each submatrix of A, B, and C is partitioned with another level
of 〈m̃′, k̃′, ñ′〉 FMM algorithm with the coefficients JU ′, V ′,W ′K, and Ai, Bj, Cp
are the submatrices of A, B and C, with a single index in two-level recursive
block storage indexing. Then we have proved (D.7), that is, C := C+AB can
be computed by,











(V ⊗ V ′)j,rBj
)
;
Cp+= (W ⊗W ′)p,rMr(p = 0, ..., m̃ñ · m̃′ñ′ − 1).
(D.8)
Theorem D.2. (L-level generalization) Assuming that the set of coef-
ficients that determine the l-level 〈m̃l, k̃l, ñl〉 algorithm is JUl, Vl,WlK (l =
0, 1, ..., L−1;L ≥ 1), and the submatrices of A, B and C are indexed with
a L-level recursive block storage indexing, then the set of coefficients of this








Proof. This can be derived by mathematical induction.
Base Step. For L = 1, following our representations in Section 4.1, since 1-
level recrusive block storage indexing is the same as row major order indexing,
the set of coefficients of 1-level 〈m̃0, k̃0, ñ0〉 algorithm is JU0, V0,W0K.
Inductive Hypothesis. Assume that for some L = K (K ≥ 1), the set of







we want to show that the proposition holds for L = K + 1.
Inductive Step. For L = K + 1, the (K + 1)-level recursive block storage
indexing can be viewed as the K-level recursive block storage indexing and a
following 1-level block storage indexing on top of the K-level partition. the
(K + 1)-level FMM algorithm can also be viewed as the K-level FMM algo-
rithm and a following 1-level 〈m̃K , k̃K , ñK〉 algorithm with the set of coeffi-
cients JUK , VK ,WKK. By Theorem D.1 and the inductive hypothesis, we have




















We showed that L = 1 satisfies the property and that if some K (K ≥ 1)
satisfies it, then K + 1 satisfies it as well. This shows, by the induction
principle, this theorem holds for any L ≥ 1, so this completes the proof.
The formula for defining the L-level FMM algorithm is given by,
153
for r = 0, ...,
∏L−1

























Wl)p,rMr(p = 0, ...,
∏L−1
l=0 m̃lñl − 1)
(D.9)
where Ai, Bj, Cp are the submatrices of A, B and C, with a single index in




In this appendix, we analyze the theoretical and empirical numerical
stability result of Strassen and similar Strassen-like FMM algorithms, and
discuss some possible algorithmic techniques to improving the numerical ac-
curacy.
E.1 Numerical stability for Strassen’s algorithm
E.1.1 Theoretical bound
There once was a widespread viewpoint that Strassen is severely nu-
merically unstable. However, this viewpoint is unfounded [16]. The fact is
that Strassen does meet a somewhat weaker error bound than the classical
matrix multiplication algorithm (Section 2.1) [50, 28, 6]:
• Strassen only meets a norm-wise bound, while the classical matrix
multiplication algorithm further meets an element-wise bound.
• The constant term in the norm-wise bound for Strassen is larger than
that for the classical algorithm.
As illustrated in [50], The forward norm-wise error bound for L-level
155


















Strassen’s algorithm, double precision
Theoretical bound, 3 level
Theoretical bound, 2 level
Theoretical bound, 1 level
Actual max-norm error, 3 level
Actual max-norm error, 2 level
Actual max-norm error, 1 level
Actual mean error, 3 level
Actual mean error, 2 level
Actual mean error, 1 level


















Strassen’s algorithm, single precision
Theoretical bound, 3 level
Theoretical bound, 2 level
Theoretical bound, 1 level
Actual max-norm error, 3 level
Actual max-norm error, 2 level
Actual max-norm error, 1 level
Actual mean error, 3 level
Actual mean error, 2 level
Actual mean error, 1 level
Figure E.1: Empirical stability experiment for Strassen. The actual max-
norm error and mean error vs. theoretical error bound for square matrices in
double precision and single precision.
156
Strassen is





)− 5n)‖A‖‖B‖u +O(u2) (E.1)
where M is an n × n matrix, ‖M‖ := maxi,j|Mi,j| for M ∈ {A,B,C}, and u
is the unit roundoff. For single precision, u = 2−24 ≈ 5.96× 10−8; For double
precision, u = 2−53 ≈ 1.11× 10−16.
As a comparison, the forward norm-wise error bound for the classical
algorithm given in [50] is
‖C − Ĉ‖ ≤ n‖A‖‖B‖u +O(u2).
E.1.2 Empirical experiment
The theoretical bound given in [50] is too loose. In Figure E.1, we show
the measured max-norm absolute error (maxi,j|Ci,j− Ĉi,j|) and mean absolute
error ( 1
n2
Σi,j|Ci,j−Ĉi,j|) compared to the theoretical bound for square matrices
with Strassen in Chapter 3 in double precision and single precision, where
each entry is uniformly randomly generated from [−1, 1].
E.2 Numerical stability for FMM algorithms
E.2.1 Theoretical bound
Strassen-like FMM algorithms are also norm-wise numerical stable. A
number of papers [13, 50, 28, 6] reveal that the norm-wise stability bounds for
L-level FMM algorithms can be present as
‖C − Ĉ‖ ≤ falg(n, L)‖A‖‖B‖u +O(u2) (E.2)
157















FMM, double precision, 1 level
〈2, 2, 2〉 〈2, 3, 2〉 〈2, 3, 4〉 〈2, 4, 3〉
〈2, 5, 2〉 〈3, 2, 2〉 〈3, 2, 3〉 〈3, 2, 4〉
〈3, 3, 2〉 〈3, 3, 3〉 〈3, 3, 6〉 〈3, 4, 2〉
〈3, 4, 3〉 〈3, 5, 3〉 〈3, 6, 3〉 〈4, 2, 2〉
〈4, 2, 3〉 〈4, 2, 4〉 〈4, 3, 2〉 〈4, 3, 3〉
〈4, 4, 2〉 〈5, 2, 2〉 〈6, 3, 3〉















FMM, double precision, 2 level














FMM, single precision, 1 level
〈2, 2, 2〉 〈2, 3, 2〉 〈2, 3, 4〉 〈2, 4, 3〉
〈2, 5, 2〉 〈3, 2, 2〉 〈3, 2, 3〉 〈3, 2, 4〉
〈3, 3, 2〉 〈3, 3, 3〉 〈3, 3, 6〉 〈3, 4, 2〉
〈3, 4, 3〉 〈3, 5, 3〉 〈3, 6, 3〉 〈4, 2, 2〉
〈4, 2, 3〉 〈4, 2, 4〉 〈4, 3, 2〉 〈4, 3, 3〉
〈4, 4, 2〉 〈5, 2, 2〉 〈6, 3, 3〉














FMM, single precision, 2 level
Figure E.2: Empirical stability experiment for 1-level and 2-level FMM algorithms. The actual max-norm
error for square matrices in double precision and single precision.
158
where falg is a polynomial function of n depending on the algorithm.
E.2.2 Empirical experiment
In Figure E.2, we present the measured max-norm absolute error (‖C−
Ĉ‖) for square matrices with 1-level and 2-level FMM algorithms in Chapter 4
in double precision and single precision, where each entry is uniformly ran-
domly generated from [−1, 1].
E.3 Improving the numerical accuracy
The numerical accuracy of FMM algorithms depends on not only the
properties of the algorithms, but also the intrinsic structures of the input
matrices. There are several strategies for improving the error guarantees.
• Reducing the number of levels of FMM algorithms. The numerical error
decreases as fewer levels of FMM algorithms are employed, as observed
in Figure E.1 and Figure E.2. However, fewer levels may mean less
performance improvement with FMM algorithms. There is a trade-off
between the levels of FMM algorithms and the performance.
• Selecting the FMM algorithm with better numerical stability. This may
involve an exhaustive search and more details can be found in [6].
• Diagonal scaling by preprocessing A and B and postprocessing C [28, 6]
in order to improve the intrisic structure of the input matrices, that is,
to decrease ‖A‖ and ‖B‖ in Equation (E.1) and Equation (E.2). The
159








for any nonsingular diagonal scaling matrices, DA, DB, and D. With the
associativity property of matrix multiplication, these diagonal scaling
matrices can be first applied to the input matrices A and B as a prepro-
cessing step, and later applied to the matric product as a postprocessing
step, which involve an extra O(N2) cost. The scaling technique can be
incorporated in the the packing and micro-kernel routines for CPUs, and
in the packing and accumulating process for GPUs.
Finally, we note that the numerical accuracy loss caused by using
Strassen or similar Strassen-like FMM algorithms instead of the classical
matrix multiplication is likely no worse than the other computations encoun-
tered in a larger application.
160
Bibliography
[1] AMD rocBLAS. GitHub Repository, 2018. https://github.com/ROCm
SoftwarePlatform/rocBLAS.
[2] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J.
Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney,
and D. Sorensen. LAPACK Users’ Guide (Third Ed.). SIAM, Philadel-
phia, PA, USA, 1999.
[3] ATLAS. Available Online, 2016. http://math-atlas.sourceforge.net.
[4] Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox ver-
sion 2.6. Available Online, February 2015. http://www.sandia.gov/
~tgkolda/TensorToolbox/.
[5] D. H. Bailey. Extra high speed matrix multiplication on the Cray-
2. SIAM Journal of Scientific and Statistical Computing, 8(3):603–607,
1988.
[6] Grey Ballard, Austin R Benson, Alex Druinsky, Benjamin Lipshitz,
and Oded Schwartz. Improving the numerical stability of fast matrix
multiplication. SIAM Journal on Matrix Analysis and Applications,
37(4):1382–1418, 2016.
161
[7] Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded
Schwartz. Communication-optimal parallel algorithm for Strassen’s ma-
trix multiplication. In Proceedings of the 24th ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA 12), pages 193–204.
ACM, 2012.
[8] Austin R. Benson. Fast matrix multiplication. GitHub Repository.
https://github.com/arbenson/fast-matmul.
[9] Austin R. Benson and Grey Ballard. A framework for practical parallel
fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming (PPoPP
2015), pages 42–53. ACM, 2015.
[10] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of
Strassen’s matrix multiplication with recomputation. arXiv preprint
arXiv:1605.02224, 2016.
[11] Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim Demmel. Opti-
mizing matrix multiply using PHiPAC: a Portable, High-Performance,
ANSI C coding methodology. In Proceedings of International Confer-
ence on Supercomputing, Vienna, Austria, July 1997.
[12] Dario Bini, Milvio Capovani, Francesco Romani, and Grazia Lotti. O
(n2.7799) complexity for n×n approximate matrix multiplication. Infor-
mation Processing Letters, 8(5):234–235, 1979.
162
[13] Dario Bini and Grazia Lotti. Stability of fast algorithms for matrix
multiplication. Numerische Mathematik, 36(1):63–72, 1980.
[14] BLAS-like Library Instantiation Software Framework (BLIS). GitHub
Repository, 2018. https://github.com/flame/blis.
[15] Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, and Wei Zhou.
Memory efficient scheduling of Strassen-Winograd’s matrix multiplica-
tion algorithm. In Proceedings of the 2009 International Symposium
on Symbolic and Algebraic Computation, ISSAC 09, pages 55–62, New
York, NY, USA, 2009. ACM.
[16] R. P. Brent. Error analysis of algorithms for matrix multiplication
and triangular decomposition using winograd’s identity. Numerische
Mathematik, 16(2):145–156, Nov 1970.
[17] L.E. Cannon. A Cellular Computer to Implement the Kalman Filter
Algorithm. PhD thesis, Montana State University, 1969.
[18] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen,
John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient
primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
[19] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A
scalable linear algebra library for distributed memory concurrent com-
puters. In Proceedings of the Fourth Symposium on the Frontiers of
163
Massively Parallel Computation, pages 120–127. IEEE Comput. Soc.
Press, 1992.
[20] clBLAS. GitHub Repository, 2017. https://github.com/clMathLibra
ries/clBLAS.
[21] Pierre Comon, Gene Golub, Lek-Heng Lim, and Bernard Mourrain.
Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix
Analysis and Applications, 30(3):1254–1279, 2008.
[22] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic
progressions. In Proceedings of the Nineteenth Annual ACM Symposium
on Theory of Computing, STOC 87, pages 1–6, New York, NY, USA,
1987. ACM.
[23] Don Coppersmith and Shmuel Winograd. On the asymptotic complex-
ity of matrix multiplication. SIAM Journal on Computing, 11(3):472–
492, 1982.
[24] T.D. Crawford and H.F. Schaefer III. An introduction to coupled cluster
theory for computational chemists. Rev. Comp. Chem., 14:33–136,
2000.
[25] Paolo D’Alberto. The better accuracy of Strassen-Winograd algorithms
(FastMMW). Advances in Linear Algebra & Matrix Theory, 4(01):9,
2014.
164
[26] Paolo D’Alberto, Marco Bodrato, and Alexandru Nicolau. Exploiting
parallelism in matrix-computation kernels for symmetric multiprocessor
systems: Matrix-multiplication and matrix-addition algorithm optimiza-
tions by software pipelining and threads allocation. ACM Trans. Math.
Softw., 38(1):2:1–2:30, December 2011.
[27] Paolo D’Alberto and Alexandru Nicolau. Adaptive Winograd’s matrix
multiplications. ACM Trans. Math. Softw., 36(1):3:1–3:23, March 2009.
[28] James Demmel, Ioana Dumitriu, Olga Holtz, and Robert Kleinberg.
Fast matrix multiplication is stable. Numerische Mathematik, 106(2):199–
224, 2007.
[29] E. Di Napoli, D. Fabregat-Traver, G. Quintana-Orti, and P. Bientinesi.
Towards an efficient use of the BLAS library for multilinear tensor con-
tractions. Appl. Math. Comput., 235:454–468, 2014.
[30] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff.
A set of level 3 basic linear algebra subprograms. ACM Trans. Math.
Soft., 16(1):1–17, March 1990.
[31] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J.
Hanson. An extended set of FORTRAN basic linear algebra subpro-
grams. ACM Trans. Math. Soft., 14(1):1–17, March 1988.
[32] C.C. Douglas, M. Heroux, G. Slishman, and R.M. Smith. GEMMW -
a portable level 3 BLAS Winograd variant of Strassen’s matrix-matrix
165
multiplication algorithm. J. Computational Physics, pages 1–10, 1994.
[33] Jean-Guillaume Dumas and Victor Pan. Fast matrix multiplication and
symbolic computation. arXiv preprint arXiv:1612.05766, 2016.
[34] Brett I. Dunlap. Robust and variational fitting. Phys. Chem. Chem.
Phys., 2:2113–2116, 2000.
[35] Thom H. Dunning Jr. Gaussian basis sets for use in correlated molecular
calculations. i. the atoms boron through neon and hydrogen. J. Chem.
Phys., 90:1007–1023, 1989.
[36] Luke Durant, Olivier Giroux, Mark Harris, and Nick Stam. Inside Volta:
The world’s most advanced data center GPU. Nvidia Developer Blog,
2017. https://devblogs.nvidia.com/inside-volta.
[37] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo K̊agström. Re-
cursive blocked algorithms and hybrid data structures for dense matrix
library software. SIAM review, 46(1):3–45, 2004.
[38] Kazushige Goto and Robert A. van de Geijn. Anatomy of a high-
performance matrix multiplication. ACM Trans. Math. Soft., 34(3):12,
May 2008.
[39] Kazushige Goto and Robert A. van de Geijn. High-performance imple-
mentation of the level-3 BLAS. ACM Trans. Math. Softw., 35(1):4:1–
4:14, July 2008.
166
[40] GOTOBLAS. Available Online, 2010. https://www.tacc.utexas.edu/
research-development/tacc-software/gotoblas2.
[41] Alexander Graham. Kronecker products and matrix calculus: With
applications. John Wiley & Sons, Inc., 605 3rd Ave., New York, NY
10158, 1982.
[42] Brian Grayson and Robert van de Geijn. A high performance parallel
Strassen implementation. Parallel Processing Letters, 6(1):3–12, 1996.
[43] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A.
van de Geijn. FLAME: Formal linear algebra methods environment.
ACM Trans. Math. Softw., 27(4):422–455, December 2001.
[44] John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. A fam-
ily of high-performance matrix multiplication algorithms. In Vassil N.
Alexandrov, Jack J. Dongarra, Benjoe A. Juliano, René S. Renner, and
C. J. Kenneth Tan, editors, Computational Science — ICCS 2001, pages
51–60, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
[45] M. Hanrath and A. Engels-Putzka. An efficient matrix-matrix multipli-
cation based antisymmetric tensor contraction engine for general order
coupled cluster. J. Chem. Phys., 133(6):064108, 2010.
[46] A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baum-
gartner, D. E. Bernholdt, M. Nooijen, R. Pitzer, J. Ramanujam, and
167
P. Sadayappan. Performance optimization of tensor contraction expres-
sions for many-body methods in quantum chemistry. J. Phys. Chem.
A, 113(45):12715–12723, 2009.
[47] Trygve Helgaker, Poul Jorgensen, and Jeppe Olsen. Molecular Electronic-
Structure Theory. Wiley, Chichester, NY, first edition, February 2013.
[48] John L. Hennessy and David A. Patterson. Computer Architecture,
Fifth Edition: A Quantitative Approach. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 5th edition, 2011.
[49] Nicholas J. Higham. Exploiting fast matrix multiplication within the
level 3 BLAS. ACM Trans. Math. Softw., 16(4):352–368, December
1990.
[50] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms.
SIAM, Philadelphia, PA, USA, second edition, 2002.
[51] C-H Huang, Jeremy R Johnson, and Robert W Johnson. A tensor prod-
uct formulation of Strassen’s matrix multiplication algorithm. Applied
Mathematics Letters, 3(3):67–71, 1990.
[52] Jianyu Huang. BLISlab. GitHub Repository, 2016. https://gith
ub.com/flame/blislab.
[53] Jianyu Huang. Generating families of practical fast matrix multiplica-
tion algorithms. GitHub Repository, 2017. https://github.com/fla
me/fmm-gen.
168
[54] Jianyu Huang, Devin A. Matthews, and Robert A. van de Geijn. Strassen’s
algorithm for tensor contraction. SIAM Journal on Scientific Comput-
ing, 40(3):C305–C326, 2018.
[55] Jianyu Huang, Leslie Rice, Devin A. Matthews, and Robert A. van de
Geijn. Generating families of practical fast matrix multiplication algo-
rithms. In 31th IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2017), pages 656–667, May 2017.
[56] Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de
Geijn. Strassen’s algorithm reloaded. In Proceedings of the Interna-
tional Conference for High Performance Computing, Networking, Stor-
age and Analysis (SC 16), pages 59:1–59:12. IEEE Press, 2016.
[57] Jianyu Huang and Robert A. van de Geijn. BLISlab: A sandbox for
optimizing GEMM. FLAME Working Note #80, TR-16-13, The Uni-
versity of Texas at Austin, Department of Computer Science, 2016.
[58] W. Huang, G. Santhanaraman, H. W. Jin, Q. Gao, and D. K. Panda.
Design of high performance MVAPICH2: MPI2 over infiniband. In
Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE In-
ternational Symposium on, volume 1, pages 43–48, May 2006.
[59] Steven Huss-Lederman, Elaine M. Jacobson, Anna Tsao, Thomas Turn-
bull, and Jeremy R. Johnson. Implementation of Strassen’s algorithm
for matrix multiplication. In Proceedings of the 1996 ACM/IEEE Con-
ference on Supercomputing, SC 96, Washington, DC, USA, 1996. IEEE.
169
[60] IBM ESSL. Available Online, 2018. https://www.ibm.com/support/
knowledgecenter/en/SSFHY8/essl welcome.html.
[61] Intel MKL. Available Online, 2018. https://software.intel.com/en-
us/intel-mkl.
[62] Dror Irony, Sivan Toledo, and Alexander Tiskin. Communication lower
bounds for distributed-memory matrix multiplication. Journal of Par-
allel and Distributed Computing, 64(9):1017–1026, 2004.
[63] Bo K̊agström, Per Ling, and Charles van Loan. GEMM-based level 3
BLAS: High-performance model implementations and performance eval-
uation benchmark. ACM Trans. Math. Softw., 24(3):268–302, Septem-
ber 1998.
[64] Igor Kaporin. The aggregation and cancellation techniques as a practical
tool for faster matrix multiplication. Theoretical Computer Science,
315(2-3):469–510, 2004.
[65] Elaye Karstadt and Oded Schwartz. Matrix multiplication, a little
faster. In Proceedings of the 29th Annual ACM Symposium on Paral-
lelism in Algorithms and Architectures, SPAA 17, New York, NY, USA,
2017. ACM.
[66] Jeremy Kepner and John Gilbert. Graph algorithms in the language of
linear algebra. SIAM, 2011.
170
[67] Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. CUT-
LASS: Fast linear algebra in CUDA C++. Nvidia Developer Blog, Dec
2017. https://devblogs.nvidia.com/cutlass-linear-algebra-cud
a.
[68] Donald E. Knuth. The Art of Computer Programming, Volume 2 (3rd
Ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 1997.
[69] R. Krishnan, J. S. Binkley, R. Seeger, and J. A. Pople. Selfconsistent
molecular orbital methods. xx. a basis set for correlated wave functions.
J. Chem. Phys., 72:650–654, 1980.
[70] Bharat Kumar, C-H Huang, P Sadayappan, and Rodney W Johnson. A
tensor product formulation of Strassen’s matrix multiplication algorithm
with memory reduction. Scientific Programming, 4(4):275–289, 1995.
[71] Julian Laderman, Victor Pan, and Xuan-He Sha. On practical algo-
rithms for accelerated matrix multiplication. Linear Algebra and Its
Applications, 162:557–588, 1992.
[72] P. W. Lai, H. Arafat, V. Elango, and P. Sadayappan. Accelerating
Strassen-Winograd’s matrix multiplication algorithm on GPUs. In
20th Annual International Conference on High Performance Comput-
ing, pages 139–148, Dec 2013.
171
[73] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic
linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft.,
5(3):308–323, Sept. 1979.
[74] François Le Gall. Powers of tensors and fast matrix multiplication.
In Proceedings of the 39th International Symposium on Symbolic and
Algebraic Computation, ISSAC 14, pages 296–303, New York, NY, USA,
2014. ACM.
[75] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. An input-adaptive
and in-place approach to dense tensor-times-matrix multiply. In Pro-
ceedings of the International Conference for High Performance Comput-
ing, Networking, Storage and Analysis, SC 15, pages 76:1–76:12, New
York, NY, USA, 2015. ACM.
[76] J. Li, S. Ranka, and S. Sahni. Strassen’s matrix multiplication on
GPUs. In Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th
International Conference on, pages 157–164, Dec 2011.
[77] J. Li, A. Skjellum, and R. D. Falgout. A poly-algorithm for parallel
dense matrix multiplication on two-dimensional process grid topologies.
Concurrency: Practice and Experience, 9(5):345–389, May 1997.
[78] Xiaoye S. Li, James W. Demmel, David H. Bailey, Greg Henry, Yozo
Hida, Jimmy Iskandar, William Kahan, Suh Y. Kang, Anil Kapur,
Michael C. Martin, Brandon J. Thompson, Teresa Tung, and Daniel J.
172
Yoo. Design, implementation and testing of extended and mixed preci-
sion BLAS. ACM Trans. Math. Softw., 28(2):152–205, June 2002.
[79] Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz.
Communication-avoiding parallel Strassen: Implementation and perfor-
mance. In Proceedings of the International Conference on High Per-
formance Computing, Networking, Storage and Analysis (SC 12), pages
101:1–101:11. IEEE, 2012.
[80] Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S.
Quintana-Orti. Analytical modeling is enough for high-performance
BLIS. ACM Trans. Math. Softw., 43(2):12:1–12:18, August 2016.
[81] Qingshan Luo and John B. Drake. A scalable parallel Strassen’s matrix
multiplication algorithm for distributed-memory computers. In Pro-
ceedings of the 1995 ACM Symposium on Applied Computing, SAC 95,
pages 221–226, New York, NY, USA, 1995. ACM.
[82] Dmitry I. Lyakh. An efficient tensor transpose algorithm for multicore
CPU, Intel Xeon Phi, and NVidia Tesla GPU. Comput. Phys. Com-
mun., 189:84–91, 2015.
[83] W. Ma, S. Krishnamoorthy, O. Villa, K. Kowalski, and G. Agrawal. Op-
timizing tensor contraction expressions for hybrid CPU-GPU execution.
Cluster Comput., 16(1):131–155, 2011.
173
[84] Devin A. Matthews. TBLIS: Tensor-based library instantiation soft-
ware. Github Repository, 2016. https://github.com/devinamatth
ews/tblis.
[85] Devin A. Matthews. High-performance tensor contraction without
transposition. SIAM Journal on Scientific Computing, 40(1):C1–C24,
2018.
[86] Nvidia. CUTLASS: CUDA templates for linear algebra subroutines
(v0.1.0). GitHub Repository, 2017. https://github.com/NVIDIA/
cutlass/releases/tag/v0.1.0.
[87] Nvidia. cuBLAS. Available Online, 2018. https://developer.nvidia
.com/cublas.
[88] Nvidia. CUDA C programming guide. Available Online, 2018. https:
//docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[89] Nvidia. Summit GPU supercomputer enables smarter science. Nvidia
Developer Blog, 2018. https://devblogs.nvidia.com/summit-gpu-
supercomputer-enables-smarter-science/.
[90] OpenBLAS, an optimized BLAS library. Available Online. http:
//www.openblas.net.
[91] V. Ya. Pan. Strassen’s algorithm is not optimal trilinear technique
of aggregating, uniting and canceling for constructing fast algorithms
174
for matrix operations. In Proceedings of the 19th Annual Symposium on
Foundations of Computer Science, SFCS 78, pages 166–176, Washington,
DC, USA, 1978. IEEE Computer Society.
[92] V. Ya. Pan. Trilinear aggregating with implicit canceling for a new
acceleration of matrix multiplication. Computers & Mathematics with
Applications, 8(1):23–34, 1982.
[93] Devangi N. Parikh, Jianyu Huang, Margaret E. Myers, and Robert A.
van de Geijn. Learning from optimizing matrix-matrix multiplication.
In 8th NSF/TCPP Workshop on Parallel and Distributed Computing
Education (EduPar-18). IEEE, 2018.
[94] E. Peise, D. Fabregat-Traver, and P. Bientinesi. On the performance
prediction of BLAS-based tensor contractions. In S. A. Jarvis, S. A.
Wright, and S. D. Hammond, editors, High Performance Computing
Systems. Performance Modeling, Benchmarking, and Simulation, num-
ber 8966 in Lecture Notes in Computer Science, pages 193–212. Springer
International Publishing, 2014.
[95] Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Ham-
mond, and Nichols A. Romero. Elemental: A new framework for
distributed memory dense matrix computations. ACM Trans. Math.
Softw., 39(2):13:1–13:24, February 2013.
[96] Krishnan Raghavachari, Gary W. Trucks, John A. Pople, and Martin
175
Head-Gordon. A fifth-order perturbation comparison of electron corre-
lation theories. Chemical Physics Letters, 157(6):479 – 483, 1989.
[97] Leslie Rice. Performance optimization for the K-Nearest Neighbors
kernel using Strassen’s algorithm. 2017. Undergraduate Honors Thesis,
The University of Texas at Austin.
[98] Martin D. Schatz, Tze Meng Low, Robert A. van de Geijn, and Tamara G.
Kolda. Exploiting symmetry in tensors for high performance: Multipli-
cation with symmetric tensors. SIAM Journal on Scientific Computing,
36(5):C453–C479, 2014.
[99] Martin D. Schatz and Jack Poulson Robert A. van de Geijn. Parallel
matrix multiplication: A systematic journey. SIAM Journal on Scien-
tific Computing, 38(6):C748–C781, 2016.
[100] Arnold Schönhage. Partial and total matrix multiplication. SIAM
Journal on Computing, 10(3):434–455, 1981.
[101] Jacob Scott, Olga Holtz, and Oded Schwartz. Matrix multiplication
I/O-complexity by path routing. In Proceedings of the 27th ACM
Symposium on Parallelism in Algorithms and Architectures (SPAA 15),
pages 35–45. ACM, 2015.
[102] G. E. Scuseria, A. C. Scheiner, T. J. Lee, J. E. Rice, and H. F. Schaefer.
The closed-shell coupled cluster single and double excitation (CCSD)
model for the description of electron correlation. A comparison with
176
configuration interaction (CISD) results. J. Chem. Phys., 85(5):2881,
March 1987.
[103] Isaiah Shavitt and Rodney J. Bartlett. Many-Body Methods in Chem-
istry and Physics: MBPT and Coupled-Cluster Theory. Cambridge
University Press, Cambridge ; New York, first edition, August 2009.
[104] A. V. Smirnov. The bilinear complexity and practical algorithms for
matrix multiplication. Computational Mathematics and Mathematical
Physics, 53(12):1781–1795, 2013.
[105] Tyler M. Smith. Multilevel Optimized Matrix-matrix Multiplication
Sandbox (MOMMS). GitHub Repository. https://github.com/tlrm
chlsmth/momms.
[106] Tyler M. Smith. Theory and practice of classical matrix-matrix multi-
plication for hierarchical memory architectures. 2017. PhD thesis, The
University of Texas at Austin.
[107] Tyler M. Smith and Robert A van de Geijn. Pushing the bounds for
matrix-matrix multiplication. arXiv preprint arXiv:1702.02017, 2017.
[108] Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R.
Hammond, and Field G. Van Zee. Anatomy of high-performance many-
threaded matrix multiplication. In 28th IEEE International Parallel
and Distributed Processing Symposium (IPDPS 2014), 2014.
177
[109] Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, and
Enrique S. Quintana-Ort́ı. Toward ABFT for BLIS GEMM. FLAME
Working Note #76. Technical Report TR-05-2015, The University of
Texas at Austin, Department of Computer Sciences, June 2015.
[110] Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton,
and James Demmel. A massively parallel tensor contraction framework
for coupled-cluster computations. Journal of Parallel and Distributed
Computing, 74(12):3176–3190, 2014.
[111] Paul Springer and Paolo Bientinesi. Tensor contraction benchmark v0.1.
GitHub Repository, December 2016. https://github.com/hpac/tccg/
tree/master/benchmark.
[112] Paul Springer and Paolo Bientinesi. Design of a high-performance
GEMM-like tensor-tensor multiplication. ACM Trans. Math. Softw.,
44(3):28:1–28:29, January 2018.
[113] Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. TTC: A high-
performance compiler for tensor transpositions. ACM Trans. Math.
Softw., 44(2):15:1–15:21, August 2017.
[114] Andrew James Stothers. On the complexity of matrix multiplication.
2010. PhD thesis, The University of Edinburgh.
[115] Volker Strassen. Gaussian elimination is not optimal. Numerische
Mathematik, 13(4):354–356, August 1969.
178
[116] Volker Strassen. The asymptotic spectrum of tensors and the exponent
of matrix multiplication. In Proceedings of the 27th Annual Symposium
on Foundations of Computer Science, SFCS 86, pages 49–54, Washing-
ton, DC, USA, 1986. IEEE Computer Society.
[117] Mithuna Thottethodi, Siddhartha Chatterjee, and Alvin R. Lebeck. Tun-
ing Strassen’s matrix multiplication for memory efficiency. In Proceed-
ings of the 1998 ACM/IEEE Conference on Supercomputing (SC 98),
pages 1–14. IEEE, 1998.
[118] Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra
Package. The MIT Press, 1997.
[119] Robert A. van de Geijn, Jianyu Huang, Margaret E. Myers, Devangi N.
Parikh, and Tyler M. Smith. Lowering barriers into HPC through open
education. In 2017 Workshop on Education for High-Performance Com-
puting (EduHPC). IEEE, 2017.
[120] Robert A. van de Geijn and Jerrell Watts. SUMMA: Scalable universal
matrix multiplication algorithm. Concurrency: Practice and Experi-
ence, 9(4):255–274, April 1997.
[121] Field G. Van Zee. libflame: The Complete Reference. www.lulu.com,
2009.
[122] Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy,
Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng
179
Low, Bryan Marker, Lee Killough, and Robert A. van de Geijn. The
BLIS framework: Experiments in portability. ACM Transactions on
Mathematical Software, 42(2):12:1–12:19, June 2016.
[123] Field G. Van Zee and Tyler M. Smith. Implementing high-performance
complex matrix multiplication via the 3m and 4m methods. ACM
Trans. Math. Softw., 44(1):7:1–7:36, July 2017.
[124] Field G. Van Zee and Robert A. van de Geijn. BLIS: A framework for
rapidly instantiating BLAS functionality. ACM Trans. Math. Soft.,
41(3):14:1–14:33, June 2015.
[125] Pete Warden. Why GEMM is at the heart of deep learning. https:
//petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-
deep-learning/, 2015.
[126] R. Clint Whaley and Jack J. Dongarra. Automatically tuned linear alge-
bra software. In SC 98: Proceedings of the 1998 ACM/IEEE Conference
on Supercomputing, pages 38–38, Nov 1998.
[127] J. L. Whitten. Coulombic potential energy integrals and approxima-
tions. The Journal of Chemical Physics, 58(10):4496–4501, 1973.
[128] V. V. Williams. Multiplying matrices faster than Coppersmith-Winograd.
In Proceedings of the Forty-fourth Annual ACM Symposium on Theory
of Computing, STOC 12, pages 887–898, New York, NY, USA, 2012.
ACM.
180
[129] Shmuel Winograd. On multiplication of 2× 2 matrices. Linear algebra
and its applications, 4(4):381–388, 1971.
[130] Kamen Yotov, Xiaoming Li, Gang Ren, MJS Garzaran, David Padua,
Keshav Pingali, and Paul Stodghill. Is search really necessary to gener-
ate high-performance BLAS? Proceedings of the IEEE, 93(2):358–386,
2005.
[131] Chenhan D. Yu, Jianyu Huang, Woody Austin, Bo Xiao, and George
Biros. Performance optimization for the K-Nearest Neighbors kernel on
x86 architectures. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis, SC
15, pages 7:1–7:12, New York, NY, USA, 2015. ACM.
181
Vita
Jianyu Huang was born in Jiangsu, China. He received his Bache-
lor’s degree in Computer Science and Technology from Beihang University
(BUAA), Beijing, China in 2013. Since August 2013, he has been a PhD
student in the Department of Computer Science at the University of Texas at
Austin, working with Robert van de Geijn. His research has been supported by
the Microelectronics and Computer Development Fellowship, Graduate School
Summer Fellowship, and Graduate Research Assistantship at UT Austin. He
was a research and software engineering intern at Microsoft Research Asia,
VMware Inc., and Intel Labs.
Email address: jianyu.huang@utexas.edu
This dissertation was typeset with LATEX
† by the author.
†LATEX is a document preparation system developed by Leslie Lamport as a special
version of Donald Knuth’s TEX Program.
182
