The MOMMS Family of Matrix Multiplication Algorithms by Smith, Tyler M. & van de Geijn, Robert A.
ar
X
iv
:1
90
4.
05
71
7v
1 
 [c
s.M
S]
  1
1 A
pr
 20
19
The MOMMS Family of Matrix Multiplication Algorithms
Tyler M. Smith and Robert A. van de Geijn
ABSTRACT
As the ratio between the rate of computation and rate with which
data can be retrieved from various layers of memory continues to
deteriorate, a question arises: Will the current best algorithms for
computing matrix-matrix multiplication on future CPUs continue
to be (near) optimal? is paper provides compelling analytical
and empirical evidence that the answer is “no”. e analytical re-
sults guide us to a new family of algorithms of which the current
state-of-the-art “Goto’s algorithm” is but one member. e empir-
ical results, on architectures that were custom built to reduce the
amount of bandwidth to main memory, show that under different
circumstances, different and particular members of the family be-
come more superior. us, this family will likely start playing a
prominent role going forward.
KEYWORDS
matrix multiplication, dense linear algebra, performance, caches
ACM Reference format:
Tyler M. Smith and Robert A. van de Geijn. 2019. e MOMMS Family of
Matrix Multiplication Algorithms. In Proceedings of e International Con-
ference for High Performance Computing, Networking, Storage, and Analysis,
Denver, CO, November 17–27, 2019 (SC19), 12 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
For almost two decades, the so-called Goto’s algorithm for matrix-
matrixmultiplication (MMM)has guided practical implementations
on current CPUs [6, 7]. e algorithm orchestrates computation
so as to keep a packed copy of a roughly square submatrix (block)
of A in the L2 cache and a packed copy of a row panel of B in
the L3 cache. Major innovations of Goto’s algorithm include stag-
ing a block of A in the L2 cache rather than the L1 cache to re-
duce the amount of data movement by allowing the block of A
to be larger, packing the block of A and panel of B into specially-
formaed contiguous buffers for beer spatial locality, and show-
ing that translation-lookaside buffer (TLB) misses can be a perfor-
mance impediment that is alleviated by reducing the footprint of
the block of A.
For years, the rate of peak computation and the memory move-
ment have been diverging [22], and MMM has been predicted to
soon become a memory-bound operation based on such hardware
trends [2], taking into the hardware requirements for this computa-
tion to be balanced [17]. While Goto’s algorithm is well-suited to
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
the author(s) must be honored. Abstracting with credit is permied. To copy other-
wise, or republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from permissions@acm.org.
SC19, Denver, CO
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
978-x-xxxx-xxxx-x/YY/MM. . . $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
0 500 1000 1500 2000 2500 3000 3500 4000
0
50
100
150
200
250
m = n = k
G
FL
O
P
S
MOMMS Goto
MOMMSC3A2C0
MKL
Figure 1: Comparison of variousMMM implementations for
an architecture with very low bandwidth to main memory.
the relative speeds of caches in current memory hierarchies, we
show, through analysis and empirical studies, that this will not
continue to be the case as bandwidths between various memory
layers continue to deteriorate relative to the rate of computation.
Figure 1 reports the aained performance of various implementa-
tions on a custom-built computer that allows the bandwidth to dif-
ferent memory layers to be artificially reduced1. e curve labeled
MOMMS Goto uses Goto’s algorithm and we believe MKL uses
an algorithm similar to it, and the curve labeled MOMMS C3A2C0
uses an algorithm that more effectively utilizes the L3 cache. e
performance degradation of Goto’s algorithm on such an architec-
ture is significant. As MMM becomes near memory bound, it
is essential that algorithms for this extremely widely-used
operation to utilize the memory hierarchy as effectively as
possible.
is paper first reviews recent theoretical results by Smith et
al. [26] that establish an (essentially) tight lower bound on the
memory traffic incurred by a MMM under a simple model. It then
makes a number of new contributions:
• With these theoretical results as a foundation, it proposes
theMultilevel OptimizedMatrix-matrixMultiplicationSand-
box (MOMMS) family of practical algorithms ofwhichGoto’s
algorithm is a member.
• It analyzes tradeoffs in the number of transfers between
layers of the memory hierarchy that arise when simul-
taneously optimizing for multiple levels of cache. ese
tradeoffs are not explained by current state-of-the-art the-
oretical results.
• It analytically exposes different scenarios under which dif-
ferent algorithms in the family exhibit beneficial charac-
teristics.
1Details of the experiment are given in Section 5.
SC19, November 17–27, 2019, Denver, CO Tyler M. Smith and Robert A. van de Geijn
• It empirically demonstrates the benefits of different algo-
rithms on custom-built hardware that allowsmemory band-
widths to be varied.
Together, these lay the foundation for practical solutions if and
when the balance between computation and memory bandwidth
changes in the future.
2 THEORY AND FUNDAMENTAL SHAPES
We briefly review the state-of-the-art theoretical lower bounds for
the I/O complexity for MMM, and describe algorithms that aain
those bounds and thus are optimal for a simple model of a two-
level memory hierarchy. ese algorithms become our fundamen-
tal components from which we compose practical algorithms for
multiple levels of cache, in Section 3.
2.1 An I/O lower bound for MMM
Smith et al. [26] startswith a simplemodel ofmemorywith two lay-
ers of memory: a small, fast memory with capacity ofM elements
and a large, slow memory with unlimited capacity. It shows that
any algorithm for ordinary MMM2 must read at least 2mnk/
√
M −
2M elements from slow memory and additionally write at least
mn−M elements to slowmemory. Adding these two lower bounds
gives a lower bound on the number of transfers between slow and
fast memory, called the I/O lower bound, of approximately 2mnk/
√
M .
Importantly, this lower bound is tight, modulo lower order terms.
It improves upon previous work [3, 13, 15].
2.2 Resident algorithms for MMM
In [26], it is shown that three algorithms, named Resident A, Res-
ident B, and Resident C, aain the lower bound on the number of
reads from slow memory3. Additionally, Resident C aains the
lower bound on the number of writes to slow memory3. In each
algorithm, the elements of one of the operand matrices are read
from slow memory only once, and each element of the other two
operand matrices is reused approximately
√
M times each time it
is brought into fast memory. While the Resident A algorithm was
described as early as 1991 [18], and all three appear in [10], their
optimality was first noted in [26].
2.2.1 Resident C. e operation MMM Z += XY can be com-
puted by the sequence of rank-1 updates Z += x0y
T
0 + x1y
T
1 + · · · ,
where xi and y
T
i are a row and column of X and Y , respectively.
is is illustrated in Figure 2 (le), where Z is the square block on
the le, X is the middle operand, and Y is the operand on the right.
e vectors xi and y
T
i are represented by the thin partitions of X
and Y .
Suppose we have (larger) matrices C , A, and B. We compute
C += AB in the following way. Partition:
C→
©­­­«
C0,0 · · · C0,n−1
.
.
.
.
.
.
Cm−1,0 · · · Cm−1,n−1
ª®®®¬
,A→
©­­­«
A0
.
.
.
Am−1
ª®®®¬
,B→
(
B0 · · · Bn−1
)
,
2We only consider algorithms that compute the i, j element of m × n matrix C as
γi, j :=
∑k−1
p=0 αi,pβp, j where αi,p and βp, j are the i, p and p, j elements ofm × k
matrix A and k × n matrix B, respectively.
3Modulo lower order terms.
where Ci, j ismc × nc , Ai ismc × k , and B j is k × nc , except at the
margins. en we compute the suboperation Ci, j += AiB j using
the described algorithm for Z += XY . Now,Ci, j is read from slow
memory once at the beginning of the suboperation, and resides in
fast memory during the rest of the duration, and Ai and B j are
streamed one row and column at a time from slow memory. Each
element of each operand is read once for each i, j, and there are
⌈ mmc ⌉⌈
n
nc
⌉ such suboperations. Overall, this algorithm incursmn
reads andmn writes for matrix C , ⌈mnknc ⌉ reads for matrix A, and
⌈mnkmc ⌉ reads for matrix B. Whenmc ≈ nc ≈
√
M 4, the I/O cost is
2mnk/
√
M + 2mn. e highest ordered term in the I/O cost of the
Resident C algorithm is the same as the I/O lower bound for MMM.
us the algorithm is essentially optimal.
2.2.2 Resident A and B. Similarly, in the MMM Z += XY , each
columnofZ can be computed by thematrix-vectormultiplicationzi +=
Xyi , where zi and yi are columns of Z and Y , respectively. is is
illustrated in Figure 2 (middle), Z , X , andY are the le, middle, and
right operands, respectively.
Consider C += AB. Partition:
C →
©­­­«
C0
.
.
.
Cn−1
ª®®®¬
,A→
©­­­«
A0,0 · · · A0,n−1
.
.
.
.
.
.
Am−1,0 · · · Am−1,n−1
ª®®®¬
,B →
©­­­«
B0
.
.
.
Bn−1
ª®®®¬
,
where Ci is mc × n, Ai,p is mc × kc , and Bp is kc × n, except at
the margins. en we compute the suboperation Ci += Ai,pBp
using the described MVM-based algorithm for Z += XY . In Resi-
dent A,Ai,p is read from slowmemory once at the beginning of the
suboperation, and Cj and Bp are streamed from slow memory one
column at a time. e total I/O costs associated with each matrix
are: ⌈mnk
kc
⌉ reads and ⌈mnk
kc
⌉ writes of elements of C ;mk reads of
elements ofA; and ⌈mnkmc ⌉ reads of elements of B. If kc ≈ nc ≈
√
M ,
the input cost is approximately 2mnk/
√
M+nk , and the output cost
is approximatelymnk/
√
M . e input cost aains near the lower
bound on reads from slow memory.
eResident B algorithm is the obvious symmetric equivalent to
the Resident A algorithm, built upon the suboperation in Figure 2
(right). Its I/O costs mirror that of the Resident A algorithm.
e above descriptions “stream” rows and/or columns of two
matrices while keeping a block of the third matrix resident in fast
memory. Notice that one can instead stream row panels instead
of rows and/or column panels instead of columns as long as the
“small” dimension of the panel is small relative to the sizes of the
block that is resident in fast memory. is still achieves the I/O
lower bound modulo a lower order term. is insight becomes
crucial when we discuss blocking for multiple levels of memory.
2.3 Algorithms for different shapes of MMM
e number of reads and writes from slow memory for the Res-
ident A, B, and C algorithms depend on the shape of the input
matrices: ere are cases where one of the algorithms is more ef-
ficient than the other two, where we define efficiency by flops per
memop (I/O operations). ere are 2mnk flops performed during
4mc and nc must be slightly less than
√
M to make room for a row of Ai and a
column of Bj in fast memory.
The MOMMS Family of Matrix Multiplication Algorithms SC19, November 17–27, 2019, Denver, CO
+= += +=
Resident C Shape Resident A Shape Resident B Shape
Data in cache.
Data in main memory.
Figure 2: e three shapes of MMM exposed by the algorithms Resident C (le), Resident A (middle), and Resident B (right).
Each algorithm can be implemented as two loops around its corresponding shape.
MMM, and the I/O lower bound is 2mnk/
√
M . us our goal for
efficiency is
√
M flops per memop. We examine the cases for which
algorithms are efficient, assuming thatm, n, and k are at least
√
M .
Resident C is efficient if and only if k is large. It reads ⌈mnknc ⌉ +
⌈mnkmc ⌉ +mn elements from slow memory during MMM. Ifmc =
nc =
√
M , this is approximately 2mnk/
√
M +mn. is gives an ef-
ficiency of
(
1√
M
+
mn
2k
)−1
. When k is large, this is approximately
√
M . We can analyze Resident A and Resident B similarly. Here
we ignore the I/O cost for writes. If the sizes of the resident blocks
are chosen to be equal to
√
M , Resident B has an efficiency of(
1√
M
+
nk
2m
)−1
, which is approximately
√
M whenm is large. Resi-
dent A has an efficiency of
(
1√
M
+
mk
2n
)−1
, which is approximately
√
M when n is large.
is shows that one must choose the right algorithm depending
on the shape of the problem. For each of the Resident A,B, and
C algorithms, there is a minimal shape that can be implemented
efficiently. For Resident C this occurs when m ≈ n ≈
√
M , and
k is large. For Resident B this occurs when k ≈ n ≈
√
M , and m
is large. For Resident A this occurs whenm ≈ k ≈
√
M , and n is
large. In each case, the resident matrix fits into fast memory, and
the dimension shared by the other two operands should be large
so that the cost of moving the resident matrix into fast memory
can be amortized.
e fact that one must choose a different algorithm for MMM
depending on problem shape and size was previously noted for
distributed memory MMM [19, 24], and for hierarchical memory
MMM [10, 31].
2.4 A balancing act
So far in this section, we have assumed that the costs associated
with accessing an element is the same, no maer if it is an element
of A, B, or C . In doing so, we arrived at the following strategy:
Place a square block of the resident matrix in fast memory, stream-
ing the other two from slow memory. is amortizes the I/O costs
associated with the resident matrix, and equalizes the number of
accesses of the two streamed matrices.
We now re-analyze the Resident A, B, and C algorithms in the
case that the costs associated with accessing elements of the dif-
ferent operands are unequal. is can happen if, e.g., if we add a
third layer of memory of intermediate size and access cost. In this
case, at the start of a multiplication with submatrices one operand
may reside in slowmemory while another resides in some interme-
diate layer. Furthermore, in many cases reads and writes cannot
be overlapped (e.g. main memory is oen not dual-ported), and
hence it is more expensive to access elements of C since C must
be both read and wrien. One way to address this is to select the
algorithm where blocks of the operand that is most expensive to
access are kept in fast memory as much as possible. Another is to
adjust the sizes used for the the resident block in fast memory.
We now walk through an example of the second solution. Sup-
pose we are employing the Resident A algorithm, with anmc ×kc
block of A in fast memory. If the cost of accessing an element of B
costs βB , and accessing an element ofC costs βC , whenm, n, and k
are large, the efficiency in terms of flops per memop is 2mc
βB
+
2kc
βC
.
is is maximized when mc =
√
βB
βC
M and kc =
√
βC
βB
M . With
this, the total cost of I/O (rather than the number of accesses) as-
sociated with accessing the streamed matrices are equalized and
thus the cost minimized (modulo lower order terms).
2.5 Summary
e ingredients to an efficient algorithm are: (1) Fill fast memory
with a submatrix of one of the operands (the resident matrix), (2)
Amortize the I/O cost associated with (1) over enough computa-
tion, (3) Choose dimensions for the resident block that equalize
the I/O costs (rather then the number of accesses) associated with
the two streamed matrices.
3 MULTIPLE LEVELS OF CACHE
We now extend the ideas from Section 2 to MMM for computers
with multiple layers of fast memory. We carefully build up a par-
ticular algorithm for a computer with three layers of memory: a
slow main memory, a medium cache, and a fast cache, with each
level of cache faster and smaller in capacity than the one before it.
We name the fast and medium caches Lf and Lm , and name their
capacitiesMf and Mm , respectively.
We start by partitioning the matrices so that the computation is
orchestrated as a double loop over a particular subproblem, one of
the shapes in Figure 2, that that effectively utilizes the Lm cache.
e question then becomes how to implement the subproblem in
a way that effectively utilizes Lf .
SC19, November 17–27, 2019, Denver, CO Tyler M. Smith and Robert A. van de Geijn
3.1 A motivating example
For our motivating example, we choose the Resident A algorithm
when blocking for Lm .
Effectively utilizing Lm . To effectively utilize Lm , we select the
partitioning that casts computation in terms of a double loop around
themiddle shape in Figure 2. We call this subproblem the Lm block-
panel multiply.
Effectively utilizing Lf . e question now becomes how to or-
chestrate the Lm block-panel multiply in a way that effectively
utilizes Lf . is suggests again a double loop around one of the
shapes in Figure 2. It is not hard to see that creating a double loop
that again implements a Resident A algorithm is problematic: Par-
titioning A for Lf would expose panels of B and C that by design
are too large to fit into Lm , and these panels are used in multiple
Lf subproblems. Either these panels of B or C would need to be
brought into Lm multiple times or the sizes of the various matrix
partitions would need to be reduced. Either way the effect would
be that the operation would no longer be near-optimal with re-
spect to the number of transfers between Lm and slower levels of
memory. We conclude that the block-panel multiply should be im-
plemented in terms of a Resident C or Resident B algorithm. Which
of these depends on the choice of the outer and inner loop, which
we discuss next.
Choosing the outer loop for the Lf cache. In order to aain near
the lower bound, each element in the two long panels of B and C
must be used ≈
√
M times each time it is brought into Lm . is
leads us to first partition along the n dimension with blocksize nc ,
yielding partitions of B and C that are small enough to fit into the
Lm cache along with the block of A.
e inner loop for the Lf cache. e next step is to further parti-
tion the matrices to optimize for Lf . e subproblem exposed by
each iteration of the Lf outer loop is a block of A times a skinny
panel of B updating a skinny panel ofC . e Lf inner loopwill par-
tition this subproblem along one of the two dimensions that the Lf
outer loop did not. We can choose either of these. For this example,
we will choose the k dimension (with blocksize kc ). is Lf inner
loop exposes a new subproblem that we will call the Lf subprob-
lem. In this case, the Lf subproblem is a tall and skinny panel of
A times a kc × nc block of B, updating a tall and skinny panel of
C . If kc ≈ nc ≈
√
Mf , then the Lf subproblem corresponds to the
furthest le shape seen in Figure 2. en, the block of B will reside
in the Lf cache, and the panels ofA andC will be streamed in from
lower levels of cache for the duration of this subproblem.
3.2 Building a Family of Algorithms
In the the motivating example, we started with a problem resem-
bling the middle shape from Figure 2, and used two loops to par-
tition the problem, resulting in a problem resembling one of the
other two problem shapes. is suggests the following methodol-
ogy to optimize for any number of levels of cache: We begin with
one of the three shapes in Figure 2, optimizing for the I/O cost for
the Lh cache. en, to optimize for the next smaller and faster level
of cache, the Lh−1 cache, we first partition the problem along the
long dimension, and then partition along along one of the other
two dimensions. e result is one of the other two shapes shown
in Figure 2. We name the outermost of these two loops the Lh−1
outer loop and the innermost the Lh−1 inner loop. is process is
shown in Figure 3.
We note that [10] claimed that it was locally optimal to encounter
a subproblem that corresponds to one of the three optimal subprob-
lems at every level of the memory hierarchy. However that paper
did not give details on how this could be accomplished, nor did it
analyze the claim in terms of any I/O lower bounds.
3.3 Classifying matrix operands and
algorithms
e two loops for Lh−1 have exposed partitions of matrices that
differ in terms of access frequency and size. From these properties,
we can classify these different matrix partitions.
e Lh resident block is the block that is designed to remain
and reside in the Lh cache during the duration of the Lh subprob-
lem. e other two operands of an Lh subproblem are called the
Lh streamed panels, as small partitions of the streamed panels are
brought into Lh during an iteration of the Lh−1 outer loop, used for
computation, and then not used again during the Lh subproblem.
e Lh−1 inner loop partitions the Lh resident block and one of
the Lh streamed panels. e remaining Lh streamed panel is le
unpartitioned. e matrix partition not partitioned by the Lh−1
inner loop is used during every iteration of the Lh−1 inner loop.
Guided by the principle that each element of the Lh subproblem
should only by read into Lh once, it must remain in cache during
the entire inner Lh−1 loop. We name this matrix partition the Lh
guest panel. Compare this to the resident block of the Lh cache.
e elements of the Lh guest matrix, like the elements of the Lh
resident block, are reused from Lh across iterations of a loop. e
difference is that the Lh resident block is reused across every it-
eration of the outer Lh−1 loop, and the Lh guest matrix is reused
across the iterations of the inner Lh−1 loop.
Aer the two Lh−1 loops, we have exposed one of the three
shapes associated with our algorithms Resident A, Resident B, and
Resident C. e small block that will then reside in Lh−1 will be
known as the Lh−1 resident block.
e algorithms that arise from our methodology can be identi-
fied by the operand that the resident block is from in each level of
cache. We introduce a naming convention for the algorithms that
states the level of cache and the operand that resides in it. For in-
stance if an algorithm has B as the resident block of the L2 cache,
A as the resident block of the L1 cache, and C as the resident block
in registers, it is called B2A1C0.
3.4 Optimizing for registers
In our family of algorithms, we think of the register file as L0: the
smallest and fastest level of cache. For practical reasons, it should
be treated as a special case. In many implementations of MMM,
the innermost kernel implements the Resident C algorithm [7, 11,
29, 30]. ere are good reasons for this. e latency of the com-
putation instructions dictates that there is a minimum number of
registers that must be used to store elements of C to avoid the in-
struction latency becoming a boleneck. e number of elements
of C that are stored in registers must be at least the product of the
The MOMMS Family of Matrix Multiplication Algorithms SC19, November 17–27, 2019, Denver, CO
+= +=
+= +=
+= +=
+= +=
+= +=
+= +=
+= +=
+= +=
+= +=
Lh subproblem. Lh−1 outer loop. Lh−1 inner loop. Lh−1 subproblem.
Resident block of Lh cache (or part of it).
Guest panel of Lh cache.
Resident block of Lh−1 cache.
Figure 3: Possible scenarios when partitioning for Lh and Lh−1 when Resident A (top), Resident B (middle), or Resident C
(bottom) is encountered in Lh .
instruction latency and the number of elementary additions that
can be performed per cycle [20, 33]. Oen this means that a signif-
icant portion of the registers must be dedicated to storing elements
ofC , making it unnatural to use the Resident A or Resident B algo-
rithms for the registers. erefore, it is oen the case that there is
no choice but to use Resident C for the registers.
4 MULTILEVEL CACHE TRADEOFFS
When simultaneously optimizing formultiple levels of cache, there
are tensions between I/O costs at the different levels of cache.
4.1 Optimizing for Lh−1 impacts the Lh I/O cost
When simultaneously optimizing for both Lh−1 and Lh , the size of
the Lh resident block is reduced relative to its size when optimizing
only for Lh , since larger portions of the Lh streamed panels must
fit in Lh . When optimizing for both Lh and Lh−1:
• At minimum, the Lh resident matrix and Lh guest matrix
must fit into Lh .
• If Lh is inclusive, meaning that anything in Lh−1must also
be in Lh , then there must also be space in Lh for the Lh−1
resident matrix.
• If Lh is inclusive and has a LRU policy, then in order for
an element to remain in Lh , fewer thanMh elements may
be accessed in between accesses of the element for it to
remain in cache and hence every matrix partition exposed
by the Lh−1 outer loop must fit in Lh .
ese conditions represent a tradeoff between optimizing for Lh
and Lh−1. e larger Lh−1 is, the more data must fit into it, and
the smaller the Lh resident block can be. With the simplifying
assumptions that the resident blocks of both Lh and Lh−1 must be
square, the I/O cost for Lh when optimizing for both Lh and Lh−1
can be determined by the ratioMh/Mh−1.
Sometimes, it is counter-productive or of limited value to opti-
mize for Lh−1 when optimizing for Lh . In this case:
(1) A simple option is to treat Lh−1 as if it were smaller than
it is, reducing the size of the Lh−1 resident block.
(2) If Lh is LRU, another option is to tweak the blocksizes for
Lh−1 slightly. e portions of the Lh streamed panels that
must fit into Lh alongside the Lh resident block depends
on the tiling of theLh−1 outer loop but not on the blocksize
of the Lh−1 inner loop. erefore, one can tweak the shape
of the Lh−1 resident block accordingly.
(3) A third option is that one could “skip” optimizing for Lh−1
and instead simultaneously optimize for Lh and Lh−2.
Blocking for the Lh−1 cache adversely affects the number of
transfers into and out of the Lh cache but blocking for further
(smaller and faster) levels of cache does not, because the entire
Lh−2 subproblem fits within the data that must be in the Lh cache.
4.2 Optimizing for Lh impacts the Lh−1 I/O cost
Simultaneously optimizing for Lh and Lh−1 adversely effects trans-
fers into and out of Lh . We now argue that it also has an adverse
effect on the transfers into and out of Lh .
When optimizing for only one cache of sizeM , the streamed ma-
trices are each associatedwith an aggregate I/O cost of≈mnk/
√
M .
When optimizing for both Lh and Lh−1, however, the I/O cost asso-
ciated with the Lh−1 resident matrix becomes cubic because each
element of the Lh−1 resident matrix is moved into Lh−1 once per
Lh subproblem.
SC19, November 17–27, 2019, Denver, CO Tyler M. Smith and Robert A. van de Geijn
When optimizing for both Lh and Lh−1, the I/O cost associated
with the Lh−1 resident matrix will be ≈mnk/
√
Mh , whereas when
only optimizing for the Lh−1 cache, the I/O cost associated with
the Lh−1 resident matrix is equal to the number of compulsory
reads and writes. e I/O costs for the streamed matrices are not
affected.
While optimizing for the Lh cache has increased the Lh−1 I/O
cost, optimizing for Lh+1, Lh+2, etc., does not affect it, because
optimizing for further levels of cache does not reduce the number
of times each element is used every time it is brought into the Lh−1
cache.
4.3 Skipping caches
We have seen that tradeoffs occur when simultaneously optimiz-
ing for the I/O cost of multiple levels of cache. Sometimes these
tradeoffs are too great, so instead of optimizing for both Lh and
Lh−1, one may forego the Lh−1 cache, and instead simultaneously
optimize for Lh and Lh−2 I/O costs, where the Lh−1 cache is in-
termediate between Lh and Lh−2. We call this skipping the Lh−1
cache.
When the Lh−1 cache is skipped, an optimal subproblem is en-
countered at the Lh level and at the Lh−2 level, but not at the Lh−1
level, However, this does not mean that the Lh−1 is not useful. Re-
call that the Lh guest matrix is reused during each iteration of the
Lh−2 inner loop. In the right circumstances, this Lh guest matrix
may be instead reused in the Lh−1 cache, if that cache is skipped.
• In idealized circumstances, only theLh guest matrix should
need to be in the Lh−1 cache.
• If the Lh−1 cache is LRU, then a panel of the Lh resident
matrix must also fit into the Lh−1 cache.
• If the Lh−1 cache is inclusive, then the Lh−2 resident block
must also fit into the Lh−1 cache.
In this case, the Lh guest matrix is reused from Lh−1, but is not
square, and the I/O cost associated with reading the other two
operands is suboptimal. Furthermore, in many cases this panel
occupies only a fraction of Lh−1, reducing its size and further in-
creasing the I/O cost.
Goto’s algorithm is a member of the MOMMS family. It skips
optimizing for the L3 and L1 caches. Since A is the resident ma-
trix of the L2 cache, and C is the resident matrix of the registers,
Goto’s algorithm is named A2C0 according to the convention in
Section 3.3.
5 EXPERIMENTS
We now evaluate algorithms created by our methodology. We do
so by performing experiments on architectureswith a varying num-
ber of levels of cache.
For current CPUs, Goto’s algorithmaains excellent performance [28]
that is difficult to exceed despite the fact that it does not aain close
to the I/O lower bound on computers with an L3 cache. In order
to evaluate the I/O cost of different algorithms, we artificially vary
the cost of accessing main memory. In all experiments, we perform
double precision MMM.
5.1 Experimental setup
We have implemented the described family of algorithms as the
Multilevel OptimizedMatrix-MatrixMultiplication Sandbox (MOMMS).
MOMMS implements algorithms for MMM by composing compo-
nents like matrix partitioning, packing, and parallelization at com-
pile time. MOMMS is wrien in Rust [21], a modern system pro-
gramming language focusing onmemory safety. Most of this safety
is enforced at compile-time through Rust’s borrow checker. In
Rust, memory is freed when it goes out of scope, and there is no
garbage collector. From Rust, one can call C functions with very
low overhead. For low-level kernels, MOMMS calls the BLIS micro-
kernel [29] coded in C and inline assembly language.
We custom built two computers. One has an Intel i7-7700K CPU
with two 8GBDIMMS of DDR4-3200 RAM and amotherboardwith
an Intel Z270 chipset. e other has an Intel i7-5775C CPU with
two 8GB DIMMS of DDR3-2400 RAM and a motherboard with an
Intel Z97 chipset. We refer to these computers by their proces-
sor names. We chose the Z270 and Z97 chipset motherboards be-
cause these are enthusiast motherboards for consumers interested
in overclocking, and they provide the ability to change the mem-
orymultiplier. e i7-7700K computer has a 4-core Intel Kaby Lake
CPU with 64KB L1, 256KB L2 , and 6MB L3 caches. We chose this
because it is a recent readily available Intel processor with an L3
cache. e i7-5775C is a 4-core Intel Broadwell CPU. It also has
64KB L1, 256KB L2,and 6MB L3 caches. Most notably it has 128MB
of eDRAM, functioning as an L4 cache.
All experiments were performed with hyperthreading disabled.
A userspace CPU governorwas used to set the CPUs to the nominal
CPU frequency: 4.2 GHz for the i7-7700K and 3.3 GHz for the i7-
5775C.
e bandwidth to main memory can be determined by the prod-
uct of the number of memory channels, the base clock rate, the
number of bytes per transfer, and the memory multiplier. With
DDR RAM, this is doubled since it transfers on both the leading
and trailing edges of the clock signal. We increase the ratio of the
rate of I/O to the rate of computation via the BIOS seings. Reduc-
ing the memory multiplier and the number of memory channels
decreases the rate of I/O without changing the rate of computa-
tion.
5.2 Optimizing for the L3 cache
We here describe an algorithm implemented in MOMMS that op-
timizes for both L3 and L2 labeled B3A2C0. We compare this al-
gorithm to our re-implementation of Goto’s algorithm (also im-
plemented in MOMMS), and to vendor and state-of-the-art open
source BLAS [4] implementations. Figure 4 compares Goto’s algo-
rithm with other MOMMS algorithms optimized for both the I/O
cost of L3 and L2.
We now describe the B3A2C0 algorithm as implemented for the
i7-7700K and illustrated in Figure 4 (second from the le). First, we
partition for L3 cache. e L3 outer loop partitions the matrices
in the n dimension with blocksize 768. en the L3 inner loop
partitions in the k dimension, also with blocksize 768. is reveals
a 768×768 block of B that becomes the L3 resident matrix. Next, we
partition for the L2 cache. Since B is the L3 resident matrix, the L2
outer loopmust be in them dimension, and it is with blocksize 120.
The MOMMS Family of Matrix Multiplication Algorithms SC19, November 17–27, 2019, Denver, CO
Goto’s algorithm B3A2C0 Algorithm A3B2C0 Algorithm C3A2C0 Algorithm
+=
+=
+=
+=
Partition n with blocksize 3000
Partition k with blocksize 192
Partitionm with blocksize 120
Inner kernel
Block is reused in L3 cache.
Block is reused in L2 cache.
+=
+=
+=
+=
+=
Partitionm with blocksize 768
Partition k with blocksize 768
Partition n with blocksize 120
Partition k with blocksize 192
Inner kernel
+=
+=
+=
+=
+=
Partition n with blocksize 768
Partition k with blocksize 768
Partitionm with blocksize 120
Partition k with blocksize 192
Inner kernel
+=
+=
+=
+=
+=
Partition n dimension with blocksize 624
Partitionm dimension with blocksize 624
Partition k dimension with blocksize 156
Partitionm dimension with blocksize 156
Inner kernel
Figure 4: Four MOMMS algorithms for MMM.
e L2 inner loop then partitions the k dimension with blocksize
192, making a block of A resident in L2, and a 120 × 768 panel of
C the guest matrix of L3. We skip L1, since it is a quarter the size
of L2, making it is not beneficial to optimize for both L2 and L1.
e next two loops make a 4 × 12 block of C the resident matrix
of registers, and a 192 × 12 panel of B, the guest panel of L2. is
guest panel of L2 is designed to be reused in the (skipped) L1 cache.
Finally, we call a 4 × 12 micro-kernel provided by BLIS [29].
We compare this to Goto’s algorithm with similar blocksizes as
follows: nc is 3000,kc is 192,mc is 120,mr is 4, andnr is 12. Our im-
plementation of Goto’s algorithm uses the samemicro-kernel from
BLIS as does B3A2C0. For both algorithms, we parallelize the sec-
ond loop around the micro-kernel with 4 threads. is quadruples
the bandwidth requirements of our algorithms without increasing
the amount of the L3 cache that must be set-aside for elements of
A [27].
Rooflines. e roofline model is a simple model used to give an
upper bound on performance based on the arithmetic intensity of
an algorithm for a specific computer [32]. e computer is charac-
terized by its rate of computation and the rate at which it can trans-
fer data betweenmainmemory and cache. e arithmetic intensity
of an algorithm is the number of flops per byte transferred between
memory and cache during the execution of that algorithm. When
the arithmetic intensity is low it is bandwidth bound, and when
the arithmetic intensity is high it is compute bound. e roofline
model is thus a plotwhere the x-axis is the arithmetic intensity and
the y-axis is maximum rate of computation for that arithmetic in-
tensity. e roofline that serves as an upper bound on performance
is formed by two linear curves that intersect when the minimum
time spent for computation for an algorithm is equal to the min-
imum time spent for I/O. Algorithms are ploed on the roofline
1 2 4 8 16 32 64 128 256 512
10
100
flops per byte
G
FL
O
P
S
Roofline Single Channel DDR-800
Roofline Dual Channel DDR-3200
MOMMS Goto
MOMMS B3A2C0
Figure 5: Roofline models for i7-7700K at 4.2GHz with 4
cores under two bandwidth conditions. e y-axis is mea-
sured and the x-axis is theoretical. e higher points for
each algorithm represent high-bandwidth performance and
the lower points represent low-bandwidth performance.
model according to their arithmetic intensity and measured per-
formance as a way to explain their performance and to explain
whether or not they could perform beer. One can either measure
the arithmetic intensity of an algorithm or analyze it. We choose
to analyze the arithmetic intensity of the algorithms ploed.
When the matrices are large, Goto’s algorithm has an efficiency
of
(
1
kc
+
1
2nc
)−1
flops per element. With the blocksizes we used,
this is 23.26 flops per byte. e algorithm B3A2C0, with a 768×768
block of B in the L3 cache, has an efficiency of 64 flops per byte.
In Figure 5, we show the roofline model for the i7-7700K for
the case of one channel of DDR4-800 RAM, and for the case of
SC19, November 17–27, 2019, Denver, CO Tyler M. Smith and Robert A. van de Geijn
0 1000 2000 3000 4000
0
100
200
m = n = k
G
FL
O
P
S
1 channel of DDR4-800
0 1000 2000 3000 4000
0
100
200
m = n = k
1 channel of DDR4-1200
0 1000 2000 3000 4000
0
100
200
m = n = k
G
FL
O
P
S
1 channel of DDR4-1600
0 1000 2000 3000 4000
0
100
200
m = n = k
2 channels of DDR4-3200
MOMMS Goto
MOMMS B3A2C0
Figure 6: Performance on i7-7700K for square matrices,
varying problem size and available bandwidth. Matrices are
stored prepacked.
two channels of DDR4-3200 RAM. ese cases represent the min-
imum and the maximum memory bandwidth that we configure
the computer for. We plot the modeled efficiency of Goto’s al-
gorithm and the algorithm B3A2C0 against each algorithm’s mea-
sured performance. e roofline plot clearly shows that in the high-
bandwidth case, either algorithm is capable of achieving the peak
performance of the CPU based on its arithmetic intensity, but for
the low-bandwidth case, only B3A2C0 can. e improved arith-
metic intensity is caused by its more effective utilization of the L3
cache.
Varying Bandwidth. Figure 6 reports the achieved performance
of Goto’s algorithm and B3A2C0 for square matrices, varying the
amount of bandwidth to main memory. Packing is oen used to
achieve spatial locality during an algorithm. Otherwise blocks that
are designed to reside in cache may not be able to do so due to
cache conflict issues [12]. Packing incurs extra memory move-
ments that do not fundamentally need to happen during MMM.
is paper is concerned with the fundamentals of temporal local-
ity during MMM, and hence we sidestep the spatial locality issue
by storing matrices “prepacked” such that every time a matrix is
partitioned, the blocks are stored contiguously. is lets us sepa-
rate the issues of temporal and spatial locality in our experiments5.
At low bandwidth, B3A2C0 outperforms Goto’s algorithm by
thirty to forty percent. As the and the gap eventually disappears.
Comparing with existing implementations. In Figure 7, we com-
pare our implementations of Goto’s algorithm and B3A2C0 against
the dgemm routines in ATLAS [31] (3.10.3), BLIS (0.2.1), and Intel’s
Math Kernel Library (MKL 2017 Release 2) [14]. It would not be
fair to compare against implementations of MMM if we did not
need to pack, so for this experiment, input matrices are stored in
5Others have avoided packing for practical reasons. BLASFEO [5] operates on so-
called panel-major matrices for performance on small matrices. e panel-major for-
mat is similar to the format used in Goto’s algorithm for the packed panel of B. An-
other library, libxsmm[11], also targets smallmatrices, and operates on column-major
matrices, but does not perform packing.
0 1000 2000 3000 4000
0
100
200
m = n = k
G
FL
O
P
S
1 channel of DDR4-800
0 1000 2000 3000 4000
0
100
200
m = n = k
2 channels of DDR4-3200
MOMMS Goto
MOMMS B3A2C0
MOMMSC3A2C0
BLIS
MKL
ATLAS
Figure 7: Comparison with state-of-the-art open source and
vendor libraries, for both high and low bandwidth scenarios.
Matrices stored in column-major order.
columnmajor order, and our implementations of Goto’s algorithm
and B3A2C0 pack matrices the first time they become the resident
or guest matrix at some level of cache. is packing (and the fact
thatC is not stored hierarchically for Goto’s algorithm) account for
the performance difference seen for the Goto and B3A2C0 curves
between Figures 6 and 7. We see that for high bandwidth scenar-
ios, BLIS, the MOMMS implementation of Goto’s algorithm, and
B3A2C0 all aain roughly 75% of peak, and that MKL outperforms
the other implementations. For low bandwidth, implementations
that use Goto’s algorithm (or something similar) exhibit poorer
performance as they do not effectively utilize the L3. In this case,
C3A2C0 performs best, with B3A2C0 close behind. For large prob-
lem sizes, ATLAS performs almost as well as the algorithms imple-
mented in MOMMS that optimize for the L3 I/O cost but it does
not perform nearly as well for the high bandwidth case.
5.3 Optimizing for the L4 cache
In this section, we demonstrate that our methodology can be effi-
ciently applied to the Intel i7-5775C, which has four levels of cache,
where the L4 cache is 128MB of eDRAM. We implemented an al-
gorithm called C4A2C0 for this architecture. Figure 8 shows the
loop ordering and the blocksizes used for C4A2C0. In C4A2C0, a
3600×3600 block ofC resides in the L4 cache, and a 120×192 block
ofA resides in the L2 cache. We decided to skip blocking for the L3
cache, as there is sufficient bandwidth from the L4 cache without
optimizing for the number of L3 cache misses. Nevertheless, the
guest matrix of the L4 cache, a 192 × 3600 panel of B, is appropri-
ately sized to remain in the L3. C4A2C0 uses the same inner kernel
as B3A2C0.
In Figure 9, we compare the performance of Goto’s algorithm
and C4A2C0 for square matrices across several bandwidths. In
this experiment, matrices are stored hierarchically, and so pack-
ing is not performed. For high bandwidths, Goto’s algorithm and
C4A2C0 exhibit similar performance, but when bandwidth is low,
C4A2C0 outperforms Goto’s algorithm for large problem sizes.
Figure 10 compares the performance on square matrices of our
implementations of Goto’s algorithm and C4A2C0. Here, matrices
are stored in column-major order and accordingly packing is per-
formed when partitions of A and B become resident or guest ma-
trices of some level of cache. In C4A2C0, C is unpacked when it is
The MOMMS Family of Matrix Multiplication Algorithms SC19, November 17–27, 2019, Denver, CO
+=
+=
+=
+=
+=
Partitionm dimension with blocksize 3600
Partition n dimension with blocksize 3600
Partition k dimension with blocksize 192
Partitionm dimension with blocksize 120
Inner kernel
Block is reused in L4 cache.
Block is reused in L3 cache.
Block is reused in L2 cache.
Figure 8: Algorithm C4A2C0 that optimizes for both L4 and
L2. It places a square block of C in L4, and a square block of
A in L2. L3 is “skipped”, with the and the L4 guest panel, B,
reused from L3.
no longer resident in L4. In both Figures 9 and 10, the top of the
graphs is the peak computational rate of the CPU.
Because L4 is so large, we ran quite large problems since oth-
erwise the matrices would completely fit into cache. Performance
for Goto’s algorithm and MKL do not fall off until the problem size
becomesm = n = k ≈ 5000. We can see that Goto’s algorithm does
not optimally use L4 and neither does Intel’s MKL.
While BLIS’s performance does not fall off as severely as for the
other implementations when the problem size grows, its overall
performance is not as high. e algorithmic differences between
BLIS and theMOMMS implementation of Goto’s algorithm are par-
allelism and blocksizes. BLIS uses a larger kc and a smaller mc
than MOMMS and parallelizes the 2nd and 3rd loops around the
micro-kernel, whereas MOMMS parallelizes the 2nd loop around
the micro-kernel. Modifying either the parallelism or the block-
sizes so that they match that of the MOMMS implementation of
0
50
00
10
00
0
15
00
0
0
50
100
150
200
m = n = k
G
FL
O
P
S
1 channel of DDR3-800
0
50
00
10
00
0
15
00
0
0
50
100
150
200
m = n = k
1 channel of DDR3-1066
0
50
00
10
00
0
15
00
0
0
50
100
150
200
m = n = k
G
FL
O
P
S
1 channel of DDR3-1333
0
50
00
10
00
0
15
00
0
0
50
100
150
200
m = n = k
2 channels of DDR3-2400
MOMMS Goto
MOMMSC4A2C0
Figure 9: Performance for square matrices, varying prob-
lem size and available bandwidth. Matrices are stored
prepacked.
0
50
00
10
00
0
15
00
0
0
50
100
150
200
m = n = k
G
FL
O
P
S
1 channel of DDR-800
0
50
00
10
00
0
15
00
0
0
50
100
150
200
m = n = k
2 channels of DDR3-2400
MOMMS Goto
MOMMSC4A2C0
BLIS
MKL
Figure 10: Performance on i7-5775Cwith an 128MB L4 cache
for low bandwidth scenario (1 channel of DDR3-800) and a
high bandwidth scenario (2 channels of DDR3-2400).
Goto’s algorithm adversely affects performance for the low band-
width case, causing a noticeable dropoff for larger matrices. We
postulate that somehow the way that data is shared by the threads
within BLIS, coupled with the larger value of kc within BLIS, (or
the smallermc ) fosters beer reuse of data within the L4 cache.
With DDR-800, all implementations of MMM on the i7-5775C
outperform those on the i7-7700K, despite the fact that the former
processor is two generations older. e large L4 cache means that
blocksizes for the C4A2C0 algorithm can be very large, so the al-
gorithm does not need much bandwidth from main memory, but
even algorithms that do not take advantage of L4 by using such
large blocksizes benefit from having the 128MB cache. e large
capacity cache can facilitate the hiding of latency to main memory,
through techniques such as hardware prefetching.
5.4 Algorithms for different shapes of matrices
AlgorithmA3B2C0 partitions the matrices such that a square block
of A is resident in L3 and a block of B is resident in L2. It then
calls an inner kernel updating a panel of C whose elements are in
SC19, November 17–27, 2019, Denver, CO Tyler M. Smith and Robert A. van de Geijn
L3 by multiplying a block of A whose elements are in L2 times a
panel of B whose elements are in L3. AlgorithmC3A2C0 partitions
the matrices such that a square block ofC is resident in the L3 and
a block of A is resident in L2. It then calls the same inner kernel
as the algorithm B3A2C0 does. Blocksizes and loop orderings for
algorithms A3B2C0 and C3A2C0 are shown in Figure 4.
AlgorithmsA3B2C0, B3A2C0, andC3A2C0 represent three choices
for blocking for L3 . In Section 2.3, we argued that each of these
choices may be optimal for a specific problem shape where two di-
mensions are equal to
√
M3 and the other dimension is large, and
selecting the wrong algorithm for a problem shape can result in an
I/O cost that is 50% greater.
On a computer with three levels of cache and low bandwidth,
we claim the following: A3B2C0 casts its computation in terms of
a block-panel multiply, with a block ofA in L3, and so it should be
the best choice of the three algorithms whenm = k ≈ √M3, and
n is large. Similarly, B3A2C0 casts its computation in terms of a
panel-block multiply, with a block of B in L3, and so it should be
the best choice of the three algorithms when n = k ≈ √M3, andm
is large. Finally, C3A2C0 casts its computation in terms of a block
dot product multiply, with a block of C in L3, and so it should be
the best of the three algorithms whenm = n ≈ √M3, and k is large.
Figure 11 reports results using A3B2C0, B3A2C0, C3A2C0, and
Goto’s algorithm, with matrices stored hierarchically and no pack-
ing is performed. We vary the shape of the matrices. In each case,
two of the dimensions are set to 768, and one of the dimensions
is varied along the x-axis. e experiments were performed on
the Intel i7-7700K, with the DDR speed set to a single channel of
DDR4-800. When the dimension that is allowed to vary is large, the
predicted algorithm outperforms the others. We also show perfor-
mance when the matrices are square, and the size varies along the
x-axis. For our algorithms that optimize for the L3 cache, there is
very lile performance difference between the square case and the
case where an algorithm is the “correct” choice. We conclude that
when executing MMM, optimal I/O properties are aainable two
dimensions are at least the square root of the last level cache size,
as long as the third dimension is much larger.
e algorithmsA3B2C0, B3A2C0, andC3A2C0 outperformGoto’s
algorithm for larger problem sizes in this low bandwidth scenario.
is is because even when the algorithm is wrong for the problem
shape, the I/O cost is only 50% higher. In comparison, on this com-
puter, Goto’s algorithm has an I/O cost that is approximately two
times greater than the optimal algorithm.
6 SUMMARY
We have developed a new family of algorithms for MMM that effec-
tively utilizes a cache hierarchy with multiple layers of fast mem-
ory, using two loops at each level of the memory hierarchy. We
then demonstrated performance improvements over state-of-the-
art implementations of MMM when I/O cost to main memory is
a limiting factor. Algorithms like this are key to delaying the in-
evitable situation where MMM becomes memory bound.
We chose to focus on potential performance benefits of these
algorithms and demonstrated those benefits in our experiments.
However memorymovements cost far more energy than flops do [25],
0 1000 2000 3000 4000
0
100
200
m
G
F
L
O
P
S
n = k = 600
0 1000 2000 3000 4000
0
100
200
k = n
m = 600
0 1000 2000 3000 4000
0
100
200
n
G
F
L
O
P
S
m = k = 600
0 1000 2000 3000 4000
0
100
200
m = k
n = 600
0 1000 2000 3000 4000
0
100
200
k
G
F
L
O
P
S
m = n = 600
0 1000 2000 3000 4000
0
100
200
m = n
k = 600
0 1000 2000 3000 4000
0
100
200
m = n = k
G
F
L
O
P
S
Square
MOMMS Goto
MOMMS A3B2C0
MOMMS B3A2C0
MOMMSC3A2C0
Figure 11: Comparison for different shapes on i7-7700K
with low-bandwidth scenario (1 channel of DDR4-800).
so algorithms that reduce I/O costs are beneficial for the additional
reason that they reduce energy usage.
Many algorithms in libraries that implement higher-level linear
algebra, such as LAPACK [1] or libflame [9], take advantage of
the fact that MMM implementations in BLAS libraries are efficient
when the k dimension is relatively small, on the order of a cou-
ple of hundred, as Goto’s Algorithm reaches its maximal efficiency
when k is equal to the blocksize kc . We expect future computers to
be bandwidth boundwhen executingMMM in such situations, and
algorithms for MMM that have larger blocksizes will be used. To
take advantage of algorithms that use larger blocksizes, LAPACK
and FLAME can use larger blocksizes, however this is currently
disadvantageous because the larger their blocksizes are, the more
time is spent during inefficient unblocked subproblems. According
to this line of thought, LAPACK and FLAME would benefit from
algorithms that do not suffer from this weakness. One possibility
is to use recursive algorithms, as advocated in Peise and Bienti-
nesi [23].
Hard drives and other similarly slow storage devices can be
thought of as another layer of the memory hierarchy. Because
The MOMMS Family of Matrix Multiplication Algorithms SC19, November 17–27, 2019, Denver, CO
of this, we believe the methodology in this paper can be used to
instantiate out-of-core algorithms for MMM. A major difference
between such out-of-core algorithms and the ones in this paper
targeting LRU caches is that out-of-core algorithms may require
explicit transfers to disk and explicit overlapping of I/O and com-
putation.
Future work is to generalize the MOMMS family of algorithms
to other dense linear algebra operations, much in the way that
Goto’s algorithm [7]was generalized to the rest of the level-3 BLAS [8].
e key point is that most suboperations during the other level-3
BLAS operations (that operate on structured matrices) are regular,
unstructured MMM operations [16].
REFERENCES
[1] Edward Anderson, Zhaojun Bai, Christian Bischof, L Susan Blackford, James
Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling,
Alan McKenney, et al. 1999. LAPACK Users’ guide. SIAM.
[2] Kent Czechowski, Casey Baaglino, Chris McClanahan, Aparna Chan-
dramowlishwaran, and Richard W Vuduc. 2011. Balance Principles for
Algorithm-Architecture Co-Design. HotPar 11 (2011), 9–9.
[3] Jack Dongarra, Jean-Franc¸ois Pineau, Yves Robert, Zhiao Shi, and Fre´de´ric
Vivien. 2008. Revisiting matrix product on master-worker platforms. Interna-
tional Journal of Foundations of Computer Science 19, 06 (2008), 1317–1336.
[4] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S. Duff. 1990. A
set of level 3 basic linear algebra subprograms. ACM Transactions on Mathemat-
ical Soware (TOMS) 16, 1 (1990), 1–17.
[5] Gianluca Frison, Dimitris Kouzoupis, Tommaso Sartor, Andrea Zanelli, and
Moritz Diehl. 2018. BLASFEO: Basic linear algebra subroutines for embedded
optimization. ACM Transactions on Mathematical Soware (TOMS) 44, 4 (2018),
42.
[6] KazushigeGoto and Robert van de Geijn. 2002. On reducing TLBmisses in matrix
multiplication. Technical Report TR-02-55. Department of Computer Sciences,
e University of Texas at Austin.
[7] Kazushige Goto and Robert van de Geijn. 2008. Anatomy of high-performance
matrix multiplication. ACM Transactions on Mathematical Soware (TOMS) 34,
3 (May 2008), 12.
[8] Kazushige Goto and Robert van de Geijn. 2008. High-performance implementa-
tion of the level-3 BLAS. ACM Transactions on Mathematical Soware (TOMS)
35, 1 (2008), 4.
[9] JohnA. Gunnels, Fred G. Gustavson, GregM. Henry, and Robert A. van de Geijn.
2001. FLAME: Formal linear algebra methods environment. ACM Transactions
on Mathematical Soware (TOMS) 27, 4 (2001), 422–455.
[10] John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A Family
of High-Performance Matrix Multiplication Algorithms. In Proceedings of the
International Conference on Computational Sciences. Springer-Verlag, 51–60.
[11] Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016.
LIBXSMM: accelerating small matrix multiplications by runtime code genera-
tion. In Proceedings of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis. IEEE Press, 84.
[12] GregHenry. 1992. BLAS based on block data structures. Technical Report. Cornell
University.
[13] Jia-Wei Hong and Hsiang-Tsung Kung. 1981. I/O complexity: e red-blue peb-
ble game. In Proceedings of the thirteenth annual ACM symposium on eory of
computing. ACM, 326–333.
[14] Intel. [n. d.]. Math Kernel Library. ([n. d.]). hps://soware.intel.com/mkl
[15] Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication lower
bounds for distributed-memory matrix multiplication. J. Parallel and Distrib.
Comput. 64, 9 (2004), 1017–1026.
[16] Bo Ka˚gstro¨m, Per Ling, and Charles Van Loan. 1998. GEMM-basedLevel 3 BLAS:
High Performance Model Implementations and Performance Evaluation Bench-
mark. ACM Transactions on Mathematical Soware (TOMS) 24, 3 (1998), 268–
302.
[17] Hsiang-Tsung Kung. 1985. Memory requirements for balanced computer archi-
tectures. Journal of Complexity 1, 1 (1985), 147–157.
[18] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. e cache per-
formance and optimizations of blocked algorithms. In ACM SIGARCH Computer
Architecture News, Vol. 19. ACM, 63–74.
[19] Jin Li, Anthony Skjellum, and Robert D. Falgout. 1996. A poly-algorithm for
parallel dense matrix multiplication on two- dimensional process grid topologies.
Master’s thesis. Mississippi State University. Department of Computer Science.
[20] Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. intana-
Ortı´. 2016. Analytical modeling is enough for high-performance BLIS. ACM
Transactions on Mathematical Soware (TOMS) 43, 2 (2016), 12.
[21] Nicholas D. Matsakis and Felix S. Klock II. 2014. e rust language. In ACM
SIGAda Ada Leers, Vol. 34. ACM, 103–104.
[22] JD McCalpin. 1995. Memory bandwidth and machine balance in high perfor-
mance computers. IEEE Technical Commiee on Computer Architecture (TCCA)
Newsleer (December 1995).
[23] Elmar Peise and Paolo Bientinesi. 2017. Algorithm 979: Recursive Algorithms
for Dense Linear Algebrafie ReLAPACK Collection. ACM Transactions on
Mathematical Soware (TOMS) 44, 2 (2017), 16.
[24] MartinD. Schatz, Robert A. van de Geijn, and Jack Poulson. 2016. Parallelmatrix
multiplication: A systematic journey. SIAM Journal on Scientific Computing 38,
6 (2016), C748–C781.
[25] John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing tech-
nology challenges. In International Conference on High Performance Computing
for Computational Science. Springer, 1–25.
[26] Tyler M. Smith, Bradley Lowery, Julien Langou, and Robert A van de Geijn.
2019. A Tight I/O Lower Bound for Matrix Multiplication. arXiv preprint
arXiv:1702.02017 (2019).
SC19, November 17–27, 2019, Denver, CO Tyler M. Smith and Robert A. van de Geijn
[27] Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond,
and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded ma-
trix multiplication. In 2014 IEEE 28th International Parallel and Distributed Pro-
cessing Symposium. IEEE, 1049–1059.
[28] Field G. Van Zee, Tyler M. Smith, Bryan Marker Bryan, Tze Meng Low, Robert
van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael
Kistler, Vernon Austel, John Gunnels, and Lee Killough. 2016. e BLIS frame-
work: Experiments in portability. ACM Transactions on Mathematical Soware
(TOMS) 42, 2 (2016), 12.
[29] Field G. VanZee and RobertA. van de Geijn. 2015. BLIS: A framework for rapidly
instantiating BLAS functionality. ACM Transactions on Mathematical Soware
(TOMS) 41, 3 (2015), 14.
[30] Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: auto-
matically generate high performance dense linear algebra kernels on x86 CPUs.
In Proceedings of the International Conference on High Performance Computing,
Networking, Storage and Analysis. ACM, 25.
[31] R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra
soware. In SC’98: Proceedings of the 1998 ACM/IEEE conference on Supercom-
puting. IEEE, 38–38.
[32] Samuel Williams, Andrew Waterman, and David Paerson. 2009. Roofline: An
insightful visual performancemodel for multicore architectures. Commun. ACM
52, 4 (2009), 65–76.
[33] Kamen Yotov, Xiaoming Li, Gang Ren, MJS Garzaran, David Padua, Keshav
Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-
performance BLAS? Proc. IEEE 93, 2 (2005), 358–386.
