As the ratio between the rate of computation and rate with which data can be retrieved from various layers of memory continues to deteriorate, a question arises: Will the current best algorithms for computing matrix-matrix multiplication on future CPUs continue to be (near) optimal?
INTRODUCTION
For almost two decades, the so-called Goto's algorithm for matrixmatrix multiplication (MMM) has guided practical implementations on current CPUs [6, 7] .
e algorithm orchestrates computation so as to keep a packed copy of a roughly square submatrix (block) of A in the L2 cache and a packed copy of a row panel of B in the L3 cache. Major innovations of Goto's algorithm include staging a block of A in the L2 cache rather than the L1 cache to reduce the amount of data movement by allowing the block of A to be larger, packing the block of A and panel of B into speciallyforma ed contiguous buffers for be er spatial locality, and showing that translation-lookaside buffer (TLB) misses can be a performance impediment that is alleviated by reducing the footprint of the block of A.
For years, the rate of peak computation and the memory movement have been diverging [22] , and MMM has been predicted to soon become a memory-bound operation based on such hardware trends [2] , taking into the hardware requirements for this computation to be balanced [17] . While Goto's algorithm is well-suited to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SC19, Denver, CO the relative speeds of caches in current memory hierarchies, we show, through analysis and empirical studies, that this will not continue to be the case as bandwidths between various memory layers continue to deteriorate relative to the rate of computation. Figure 1 reports the a ained performance of various implementations on a custom-built computer that allows the bandwidth to different memory layers to be artificially reduced 1 . e curve labeled MOMMS Goto uses Goto's algorithm and we believe MKL uses an algorithm similar to it, and the curve labeled MOMMS C 3 A 2 C 0 uses an algorithm that more effectively utilizes the L 3 cache. e performance degradation of Goto's algorithm on such an architecture is significant. As MMM becomes near memory bound, it is essential that algorithms for this extremely widely-used operation to utilize the memory hierarchy as effectively as possible.
is paper first reviews recent theoretical results by Smith et al. [26] that establish an (essentially) tight lower bound on the memory traffic incurred by a MMM under a simple model. It then makes a number of new contributions:
• With these theoretical results as a foundation, it proposes the Multilevel Optimized Matrix-matrix Multiplication Sandbox (MOMMS) family of practical algorithms of which Goto's algorithm is a member.
• It analyzes tradeoffs in the number of transfers between layers of the memory hierarchy that arise when simultaneously optimizing for multiple levels of cache. ese tradeoffs are not explained by current state-of-the-art theoretical results.
• It analytically exposes different scenarios under which different algorithms in the family exhibit beneficial characteristics.
• It empirically demonstrates the benefits of different algorithms on custom-built hardware that allows memory bandwidths to be varied. Together, these lay the foundation for practical solutions if and when the balance between computation and memory bandwidth changes in the future.
THEORY AND FUNDAMENTAL SHAPES
We briefly review the state-of-the-art theoretical lower bounds for the I/O complexity for MMM, and describe algorithms that a ain those bounds and thus are optimal for a simple model of a twolevel memory hierarchy. ese algorithms become our fundamental components from which we compose practical algorithms for multiple levels of cache, in Section 3.
An I/O lower bound for MMM
Smith et al. [26] starts with a simple model of memory with two layers of memory: a small, fast memory with capacity of M elements and a large, slow memory with unlimited capacity. It shows that any algorithm for ordinary MMM 2 must read at least 2mnk/ √ M − 2M elements from slow memory and additionally write at least mn −M elements to slow memory. Adding these two lower bounds gives a lower bound on the number of transfers between slow and fast memory, called the I/O lower bound, of approximately 2mnk/ √ M. Importantly, this lower bound is tight, modulo lower order terms. It improves upon previous work [3, 13, 15] .
Resident algorithms for MMM
In [26] , it is shown that three algorithms, named Resident A, Resident B, and Resident C, a ain the lower bound on the number of reads from slow memory 3 . Additionally, Resident C a ains the lower bound on the number of writes to slow memory 3 . In each algorithm, the elements of one of the operand matrices are read from slow memory only once, and each element of the other two operand matrices is reused approximately √ M times each time it is brought into fast memory. While the Resident A algorithm was described as early as 1991 [18] , and all three appear in [10] , their optimality was first noted in [26] .
2.2.1 Resident C. e operation MMM Z += XY can be computed by the sequence of rank-1 updates Z += x 0
, where x i and T i are a row and column of X and Y , respectively. is is illustrated in Figure 2 (le ), where Z is the square block on the le , X is the middle operand, and Y is the operand on the right.
e vectors x i and T i are represented by the thin partitions of X and Y .
Suppose we have (larger) matrices C, A, and B. We compute C += AB in the following way. Partition:
2 We only consider algorithms that compute the i, j element of m × n matrix C as γ i, j := k −1 p=0 α i,p β p, j where α i,p and β p, j are the i, p and p, j elements of m × k matrix A and k × n matrix B, respectively. 3 Modulo lower order terms.
where C i, j is m c × n c , A i is m c × k, and B j is k × n c , except at the margins. en we compute the suboperation C i, j += A i B j using the described algorithm for Z += XY . Now, C i, j is read from slow memory once at the beginning of the suboperation, and resides in fast memory during the rest of the duration, and A i and B j are streamed one row and column at a time from slow memory. Each element of each operand is read once for each i, j, and there are us the algorithm is essentially optimal.
Resident A and B.
Similarly, in the MMM Z += XY , each column of Z can be computed by the matrix-vector multiplicationz i += X i , where z i and i are columns of Z and Y , respectively. is is illustrated in Figure 2 (middle), Z , X , and Y are the le , middle, and right operands, respectively.
Consider C += AB. Partition:
where C i is m c × n, A i,p is m c × k c , and B p is k c × n, except at the margins.
en we compute the suboperation C i += A i,p B p using the described MVM-based algorithm for Z += XY . In Resident A, A i,p is read from slow memory once at the beginning of the suboperation, and C j and B p are streamed from slow memory one column at a time. e total I/O costs associated with each matrix are: ⌈ mnk e Resident B algorithm is the obvious symmetric equivalent to the Resident A algorithm, built upon the suboperation in Figure 2 (right). Its I/O costs mirror that of the Resident A algorithm. e above descriptions "stream" rows and/or columns of two matrices while keeping a block of the third matrix resident in fast memory. Notice that one can instead stream row panels instead of rows and/or column panels instead of columns as long as the "small" dimension of the panel is small relative to the sizes of the block that is resident in fast memory.
is still achieves the I/O lower bound modulo a lower order term.
is insight becomes crucial when we discuss blocking for multiple levels of memory.
Algorithms for different shapes of MMM
e number of reads and writes from slow memory for the Resident A, B, and C algorithms depend on the shape of the input matrices: ere are cases where one of the algorithms is more efficient than the other two, where we define efficiency by flops per memop (I/O operations). ere are 2mnk flops performed during 4 m c and n c must be slightly less than √ M to make room for a row of A i and a column of B j in fast memory.
+=
+= += , which is approximately √ M when m is large. Resident A has an efficiency of 1
, which is approximately √ M when n is large.
is shows that one must choose the right algorithm depending on the shape of the problem. For each of the Resident A,B, and C algorithms, there is a minimal shape that can be implemented efficiently. For Resident C this occurs when m ≈ n ≈ √ M, and k is large. For Resident B this occurs when k ≈ n ≈ √ M, and m is large. For Resident A this occurs when m ≈ k ≈ √ M, and n is large. In each case, the resident matrix fits into fast memory, and the dimension shared by the other two operands should be large so that the cost of moving the resident matrix into fast memory can be amortized.
e fact that one must choose a different algorithm for MMM depending on problem shape and size was previously noted for distributed memory MMM [19, 24] , and for hierarchical memory MMM [10, 31] .
A balancing act
So far in this section, we have assumed that the costs associated with accessing an element is the same, no ma er if it is an element of A, B, or C. In doing so, we arrived at the following strategy: Place a square block of the resident matrix in fast memory, streaming the other two from slow memory. is amortizes the I/O costs associated with the resident matrix, and equalizes the number of accesses of the two streamed matrices.
We now re-analyze the Resident A, B, and C algorithms in the case that the costs associated with accessing elements of the different operands are unequal. is can happen if, e.g., if we add a third layer of memory of intermediate size and access cost. In this case, at the start of a multiplication with submatrices one operand may reside in slow memory while another resides in some intermediate layer. Furthermore, in many cases reads and writes cannot be overlapped (e.g. main memory is o en not dual-ported), and hence it is more expensive to access elements of C since C must be both read and wri en. One way to address this is to select the algorithm where blocks of the operand that is most expensive to access are kept in fast memory as much as possible. Another is to adjust the sizes used for the the resident block in fast memory.
We now walk through an example of the second solution. Suppose we are employing the Resident A algorithm, with an m c × k c block of A in fast memory. If the cost of accessing an element of B costs β B , and accessing an element of C costs β C , when m, n, and k are large, the efficiency in terms of flops per memop is
M. With this, the total cost of I/O (rather than the number of accesses) associated with accessing the streamed matrices are equalized and thus the cost minimized (modulo lower order terms).
Summary
e ingredients to an efficient algorithm are: (1) Fill fast memory with a submatrix of one of the operands (the resident matrix), (2) Amortize the I/O cost associated with (1) over enough computation, (3) Choose dimensions for the resident block that equalize the I/O costs (rather then the number of accesses) associated with the two streamed matrices.
MULTIPLE LEVELS OF CACHE
We now extend the ideas from Section 2 to MMM for computers with multiple layers of fast memory. We carefully build up a particular algorithm for a computer with three layers of memory: a slow main memory, a medium cache, and a fast cache, with each level of cache faster and smaller in capacity than the one before it. We name the fast and medium caches L f and L m , and name their capacities M f and M m , respectively.
We start by partitioning the matrices so that the computation is orchestrated as a double loop over a particular subproblem, one of the shapes in Figure 2 , that that effectively utilizes the L m cache.
e question then becomes how to implement the subproblem in a way that effectively utilizes L f .
A motivating example
For our motivating example, we choose the Resident A algorithm when blocking for L m .
Effectively utilizing L m . To effectively utilize L m , we select the partitioning that casts computation in terms of a double loop around the middle shape in Figure 2 . We call this subproblem the L m blockpanel multiply.
Effectively utilizing L f . e question now becomes how to orchestrate the L m block-panel multiply in a way that effectively utilizes L f .
is suggests again a double loop around one of the shapes in Figure 2 . It is not hard to see that creating a double loop that again implements a Resident A algorithm is problematic: Partitioning A for L f would expose panels of B and C that by design are too large to fit into L m , and these panels are used in multiple L f subproblems. Either these panels of B or C would need to be brought into L m multiple times or the sizes of the various matrix partitions would need to be reduced. Either way the effect would be that the operation would no longer be near-optimal with respect to the number of transfers between L m and slower levels of memory. We conclude that the block-panel multiply should be implemented in terms of a Resident C or Resident B algorithm. Which of these depends on the choice of the outer and inner loop, which we discuss next.
Choosing the outer loop for the L f cache. In order to a ain near the lower bound, each element in the two long panels of B and C must be used ≈ √ M times each time it is brought into L m . is leads us to first partition along the n dimension with blocksize n c , yielding partitions of B and C that are small enough to fit into the L m cache along with the block of A.
e inner loop for the L f cache. e next step is to further partition the matrices to optimize for L f . e subproblem exposed by each iteration of the L f outer loop is a block of A times a skinny panel of B updating a skinny panel of C. e L f inner loop will partition this subproblem along one of the two dimensions that the L f outer loop did not. We can choose either of these. For this example, we will choose the k dimension (with blocksize k c ). is L f inner loop exposes a new subproblem that we will call the L f subproblem. In this case, the L f subproblem is a tall and skinny panel of A times a k c × n c block of B, updating a tall and skinny panel of C. If k c ≈ n c ≈ M f , then the L f subproblem corresponds to the furthest le shape seen in Figure 2 . en, the block of B will reside in the L f cache, and the panels of A and C will be streamed in from lower levels of cache for the duration of this subproblem.
Building a Family of Algorithms
In the the motivating example, we started with a problem resembling the middle shape from Figure 2 , and used two loops to partition the problem, resulting in a problem resembling one of the other two problem shapes. is suggests the following methodology to optimize for any number of levels of cache: We begin with one of the three shapes in Figure 2 , optimizing for the I/O cost for the L h cache. en, to optimize for the next smaller and faster level of cache, the L h−1 cache, we first partition the problem along the long dimension, and then partition along along one of the other two dimensions. e result is one of the other two shapes shown in Figure 2 . We name the outermost of these two loops the L h−1 outer loop and the innermost the L h−1 inner loop. is process is shown in Figure 3 .
We note that [10] claimed that it was locally optimal to encounter a subproblem that corresponds to one of the three optimal subproblems at every level of the memory hierarchy. However that paper did not give details on how this could be accomplished, nor did it analyze the claim in terms of any I/O lower bounds.
Classifying matrix operands and algorithms
e two loops for L h−1 have exposed partitions of matrices that differ in terms of access frequency and size. From these properties, we can classify these different matrix partitions.
e L h resident block is the block that is designed to remain and reside in the L h cache during the duration of the L h subproblem. e other two operands of an L h subproblem are called the L h streamed panels, as small partitions of the streamed panels are brought into L h during an iteration of the L h−1 outer loop, used for computation, and then not used again during the L h subproblem.
e L h−1 inner loop partitions the L h resident block and one of the L h streamed panels. e remaining L h streamed panel is le unpartitioned.
e matrix partition not partitioned by the L h−1 inner loop is used during every iteration of the L h−1 inner loop. Guided by the principle that each element of the L h subproblem should only by read into L h once, it must remain in cache during the entire inner L h−1 loop. We name this matrix partition the L h guest panel. Compare this to the resident block of the L h cache.
e elements of the L h guest matrix, like the elements of the L h resident block, are reused from L h across iterations of a loop. e difference is that the L h resident block is reused across every iteration of the outer L h−1 loop, and the L h guest matrix is reused across the iterations of the inner L h−1 loop.
A er the two L h−1 loops, we have exposed one of the three shapes associated with our algorithms Resident A, Resident B, and Resident C. e small block that will then reside in L h−1 will be known as the L h−1 resident block.
e algorithms that arise from our methodology can be identified by the operand that the resident block is from in each level of cache. We introduce a naming convention for the algorithms that states the level of cache and the operand that resides in it. For instance if an algorithm has B as the resident block of the L 2 cache, A as the resident block of the L 1 cache, and C as the resident block in registers, it is called B 2 A 1 C 0 .
Optimizing for registers
In our family of algorithms, we think of the register file as L 0 : the smallest and fastest level of cache. For practical reasons, it should be treated as a special case. In many implementations of MMM, the innermost kernel implements the Resident C algorithm [7, 11, 29, 30] .
ere are good reasons for this. e latency of the computation instructions dictates that there is a minimum number of registers that must be used to store elements of C to avoid the instruction latency becoming a bo leneck. e number of elements of C that are stored in registers must be at least the product of the
Resident block of L h−1 cache. instruction latency and the number of elementary additions that can be performed per cycle [20, 33] . O en this means that a significant portion of the registers must be dedicated to storing elements of C, making it unnatural to use the Resident A or Resident B algorithms for the registers. erefore, it is o en the case that there is no choice but to use Resident C for the registers.
MULTILEVEL CACHE TRADEOFFS
When simultaneously optimizing for multiple levels of cache, there are tensions between I/O costs at the different levels of cache.
Optimizing for
When simultaneously optimizing for both L h−1 and L h , the size of the L h resident block is reduced relative to its size when optimizing only for L h , since larger portions of the L h streamed panels must fit in L h . When optimizing for both L h and L h−1 :
• At minimum, the L h resident matrix and
• If L h is inclusive and has a LRU policy, then in order for an element to remain in L h , fewer than M h elements may be accessed in between accesses of the element for it to remain in cache and hence every matrix partition exposed by the L h−1 outer loop must fit in L h .
ese conditions represent a tradeoff between optimizing for L h and L h−1 .
e larger L h−1 is, the more data must fit into it, and the smaller the L h resident block can be. With the simplifying assumptions that the resident blocks of both L h and L h−1 must be square, the I/O cost for L h when optimizing for both L h and L h−1 can be determined by the ratio M h /M h−1 .
Sometimes, it is counter-productive or of limited value to optimize for L h−1 when optimizing for L h . In this case:
(1) A simple option is to treat L h−1 as if it were smaller than it is, reducing the size of the L h−1 resident block. (2) If L h is LRU, another option is to tweak the blocksizes for L h−1 slightly. e portions of the L h streamed panels that must fit into L h alongside the L h resident block depends on the tiling of the L h−1 outer loop but not on the blocksize of the L h−1 inner loop. erefore, one can tweak the shape of the L h−1 resident block accordingly. (3) A third option is that one could "skip" optimizing for L h−1 and instead simultaneously optimize for L h and L h−2 . Blocking for the L h−1 cache adversely affects the number of transfers into and out of the L h cache but blocking for further (smaller and faster) levels of cache does not, because the entire L h−2 subproblem fits within the data that must be in the L h cache.
Simultaneously optimizing for L h and L h−1 adversely effects transfers into and out of L h . We now argue that it also has an adverse effect on the transfers into and out of L h .
When optimizing for only one cache of size M, the streamed matrices are each associated with an aggregate I/O cost of ≈ mnk/ √ M. When optimizing for both L h and L h−1 , however, the I/O cost associated with the L h−1 resident matrix becomes cubic because each element of the L h−1 resident matrix is moved into
When optimizing for both L h and L h−1 , the I/O cost associated with the L h−1 resident matrix will be ≈ mnk/ √ M h , whereas when only optimizing for the L h−1 cache, the I/O cost associated with the L h−1 resident matrix is equal to the number of compulsory reads and writes. e I/O costs for the streamed matrices are not affected.
While optimizing for the L h cache has increased the L h−1 I/O cost, optimizing for L h+1 , L h+2 , etc., does not affect it, because optimizing for further levels of cache does not reduce the number of times each element is used every time it is brought into the L h−1 cache.
Skipping caches
We have seen that tradeoffs occur when simultaneously optimizing for the I/O cost of multiple levels of cache. Sometimes these tradeoffs are too great, so instead of optimizing for both L h and L h−1 , one may forego the L h−1 cache, and instead simultaneously optimize for L h and L h−2 I/O costs, where the L h−1 cache is intermediate between L h and L h−2 . We call this skipping the L h−1 cache.
When the L h−1 cache is skipped, an optimal subproblem is encountered at the L h level and at the L h−2 level, but not at the L h−1 level, However, this does not mean that the L h−1 is not useful. Recall that the L h guest matrix is reused during each iteration of the L h−2 inner loop. In the right circumstances, this L h guest matrix may be instead reused in the L h−1 cache, if that cache is skipped.
• In idealized circumstances, only the L h guest matrix should need to be in the L h−1 cache.
• If the L h−1 cache is LRU, then a panel of the L h resident matrix must also fit into the L h−1 cache.
• If the L h−1 cache is inclusive, then the L h−2 resident block must also fit into the L h−1 cache.
In this case, the L h guest matrix is reused from L h−1 , but is not square, and the I/O cost associated with reading the other two operands is suboptimal. Furthermore, in many cases this panel occupies only a fraction of L h−1 , reducing its size and further increasing the I/O cost. Goto's algorithm is a member of the MOMMS family. It skips optimizing for the L 3 and L 1 caches. Since A is the resident matrix of the L 2 cache, and C is the resident matrix of the registers, Goto's algorithm is named A 2 C 0 according to the convention in Section 3.3.
EXPERIMENTS
We now evaluate algorithms created by our methodology. We do so by performing experiments on architectures with a varying number of levels of cache.
For current CPUs, Goto's algorithm a ains excellent performance [28] that is difficult to exceed despite the fact that it does not a ain close to the I/O lower bound on computers with an L 3 cache. In order to evaluate the I/O cost of different algorithms, we artificially vary the cost of accessing main memory. In all experiments, we perform double precision MMM.
Experimental setup
We have implemented the described family of algorithms as the Multilevel Optimized Matrix-Matrix Multiplication Sandbox (MOMMS). MOMMS implements algorithms for MMM by composing components like matrix partitioning, packing, and parallelization at compile time. MOMMS is wri en in Rust [21] , a modern system programming language focusing on memory safety. Most of this safety is enforced at compile-time through Rust's borrow checker. In Rust, memory is freed when it goes out of scope, and there is no garbage collector. From Rust, one can call C functions with very low overhead. For low-level kernels, MOMMS calls the BLIS microkernel [29] coded in C and inline assembly language.
We custom built two computers. One has an Intel i7-7700K CPU with two 8GB DIMMS of DDR4-3200 RAM and a motherboard with an Intel Z270 chipset.
e other has an Intel i7-5775C CPU with two 8GB DIMMS of DDR3-2400 RAM and a motherboard with an Intel Z97 chipset. We refer to these computers by their processor names. We chose the Z270 and Z97 chipset motherboards because these are enthusiast motherboards for consumers interested in overclocking, and they provide the ability to change the memory multiplier. e i7-7700K computer has a 4-core Intel Kaby Lake CPU with 64KB L 1 , 256KB L 2 , and 6MB L 3 caches. We chose this because it is a recent readily available Intel processor with an L 3 cache.
e i7-5775C is a 4-core Intel Broadwell CPU. It also has 64KB L 1 , 256KB L 2 ,and 6MB L 3 caches. Most notably it has 128MB of eDRAM, functioning as an L 4 cache.
All experiments were performed with hyperthreading disabled. A userspace CPU governor was used to set the CPUs to the nominal CPU frequency: 4.2 GHz for the i7-7700K and 3.3 GHz for the i7-5775C.
e bandwidth to main memory can be determined by the product of the number of memory channels, the base clock rate, the number of bytes per transfer, and the memory multiplier. With DDR RAM, this is doubled since it transfers on both the leading and trailing edges of the clock signal. We increase the ratio of the rate of I/O to the rate of computation via the BIOS se ings. Reducing the memory multiplier and the number of memory channels decreases the rate of I/O without changing the rate of computation.
Optimizing for the L 3 cache
We here describe an algorithm implemented in MOMMS that optimizes for both L 3 and L 2 labeled B 3 A 2 C 0 . We compare this algorithm to our re-implementation of Goto's algorithm (also implemented in MOMMS), and to vendor and state-of-the-art open source BLAS [4] implementations. Figure 4 compares Goto's algorithm with other MOMMS algorithms optimized for both the I/O cost of L 3 and L 2 .
We now describe the B 3 A 2 C 0 algorithm as implemented for the i7-7700K and illustrated in Figure 4 (second from the le ). First, we partition for L 3 cache.
e L 3 outer loop partitions the matrices in the n dimension with blocksize 768. en the L 3 inner loop partitions in the k dimension, also with blocksize 768. is reveals a 768×768 block of B that becomes the L 3 resident matrix. Next, we partition for the L 2 cache. Since B is the L 3 resident matrix, the L 2 outer loop must be in the m dimension, and it is with blocksize 120. e next two loops make a 4 × 12 block of C the resident matrix of registers, and a 192 × 12 panel of B, the guest panel of L 2 . is guest panel of L 2 is designed to be reused in the (skipped) L 1 cache. Finally, we call a 4 × 12 micro-kernel provided by BLIS [29] .
We compare this to Goto's algorithm with similar blocksizes as follows: n c is 3000, k c is 192, m c is 120, m r is 4, and n r is 12. Our implementation of Goto's algorithm uses the same micro-kernel from BLIS as does B 3 A 2 C 0 . For both algorithms, we parallelize the second loop around the micro-kernel with 4 threads. is quadruples the bandwidth requirements of our algorithms without increasing the amount of the L 3 cache that must be set-aside for elements of A [27] .
Rooflines. e roofline model is a simple model used to give an upper bound on performance based on the arithmetic intensity of an algorithm for a specific computer [32] . e computer is characterized by its rate of computation and the rate at which it can transfer data between main memory and cache. e arithmetic intensity of an algorithm is the number of flops per byte transferred between memory and cache during the execution of that algorithm. When the arithmetic intensity is low it is bandwidth bound, and when the arithmetic intensity is high it is compute bound. e roofline model is thus a plot where the x-axis is the arithmetic intensity and the y-axis is maximum rate of computation for that arithmetic intensity. e roofline that serves as an upper bound on performance is formed by two linear curves that intersect when the minimum time spent for computation for an algorithm is equal to the minimum time spent for I/O. Algorithms are plo ed on the roofline cores under two bandwidth conditions. e y-axis is measured and the x-axis is theoretical. e higher points for each algorithm represent high-bandwidth performance and the lower points represent low-bandwidth performance. model according to their arithmetic intensity and measured performance as a way to explain their performance and to explain whether or not they could perform be er. One can either measure the arithmetic intensity of an algorithm or analyze it. We choose to analyze the arithmetic intensity of the algorithms plo ed.
When the matrices are large, Goto's algorithm has an efficiency of 1
flops per element. With the blocksizes we used, this is 23.26 flops per byte. e algorithm B 3 A 2 C 0 , with a 768 × 768 block of B in the L 3 cache, has an efficiency of 64 flops per byte.
In Figure 5 , we show the roofline model for the i7-7700K for the case of one channel of DDR4-800 RAM, and for the case of two channels of DDR4-3200 RAM. ese cases represent the minimum and the maximum memory bandwidth that we configure the computer for. We plot the modeled efficiency of Goto's algorithm and the algorithm B 3 A 2 C 0 against each algorithm's measured performance. e roofline plot clearly shows that in the highbandwidth case, either algorithm is capable of achieving the peak performance of the CPU based on its arithmetic intensity, but for the low-bandwidth case, only B 3 A 2 C 0 can. e improved arithmetic intensity is caused by its more effective utilization of the L 3 cache.
Varying Bandwidth. Figure 6 reports the achieved performance of Goto's algorithm and B 3 A 2 C 0 for square matrices, varying the amount of bandwidth to main memory. Packing is o en used to achieve spatial locality during an algorithm. Otherwise blocks that are designed to reside in cache may not be able to do so due to cache conflict issues [12] . Packing incurs extra memory movements that do not fundamentally need to happen during MMM.
is paper is concerned with the fundamentals of temporal locality during MMM, and hence we sidestep the spatial locality issue by storing matrices "prepacked" such that every time a matrix is partitioned, the blocks are stored contiguously. is lets us separate the issues of temporal and spatial locality in our experiments 5 .
At low bandwidth, B 3 A 2 C 0 outperforms Goto's algorithm by thirty to forty percent. As the and the gap eventually disappears.
Comparing with existing implementations. In Figure 7 , we compare our implementations of Goto's algorithm and B 3 A 2 C 0 against the routines in ATLAS [31] [14] . It would not be fair to compare against implementations of MMM if we did not need to pack, so for this experiment, input matrices are stored in 5 Others have avoided packing for practical reasons. BLASFEO [5] operates on socalled panel-major matrices for performance on small matrices. e panel-major format is similar to the format used in Goto's algorithm for the packed panel of B. Another library, libxsmm [11] , also targets small matrices, and operates on column-major matrices, but does not perform packing. column major order, and our implementations of Goto's algorithm and B 3 A 2 C 0 pack matrices the first time they become the resident or guest matrix at some level of cache. is packing (and the fact that C is not stored hierarchically for Goto's algorithm) account for the performance difference seen for the Goto and B 3 A 2 C 0 curves between Figures 6 and 7. We see that for high bandwidth scenarios, BLIS, the MOMMS implementation of Goto's algorithm, and B 3 A 2 C 0 all a ain roughly 75% of peak, and that MKL outperforms the other implementations. For low bandwidth, implementations that use Goto's algorithm (or something similar) exhibit poorer performance as they do not effectively utilize the L 3 . In this case, C 3 A 2 C 0 performs best, with B 3 A 2 C 0 close behind. For large problem sizes, ATLAS performs almost as well as the algorithms implemented in MOMMS that optimize for the L 3 I/O cost but it does not perform nearly as well for the high bandwidth case.
Optimizing for the L 4 cache
In this section, we demonstrate that our methodology can be efficiently applied to the Intel i7-5775C, which has four levels of cache, where the L 4 cache is 128MB of eDRAM. We implemented an algorithm called C 4 A 2 C 0 for this architecture. Figure 8 shows the loop ordering and the blocksizes used for C 4 A 2 C 0 . In C 4 A 2 C 0 , a 3600× 3600 block of C resides in the L 4 cache, and a 120× 192 block of A resides in the L 2 cache. We decided to skip blocking for the L 3 cache, as there is sufficient bandwidth from the L 4 cache without optimizing for the number of L 3 cache misses. Nevertheless, the guest matrix of the L 4 cache, a 192 × 3600 panel of B, is appropriately sized to remain in the L 3 . C 4 A 2 C 0 uses the same inner kernel as B 3 A 2 C 0 .
In Figure 9 , we compare the performance of Goto's algorithm and C 4 A 2 C 0 for square matrices across several bandwidths. In this experiment, matrices are stored hierarchically, and so packing is not performed. For high bandwidths, Goto's algorithm and C 4 A 2 C 0 exhibit similar performance, but when bandwidth is low, C 4 A 2 C 0 outperforms Goto's algorithm for large problem sizes. Figure 10 compares the performance on square matrices of our implementations of Goto's algorithm and C 4 A 2 C 0 . Here, matrices are stored in column-major order and accordingly packing is performed when partitions of A and B become resident or guest matrices of some level of cache. In C 4 A 2 C 0 , C is unpacked when it is no longer resident in L 4 . In both Figures 9 and 10 , the top of the graphs is the peak computational rate of the CPU.
Because L 4 is so large, we ran quite large problems since otherwise the matrices would completely fit into cache. Performance for Goto's algorithm and MKL do not fall off until the problem size becomes m = n = k ≈ 5000. We can see that Goto's algorithm does not optimally use L 4 and neither does Intel's MKL. While BLIS's performance does not fall off as severely as for the other implementations when the problem size grows, its overall performance is not as high.
e algorithmic differences between BLIS and the MOMMS implementation of Goto's algorithm are parallelism and blocksizes. BLIS uses a larger k c and a smaller m c than MOMMS and parallelizes the 2nd and 3rd loops around the micro-kernel, whereas MOMMS parallelizes the 2nd loop around the micro-kernel. Modifying either the parallelism or the blocksizes so that they match that of the MOMMS implementation of Goto's algorithm adversely affects performance for the low bandwidth case, causing a noticeable dropoff for larger matrices. We postulate that somehow the way that data is shared by the threads within BLIS, coupled with the larger value of k c within BLIS, (or the smaller m c ) fosters be er reuse of data within the L 4 cache. With DDR-800, all implementations of MMM on the i7-5775C outperform those on the i7-7700K, despite the fact that the former processor is two generations older. e large L 4 cache means that blocksizes for the C 4 A 2 C 0 algorithm can be very large, so the algorithm does not need much bandwidth from main memory, but even algorithms that do not take advantage of L 4 by using such large blocksizes benefit from having the 128MB cache.
e large capacity cache can facilitate the hiding of latency to main memory, through techniques such as hardware prefetching.
Algorithms for different shapes of matrices
Algorithm A 3 B 2 C 0 partitions the matrices such that a square block of A is resident in L 3 and a block of B is resident in L 2 . It then calls an inner kernel updating a panel of C whose elements are in L 3 by multiplying a block of A whose elements are in L 2 times a panel of B whose elements are in L 3 . Algorithm C 3 A 2 C 0 partitions the matrices such that a square block of C is resident in the L 3 and a block of A is resident in L 2 . It then calls the same inner kernel as the algorithm B 3 A 2 C 0 does. Blocksizes and loop orderings for algorithms A 3 B 2 C 0 and C 3 A 2 C 0 are shown in Figure 4 .
Algorithms A 3 B 2 C 0 , B 3 A 2 C 0 , and C 3 A 2 C 0 represent three choices for blocking for L 3 . In Section 2.3, we argued that each of these choices may be optimal for a specific problem shape where two dimensions are equal to √ M 3 and the other dimension is large, and selecting the wrong algorithm for a problem shape can result in an I/O cost that is 50% greater.
On a computer with three levels of cache and low bandwidth, we claim the following: A 3 B 2 C 0 casts its computation in terms of a block-panel multiply, with a block of A in L 3 , and so it should be the best choice of the three algorithms when m = k ≈ √ M 3 , and n is large. Similarly, B 3 A 2 C 0 casts its computation in terms of a panel-block multiply, with a block of B in L 3 , and so it should be the best choice of the three algorithms when n = k ≈ √ M 3 , and m is large. Finally, C 3 A 2 C 0 casts its computation in terms of a block dot product multiply, with a block of C in L 3 , and so it should be the best of the three algorithms when m = n ≈ √ M 3 , and k is large. Figure 11 reports results using A 3 B 2 C 0 , B 3 A 2 C 0 , C 3 A 2 C 0 , and Goto's algorithm, with matrices stored hierarchically and no packing is performed. We vary the shape of the matrices. In each case, two of the dimensions are set to 768, and one of the dimensions is varied along the x-axis.
e experiments were performed on the Intel i7-7700K, with the DDR speed set to a single channel of DDR4-800. When the dimension that is allowed to vary is large, the predicted algorithm outperforms the others. We also show performance when the matrices are square, and the size varies along the x-axis. For our algorithms that optimize for the L 3 cache, there is very li le performance difference between the square case and the case where an algorithm is the "correct" choice. We conclude that when executing MMM, optimal I/O properties are a ainable two dimensions are at least the square root of the last level cache size, as long as the third dimension is much larger. e algorithms A 3 B 2 C 0 , B 3 A 2 C 0 , and C 3 A 2 C 0 outperform Goto's algorithm for larger problem sizes in this low bandwidth scenario.
is is because even when the algorithm is wrong for the problem shape, the I/O cost is only 50% higher. In comparison, on this computer, Goto's algorithm has an I/O cost that is approximately two times greater than the optimal algorithm.
SUMMARY
We have developed a new family of algorithms for MMM that effectively utilizes a cache hierarchy with multiple layers of fast memory, using two loops at each level of the memory hierarchy. We then demonstrated performance improvements over state-of-theart implementations of MMM when I/O cost to main memory is a limiting factor. Algorithms like this are key to delaying the inevitable situation where MMM becomes memory bound.
We chose to focus on potential performance benefits of these algorithms and demonstrated those benefits in our experiments. However memory movements cost far more energy than flops do [25] , so algorithms that reduce I/O costs are beneficial for the additional reason that they reduce energy usage.
Many algorithms in libraries that implement higher-level linear algebra, such as LAPACK [1] or libflame [9] , take advantage of the fact that MMM implementations in BLAS libraries are efficient when the k dimension is relatively small, on the order of a couple of hundred, as Goto's Algorithm reaches its maximal efficiency when k is equal to the blocksize k c . We expect future computers to be bandwidth bound when executing MMM in such situations, and algorithms for MMM that have larger blocksizes will be used. To take advantage of algorithms that use larger blocksizes, LAPACK and FLAME can use larger blocksizes, however this is currently disadvantageous because the larger their blocksizes are, the more time is spent during inefficient unblocked subproblems. According to this line of thought, LAPACK and FLAME would benefit from algorithms that do not suffer from this weakness. One possibility is to use recursive algorithms, as advocated in Peise and Bientinesi [23] .
Hard drives and other similarly slow storage devices can be thought of as another layer of the memory hierarchy. Because of this, we believe the methodology in this paper can be used to instantiate out-of-core algorithms for MMM. A major difference between such out-of-core algorithms and the ones in this paper targeting LRU caches is that out-of-core algorithms may require explicit transfers to disk and explicit overlapping of I/O and computation.
Future work is to generalize the MOMMS family of algorithms to other dense linear algebra operations, much in the way that Goto's algorithm [7] was generalized to the rest of the level-3 BLAS [8] .
e key point is that most suboperations during the other level-3 BLAS operations (that operate on structured matrices) are regular, unstructured MMM operations [16] .
