Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due to the increased cost of memory movement, which is particularly noticeable for non-square matrices. Such implementations also require considerable workspace and modifications to the standard BLAS interface. We propose a code generator framework to automatically implement a large family of FMM algorithms suitable for multiplications of arbitrary matrix sizes and shapes. By representing FMM with a triple of matrices U, V, W that capture the linear combinations of submatrices that are formed, we can use the Kronecker product to define a multi-level representation of Strassen-like algorithms. Incorporating the matrix additions that must be performed for Strassen-like algorithms into the inherent packing and micro-kernel operations inside GEMM avoids extra workspace and reduces the cost of memory movement. Adopting the same loop structures as high-performance GEMM implementations allows parallelization of all FMM algorithms with simple but efficient data parallelism without the overhead of task parallelism. We present a simple performance model for general FMM algorithms and compare actual performance of 20+ FMM algorithms to modeled predictions. Our implementations demonstrate a performance benefit over conventional GEMM on single core and multi-core systems. This study shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use.
I. INTRODUCTION
Three recent advances have revived interest in the practical implementation of Strassen's algorithm (STRASSEN) and similar Fast Matrix Multiplication (FMM) algorithms. The first [1] is a systematic way in which new FMM algorithms can be identified, building upon conventional calls to the BLAS matrix-matrix multiplication GEMM routine. That work incorporated a code generator, due to the number of algorithms that are identified and the complexity of exploiting subexpressions encountered in the linear combinations of submatrices. Parallelism was achieved through a combination of task parallelism and parallelism within the BLAS. The second [2] was the insight that the BLAS-like Library Instantiation Software (BLIS) [3] framework exposes basic building blocks that allow the linear combinations of submatrices in STRASSEN to be incorporated into the packing and/or computational micro-kernels already existing in the BLIS GEMM implementation. Parallelism in that work mirrored the highly effective data parallelism that is part of BLIS. Finally, the present work also extends insights on how to express multiple levels of STRASSEN in terms of Kronecker products [4] to multi-level FMM algorithms, facilitating a code generator for all methods from [1] (including STRASSEN), in terms of the building blocks created for [2] , but allowing different FMM algorithms to be used for each level. Importantly and unique to this work, the code generator also yields performance models that are accurate enough to guide the choice of a FMM implementation as a function of problem size and shape, facilitating the creation of poly-algorithms [5] . Performance results from single core and multi-core shared memory system support the theoretical insights.
We focus on the special case of GEMM given by C := C + AB. Extending the ideas to the more general case of GEMM is straightforward.
II. BACKGROUND
We briefly summarize how the BLIS framework implements GEMM before reviewing recent results [2] on how STRASSEN can exploit insights that underlie this framework.
n C n C Figure 1 . Figure from [2] (used with permission from authors). Left: Illustration of the BLIS implementation of the GOTOBLAS GEMM algorithm. All computation is cast in terms of a micro-kernel that is highly optimized. Right: modification that implements the representative computation M = (X + Y )(V + W ); C+= M ; D+= M of each row of computations in (2) . X, Y are submatrices of A; V , W are submatrices of B; C, D are submatrices of the original matrix C; M is the intermediate matrix product. Note that the packing buffers A i and Bp stay in cache. then it can be shown that the operations
compute C := AB + C, but with seven (sub)matrix multiplications, reducing the cost by a factor of 7/8 (ignoring a lower order number of extra additions). If all matrices are square and of size n × n, classical STRASSEN exploits this recursively, reducing the cost for GEMM to O(n 2.807 ).
Only a few levels of the recursion are exploited in practice because the cost of extra additions and extra memory movements quickly offsets the reduction in floating point operations. Also, STRASSEN is known to become more numerically unstable particularly when more than two levels of recursion are employed [8] , [9] , [10] 1 .
In [2] , captured in Figure 1 (right), it was noted that the additions of the submatrices of A and B can be incorporated into the packing into buffers A i and B p , avoiding extra memory movements. In addition, once a submatrix that contributes to C is computed in the registers, it can be directly added to the appropriate parts of multiple submatrices of C, thus avoiding the need for temporary matrices M r , again avoiding extra memory movements. As demonstrated in [2] , this makes the method practical for smaller matrices and matrices of special shape (especially rank-k updates, where k is relatively small).
III. FAST MATRIX MULTIPLICATION ALGORITHMS
We now present the basic idea that underlies families of FMM algorithms and how to generalize one-level formula for multi-level FMM utilizing Kronecker products and recursive block storage indexing.
A. One-level fast matrix multiplication algorithms
In [1] , the theory of tensor contractions is used to find a large number of FMM algorithms. In this subsection, we use the output (the resulting algorithms) of their approach.
Generalizing the partitioning for STRASSEN, consider C := C + AB, where C, A, and B are m × n, m × k, and k×n matrices, respectively. [1] defines a m, k, n algorithm by partitioning
Note that A i , B j , and C p are the submatrices of A, B and C, with a single index in the row major order. Then, C := C + AB is computed by, for r = 0, ..., R − 1,
C p += w pr M r (p = 0, ..., m n − 1)
where (×) is a matrix multiplication that can be done recursively, u ir , v jr , and w pr are entries of a ( m k) × R matrix U , a ( k n) × R matrix V , and a ( m n) × R matrix W , respectively. Therefore, the classical matrix multiplication which needs m k n submatrix multiplications can be completed with R submatrix multiplications. The set of coefficients that determine the m, k, n algorithm is denoted as U, V, W . For example, assuming that m, n, and k are all even, one-level STRASSEN has 2, 2, 2 partition dimensions and, given the partitioning in (1) and computations in (2), 
specifies U, V, W for one-level STRASSEN. Figure 2 summarizes a number of such algorithms that can be found in the literature that we eventually test in Section V. We only consider 2 ≤ m, k, n ≤ 6 and don't include arbitrary precision approximate (APA) algorithms [11] , due to their questionable numerical stability. Practical #1 Practical #2 Figure 2 . Theoretical and practical speedup for various FMM algorithms. m k n is the number of multiplication for classical matrix multiplication algorithm. R is the number of multiplication for fast matrix multiplication algorithm. Theoretical speedup is the speedup per recursive step. Practical #1 speedup is the speedup for one-level FMM comparing with GEMM when m = n = 14400, k = 480 (rank-k updates) 2 . Practical #2 speedup is the speedup for one-level FMM comparing with GEMM when m = n = 14400, k = 12000 (approximately square). We report the practical speedup of the best implementation of our generated code (generated GEMM) and the implementations in [1] (linked with Intel MKL) on single core. More details about the experiment setup is described in Section V.
B. Kronecker Product
If X and Y are m×n and p×q matrices with (i, j) entries denoted by x i,j and y i,j , respectively, then the Kronecker product [12] X ⊗ Y is the mp × nq matrix given by
C. Recursive Block Indexing (Morton-like Ordering)
An example of recursive block storage indexing (Mortonlike ordering) [15] is given in Figure 3 . In this example, A undergoes three levels of recursive splitting, and each submatrix of A is indexed in row major form. By indexing A, B, and C in this manner, data locality is maintained when operations are performed on their respective submatrices.
D. Representing two-level FMM with the Kronecker Product
In [4] , it is shown that multi-level 2, 2, 2 STRASSEN can be represented as Kronecker product. In this paper, we extend this insight to multi-level FMM, where each level can use a different choice of m, k, n .
Assume each submatrix of A, B, and C is partitioned with another level of m , k , n FMM algorithm with the coefficients U , V , W , and A i , B j , C p are the submatrices of A, B and C, with a single index in two-level recursive block storage indexing. Then it can be verified that C := C + AB is computed by, for r = 0, ..., R · R − 1,
The set of coefficients of a two-level m, k, n and m , k , n FMM algorithm can be denoted as
are the one-level STRASSEN coefficients given in (4).
E. Additional levels of FMM
Comparing one-level and two-level FMM, the same skeleton pattern emerges. The formula for defining L-level FMM is given by, for r = 0, ...,
The set of coefficients of an L-level m l , k l , n l (l = 0, 1, ..., L−1) FMM algorithm can be denoted as
IV. IMPLEMENTATION AND ANALYSIS
The last section shows that families of one-level FMM algorithms can be specified by m, k, n and U, V, W . It also shows how the Kronecker product can be used to generate multi-level FMM algorithms that are iterative rather than recursive. In this section, we discuss a code generator that takes as input m, k, n and U, V, W and as output generates implementations that build upon the primitives that combine taking linear combinations of matrices with the packing routines and/or micro-kernels that underlie BLIS. The code generator also provides a model of cost for each implementation that can then be used to choose the best FMM for a matrix of given size and shape. This code generator can thus generate code for arbitrary levels of FMM that can use different FMM choices at each level. In this way, we have generated and compared more than 200 FMM algorithms.
A. Code generation
Our code generator generates various implementations of FMM, based on the coefficient representation U, V, W , levels of recursion, and packing routine/micro-kernel incorporation specifications.
There are two stages for our code generator: generating the skeleton framework, and generating the typical operations given in (3) .
Generating the skeleton framework: During this stage, the code generator
Generates the matrix partition code by conceptual recursive block storage indexing with m l , k l , n l partition dimensions for each level. • For the general cases where one or more dimensions are not multiples of corresponding
l=0 n l , it generates dynamic peeling [16] code to handle the remaining "fringes" after invoking FMM, which requires no additional memory. 
Ta for extra submatrix additions.
Tm for reading submatrices in packing routines ( Fig. 1) .
Tm for writing submatrices in packing routines ( Fig. 1) .
Tm for reading and writing submatrices in micro-kernel ( Fig. 1) .
Tm for reading or writing submatrices, related to the temporary buffer as part of Naive FMM and AB FMM.
non-zero entry number in matrix or vector X. 
B. Performance model
In [2] , a performance model was given to estimate the execution time T for the one-level/two-level ABC, AB, and Naive variations of 2, 2, 2 STRASSEN. In this subsection, we generalize that performance model to predict the execution time T for the various FMM implementations generated by our code generator. Theoretical estimation helps us better understand the computation and memory footprint of different FMM implementations, and allows us to avoid exhaustive empirical search when searching for the best implementation for different problem sizes and shapes. Most importantly, our code generator can embed 1 Effective GFLOPS = 2 · m · n · k/T · 10 −9 2 T = Ta + Tm 
our performance model to guide the selection of a FMM implementation as a function of problem size and shape, with the input m l , k l , n l and U l , V l , W l specifications on each level l. These performance models are themselves automatically generated. 
Assumption:
We have similar architecture assumptions as in [2] . Basically we assume that the architecture has two layers of modern memory hierarchy: fast caches and relatively slow main memory (DRAM). For read operations, the latency for accessing cache can be ignored, while the latency for accessing the main memory is counted; For write operations, we assume a lazy write-back policy such that the time for writing into fast caches can be hidden. Based on these assumptions, the memory operations for GEMM and various implementations of FMM are decomposed into three parts:
• memory packing shown in Figure 1 .
• reading/writing the submatrices of C in Figure 1 .
• reading/writing of the temporary buffer that are parts of Naive FMM/AB FMM. Notation: Notation is summarized in Figure 4 . The total execution time, T , is dominated by arithmetic time T a and memory time T m ( 2 in Figure 5 ).
Arithmetic operations: T a is decomposed into submatrix multiplications (T × a ) and submatrix additions (T A+ a , T B+ a , T C+ a ) ( 3 in Figure 5 ). T X+ a has a coefficient 2 because under the hood the matrix additions are cast into FMA operations. The corresponding coefficients N X a are tabulated in Figure 5 
Note that X :,r denotes the rth column of X.
Memory operations: T m is a function of the submatrix sizes {m/ M L , k/ K L , n/ N L }, and the block sizes {m C , k C , n C } in Figure 1(right) , because the memory operation can repeat multiple times according to which loop they reside in. T m is broken down into several components, as shown in 4 in Figure 5 . Each memory operation term is characterized in Figure 5 by its read/write type and the amount of memory in units of 64-bit double precision elements. Note that T , which denotes the prefetching efficiency. λ is adapted to match GEMM performance. Note that this is a ceiling function proportional to k, because rankk updates for accumulating submatrices of C recur k/ KL kc times in 4th loop in Figure 1 . The corresponding coefficients N X m are tabulated in Figure 5 . For example, for Naive FMM and AB FMM, computing C p += ( W ) p,r M r (p = 0, ...) in (5) involves 2 read and 1 write related to temporary buffer in slow memory. Therefore, N C× m = 3nnz( W ).
C. Discussion
We can make estimations about the run time performance of the various FMM implementations generated by our code generator, based on the analysis shown in Figure 5 . We use Effective GFLOPS (defined in 1 in Figure 5 ) as the metric to compare the performance of these various FMM implementations, similar to [1] , [17] , [18] . The architecturedependent parameters for the model are given in Section V. We demonstrate the performance of two representative groups of experiments in Figures 6 and 7 .
• Contrary to what was observed in [2] , Naive FMM may perform better than ABC FMM and AB FMM for relatively large problem size. For example, in Figure 6 , 3, 6, 3 (with the maximum theoretical speedup among all FMMs we test, Figure 2 ) has better Naive FMM performance than ABC FMM and AB FMM. This is because the total number of times for packing in 3, 6, 3 is very large (N A× m = nnz( U ), N B× m = nnz( V )). This magnifies the overhead for packing with AB FMM/ABC FMM. • Contrary to what was observed in [1] , for rank-k updates (middle column, right column, Figure 7 ), 2, 2, 2 still performs the best with ABC FMM implementations ( [1] observe some other shapes, e.g. 4, 2, 4 , tend to have higher performance). This is because their implementations are similar to Naive FMM, with the overhead for forming the M r matrices explicitly. • Figure 6 shows that for small problem size, when k is small, ABC FMM performs best; when k is large, AB FMM/Naive FMM perform better. That can be quantitatively explained by comparing the coefficients of N X m in the bottom table in Figure 5 . • The graph for m = n = 14400, k varies, ABC, 1core (left column, Figure 6 ; middle column, Figure 7) shows that for k equal to the appropriate multiple of k C ( k = L−1 l=0 k l × k C ), ABC FMM achieves the best performance.
D. Apply performance model to code generator
For actual performance, even the best implementation has some unexpected drops, due to the "fringes" which are caused by the problem sizes not being divisible by partition dimesions m, k, n. This is not captured by our performance model. Therefore, given the specific problem size and shape, we choose the best two implementations predicted by our performance model as the top two candidate implementations, and then measure the performance in practice to pick the best one.
In Figure 8 we show the performance results on single core by selecting the generated FMM implementation with the guide of performance model, when m=k=n; m=n= 14400, k varies; and k=1024, m=n vary.
Overall this experiment shows that the performance model is accurate enough in terms of relative performance between various FMM implementations to guide the choice of a FMM implementation, with the problem sizes and shapes as the inputs. That will reduce the potential overhead of exhaustive empirical search.
V. PERFORMANCE EXPERIMENTS
We present performance evaluations for various generated FMM implementations. A. Implementation and architecture information The FMM implementations generated by our code generator are written in C, utilizing SSE2 and AVX assembly, compiled with the Intel C compiler version 15.0.3 with optimization flag -O3 -mavx.
We compare against our generated DGEMM (based on the packing routines and micro-kernel borrowed from BLIS, marked as BLIS in the performance figures) as well as Intel MKL's DGEMM [19] (marked as MKL in the performance figures).
We measure performance on a dual-socket Intel Xeon E5-2680 v2 (Ivy Bridge, 10 cores/socket) processor with 12.8 GB/core of memory (Peak Bandwidth: 59.7 GB/s with four channels) and a three-level cache: 32 KB L1 data cache, 256 KB L2 cache and 25.6 MB L3 cache. The stable CPU clockrate is 3.54 GHz when a single core is utilized (28.32 GFLOPS peak, marked in the graphs) and 3.10 GHz when ten cores are in use (24.8 GLOPS/core peak). To set thread affinity and to ensure the computation and the memory allocation all reside on the same socket, we disable hyperthreading explicitly and use KMP_AFFINITY=compact.
The blocking parameters, n R = 4, m R = 8, k C = 256, n C = 4096 and m C = 96, are consistent with parameters used for the standard BLIS DGEMM implementation for this architecture. This makes the size of the packing buffer A i 192 KB and B p 8192 KB, which then fit the L2 cache and L3 cache, respectively.
Parallelization is implemented mirroring that described in [20] , using OpenMP directives that parallelize the third loop around the micro-kernel in Figure 1 .
B. Benefit of hybrid partitions
First, we demonstrate the benefit of using different FMM algorithms for each level.
We report the performance of different combinations of one-level/two-level 2, 2, 2 , 2, 3, 2 , and 3, 3, 3 in Figure 9 , when k is fixed to 1200 and m = n vary. As suggested and illustrated in Section IV-C, ABC FMM performs best for rank-k updates, which is why we only show the ABC FMM performance.
Overall the hybrid partitions 2, 2, 2 + 2, 3, 2 and 2, 2, 2 + 3, 3, 3 achieve the best performance. This is because 1200 is close to 2 × 3× k C , meaning that the hybrid partitions of 2 and 3 on the k dimension are more favorable. This is consistent with what the performance model predicts. Performance benefits are less for 10 cores due to bandwidth limitations, although performance of hybrid partitions still beats two-level homogeneous partitions.
This experiment shows the benefit of hybrid partitions, facilitated by the Kronecker product representation.
C. Sequential and parallel performance
Results when using a single core are presented in Figures 2, 6, and 7. Our generated ABC FMM implementation outperforms AB FMM/Naive FMM and reference implementations from [1] for rank-k updates (when k is small). For very large square matrices, our generated AB FMM or Naive FMM can achieve competitive performance with reference implementations [1] that is linked with Intel MKL. These experiments support the validity of our model. Figure 10 reports performance results for ten cores within the same socket. Memory bandwidth contention impacts the performance of various FMM when using many cores. Nonetheless we still observe the speedup of FMM over GEMM. For smaller matrices and special shapes such as rankk updates, our generated implementations achieve better performance than reference implementations [1] .
VI. CONCLUSION
We have discussed a code generator framework that can automatically implement families of FMM algorithms for Strassen-like fast matrix multiplication algorithms. This code generator expresses the composition of multi-level FMM algorithms as Kronecker products. It incorporates the matrix summations that must be performed for FMM into the inherent packing and micro-kernel operations inside GEMM, avoiding extra workspace requirement and reducing the overhead of memory movement. Importantly, it generates an accurate performance model to guide the selection of a FMM implementation as a function of problem size and shape, facilitating the creation of poly-algorithms that select the best algorithm for a problem size. Comparing with state-of-the-art results, we observe a significant performance improvement for smaller matrices and special matrix multiplication shapes such as rank-k updates, without the need for exhaustive empirical search.
There are a number of avenues for future work:
• Task parallelism and various parallel schemes are proposed in the recent literature [21] , [1] . We need to pursue how our techniques compare to these and how to combine these with our advances. It may be possible to utilize our performance model for task scheduling. • Finding the new FMM algorithms by searching the coefficient matrix U, V, W is an NP-hard problem [22] . It may be possible to prune branches with the performance model as the cost function during the search process. • In [2] , it is shown that Intel Xeon Phi coprocessor (KNC) can benefit from ABC variation of STRASSEN. It may be possible to get performance benefit by porting our code generator to generate variations of FMM implementations for many-core architecture such as second-generation Intel Xeon Phi coprocessor (KNL). • The asymptotic communication lower bound for Strassen's algorithm and matrix multiplication has been characterized in [23] , [24] , [25] , [26] . It may be possible to apply our performance model to constrain the coefficients of the cubic and quadratic terms and get more precise lower bound for specific architectures.
