Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.
Introduction
Matrix multiplication is one of the most fundamental computations in numerical linear algebra and scientific computing. Consequently, the computation has been extensively studied in parallel computing environments (cf. [3, 20, 34] and references therein). In this paper, we show that fast algorithms for matrix-matrix multiplication can achieve higher performance on sequential and shared-memory parallel architectures for modestly sized problems. By fast algorithms, we mean ones that perform asymptotically fewer floating point operations and communicate asymptotically less data than the classical algorithm. We also provide a code generation framework to rapidly implement sequential and parallel versions of over 20 fast algorithms. Our performance results in Section 5 show that several fast algorithms can outperform the Intel Math Kernel Library (MKL) dgemm (double precision general matrix-matrix multiplicaPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. PPoPP'15, , February 7-11, 2015 , San Francisco, CA, USA. Copyright c 2015 ACM 978-1-4503-3205-7/15/02. . . $15.00. http://dx.doi.org/10.1145/2688500.2688513 tion) routine and Strassen's algorithm [32] . In parallel implementations, fast algorithms can achieve a speedup of 5% over Strassen's original fast algorithm and greater than 15% over MKL.
However, fast algorithms for matrix multiplication have largely been ignored in practice. For example, numerical libraries such as Intel's MKL [19] , AMD's Core Math Library (ACML) [1] , and the Cray Scientific Libraries package (LibSci) [8] do not provide implementations of fast algorithms, though we note that IBM's Engineering and Scientific Subroutine Library (ESSL) [18] does include Strassen's algorithm. Why is this the case? First, users of numerical libraries typically consider fast algorithms to be of only theoretical interest and never practical for reasonable problem sizes. We argue that this is not the case with our performance results in Section 5. Second, fast algorithms do not provide the same numerical stability guarantees as the classical algorithm. In practice, there is some loss in precision in the fast algorithms, but they are not nearly as bad as the worst-case guarantees [14, 27] . Third, the LINPACK benchmark 1 used to rank supercomputers by performance forbids fast algorithms. We suspect that this has driven effort away from the study of fast algorithms.
Strassen's algorithm is the most well known fast algorithm, but this paper explores a much larger class of recursive fast algorithms based on different base case dimensions. We review these algorithms and methods for constructing them in Section 2. The structure of these algorithms makes them amenable to code generation, and we describe this process and other performance tuning considerations in Section 3. In Section 4, we describe three different methods for parallelizing fast matrix multiplication algorithms on shared-memory machines. Our code generator implements all three parallel methods for each fast algorithm. We evaluate the sequential and parallel performance characteristics of the various algorithms and implementations in Section 5 and compare them with MKL's implementation of the classical algorithm as well as an existing implementation of Strassen's algorithm.
The goal of this paper is to help bridge the gap between theory and practice of fast matrix multiplication algorithms. By introducing our tool of automatically translating a fast matrix multiplication algorithm to high performance sequential and parallel implementations, we enable the rapid prototyping and testing of theoretical developments in the search for faster algorithms. We focus the attention of theoretical researchers on what algorithmic characteristics matter most in practice, and we demonstrate to practical researchers the utility of several existing fast algorithms besides Strassen's, motivating further effort towards high performance implementations of those that are most promising. Our contributions are summarized as follows:
• By using new fast matrix multiplication algorithms, we achieve better performance than Intel MKL's dgemm, both sequentially and with 6 and 24 cores on a shared-memory machine.
• We demonstrate that, in order to achieve the best performance for matrix multiplication, the choice of fast algorithm depends on the size and shape of the matrices. Our new fast algorithms outperform Strassen's on the multiplication of rectangular matrices.
• We show how to use code generation techniques to rapidly implement sequential and shared-memory parallel fast matrix multiplication algorithms.
• We provide a new hybrid parallel algorithm for shared-memory fast matrix multiplication.
• We implement a fast matrix multiplication algorithm with asymptotic complexity O(N 2.775 ) for square N ×N matrices (discovered by Smirnov [31] ). In terms of asymptotic complexity, this is the fastest matrix multiplication algorithm implementation to date. However, our performance results show that this algorithm is not practical for the problem sizes that we consider.
Overall, we find that Strassen's algorithm is hard to beat for square matrix multiplication, both in serial and in parallel. However, for rectangular matrices (which occur more frequently in practice), other fast algorithms can perform much better. The structure of the fast algorithms that perform well tend to "match the shape" of the matrices, an idea that we will make clear in Section 5. We also find that bandwidth is a limiting factor towards scalability in shared-memory parallel implementations of fast algorithms. Our parallel implementations of fast algorithms suffer when memory bandwidth does not scale linearly with the number of cores. Finally, we find that algorithms that are theoretically fast in terms of asymptotic complexity do not perform well on problems of modest size that we consider on shared-memory parallel architectures. We discuss these conclusions in more detail in Section 6.
Finally, all of the software used for this paper is available at https://github.com/arbenson/fast-matmul.
Related Work
There are several sequential implementations of Strassen's fast matrix multiplication algorithm [2, 11, 17] , and parallel versions have been implemented for both shared-memory [9, 25] and distributedmemory architectures [3, 13] . For our parallel algorithms in Section 4, we use the ideas of breadth-first and depth-first traversals of the recursion trees, which were first considered by Kumar et al. [25] and Ballard et al. [3] for minimizing memory footprint and communication.
Apart from Strassen's algorithm, a number of fast matrix multiplication algorithms have been developed, but only a small handful have been implemented. Furthermore, these implementations have only been sequential. Hopcroft and Kerr showed how to construct recursive fast algorithms where the base case is multiplying a p × 2 by a 2×n matrix [16] . Bini et al. introduced the concept of arbitrary precision approximate (APA) algorithms for matrix multiplication and demonstrated a method for multiplying 3 × 2 by 2 × 2 matrices which leads to a general square matrix multiplication APA algorithm that is asymptotically faster than Strassen's [5] . Schönhage also developed an APA algorithm that is asymptotically faster than Strassen's, based on multiplying square 3 × 3 matrices [30] . These APA algorithms suffer from severe numerical issues-both lose at least half the digits of accuracy with each recursive step. While no exact solution can have the same complexity as Bini's algorithm [16] , it is still an open question if there exists an exact fast algorithm with the same complexity as Schönhage's. Pan used factorization of trilinear forms and a base case of 70×70 square matrix multiplication to construct an exact algorithm asymptotically faster than Strassen's algorithm [29] . This algorithm was implemented by Kaporin [22] , and the running time was competitive with Strassen's algorithm on a sequential machine. Recently, Smirnov presented optimization tools for finding many fast algorithms based on factoring bilinear forms [31] , and we will use these tools for finding our own algorithms in Section 2.
There are several lines of theoretical research (cf. [12, 36] and references therein) that prove existence of fast APA algorithms with much better asymptotic complexity than the algorithms considered here. Unfortunately, there remains a large gap between the substantial theoretical work and what we can practically implement.
Renewed interest in the practicality of Strassen's and other fast algorithms is motivated by the observation that not only is the arithmetic cost reduced when compared to the classical algorithm, the communication costs also improve asymptotically [4] . That is, as the relative cost of moving data throughout the memory hierarchy and between processors increases, we can expect the benefits of fast algorithms to grow accordingly. We note that communication lower bounds [4] apply to all the algorithms presented in this paper, and in most cases they are attained by the implementations used here.
Notation and Tensor Preliminaries
We briefly review basic tensor preliminaries, following the notation of Kolda and Bader [24] . Scalars are represented by lowercase Roman or Greek letters (a), vectors by lowercase boldface (x), matrices by uppercase boldface (A), and tensors by boldface Euler script letters (T). For a matrix A, we use a k and a i j to denote the kth column and i, j entry, respectively. A tensor is a multidimensional array, and in this paper we deal exclusively with order-3, real-valued tensors; i.e., T ∈ R I×J×K . The kth frontal slice of
I×J×K with entries t i jk = u i v j w k . Addition of tensors is defined entry-wise. The rank of a tensor T is the minimum number of rank-one tensors that generate T as their sum. Decompositions of the form T = R r=1 u r • v r • w r lead to fast matrix multiplication algorithms (Section 2.2), and we use U, V, W to denote the decomposition, where U, V, and W are matrices with R columns given by u r , v r , and w r . Of the various flavors of products involving tensors, we will need to know that, for
Fast Matrix Multiplication
We now review the preliminaries for fast matrix multiplication algorithms. In particular, we focus on factoring tensor representations of bilinear forms, which will facilitate the discussion of the implementation in Sections 3 and 4.
Recursive Multiplication
Matrices are self-similar, i.e., a submatrix is also a matrix. Arithmetic with matrices is closely related to arithmetic with scalars, and we can build recursive matrix multiplication algorithms by manipulating submatrix blocks. For example, consider multiplying C = A · B,
where we have partitioned the matrices into four submatrices. Throughout this paper, we denote the block multiplication of M ×K and K × N matrices by M, K, N . Thus, the above computation is 2, 2, 2 . Multiplication with the classical algorithm proceeds by combining a set of eight matrix multiplications with four matrix additions:
The multiplication to form each M i is recursive and the base case is scalar multiplication. The number of flops performed by the classical algorithm for N × N matrices, where N is a power of two, is 2N 3 − N 2 . The idea of fast matrix multiplication algorithms is to perform fewer recursive matrix multiplications at the expense of more matrix additions. Since matrix multiplication is asymptotically more expensive than matrix addition, this tradeoff results in faster algorithms. The most well known fast algorithm is due to Strassen, and follows the same block structure:
We have explicitly written out terms like T 2 = B 11 to hint at the generalizations provided in Section 2.2. Strassen's algorithm uses 7 matrix multiplications and 18 matrix additions. The number of flops performed by the algorithm is 7N log 2 7 −6N 2 = O(N 2.81 ), when we assume a base case of N = 1.
There are natural extensions to Strassen's algorithm. We might try to find an algorithm using fewer than 7 multiplications; unfortunately, we cannot [37] . Alternatively, we could try to reduce the number of additions. This leads to the Strassen-Winograd algorithm, which reduces the 18 additions down to 15. We explore such methods in Section 3.3. We can also improve the constant on the leading term by choosing a bigger base case dimension (and using the classical algorithm for the base case). This turns out not to be important in practice because the base case will be chosen to optimize performance rather than flop count. Lastly, we can use blocking schemes apart from 2, 2, 2 , which we explain in the remainder of this section. This leads to a host of new algorithms, and we show in Section 5 that they are often faster in practice.
Fast Algorithms as Low-Rank Tensor Decompositions
The approach we use to devise fast algorithms exploits an important connection between matrix multiplication (and other bilinear forms) and tensor computations. We detail the connection in this section for completeness; see [7, 23] for earlier explanations.
A bilinear form on a pair of finite-dimensional vector spaces is a function that maps a pair of vectors to a scalar and is linear in each of its inputs separately. A bilinear form B(x, y) can be represented by a matrix D of coefficients: B(x, y) = x T Dy = i j d i j x i y j , where we note that x and y may have different dimensions. In order to describe a set of K bilinear forms B k (x, y) = z k , 1 ≤ k ≤ K, we can use a three-way tensor T of coefficients:
or, in more succinct tensor notation, z = T × 1 x × 2 y.
Low-Rank Tensor Decompositions
The advantage of representing the operations using a tensor of coefficients is a key connection between the rank of the tensor to the arithmetic complexity of the corresponding operation. Consider the "active" multiplications between elements of the input vectors (e.g., x i · y j ). The conventional algorithm, following Equation (1), will compute an active multiplication for every nonzero coefficient in T. However, suppose we have a rank-R decomposition of the
for all i, j, k, where U, V, and W are matrices with R columns each. We will also use the equivalent notation T = U, V, W . Substituting Equation (2) into Equation (1) and rearranging, we have for
m r w kr , where s = U T x, t = V T y, and m = s * t. 2 This reduces the number of active multiplications (now between linear combinations of elements of the input vectors) to R. Here we highlight active multiplications with (·) notation.
Assuming R < nnz (T), this reduction of active multiplications, at the expense of increasing the number of other operations, is valuable when active multiplications are much more expensive than the other operations. This is the case for recursive matrix multiplication, where the elements of the input vectors are (sub)matrices, as we describe below.
Tensor Representation of Matrix Multiplication
Matrix multiplication is a bilinear operation, so we can represent it as a tensor computation. In order to match the notation above, we vectorize the input and output matrices A, B, and C using row-wise ordering, so that x = vec (A), y = vec (B), and z = vec (C).
For every triplet of matrix dimensions for valid matrix multiplication, there is a fixed tensor that represents the computation so that T × 1 vec (A) × 2 vec (B) = vec (C) holds for all A, B, and C. For example, if A and B are both 2 × 2, the corresponding 4 × 4 × 4 tensor T has frontal slices
This yields, for example,
By Strassen's algorithm, we know that although this tensor has 8 nonzero entries, its rank is at most 7. Indeed, that algorithm corresponds to a low-rank decomposition represented by the following triplet of matrices, each with 7 columns:
As in the previous section, for example,
Note that in the previous section, the elements of the input matrices are already interpreted as submatrices (e.g., A 11 and M 1 ); here we represent them as scalars (e.g., a 11 and m 1 ).
We need not restrict ourselves to the 2, 2, 2 case; there exists a tensor for matrix multiplication given any set of valid dimensions. When considering a base case of M ×K by K ×N matrix multiplication (denoted M, K, N ), the tensor has dimensions MK×KN×MN and MKN non-zeros. In particular, t i jk = 1 if the following three conditions hold: (1) 
Otherwise, t i jk = 0 (here we assume entries are 1-indexed).
Approximate Tensor Decompositions
The APA algorithms discussed in Section 1.1 arise from approximate tensor decompositions. With Bini's algorithm, for example, the factor matrices (i.e., the corresponding U, V, W ) have entries 1/λ and λ. As λ → 0, the low-rank tensor approximation approaches the true tensor. However, as λ gets small, we suffer from loss of precision in the floating point calculations of the resulting fast algorithm. Setting λ = √ minimizes the loss of accuracy for one step of Bini's algorithm, where is machine precision [6] , but even in this case at least half the digits are lost with a single recursive step of the algorithm.
Finding Fast Algorithms
We conclude this section with a description of a method for searching for and discovering fast algorithms for matrix multiplication. Our search goal is to find low-rank decompositions of tensors corresponding to matrix multiplication of a particular set of dimensions, which will identify fast, recursive algorithms with reduced arithmetic complexity. That is, given a particular base case M, K, N and the associated tensor T, we seek a rank R and matrices U, V, and W that satisfy Equation (1). Table 1 summarizes the algorithms that we find and use for numerical experiments in Section 5.
The rank of the decomposition determines the number of active multiplications, or recursive calls, and therefore the exponent in the arithmetic cost of the algorithm. The number of other operations (additions and inactive multiplications) will affect only the constants in the arithmetic cost. For this reason, we want sparse U, V, and W matrices with simple values (like ±1), but that goal is of secondary importance compared to minimizing the rank R. Note that these constant values do affect performance of these algorithms for reasonable matrix dimensions in practice, though mainly because of how they affect the communication costs of the implementations rather than the arithmetic cost. We discuss this in more detail in Section 3.2.
Equivalent Algorithms
Given an algorithm U, V, W for base case M, K, N , we can transform it to an algorithm for any of the other 5 permutations of the base case dimensions with the same number of multiplications. This is a well known property [15] ; here we state the two transformations that generate all permutations in our notation. We let P I×J be the permutation matrix that swaps row-order for column-order in the vectorization of an I × J matrix. In other words, if A is I × J,
Fast algorithms for a given base case also belong to equivalence classes. Two algorithm are equivalent if one can be generated from another based on the following transformations [10, 21] . 
Numerical Search
Given a rank R for base case M, K, N , Equation (1) defines (MKN) 2 polynomial equations of the form given in Equation (2) . Because the polynomials are trilinear, alternating least squares (ALS) can be used to iteratively compute an approximate (numerical) solution to the equations. That is, if two of the three factor matrices are fixed, the optimal third factor matrix is the solution to a linear least squares problem. Thus, each outer iteration of ALS involves alternating among solving for U, V, and W, each of which can be done efficiently with a QR decomposition, for example. This approach was first proposed for fast matrix multiplication search by Brent [7] , but ALS has been a popular method for general low-rank tensor approximation for as many years (see [24] and references therein).
The main difficulties ALS faces for this problem include getting stuck at local minima, encountering ill-conditioned linear least-squares problems, and, even if ALS converges to machineprecision accuracy, computing dense U, V, and W matrices with floating point entries. We follow the work of Johnson and McLoughlin [21] and Smirnov [31] in addressing these problems. We use multiple starting points to handle the problem of local minima, add regularization to help with the ill-conditioning, and encourage sparsity in order to recover exact factorizations (with integral or rational values) from the approximations.
The most useful techniques in our search have been (1) exploiting equivalence transformations [21, Eq. (6) ] to encourage sparsity and obtain discrete values and (2) using and adjusting the regularization penalty term [31, Eq. (4-5)] throughout the iteration. As described in earlier efforts, algorithms for small base cases can be discovered nearly automatically. However, as the values M, N, and K grow, more hands-on tinkering using heuristics seems to be necessary to find discrete solutions.
Implementation and Practical Considerations
We now discuss our code generation method for fast algorithms and the major implementation issues. All experiments were conducted on a single compute node on NERSC's Edison. Each node has two 12-core Intel 2.4 GHz Ivy Bridge processors and 64 GB of memory.
Code Generation
Our code generator automatically implements a fast algorithm in C++ given the U, V, and W matrices representing the algorithm. The generator simultaneously produces both sequential and parallel implementations. We discuss the sequential code in this section and the parallel extensions in Section 4. For computing C = A · B, the following are the key ingredients of the generated code:
• Using the entries in the U and V matrices, form the temporary matrices S r and T r , 1 ≤ r ≤ R, via matrix additions and scalar multiplication. The S r and T r are linear combinations of sub-blocks of A and B, respectively. For each S r and T r , the corresponding linear combination is a custom implementation. Scalar multiplication by ±1 is replaced with native addition / subtraction operators. The code generator can produce three variants of matrix additions, which we describe in Section 3.2. When a column of U or V contains a single non-zero element, there is no matrix addition (only scalar multiplication). In order to save memory, the code generator does not form a temporary matrix in this case. The scalar multiplication is piped through to subsequent recursive calls and is eventually used in a base case call to dgemm.
• Recursive calls to the fast matrix multiplication routine compute M r = S r · T r , 1 ≤ r ≤ R.
• Using the entries of W, linear combinations of the M r form the output C. Matrix additions and scalar multiplications are again handled carefully, as above.
• Common subexpression elimination detects redundant matrix additions, and the code generator can automatically implement algorithms with fewer additions. We discuss this process in more detail in Section 3.3.
• Dynamic peeling [33] accounts for matrices whose dimensions are not evenly divided by the base case of the fast algorithm. This method handles the boundaries of the matrix at each recursive level, and requires no additional memory. (Other methods, such as zero-padding, require additional memory). With dynamic peeling, the implementation can multiply matrices of any dimensions. Figure 1 shows performance benchmarks of the code generator's implementation. In order to compare the performance of matrix multiplication algorithms with different computational costs, we use the effective GFLOPS metric for P × Q × R matrix multiplication:
We note that effective GFLOPS is only the true GFLOPS for the classical algorithm (the fast algorithms perform fewer floating point operations). However, this metric lets us compare all of the algorithms on an inverse-time scale, normalized by problem size [13, 27] . We compare our code-generated Strassen implementation with MKL's dgemm and a tuned implementation of Strassen-Winograd from D'Alberto et al. [9] (recall that Strassen-Winograd performs the same number of multiplications but fewer matrix additions than Strassen's algorithm). The code generator's implementation outperforms MKL and is competitive with the tuned implementation. [9] . The problems sizes are square. The generated code easily outperforms MKL and is competitive with the tuned code.
Thus, we are confident that the general conclusions we draw with code-generated implementations of fast algorithms will also apply to hand-tuned implementations.
Handling Matrix Additions
While the matrix multiplications constitute the bulk of the running time, matrix additions are still an important performance optimization. We call the linear combinations used to form S r , T r , and C i j addition chains. For example, S 1 = A 11 +A 22 is an addition chain in Strassen's algorithm. We consider three different implementations for the addition chains:
1. Pairwise: With r fixed, compute S r and T r using the daxpy BLAS routine for all matrices in the addition chain. This requires nnz (u r ) calls to daxpy to form S r and nnz (v r ) calls to form T r . After the recursive computations of the M r , we follow the same strategy to form the output. The ith sub-block (rowwise) of C requires nnz w i,: daxpy calls. 3 2. Write-once: With r fixed, compute S r and T r with only one write for each entry (instead of, for example, nnz (v r ) writes for S r with the pairwise method). In place of daxpy, stream through the necessary submatrices of A and B and combine the entries to form S r and T r . This requires reading some submatrices of A and B several times, but writing to only one output stream at a time. Similarly, we write the output matrix C once and read the M r several times.
Streaming:
Read each input matrix once and write each temporary matrix S r and T r once. Stream through the entries of each sub-block of A and B, and update the corresponding entries in all temporary matrices S r and T r . Similarly, stream through the entries of the M r and update all submatrices of C.
Each daxpy call requires two matrix reads and one matrix write (except for the first call in an addition chain, which is a copy and requires one read and one write). Let nnz (U, V, W) = nnz (U) + nnz (V) + nnz (W). Then the pairwise additions perform
The write-once additions perform nnz (U, V, W) submatrix reads and at most 2R + MN submatrix writes. We do not need to write any data for the columns of U and V with a single nonzero entry. These correspond to addition chains that are just a copy, for example, T 2 = B 11 in Strassen's algorithm. While we perform fewer reads and writes than the pairwise additions, the complexity of our code increases (we have to write our own additions), and we can no longer use a tuned daxpy routine. We do not worry about code complexity because we use code generation. Since the problem is bandwidth-bound and compilers can automatically vectorize for loops, we don't expect the latter concern to be an issue.
Finally, the streaming additions perform MK+KN+R submatrix reads and at most 2R + MN submatrix writes. This is fewer reads than the write-once additions, but we have increased the complexity of the writes. Specifically, we alternate writes to different memory locations, whereas with the write-once algorithm, we write to a single (contiguous) output stream.
The three methods also have different memory footprints. With pairwise or write-once, S r and T r are formed just before computing M r . After M r is computed, the memory becomes available. On the other hand, the streaming algorithm must compute all temporary matrices S r and T r simultaneously, and hence needs R times as much memory for the temporary matrices. We will explore the performance of the three methods at the end of Section 3.3.
Common Subexpression Elimination
The S r , T r , and M r matrices often share subexpressions. For example, in our 4, 2, 4 fast algorithm (see Table 1 ), T 11 and T 25 are:
Both T 11 and T 25 share the subexpression B 12 + B 22 , up to scalar multiplication. Thus, there is opportunity to remove additions / subtractions:
At face value, eliminating additions would appear to improve the algorithm. However, there are two important considerations. First, using Y 1 with the pairwise or write-once approaches requires additional memory (with the streaming approach it requires only additional local variables).
Second, we discussed in Section 3.2 that an important metric is the number of reads and writes. If we use the write-once algorithm, we have actually increased the number of reads and writes. Originally, forming T 11 and T 25 required six reads and two writes. By eliminating the common subexpression, we performed two fewer reads in forming T 11 and T 25 but needed an additional two reads and one write to form Y 1 . In other words, we have read the same amount of data and written more data. In general, eliminating the same length-two subexpression k times reduces the number of matrix reads and writes by k − 3. Thus, a length-two subexpression must appear at least four times for elimination to reduce the total number of reads and writes in the algorithm. Figure 2 shows the performance all three matrix addition methods from Section 3.2, with and without common subexpression elimination (CSE). For CSE, we greedily eliminate length-two subexpressions. In general, the write-once algorithm without CSE performs the best on the rectangular matrix multiplication problem sizes. For these problems, CSE lowers performance of the writeonce algorithm and has little to modest effect on the streaming and pairwise algorithms. For square matrix problems, the best variant is less clear, but write-once with no elimination often performs the highest. We use write-once without CSE for the rest of our performance experiments.
Recursion Cutoff Point
In practice, we take only a few steps of recursion before calling a vendor-tuned library classical routine as the base case (in our case, Intel MKL's dgemm). One method for determining the cutoff point is to benchmark each algorithm and measure where the implementation outperforms dgemm. While this is sustainable for the analysis of any individual algorithm, we are interested in a large class of fast algorithms. Furthermore, a simple set of cutoff points limits understanding of the performance and will have to be re-measured for different architectures. Instead, we provide a rule of thumb based on the performance of dgemm. Figure 3 shows the performance of Intel MKL's sequential and parallel dgemm routines. We see that the routines exhibit a "ramp-up" phase and then flatten for sufficiently large problems. In both serial and parallel, multiplication of square matrices (N × N × N computation) tends to level at a higher performance than the problem shapes with a fixed dimension (N × 800 × N and N × 800 × 800). Our principle for recursion is to take a recursive step only if the sub-problems fall on the flat part of the curve. If the ratio of performance drop in the DGEMM curve is greater than the speedup per step (as listed in Table 1 ), then taking an additional recursive step cannot improve performance. 4 Finally, we note that some of our parallel algorithms call the sequential dgemm routine in the base case. Both curves will be important to our parallel fast matrix multiplication algorithms in Section 4.
Parallel Algorithms for Shared Memory
We present three algorithms for parallel fast matrix multiplication: depth-first search (DFS), breadth-first search (BFS), and a hybrid of the two (HYBRID). In this work, we target shared memory machines, although the same ideas generalize to distributed memory. For example, DFS and BFS ideas are used for a distributed memory implementation of Strassen's algorithm [27] .
Depth-First Search
The DFS algorithm is straightforward: when recursion stops, the classical algorithm uses all threads on each sub-problem. In other words, we use parallel matrix multiplication on the leaf nodes of a depth-first traversal of the recursion tree. At a high-level, the code path is exactly the same as in the sequential case, and the main parallelism is in library calls. The advantages of DFS are that the memory footprint matches the sequential algorithm and the code is simpler-parallelism in multiplications is hidden inside library calls. Furthermore, matrix additions are trivially parallelized. The key disadvantage of DFS is that the recursion cutoff point is larger (Figure 3) , limiting the number of recursive steps. On Edison's 24-core compute node, the recursion cutoff point is around N = 5000.
Breadth-First Search
The BFS algorithm uses task-based parallelism. Each leaf node in the matrix multiplication recursion tree is an independent task. The recursion tree also serves as a dependency graph: we need to compute all M r , 1 ≤ r ≤ R, (children) before forming the result (parent). The major advantage of BFS is we can take more recursive steps because the recursion cutoff point is based on the sequential dgemm curves. Matrix additions to form S r and T r are part of the task that computes M r . In the first level of recursion, matrix additions to form C i j from the M r are handled in the same way as DFS, since all threads are available.
The BFS approach has two distinct disadvantages. First, it is difficult to load balance the tasks because the number of threads Figure 2 . Effective performance (Equation (3)) comparison of common subexpression elimination (CSE) and the three matrix addition methods: write-once, streaming, and pairwise (see Section 3.2). The 4, 2, 4 fast algorithm computed N × 1600 × N ("outer product" shape) for varying N, and the 4, 2, 3 fast algorithm computed N × N × N (square multiplication). Write-once with no CSE tends to have the highest performance, especially for the 4, 2, 4 algorithm. Pairwise is slower because it performs more reads and writes. Figure 3 . Performance curves of MKL's dgemm routine in serial (left) and in parallel (right) for three different problem shapes. The performance curves exhibit a "ramp-up" phase and then flatten for large enough problems. Performance levels near N = 1500 in serial and N = 5000 in parallel. For large problems in both serial and parallel, N × N × N multiplication is faster than N × 800 × N, which is faster than N × 800 × 800. We note that sequential performance is faster than per-core parallel performance due to Intel Turbo Boost, which increases the clock speed from 2.4 to 3.2 GHz. With Turbo Boost, peak sequential performance is 25.6 GFLOPS. Peak parallel performance is 19.2 GFLOPS/core. may not divide the number of tasks evenly. Also, with only one step of recursion, the number of tasks can be smaller than the number of threads. For example, one step of Strassen's algorithm produces only 7 tasks and one step of the fast 3, 2, 3 algorithm produces only 15 tasks. Second, BFS requires additional memory since the tasks are executed independently. In a fast algorithm for M, K, N with R multiplies, each recursive step requires a factor R/(MN) more memory than the output matrix C to store the M r . There are additional memory requirements for the S r and T r matrices, as discussed in Section 3.2.
Hybrid
Our novel hybrid algorithm compensates for the load imbalance in BFS by applying the DFS approach on a subset of the base case problems. With L levels of recursion and P threads, the hybrid algorithm applies task parallelism (BFS) to the first R L −(R L mod P) multiplications. The number of BFS sub-problems is a multiple of P, so this part of the algorithm is load balanced. All threads are used on each of the R L mod P remaining multiplications (DFS). An alternative approach uses another level of hybridization: evenly assign as many as possible of the remaining R L mod P multiplications to disjoint subsets of P < P threads (where P divides P), and then finish off the still-remaining multiplications with all P threads. This approach reduces the number of small multiplications assigned to all P threads where perfect scaling is harder to achieve. However, it leads to additional load balancing concerns in practice and requires a more complicated task scheduler.
Implementation
The code generation from Section 3.1 produces code that can compile to the DFS, BFS, or HYBRID parallel algorithms. We use OpenMP to implement each algorithm. The overview of the parallelization is:
• DFS: Each dgemm call uses all threads. Matrix additions are always fully parallelized.
• BFS: Each recursive matrix multiplication routine and the associated matrix additions are launched as an OpenMP task. At each recursive level, the taskwait barrier ensures that all M r matrices are available to form the output matrix.
• HYBRID: Matrix multiplies are either launched as an OpenMP task (BFS), or the number of MKL threads is adjusted for a parallel dgemm (DFS). This is implemented with the if conditional clause of OpenMP tasks. Again, taskwait barriers ensure that the M r matrices are computed before forming the output matrix. We use an explicit synchronization scheme with OpenMP locks to ensure that the DFS steps occur after the BFS tasks complete. This ensures that there is no oversubscription of threads.
Shared-Memory Bandwidth Limitations
The performance gains of the fast algorithms rely on the cost of matrix multiplications to be much larger than the cost of matrix additions. Since matrix multiplication is compute-bound and matrix addition is bandwidth-bound, these computations scale differently with the amount of parallelism. For large enough matrices, MKL's dgemm achieves near-peak performance of the node (Figure 3 ). On the other hand, the STREAM benchmark [28] shows that the node achieves around a five-fold speedup in bandwidth with 24 cores. In other words, in parallel, matrix multiplication is near 100% parallel efficiency and matrix addition is near 20% parallel efficiency. The bandwidth bottleneck makes it more difficult for parallel fast algorithms to be competitive with parallel MKL. To illuminate this issue, we will present performance results with both 6 and 24 cores. Using 6 cores avoids the bandwidth bottleneck and leads to much better performance per core. Figure 4 shows the performance of the BFS, DFS, and HYBRID parallel methods with both 6 and 24 cores for three representative algorithms. The left plot shows the performance of Strassen's algorithm on square problems. With 6 cores, HYBRID does the best for small problems. Since Strassen's algorithm uses 7 multiplies, BFS has poor performance with 6 cores when using one step of recursion. While all 6 cores can do 6 multiplies in parallel, the 7th multiply is done sequentially (with HYBRID, the 7th multiply uses all 6 cores). With two steps of recursion, BFS has better load balance but is forced to work on smaller sub-problems. As the problems get larger, BFS outperforms HYBRID due to synchronization overhead when HYBRID switches from BFS to DFS steps. When the matrix dimension is around 15,000, the fast algorithm achieves a 25% speedup over MKL. Using 24 cores, HYBRID and DFS are the fastest. With one step of recursion, BFS can achieve only sevenfold parallelism. With two steps, there are 49 sub-problems, so one core is assigned 3 sub-problems while all others are assigned 2. In general, we see that it is much more difficult to achieve speedups with 24 cores. However, Strassen's algorithm has a modest performance gain over MKL for large problem sizes (∼ 5% faster). The middle plot of Figure 4 shows the 4, 2, 4 fast algorithm (26 multiplies) for N × 2800 × N problems. With 6 cores, HYBRID is fastest for small problems and BFS becomes competitive for larger problems, where the performance is 15% better than MKL. In Section 5, we show that 4, 2, 4 is also faster than Strassen's algorithm for these problems. With 24 cores, we see that HYBRID is drastically faster than MKL on small problems. For example, HY-BRID is 75% faster on 3500 × 2800 × 3500. 5 As the problem sizes get larger, we experience the bandwidth bottleneck and HYBRID achieves around the same performance as MKL. BFS uses one step of recursion and is consistently slower since it parallelizes 24 of 26 multiplies and uses only 2 cores on the last 2 multiplies. While multiple steps of recursion creates more load balance, the sub-problems are small enough that performance degrades even more. DFS follows a similar ramp-up curve as MKL, but the sub-problems are still too small to see a performance benefit.
Performance Comparisons
The right plot of Figure 4 shows the 4, 3, 3 fast algorithm (29 multiplies) for N × 3000 × 3000. We see similar trends as for the other problem sizes. With 6 cores, HYBRID does well for all problem sizes. Speedups are around ∼ 5% for large problems. With 24 cores, HYBRID is again drastically faster than MKL for small problem sizes and about the same as MKL for large problems.
Performance Experiments
We now present performance results for a variety of fast algorithms on several problem sizes. Based on the results of Section 4.5, we take the best of BFS and HYBRID when using 6 cores and the best of DFS and HYBRID when using 24 cores. For rectangular problem sizes in both sequential and parallel, we take the best of one or two steps of recursion. And for square problem sizes, we take the best of one, two, or three steps of recursion. Additional recursive steps do not improve the performance for the problem sizes we consider.
The square problem sizes for parallel benchmarks require the most memory-for some algorithms, three steps of recursion results in out-of-memory errors. In these cases, the original problem consumes 6% of the memory. For these algorithms, we only record the best of one or two steps of recursion in the performance plots. Finally, all timings are the median of five trials.
Sequential Performance
Figures 5 and 6 summarize the sequential performance of several fast algorithms. For N × N × N problems ( Figure 5) , we test the algorithms in Table 1 and some of their permutations. For example, we test 4, 4, 2 and 4, 2, 4 , which are permutations of 2, 4, 4 . In total, over 20 algorithms are tested for square matrices. Two of these algorithms, Bini's 3, 2, 2 and Schönhage's 3, 3, 3 are APA algorithms. We note that APA algorithms are of limited practical interest; even one step of recursion causes numerical errors in at least similar speedups using our code generator and a classical, 2, 3, 4 recursive algorithm (24 multiplies). half the digits (a better speedup with the same or better numerical accuracy can be obtained by switching to single precision). For the problem sizes N × 1600 × N and N × 2400 × 2400 (Figure 6 ), we evaluate the algorithms that are comparable to, or outperform, Strassen's algorithm. The results are summarized as follows:
1. All of the fast algorithms outperform MKL for large enough problem sizes. These algorithms are implemented with our code generator and use only the high-level optimizations described in Section 3.1. Since the fast algorithms perform less computation and communication, we expect this to happen.
2. For square matrices, Strassen's algorithm often performs the best. This is mostly due to its relatively small number of matrix additions in comparison to other fast algorithms. On large problem sizes, Strassen's algorithm provides around a 20% speedup over MKL's dgemm. The right plot of Figure 5 shows that some algorithms are competitive with Strassen's algorithm on large problems. These algorithms have large speedups per recursive step (see Table 1 ). While Strassen's algorithm can take more recursive steps, memory constraints and the cost of additions with additional recursive steps cause Strassen's algorithm to be on par with these other algorithms.
3. Although Strassen's algorithm has the highest performance for square matrices, other fast algorithms have higher performance for N × 1600 × N and N × 2400 × 2400 problem sizes (Figure 6 ). The reason is that the fixed dimension constrains the number of recursive steps. With multiple recursive steps, the matrix sub-blocks become small enough so that dgemm does not achieve good performance on the sub-problem. Thus, algorithms that get a better speedup per recursive step have higher performance for these problem sizes.
4. For rectangular matrices, algorithms that "match the shape" of the problem tend to perform the best. For example, 4, 2, 4 and 3, 2, 3 both have the "outer product" shape of the N × 1600 × N problem sizes and have the highest performance. Similarly, 4, 2, 3 and 4, 3, 3 have the highest performance of the exact algorithms for N × 2400 × 2400 problem sizes. The 4, 2, 4 and 4, 3, 3 algorithms provide around a 5% performance improvement over Strassen's algorithm and a 10% performance improvement over MKL on N × 1600 × N and N × 2400 × 2400, respectively. The reason follows from the performance explanation from Result 3. Only one or two steps of recursion improve performance. Thus, algorithms that match the problem shape and have high speedups per step perform the best.
5. Bini's 3, 2, 2 APA algorithm typically has the highest performance on rectangular problem sizes. However, we remind the reader that the approximation used by this algorithm results in severe numerical errors. Figure 7 shows the parallel performance for multiplying square matrices and Figure 8 shows the parallel performance for N × 2800 × N and N × 3000 × 3000 problem sizes. We include performance on both 6 and 24 cores in order to illustrate the bandwidth issues discussed in Section 4.5. We observe the following patterns in the parallel performance data: Figure 7 . Effective parallel performance (Equation (3)) of fast algorithms on square problems using only 6 cores (top row) and all 24 cores (bottom row). With 6 cores, bandwidth is not a bottleneck and we see similar trends to the sequential algorithms. With 24 cores, speedups over MKL are less dramatic, but Strassen's (bottom left), 3, 3, 2 (bottom left), and 4, 3, 3 (bottom right) all outperform MKL and have similar performance. Bini and Schöhage have high performance, but they are APA algorithms and suffer from severe numerical problems.
Parallel Performance
1. With 6 cores, bandwidth scaling is not a problem, and we find many of the same trends as in the sequential case. All fast algorithms outperform MKL. Apart from the APA algorithms, Strassen's algorithm is typically fastest for square matrices. The 3, 2, 3 fast algorithm has the highest performance for the N × 2800 × N problem sizes, while 4, 3, 3 and 4, 2, 3 have the highest performance for the N × 3000 × 3000. These algorithms match the shape of the problem.
2. With 24 cores, MKL's dgemm is typically the highest performing algorithm for rectangular problem sizes (bottom row of Figure 8) . In these problems, the ratio of time spent in additions to time spent in multiplications is too large, and bandwidth limitations prevent the fast algorithms from outperforming MKL. We also benchmarked the asymptotically fastest implementation of square matrix multiplication. The algorithm consists of composing 3, 3, 6 , 3, 6, 3 , 6, 3, 3 base cases. At the first recursive level, we use 3, 3, 6 ; at the second level 3, 6, 3 ; and at the third, 6, 3, 3 . The composed fast algorithm is for 3 · 3 · 6, 3 · 6 · 3, 6 · 3 · 3 = 54, 54, 54 . Each step of the composed algorithm computes 40 3 = 64000 matrix multiplications. The asymptotic complexity of this algorithm is Θ(N ω 0 ), with ω 0 = 3 log 54 (40) ≈ 2.775.
Although this algorithm is asymptotically the fastest, it does not perform well for the problem sizes considered in our experiments. For example, with 6 cores and BFS parallelism, the algorithm achieved only 8.4 effective GFLOPS/core multiplying square with dimension N = 13000. This is far below MKL's performance ( Figure 7) . We conclude that while the algorithm may be of theoretical interest, it does not perform well on the modest problem sizes of interest on shared memory machines.
Discussion
Our code generation framework lets us benchmark a large number of existing and new fast algorithms and test a variety of implementation details, such as how to handle matrix additions and how to implement the parallelism. However, we performed only high-level optimizations; we believe more detailed tuning of fast algorithms can provide performance gains. Based on the performance results we obtain in this work, we can draw several conclusions in bridging the gap between the theory and practice of fast algorithms.
First, in the case of multiplying square matrices, Strassen's algorithm consistently dominates the performance of exact algorithms (in sequential and parallel). Even though Smirnov's exact algorithm and Schönhage's APA algorithm are asymptotically faster in theory, they never outperform Strassen's for reasonable matrix dimensions in practice (sequential or parallel). This sheds some doubt on the prospect of finding a fast algorithm that will outperform Strassen's on square matrices; it will likely need to have a small base case and still offer a significant reduction in multiplications.
On the other hand, another conclusion from our performance results is that for multiplying rectangular matrices (which occurs more frequently than square in practice), there is a rich space for improvements. In particular, fast algorithms with base cases that match the shape of the matrices tend to have the highest performance. There are many promising algorithms, and we suspect that algorithm-specific optimizations will prove fruitful.
Third, in the search for new fast algorithms, our results confirm the importance of the (secondary) metric of sparsity of the U, V, W factor matrices. Although the arithmetic cost associated with the sparsity is negligible in practice, the communication cost associated with each nonzero can be performance limiting. We note that the communication costs of the streaming additions algorithm Figure 8 . Effective parallel performance (Equation (3)) of fast algorithms on rectangular problems using only 6 cores (top row) and all 24 cores (bottom row). Problem sizes are an "outer product" shape, N × 2800 × N and multiplication of tall-and-skinny matrix by a small square matrix, N × 3000 × 3000. With six cores, all fast algorithms outperform MKL, and new fast algorithms achieve about a 5% performance gain over Strassen's algorithm. With 24 cores, bandwidth is a bottleneck and MKL outperforms fast algorithms.
is independent of the sparsity, but the highest-performing additions algorithm in practice is the write-once algorithm, which is sensitive to the number of nonzeros.
Fourth, we have identified a parallel scaling impediment for fast algorithms on shared-memory architectures. Because the memory bandwidth often does not scale with the number of cores, and because the additions and multiplications are separate computations in our framework, the overhead of the additions compared to the multiplications worsens in the parallel case. This hardware bottleneck is unavoidable on most shared-memory architectures, though we note that it does not occur in distributed memory where aggregate memory bandwidth scales with the number of nodes.
We would like to extend our framework to the distributedmemory case, in part because of the better prospects for parallel scaling. A larger fraction of the time is spent in communication for the classical algorithm on this architecture, and fast algorithms can reduce the communication cost in addition to the computational cost in this case [3] . Similar code generation techniques will be helpful in exploring performance in this case.
As matrix multiplication is the main computational kernel in linear algebra libraries, we also want to incorporate these fast algorithms into frameworks like BLIS [35] and PLASMA [26] to see how they affect a broader class of numerical algorithms.
Finally, we have not explored the numerical stability of the exact algorithms in order to compare their results. While theoretical bounds can be derived from each algorithm's U, V, W representation, it is an open question which algorithmic properties are most influential in practice; our framework will allow for rapid empirical testing. As numerical stability is an obstacle to widespread use of fast algorithms, extensive testing can help alleviate (or confirm) common concerns.
