The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal.
INTRODUCTION
The communication of an algorithm (e.g., transferring data between the CPU and memory devices, or between parallel processors, a.k.a. I/O-complexity) often costs significantly more time than its arithmetic 1 . It is therefore of interest to obtain lower bounds for the communication needed on the one hand, and to design and implement algorithms minimizing communication 2 and attaining these lower bounds on the other hand.
While Moore's Law predicts an exponential speedup of hardware in general, the annual improvement rate of timeper-arithmetic-operation has, over the years, consistently exceeded that of time-per-word read/write [GSP04] . The fraction of running time spent on communication is thus expected to increase further.
Communication model.
We model communication costs of sequential and parallel architecture as follows. In the sequential case, with two levels of memory hierarchy (fast and slow), communication means reading data items (words) from slow memory (of unbounded size), to fast memory (of size M ) and writing data from fast memory to slow memory 3 . Words that are stored contiguously in slow memory can be read or written in a bundle which we will call a message. We assume that a message of n words can be communicated between fast and slow memory in time α + βn where α is the latency (seconds per message) and β is the inverse bandwidth (seconds per word). We define the bandwidth cost of an algorithm to be the total number of words communicated and the latency cost of an algorithm to be the total number of messages communicated. We assume that the input matrices initially resides in slow memory, and is too large to fit in the smaller fast memory. Our goal then is to minimize bandwidth and latency costs. 4 In the parallel case, we assume p processors, each with memory of size M . We are interested in the communication among the processors. As in the sequential case, we assume that a message of n consecutively stored words can be communicated in time α + βn. This cost includes the time required to "pack" non-contiguous words into a single message, if necessary. We assume that the input is initially evenly distributed among all processors, so M · p is at least as large as the input. Again, the bandwidth cost and latency cost are the words and messages counts respectively. However, we count the number of words and messages communicated along the critical path as defined in [YM88] (i.e., two words that are communicated simultaneously are counted only once), as this metric is closely related to the total running time of the algorithm. As before, our goal is to minimize the number of words and messages communicated.
We assume that (1) the cost per flop is the same on each processor and the communication costs (α and β) are the same between each pair of processors, (2) all communication is "blocking": a processor can send/receive a single message at a time, and cannot communicate and compute a flop simultaneously (the latter assumption can be dropped, affecting the running time by a factor of two at most), and (3) there is no communication resource contention among processors. For example, if processor 0 sends a message of size n to processor 1 at time 0, and processor 2 sends a message of size n to processor 3 also at time 0, the cost along the critical path is α + βn. However, if both processor 0 and processor 1 try to send a message to processor 2 at the same time, the cost along the critical path will be the sum of the costs of each message.
The Computation Graph and Implementations of an Algorithm.
The computation performed by an algorithm on a given input can be modeled as a computation directed acyclic graph (CDAG) : We have a vertex for each input / intermediate / output argument, and edges according to direct dependencies (e.g., for the binary arithmetic operation x := y +z we have a directed edge from vy to vx and from vz to vx, where the vertices vx, vy, vz stand for the arguments x, y, z, respectively). bounds in this paper thus apply to the model with memory hierarchy as well. 4 The sequential communication model used here is sometimes called the two-level I/O model or disk access machine (DAM) model (see [AV88, BBF + 07, CR06]). Our bandwidth cost model follows that of [HK81] and [ITT04] in that it assumes the block-transfer size is one word of data (B = 1 in the common notation). However, our model allows message sizes to vary from one word up to the maximum number of words that can fit in fast memory.
An implementation of an algorithm determines, in the parallel model, which arithmetic operations are performed by which of the p processors. This corresponds to partitioning the corresponding CDAG into p parts. Edges crossing between the various parts correspond to arguments that are in the possession of one processor, but are needed by another processor, therefore relate to communication. In the sequential model, an implementation determines the order of the arithmetic operations, in a way that respects the partial ordering of the CDAG (see Section 3 relating this to communication cost).
Implementations of an algorithm may greatly vary in their communication costs. The I/O-complexity of an algorithm is the minimum bandwidth cost of the algorithm, over all possible implementations. The I/O-complexity of a problem is defined to be the minimum I/O-complexity of all algorithms for this problem.
There are quite a few I/O-complexity lower and upper bounds of specific algorithms (see below). These are results of the form: any implementation of algorithm Alg requires at least X communication (or: there is an implementation for algorithm Alg that requires at most X communication). However, we are not aware of I/O-complexity lower bounds for a problem, i.e., of the form: any algorithm for a problem P requires at least X communication. The lower bounds in this paper are for all the implementations for a family of algorithms: Strassen-like 5 fast matrix multiplication.
Previous Work.
Consider the classical Θ(n 3 ) algorithm for matrix multiplication. While naïve implementations are communication inefficient, communication-minimizing sequential and parallel variants of this algorithm were constructed, and proved optimal, by matching lower bounds [Can69, HK81, FLPR99, ITT04] .
In [BDHS10a, BDHS10b] we generalize the results of [HK81, ITT04] regarding matrix multiplication, to obtain new I/Ocomplexity lower bounds for a much wider variety of algorithms. Most of our bounds are shown to be tight. This includes algorithms for LU factorization, Cholesky factorization, LDL T factorization, QR factorization, as well as algorithms for eigenvalues and singular values. Thus we essentially cover all direct methods of linear algebra. The results hold for dense matrix algorithms (most of them have O(n 3 ) complexity), as well as sparse matrix algorithms (whose running time depends on the number of non-zero elements, and their locations). They apply to sequential and parallel algorithms, to compositions of linear algebra operations (like computing the powers of a matrix), and to certain graph theoretic problems 6 .
In [BDHS10a, BDHS10b] we use the approach of [ITT04], based on the Loomis-Whitney geometric theorem [LW49, BZ88] , by embedding segments of the computation process into a three dimensional cube. This approach, however, is not suitable when distributivity is used, as is the case in Strassen [Str69] and other fast matrix multiplication algorithms (e.g., [CW90, CKSU05] ).
While the I/O-complexity of classic matrix multiplication and algorithms with similar structure is quite well understood, this is not the case for algorithms of more complex structure. Avoiding the communication of parallel classical matrix multiplication was addressed [Can69] almost simultaneously with the publication of Strassen's fast matrix multiplication [Str69] . Moreover, an I/O-complexity lower bound for the classical matrix multiplication algorithm is known for almost three decades [HK81] . Nevertheless, the I/O-complexity of Strassen's fast matrix multiplication and similar algorithms has not yet been resolved.
In this paper we obtain the first communication cost lower bound for Strassen's and other fast matrix multiplication algorithm, in the sequential and parallel models. For sequential algorithms these bounds are attainable and so optimal.
Communication Costs of Fast Matrix Multiplication
Upper bound.
The I/O-complexity IO(n) of Strassen's algorithm (see Algorithm 1, Appendix B), applied to n-by-n matrices on a machine with fast memory of size M , can be bounded above as follows (for actual uses of Strassen's algorithm, see [DHSS94, HLJJ + 96, DS04]): Run the recursion until the matrices are sufficiently small. Then, read the two input sub-matrices into the fast memory, perform the matrix multiplication inside the fast memory, and write the result into the slow memory 7 . We thus have
and IO
.
(1)
Lower bound.
In this paper, we obtain a tight lower bound:
Theorem 1. (Main Theorem) The I/O-complexity IO(n) of Strassen's algorithm on a machine with fast memory of size M , assuming that no arithmetic operation is computed twice 8 , is
(2)
It holds for any implementation and any known variant of Strassen's algorithm 9,10 . This includes Winograd's O(n lg 7 ) variant that uses 15 additions instead of 18, which is the most used fast matrix multiplication algorithm in practice [DHSS94, HLJJ + 96, DS04]. For parallel algorithms, using a reduction from the sequential to the parallel model (see e.g., [ITT04] or our [BDHS10b]) this yields:
Corollary 2. Let IO(n) be the I/O-complexity of Strassen's algorithm, run on a machine with p processors, each with a local memory of size M . Assume that no arithmetic operation is computed twice. Then
Note that although recomputation is forbidden here, replication of the input is allowed. Specifically, for multiplying matrices of maximum possible size (i.e., M = Θ
. We can extend the bounds to a wider class of all Strassenlike fast matrix multiplication algorithms. Let Alg be any Strassen-like matrix multiplication algorithm that runs in time O(n ω 0 ) for some 2 < ω0 < 3. Then, using the same arguments that lead to (1), the I/O-complexity of Alg can be shown to be
. We obtain a matching lower bound:
(3)
Corollary 4. Let IO(n) be the I/O-complexity of a Strassen-like algorithm (with arithmetic performed as in Theorem 3), run on a machine with p processors, each with a local memory of size M . Assume that no arithmetic operation is computed twice. Then
) .
For the "2D" case M = Θ
Corollaries 2 and 4 can be generalized to other models, such as the heterogenous model (where processors have different memory sizes and communication and computation speeds), and shared memory model. The reduction is achieved by observing the communication of a single processor.
The Expansion Approach
The proof of the main theorem is based on estimating the edge expansion of the computation directed acyclic graph (CDAG) of an algorithm. The I/O-complexity is shown to be closely connected to the edge expansion properties of this graph. As the graph has a recursive structure, the expansion can be analyzed directly (combinatorially, similarly to what is done in [Mih89, ASS08, KKK10]) or by spectral analysis (in the spirit of what was done for the Zig-Zag expanders [RVW02] ). There is, however, a new technical challenge. The replacement product and the Zig-Zag product act similarly on all vertices. This is not what happens in our case: multiplication and addition vertices behave differently. The expansion approach is similar to the one taken by Hong and Kung [HK81] . They use the red-blue pebble game to obtain tight lower-bounds on the I/O-complexity of many algorithms, including ordinary matrix multiplication, matrix-vector multiplication, and FFT. The proof is obtained by showing that the size of any subset of the vertices of the CDAG is bounded by a function of the size of its dominator set.
On the one hand, their dominator set technique has the advantage of allowing recomputation of any intermediate value. We were not able to allow recomputation using our edge expansion approach. On the other hand, the dominator set requires large input or output. Such an assumption is not needed by the edge expansion approach, as the bounds are guaranteed by edge expansion of many (internal) parts of the CDAG. In that regard, one can view the approach of [ITT04] (also in [BDHS10a, BDHS10b] ) as an edge expansion assertion on the CDAGs of the corresponding classical algorithms.
The study of expansion properties of a CDAG was also suggested as one of the main motivations of Lev and Valiant [LV83] in their work on superconcentrators. They point out many papers proving that classes of algorithms computing DFT, matrix inversion and other problems all have to have good expansion properties, thus providing lower-bounds on the number of the arithmetic operations required.
Other papers study connections between bounded-space computation, and combinatorial expansion-related properties of the corresponding CDAG (see e.g., [Sav94, BP99, BPD00] and their references).
Paper organization.
Section 2 contains preliminaries on the notions of graph expansion. In Section 3 we state and prove the connection between I/O-complexity and the expansion properties of the computation graph. In Section 4 we analyze the expansion of the CDAG of Strassen's algorithm. We present the generalization of the bounds to other fast matrix multiplication algorithms, conclusions and open problems in Section 5.
PRELIMINARIES
Edge expansion.
is the set of edges connecting the vertex sets A and B. We omit the subscript G when the context makes it clear.
Expansion of small sets.
For many graphs, small sets of vertices have better expansion guarantee. Let hs(G) denote the edge expansion guarantee for sets of size at most s in G:
In many cases, hs(G) does not depend on |V (G)|, although it may decrease when s increases. One way of bounding hs(G) is by decomposing G into small subgraphs of large edge expansion.
Claim 5. Let G = (V, E) be a d-regular graph that can be decomposed into edge-disjoint (but not necessarily vertex disjoint) copies of a d ′ -regular graph G ′ = (V ′ , E ′ ). Then the edge expansion guarantee of G for sets of size at most
See proof in Appendix A.
When G is not regular.
If G = (V, E) is not regular but has a bounded maximal degree d, then we can add (< d) loops to vertices of degree < d, obtaining a regular graph G ′11 . Note that for any 
I/O-COMPLEXITY AND EDGE EXPAN-SION
In this section we recall the computation graph of an algorithm, then show how a partition argument connects the expansion properties of the graph and the I/O-complexity of the algorithm. A similar partition argument already appeared in [ITT04] , and then in our [BDHS10b] . In both cases it is used to relate I/O-complexity to the Loomis-Whitney geometric bound [LW49] , which can be viewed, in this context, as an expansion guarantee for the corresponding graphs.
The computation graph.
For a given algorithm, we consider the computation (directed) graph G = (V, E), where there is a vertex for each arithmetic operation (AO) performed, and for every input element. G contains a directed edge (u, v), if the output operand of the AO corresponding to u (or the input element corresponding to u), is an input operand to the AO corresponding to v. The in-degree of any vertex of G is, therefore, at most 2 (as the arithmetic operations are binary). The out-degree is, in general, unbounded 12 , i.e., it may be a function of |V |. We next show how an expansion analysis of this graph can be used to obtain the I/O-complexity lower bound for the corresponding algorithm.
The partition argument.
Let M be the size of the fast memory. Let O be any total ordering of the vertices that respects the partial ordering of the CDAG G, i.e., all the edges are going from left to right. This ordering can be thought of as the actual order in which the computations are performed. Let P be any partition of V into segments S1, S2, ..., so that a segment Si ∈ P is a subset of the vertices that are contiguous in the total ordering O.
Let RS and WS be the set of read and write operands, respectively (see Figure 1 ). Namely, RS is the set of vertices outside S that have an edge going into S, and WS is the set of vertices in S that have an edge going outside of S. Then the total I/O-complexity due to reads of AOs in S is at least |RS| − M , as at most M of the needed |RS| operands are already in fast memory when the execution of the segment's 11 Here we use the convention that a loop adds 1 to the degree of a vertex. 12 As the lower bounds are derived for the bounded out-degree case, we will show how to convert the corresponding CDAG to obtain constant out-degree, without affecting the I/Ocomplexity too much. 
Edge expansion and I/O-complexity .
Consider a segment S and its read and write operands RS and WS (see Figure 1 ). If the graph G containing S has h(G) edge expansion 14 , maximum degree d and at least 2|S| vertices, then (using the definition of h(G)), we have
Either (at least) half of the edges E(S, V \ S) touch RS or half of them touch WS. As every vertex is of degree d, we have
Combining this with (6) and choosing to partition V into |V |/s segments of equal size s, we obtain:
= Ω (|V | · h(G)). In many cases h(G) is too small to attain the desired I/O-complexity lower bound. Typically, h(G) is a decreasing function in |V (G)|, namely the edge expansion deteriorates with the increase of the input size and with the running time of the corresponding algorithm. This is the case with matrix multiplication algorithms: the cubic, as well as the Strassen and Strassen-like algorithms. In such cases, it is better to consider the expansion of G on small sets only:
Choosing 15 the minimal s so that
13 One can think of this as a game: the first player orders the vertices. The second player partitions them into contiguous segments. The objective of the first player (e.g., a good programmer) is to order the vertices so that any consecutive partitioning by the second player leads to a small communication count. 14 The direction of the edges does not matter much for the expansion-bandwidth argument: treating all edges as undirected, changes the I/O-complexity estimate by a factor of 2 at most. For simplicity, we will treat G as undirected. 15 The existence of an s that satisfies the condition is not always guaranteed. In the next section we confirm this for we obtain
In some cases, the computation graph G does not fit this analysis: it may not be regular (with vertices of unbounded degree), or its edge expansion may be hard to analyze. In such cases, we may consider some subgraph G ′ of G instead to obtain a lower bound on the I/O-complexity :
where s is chosen so that hs(G ′ )·αs 2 ≥ 3M .
The correctness of this claim follows from Equations (7) and (8), and from the fact that at least an α/2 fraction of the segments have at least α 2 · s of their vertices in G ′ (otherwise
. We therefore have:
Lemma 8. Let Alg be an algorithm with AO(N ) arithmetic operations (N being the total input size, N = Θ(n 2 ) for matrix multiplication) and computation graph G (N 
1). Then the I/O-complexity of Alg 16 on a machine with fast memory of size M is
As AO(N ) = Θ(|V ′ |) and hs(G ′ (N )) for s = AO(M ) is Θ(h(G ′ (M ))) (recall Claim 5) we obtain, equivalently,
(11)
EXPANSION PROPERTIES OF STRASSEN'S ALGORITHM
Recall Strassen's algorithm for matrix multiplication (see Algorithm 1 in Appendix B) and consider its computation graph (see Figure 2 ). Let H lg n be the computation graph of Strassen's algorithm on input matrices of size n × n. H lg n has the following structure: encode A: generate weighted sums of elements of A. Similarly encode B. Then multiply the encodings of A and B element-wise. Finally, decode C, by taking weighted sums of the products. This is the structure of all the fast matrix multiplication algorithms that were obtained since Strassen's 17 .
Assume w.l.o.g. that n is an integer power of 2. Denote by Enc lg n A the part of H lg n that corresponds to the encoding Strassen, for sufficiently large |V (G)| (in particular, |V (G)| has to be larger than M ). Indeed this is the interesting case, as otherwise all computations can be performed inside the fast memory, with no communication, except for reading the input once. . G ′ is the graph Dec k C for k = lg M , see Section 4 for the definition of Dec k C. 17 Indeed, any fast matrix multiplication algorithm can be converted into one of this form [Raz03] . of matrix A. Similarly, Enc lg n B, and Dec lg n C correspond to the parts of H lg n that compute the encoding of B and the decoding of C, respectively. Duplicate Dec1C 7 i times. Duplicate DeciC four times. We next identify the 4 · 7 i output vertices of the copies of Dec1C with the 4 · 7 i input vertices of the copies of DeciC. Recall that each Dec1C has four output vertices. The first output vertex of the 7 i Dec1C graphs are identified with the 7 i input vertices of the first copy of DeciC. The second output vertex of the 7 i Dec1C graphs are identified with the 7 i input vertices of the second copy of DeciC. And so on. We make sure that the jth input vertex of a copy of DeciC is identified with an output vertex of the jth copy of Dec1C.
We similarly obtain Enci+1A from EnciA and Enc1A (and Enci+1B from EnciB and Enc1B). For every i, Hi is obtained by connecting edges from the jth output vertices of EnciA and EnciB to the jth input vertex of DeciC.
The graph Dec1C has no vertices which are both input and output, therefore:
Fact 9. Dec lg n C is of maximal degree 6.
However, Enc1A and Enc1B have vertices which are both input and output (e.g., A11), therefore Enc lg n A and Enc lg n B have vertices of out-degree Θ(lg n). All in-degrees are at most 2, as an arithmetic operation has at most two inputs.
As H lg n contains vertices of large degrees, it is easier to consider Dec lg n C: it contains only vertices of constant bounded degree, yet at least one third of the vertices of H lg n are in it. See proof in Section 4. Assume w.l.o.g. that n is an integer power of √ M . 18 Then Dec lg n C can be split into edge-disjoint copies of Dec 1 2 lg M C. 18 We may assume this, as we are dealing with a lower bound here, so it suffices to prove the assertion for an infinite num-Using Claim 5, we thus deduce the expansion of Dec lg n C on small sets:
Corollary 11. s·hs(Dec lg n C) ≥ 3M for s = 9·M lg 7/2 .
As Dec lg n C contains α = 1 3 of the vertices of H lg n , Lemma 8 now yields Main Theorem 1. Note that Dec lg n C has no input vertices, so no restriction on input replication is needed.
Dec1C is presented, for simplicity, with vertices of indegree larger than two (but constant). A vertex of degree larger than two, in fact, represents a full binary tree. Note that replacing these high in-degree vertices with trees changes the edge expansion of the graph by a constant factor at most (as this graph is of constant size, and connected). Moreover, as there is no change in the number of input and output vertices, the arguments in the following proof of Lemma 10 still hold.
Combinatorial Estimation of the Expansion.
Proof of Lemma 10.
) k , where c is some universal constant, and d is the constant degree of Dec k C (after adding loops to make it regular). The proof works as follows. Recall that G k is a layered graph, so all edges (excluding loops) connect between consecutive levels of vertices. We argue (in Claim 15) that each level of G k contains about the same fraction of S vertices, or else we have many edges leaving S. We also observe (in Fact 16) that such homogeneity (of a fraction of S vertices) does not hold between distinct parts of the lowest level, or, again, we have many edges leaving S. We then show that the homogeneity between levels combined with the heterogeneity of the lowest level, guarantees that there are many edges leaving S.
Let li be the ith level of vertices of G k , so 4 k = |l1| < |l2| < · · · < |li| = 4 k−i+1 7 i−1 < · · · < |l k+1 | = 7 k . Let Si ≡ S ∩ li. Let σ = |S| |V | be the fractional size of S and σi = |S i | |l i | be the fractional size of S in level i. Due to averaging, we observe the following:
Fact 12. There exist i and i ′ such that σi ≤ σ ≤ σ i ′ .
From the geometric sum, we now have:
Fact 13.
. ber of n's. Alternatively, in the following decomposition argument, we leave out a few of the top or bottom levels of vertices of Dec lg n C, so that n is an integer power of √ M and so that at most |S|/2 vertices of S are cut off.
Proof. Let G ′ be a G1 component connecting li with li+1 (so it has four vertices in li and seven in li+1). G ′ has no edges in E(S, V \ S) if all or none of its vertices are in S. Otherwise, as G ′ is connected, it contributes at least one edge to E(S, V \ S). The number of such G1 components with all their vertices in S is at most min{σi, σi+1} · |l i | 4 . Therefore, there are at least |σi − σi+1| · |l i | 4 G1 components with at least one vertex in S and one vertex that is not.
Claim 15 (Homogeneity between levels). If there exists i so that |σ−σ i | σ ≥ 1 10 , then
where c > 0 is some constant depending on G1 only.
Proof. Assume that there exists j so that
By the initial assumption, there exists j so that
for any c ≤ c ′ 10 · 3 7 . Let T k be a tree corresponding to the recursive construction of G k in the following way (see Figure 3) . T k is a tree of height k + 1, where each internal node has four children. The root r of T k corresponds to l k+1 (the largest level of G k ). The four children of r correspond to the largest levels of the four graphs that one can obtain by removing the level of vertices l k+1 from G k . And so on. For every node u of T k , denote by Vu the set of vertices in G k corresponding to u. We thus have |Vr| = 7 k where r is the root of T k , |Vu| = 7 k−1 for each node u that is a child of r; and in general we have 4 i tree nodes u corresponding to a set of size |Vu| = 7 k−i+1 . Each leaf l correspond to a set of size 1.
For a tree node u, let us define ρu = |S∩Vu| |Vu| to be the fraction of S nodes in Vu, and δu = |ρu − ρ p(u) |, where p(u) is the parent of u (for the root r we let p(r) = r). We let ti be the ith level of T k , counting from the bottom, so t k+1 is the root and t1 are the leaves.
Fact 16. As Vr = l k+1 we have ρr = σ k+1 . For a tree leaf u ∈ t1, we have |Vu| = 1. Therefore ρu ∈ {0, 1}. The number of vertices u in t1 with ρu = 1 is σ1 · |l1|.
Claim 17. Let u0 be an internal tree node, and let u1, u2, u3, u4 be its four children. where c ′′ = c ′′ (G1).
Proof. The proof follows that of Claim 14. Let G ′ be a G1 component, connecting Vu 0 with ∪ i∈ [4] Vu i (so it has seven vertices in Vu 0 and one in each of Vu 1 ,Vu 2 ,Vu 3 ,Vu 4 ). G ′ has no edges in E(S, V \S) if all or none of its vertices are in S. Otherwise, as G ′ is connected, it contributes at least one edge to E(S, V \S). The number of G1 components with all their vertices in S is at most min{ρu 0 , ρu 1 , ρu 2 , ρu 3 , ρu 4 } · |Vu 1 | 4 . Therefore, there are at least max i∈ [4] 
components with at least one vertex in S and one vertex that is not.
As each internal node has four children, this is
where v ∼ r is the path from v to the root r. This is at least c ′′ · d · ∑ v∈t 1 |ρu − ρr|, by the triangle inequality for | · |. By Fact 16, it is ≥ c ′′ · d · |l1| · ((1 − σ1) · ρr + σ1 · (1 − ρr)). By Claim 15, w.l.o.g., |σ k+1 − σ|/σ ≤ 1 10 and |σ1 − σ|/σ ≤ 1 10 . As ρr = σ k+1 , |E(S, V \ S)| ≥ 3 4 · c ′′ · d · |l1| · σ, and by Fact 13, ≥ c · d · |S| · ( 4 7 ) k , for any c ≤ 3 4 · c ′′ . This completes the proof of Lemma 10.
CONCLUSIONS AND OPEN PROBLEMS
We obtained a tight lower bound for the I/O-complexity of Strassen's and Strassen-like fast matrix multiplication algorithms. These bounds are optimal for the sequential model with two memory levels and with memory hierarchy. The lower bounds extend to the parallel model and other models. We are not aware of matching upper bounds for the parallel cases. However recently these bounds were attained by 2.5D parallel implementations for Strassen's algorithm and for Strassen-like algorithms [BDH + 11].
Strassen-like Algorithms.
A Strassen-like algorithm has a recursive structure that utilizes a base case: multiplying two n0-by-n0 matrices using m(n0) multiplications. Given two matrices of size n-by-n, it splits them into n0-by-n0 blocks (each of size n n 0 -by-n n 0 ), and works blockwise, according to the base case algorithm. Additions (and subtractions) in the base case are interpreted as additions (and subtractions) of blocks. These are performed element-wise. Multiplications in the base case are in-terpreted as multiplications of blocks. These are performed by recursively calling the algorithm. The arithmetic count of the algorithm is then T (n) = m(n0) · T ( n n 0 ) + O(n 2 ), so T (n) = Θ(n ω 0 ) where ω0 = log n 0 m(n0). We further demand that the Dec1C is a connected graph. This class includes Winograd's variant of Strassen's algorithm [Win71] , which uses 15 additions rather than 18, but not the cubic algorithm, where Dec1C is composed of four disconnected graphs (corresponding to the four outputs). We conjecture that Dec1C is connected for all the following fast matrix-multiplication algorithms and they are all Strassen-like: [Pan80, Bin80, Sch81, Rom82, CW82, Str87, CW87], (see [BCS97] for discussion of these algorithms), as well as [CKSU05] , where the base case utilizes a novel group-theoretic approach.
To prove Theorem 3, which generalizes the I/O-complexity lower bound of Strassen's algorithm (Theorem 1) to all Strassenlike algorithms, we note the following:
The entire proof of Theorem 1, and in particular, the computations in the proof of Lemma 10, hold for any Strassenlike algorithm, where we plug in n 2 0 , m(n0), and n 0 m(n 0 ) instead of 4, 7, and 4 7 . For bounding the asymptotic I/Ocomplexity , we do not care about the number of internal vertices of Dec1C; we need only to know that it is connected, and to know the sizes n0 and m(n0). The only nontrivial adjustment is to show the equivalent of Fact 9: that the Dec log n C graph is of bounded degree.
Claim 18. The Dec log n C graph of any Strassen-like algorithm is of degree bounded by a constant.
Proof. If the set of input vertices of Dec1C, and the set of its output vertices are disjoint, then Dec log n C is of constant bounded degree (its maximal degree is at most twice the largest degree of Dec1C).
Assume (towards contradiction) that the base graph Dec1C has an input vertex which is also an output vertex. An output vertex represents the inner product of two n0-long vectors. The corresponding bilinear polynomial is irreducible. This is a contradiction, since an input vertex represents the multiplication of a (weighted) sum of elements of A with a (weighted) sum of elements of B.
Other Fast Matrix Multiplication Algorithms.
Another class of matrix multiplication algorithms, the uniform, non-stationary algorithms, allows mixing of schemes of the previous (Strassen-like) class. In each recursive level, a different scheme may be used. The CDAG has a repeating structure inside one level, but the structure may differ between two distinct levels. This class includes, for example, algorithms that optimize for input sizes, (for size that is not an integer power of a constant integer). The class also includes algorithms that cut the recursion off at some point, and then switch to the classical algorithm. For these and other implementation issues, see [DHSS94, HLJJ + 96] (sequential model) and [DS04] (parallel model). The I/Ocomplexity lower-bound generalizes to this class, and will appear in a separate note [BDHS11c] .
A third class, the non-uniform, non-stationary algorithms, allows recursive calls to have different structure, even when they refer to multiplication of matrices in the same recursive level. It is not clear how to analyze the expansion of the CDAG of an algorithm in the third class, although we are not aware of any algorithm in this class. Such an analysis, applied to the base case of [CKSU05] , may improve the I/O-complexity lower bound by a (large) constant.
Multiplication of rectangular matrices have seen a series of increasingly fast algorithms culminating in Coppersmith's algorithm [Cop97] . We suggest the first lower bounds of the communication costs for these algorithms, and show that in some cases they are attainable, and therefore optimal [BDHS11b] .
As fast matrix multiplication is a basic building block in many fast algorithms in linear-algebra (e.g., LU, QR, Sylvester equation) their I/O-complexity can in many cases be determined using our approach [BDHS11a] .
Other Algorithms, Other Hardware.
Our lower bounds, as well as most of the previous lower bounds deal with linear algebra and numerical analysis algorithms 19 . Our new approach, however, seems general enough to address I/O-complexity lower bounds of other algorithms (recall Lemma 8).
In many cases, the simplest recursive implementation of an algorithm turns out to be communication optimal (e.g., in the cases of matrix multiplication [FLPR99] and Cholesky decomposition [AP00, BDHS10a], but not in the case of LU decomposition [Tol97] ). This leads to the question: when is the communication optimality of an algorithm determined by the expansion properties of the corresponding computation graphs? In this work we showed that such is the case for Strassen-like fast matrix multiplication algorithms.
It is of great interest to construct new models general enough to capture the rich and evolving design space of current and predicted future computers. Such models can be homogeneous, consisting of many layers, where the components of each layer are the same (e.g., a supercomputer with many identical multi-core chips on a board, many identical boards in a rack, many identical racks, and many identical levels of associated memory hierarchy); or heterogeneous, with components with different properties residing on the same level (e.g., CPUs alongside GPUs, where the latter can do some computations very quickly, but are much slower to communicate with).
Some experience has been acquired with such systems (e.g., using GPU assisted linear algebra computation [VD08] ). A first step in analyzing such systems has been recently introduced by Ballard, Demmel, and Gearhart [BDG11], where they modeled heterogenous shared memory architectures, such as mixed GPU/CPU architecture, and obtained tight lower and upper bounds for O(n 3 ) matrix multiplication.
However, there is currently no systematic theoretic way of obtaining upper and lower bounds for arbitrary hardware models. Expanding such results to other architectures and algorithmic techniques is a challenging goal. For example, recursive algorithms tend to be cache oblivious and communication optimal for the sequential hierarchy model. Finding an equivalent technique that would work for an arbitrary architecture is a fundamental open problem.
ACKNOWLEDGMENT
We thank Eran Rom, Edgar Solomonik, and Chris Umans for helpful discussions.
B. STRASSEN'S FAST MATRIX MULTIPLICATION ALGORITHM
Strassen's original algorithm follows [Str69] . See [Win71] for Winograd's variant, which reduces the number of additions.
Algorithm 1 Matrix Multiplication: Strassen's Algorithm Input: Two n × n matrices, A and B. 1: if n = 1 then 2: C11 = A11 · B11 3: else 4:
{Decompose A into four equal square blocks A = ( A11 A12 A21 A22
) and the same for B.} 5: M1 = (A11 + A22) · (B11 + B22) 6: M2 = (A21 + A22) · B11 7: M3 = A11 · (B12 − B22) 8: M4 = A22 · (B21 − B11) 9: M5 = (A11 + A12) · B22 10: M6 = (A21 − A11) · (B11 + B12) 11: M7 = (A12 − A22) · (B21 + B22) 12: 
