A tight Ω((n/ √ M ) log 2 7 M ) lower bound is derived on the I/O complexity of Strassen's algorithm to multiply two n × n matrices, in a two-level storage hierarchy with M words of fast memory. A proof technique is introduced, which exploits the Grigoriev's flow of the matrix multiplication function as well as some combinatorial properties of the Strassen computational directed acyclic graph (CDAG). Applications to parallel computation are also developed. The result generalizes a similar bound previously obtained under the constraint of no-recomputation, that is, that intermediate results cannot be computed more than once. For this restricted case, another lower bound technique is presented, which leads to a simpler analysis of the I/O complexity of Strassen's algorithm and can be readily extended to other "Strassen-like" algorithms.
Introduction
Data movement is increasingly playing a major role in the performance of computing systems, in terms of both time and energy. This current technological trend [19] is destined to continue, since the very fundamental physical limitations on minimum device size and maximum message speed lead to inherent costs when moving data, whether across the levels of a hierarchical memory system or between processing elements of a parallel systems [8] .
The communication requirements of algorithms have been the target of considerable research in the last four decades; however, obtaining significant lower bounds based on such requirements remains and important and challenging task.
In this paper we focus on the I/O complexity of Strassen's matrix multiplication algorithm. Matrix multiplication is a pervasive primitive utilized in many applications. Strassen [24] showed that two n × n matrices can be multiplied with O(n ω ), with ω = log 2 7 ≈ 2.8074, hence with asymptotically fewer than the n 3 arithmetic operations required by the straightforward implementation of the definition of matrix multiplication. This result has motivated a number of efforts which have lead to increasingly faster algorithms, at least asymptotically, with the current record being at ω < 2.3728639 [17] .
XX:2
The I/O complexity of Strassen's matrix multiplication with recomputation Previous and related work I/O complexity has been introduced in the seminal work by Hong and Kung [15] ; it is essentially the number of data transfers between the two levels of a memory hierarchy with a fast memory of M words and a slow memory with an unbounded number of words. They presented techniques to develop lower bounds to the I/O complexity of computations modeled by computational directed acyclic graphs (CDAGs). The resulting lower bounds apply to all the schedules of the given CDAG, including those with recomputation, that is, where some vertices of the CDAG are evaluated multiple times. Among other results, they established an Ω n 3 / √ M lower bound to the I/O complexity of the definition-based matrix multiplication algorithm, which matched a known upper bound upper bound [10] . [4, 3] by using the approach proposed in [14] based on the Loomis-Whitney geometric theorem [18, 26] . The same papers presents tight I/O complexity bounds for various classical linear algebra algorithms, for problems such as LU/Cholesky/LDLT/QR factorization and eigenvalues and singular values computation.
It is natural to wonder what is the impact of Strassen's reduction of the number of arithmetic operations on the number of data transfers. In an important contribution Ballard et al. [5] , obtained an Ω((n/ √ M ) log 2 7 M ) I/O lower bound for Strassen's algorithm, using the "edge expansion approach". The authors extend their technique to a class of "Strassen-like" fast multiplication algorithms and to fast recursive multiplication algorithms for rectangular matrices [2] . This result was later generalized to a broader class of "Strassen-like" algorithms of by Scott et. al [23] using the "path routing" technique. A parallel, "communication avoiding" implementation of Strassen's algorithm whose performance matches the known lower bound [5, 23] , was proposed by Ballard et al. [1] .
The [6] , it is shown that some algorithms admit a portable schedule (i.e., a schedule which achieves optimal performance across memory hierarchies with different access costs) only if recomputation is allowed. A number of lower bound techniques that allow for recomputation have been presented in the literature, including the "S-partition technique" [15] , the "S-span technique" [21] , and the "S-covering technique" [7] which merges and extends aspects from both [15] and [21] . None of these has however been successfully applied to fast matrix multiplication algorithms.
Our results
Our main result is the extension of the Ω((n/ √ M ) log 2 7 M ) I/O complexity lower bound for Strassen's algorithm to schedules with recomputation. A matching upper bound is known, and obtained without recomputation; hence, we can conclude that, for Strassen's algorithm, recomputation does not help in reducing I/O complexity if not, possibly, by a constant factor. In addition to the result itself, the proof technique appears to be of independent interest, since it exploits to a significant extent the divide&conquer nature which is exhibited by many algorithms. We do follow the dominator-set approach pioneered by Hong and Kung in [15] . However, we focus the dominator analysis only on a select set of target vertices, specifically the outputs of the sub-CDAGs of Strassen's CDAG that correspond to sub-problems of a suitable size (a size chosen as a function of the fast memory capacity, M ). Any dominator set of a set of target vertices can be partitioned into two subsets, one internal and one external to the sub-CDAGs. The analysis of the external component of the dominator does require rather elaborate arguments that are specific to Strassen's CDAG. In contrast, the analysis of the internal component can be carried out based only on the fact that the sub-CDAGs compute matrix products, irrespective of the algorithm (in our case, Strassen's) by which the products are computed. To achieve this independence of the algorithm, we resort on the concept of Grigoriev's flow of a function [13] and on a lower bound to such flow established by Savage [22] for matrix multiplication.
As it turns out, for schedules without recomputation, the analysis of the external component of the dominator sets becomes unnecessary and the analysis of the internal components can be somewhat simplified. The result is a derivation of the I/O complexity without recomputation considerably simpler than those in [5, 23] . The technique is easily extended to a class of Strassen-like algorithms.
The paper is organized as follows: in Section 2, we provide the details of our model and of several theoretical notions needed in our analysis. Section 3 describes the simplified lower bound for Strassen's and Strassen-like algorithms, when implemented with no recomputation. In Section 4, we present the I/O complexity lower bound for Strassen's algorithm, when recomputation is allowed. Extensions of the result to a parallel model are also discussed.
Preliminaries
We consider algorithms which compute the product C = AB of n × n matrices A, B with entries form a ring R. Specifically, we focus on algorithms whose execution, for any given n, can be modeled as a computational directed acyclic graph (CDAG) G(V, E), where each vertex v ∈ V represents either an input value or the result of an unit time operation (i.e. an intermediate result or one of the output values), while the directed edges in E represent data dependences. A directed path connecting vertices u, v ∈ V is an ordered sequence of vertices for which u and v are respectively the first, and last vertex such that there is in E a (directed) edge pointing from each vertex in the sequence to its successor. We say that a
Model
We assume that sequential computations are executed on a system with a two-level memory hierarchy consisting of a fast memory or cache of size M , measured in words, and a slow memory of unlimited size. We assume that each memory word can store at most one value form R. An operation can be executed only if all its operands are presents in cache. Data can be moved from the slow memory to the cache by read operations and in the other direction by write operations. Read and write operations are also called I/O operations. We assume the input data to be stored in slow memory at the beginning of the computation. The evaluation of a CDAG in this model can be analyzed by means of the "red-blue pebble game" [15] . The number of I/O operations executed when evaluating a CDAG depends on the "computational schedule," that is, on the order in which vertices are evaluated and on which values are kept in/discarded from cache. The I/O complexity IO G (M ) of CDAG G is defined as the minimum number of I/O operations over all possible computational schedules.
We also consider a parallel model where P processors, each with a local memory of size M , are connected by a network. We assume that the input is initially distributed among the processors, thus requiring that M P ≥ 2n 2 . Processors can exchange point-to-point messages among each other. For this model, we derive lower bounds to the number of words that must be either sent or received by at least one processor during the CDAG evaluation.
Preliminary definitions
The concept of information flow of a function was originally introduced by Grigoriev [13] . We use a revised formulation presented by Savage [22] . We remark that the flow is an inherent property of a function, not of a specific algorithm by which the function may be computed. 
The dominator set concept was originally introduced in [15] .
◮ Definition 3 (Dominator set). Given a CDAG G(V, E), let I ⊂ V denote the set of input vertices. A dominator set for V ′ ⊆ V is a set Γ ⊆ V such that every path from a vertex in I to a vertex in V ′ contains at least a vertex of Γ. A minimum dominator set for V ′ is a dominator set with minimum cardinality.
We will use a specular concept, commonly referred as "post-dominator set". The following lemma relates the size of certain post-dominator sets to the Grigoriev's flow.
Proof. 
Properties of Strassen's Algorithm
Consider Strassen's algorithm [24] when used to compute C = AB, with A and B matrices n×n with entries from the ring R (see Algorithm 1 in Appendix A for a detailed presentation). Let H n×n denote the corresponding CDAG. For n ≥ 2, H n×n can be obtained by using a recursive construction which mirrors the recursive structure of the algorithm. The base of the construction is the H 2×2 CDAG which corresponds to the multiplication of two 2 × 2 matrices using Strassen's algorithm (Figure 2a) . H 2n×2n can then be constructed by composing seven copies of H n×n , each corresponding to one of the seven sub-products generated by the algorithm (see Figure 2b) : n 2 vertex-disjoint copies of CDAGs Enc A (resp., Enc B ) are used to connect the input vertices of H 2n×2n which correspond to the values of the input matrix A (resp., B) to the appropriate input vertices of the seven sub-CDAGs H n×n i
; the output vertices sub-CDAGs H n×n i (which correspond to outputs of the seven sub-products) are connected to the opportune output vertices of the entire H 2n×2n CDAG using n 2 copies of the decoder sub-CDAG Dec. In our proofs we will leverage the following two properties of Strassen's CDAG: 
The validity of this lemmas can be verified through an analysis of respectively, the recursive structure of Strassen's CDAG and the properties of the encoding sub-CDAGs Enc A and Enc B . We refer the reader to Appendix A for detailed proofs.
Strassen-like algorithms
A (n 0 , m 0 )-Strassen-like algorithm has a recursive structure for which in the "base case" two n 0 × n 0 matrices are multiplied using m 0 scalar multiplications, whose result are then combined to obtain the product matrix. Given input matrices of size n × n, the algorithm splits them into n 2 0 sub-matrices of size n/n 0 ×n/n 0 and then proceeds block-wise, according to the base case. Additions (resp., subtractions) in the base case are interpreted as additions (resp., subtractions) of blocks and are performed element-wise. Multiplications in the base case are interpreted as multiplications of blocks and are executed by recursively calling the algorithm. In our analysis we consider Strassen-like algorithms for which each linear combination of the input sub-matrices is used in only one multiplication.
Let H n×n be the CDAG corresponding to a (n 0 , m 0 )-Strassen-like algorithm for input matrices of size n × n . H n×n has a recursive structure analogous to that of Strassen's CDAG H n×n . The base element H n0×n0 of the CDAG construction corresponding to the base case consists of two encoding graphs (corresponding respectively to the sub-CDAGs Enc A and Enc B for H 2×2 ), which compute m 0 linear combinations of entries of respectively the factor matrix A and of B. Corresponding pairs are then multiplied and the outputs are then combined using an decoder sub-CDAG to obtain the output.
Although in general the sub-problems generated by a Strassen-like algorithm may not be input disjoint, in [23] Scott et al. showed that a property similar to Lemma 6 holds:
For completeness, we summarize their result in Appendix A.
Lower bounds for schedules without recomputation
In our presentation we denote as G n×n the CDAG corresponding to the execution of an unspecified algorithm which implements the square matrix multiplication function f n×n : R 2n 2 → R n 2 . Let G q,n×n be a CDAG composed by q vertex-disjoint CDAGs G n×n . The set I (resp., O) of input (resp., output) vertices of G q,n×n is given by the union of the sets of the input (resp., output) vertices of the q sub-CDAGs G n×n . According to this definition, the CDAG of Strassen H n×n is a G
i CDAG for i ∈ 0, 1, . . . , log 2 n − 1 (Lemma 6). 
If run on P processor each equipped with a local memory of size M the I/O complexity is:
Proof. As first observation, note that the q vertex-disjoint sub-CDAGs G
M is composed by the union of the input (resp., output) vertices of the q sub-CDAGs G 2 √ M×2 √ M . Clearly, |X| = |Y | = q4M . We start by proving (3). Let C be any computational schedule for the sequential execution of the algorithm A using a cache of size M for which each intermediate value is computed just once. Each intermediate value must therefore be kept in memory (either cache or slow) until all operations using it as an operand have been computed. The operations corresponding to vertices in G A are executed according to a topological ordering of the vertices, that is, any value corresponding to a vertex in u ∈ X which can be connected through a directed path to a vertex in v ∈ Y will be computed in C before v. We partition C into q segments such that during each segment C j the values corresponding to exactly 4M vertices in Y (denoted as Y j ) are computed for the first (and only) time. We shall now verify that at least M I/O operations are to be executed during every segment C j . The proof is by contradiction. Let Γ j denote the set of vertices of G A which correspond to the at most M values which are stored in the cache at the beginning of the segment and the at most M − 1 values which are loaded into the cache form the slow memory during C j by means of a read I/O operation. We thus have |Γ j | ≤ 2M − 1. In order for the 4M values corresponding to vertices in Y j to be computed during the segment without any additional I/O operation, there must be no path connecting any vertex in Y i to any vertex in I which does not have at least one vertex in Γ j (i.e. Γ j has to be a dominator set of Y j ). If any such path exists then there is a previously computed intermediate result v which is required for the computation of at lest one of the values corresponding to a vertex in Y j which is neither residing in the cache at the beginning of C j , nor loaded from slow memory to the cache. As no intermediate result can be computed twice this would lead to a contradiction. From Corollary 10, we have that any sub-set of 4M elements of X has dominator size at least 2M . This leads to a contradiction.
At least M I/O operations are thus executed during each segment C j . Since, by construction, the q segments are not overlapping, we can conclude that at least qM I/O operations are necessary for the execution of C. This concludes the proof for lower bound in (3).
The proof for the lower bound for the parallel model in equation (4), follows a similar strategy. If P processors are being used, at least one such processor P * must compute at least |Y |/P = q4M/P values corresponding to vertices in Y . The bound follows by applying the same argument discussed for the sequential case to the computation executed by P * . ◭ This general result can be steadily applied to Strassen-like algorithms.
◮ Corollary 12. Consider Strassen's algorithm being used to multiply two square matrices A, B ∈ R n×n . Assuming no intermediate result is ever computed more than once, the I/Ocomplexity of the algorithm run on a sequential machine with a cache of size M is:
IO H n×n (M ) ≥ 1 7 n √ M log 2 7 M.(5)
If run on P processor each equipped with local memory of size M ≤ n 2 the I/O complexity is:
Additionally, the I/O-complexity of any (n 0 , m 0 )-Strassen-like algorithm run on a sequential machine with a cache of size M is:
If run on P processor each equipped with local memory of size M ≤ n 2 the I/O complexity is:
Proof. We provide a simple proof for the sequential cases in (5) and (7). Let us assume without loss of generality that n = 2 a and √ M = 2 b for some a, b ∈ N. At least 3M I/O operations are necessary in order to read all the 2n
2 input values form slow memory to the cache and to write the n 2 output values to the slow memory once they have been computed. The statement of the theorem is therefore trivially verified if n ≤ 2 √ M . For n ≥ 2 √ M , the result in (5) follows from applying Theorem 11 to the H n×n CDAG which, from Lemma 6,
The result in (7) similarly follows from applying Lemma 8. The results for the parallel model in (6) and (8) can be obtained using a generalization similar to the one described in Theorem 11. ◭ While our lower bounds correspond asymptotically to the known bounds in [23], our technique yields a simpler analysis, especially for the I/O complexity of Strassen-like algorithms. Our proof technique is based on the analysis of the recursive structure of Strassen-like algorithms and on the identification of the vertex-disjoint sub-CDAGs corresponding to the various sub-problems. The fact that Theorem 11 applies for any square matrix multiplication algorithm, allows us to obtain significant bounds without a detailed analysis of the CDAG corresponding to the specific algorithm being considered. This property further suggests that our technique may be amenable to deal with hybrid "non-stationary" multiplication algorithms, which allow mixing of schemes of the previous Strassen-like class in different recursive levels (for the "uniform" sub-class) [11] , or even within the same level (for the "non-uniform" sub-class).
Lower bounds for schedules with recomputation
Under no-recomputation assumption once an input value is loaded in memory or an intermediate result is calculated it is then necessary to maintain it in memory (either cache or slow) until the result of each operation which uses it as an input argument has been evaluated. In this section we present a new lower bound technique which yields a novel asymptotically tight I/O lower bound for Strassen's algorithm both in the sequential and parallel model. We start by presenting some technical lemmas which will then be used for the proof of our main result in Theorem 15. denote the seven sub-CDAGs of H 2n×2n , each corresponding to one of the seven sub-products generated by the first recursive step of Strassen's algorithm. Let Z i (resp., Y i ) denote the subset of Z (resp., Y) which correspond to vertices in H it is possible to connect all the vertices in K i to a subset of the vertices in X via vertex disjoint paths.
◮

For each of the 2n
2 encoder sub-CDAGs, let us consider the vector y j ∈ {0, 1} 7 . We have that y j [i] = 1 if the corresponding i-th output vertex (respectively according to the numbering indicated in Figure 1a or Figure 1b ) is in K i or y j [i] = 0 otherwise. We therefore have that |y j | corresponds to the number of output vertices of the j-th encoder sub-CDAG which are in K. From Lemma 7, we have that for each encoder sub-CDAG there exists a subset X j ∈ X of the input vertices of the j-th encoder sub-CDAG for which is possible to connect each vertex in X j to a distinct output vertex of the j-th encoder sub-CDAG using vertex disjoint paths, each constituted by a singular edge with min{|y j |, 1+⌈(|y j | − 1) /2⌉} ≤ |X j | ≤ |y j |. The number of vertex disjoint paths connecting vertices in X , to vertices in
Without loss of generality, let us assume that δ 1 ≥ δ 2 ≥ . . . ≥ δ 7 . As previously stated, it is possible to connect all vertices in K 1 to vertices in X through vertex disjoint paths. Consider now all possible dispositions of the vertices in ∪ 7 i=2 K i over the outputs of the 2n 2 encoder sub-CDAGs. Recall that the output vertices of an encoder sub-CDAG belong each to a different H n×n sub-CDAG. From Lemma 7, we have that for each encoder, there exists a subset X j ⊂ X of the input vertices of the j-th encoder sub-CDAG,
, for which is possible to connect all vertices in X j to |X j | distinct output vertices of the j-th encoder sub-CDAG which are in ∪ 7 i=1 K i using |X j |, thus using vertex disjoint paths. As all the Enc sub-CDAGs are vertex disjoint, we can sum their contributions and we can therefore conclude that the number of vertex disjoint paths connecting values in X to vertices in
Squaring this quantity leads to:
G. Bilardi and L. De Stefani XX:11
As, by assumption, δ 1 ≥ δ 2 ≥ . . . δ 7 , we have:
. . , 7. Thus:
There are therefore at least 4 M (|Z| − 2|Γ|) vertex disjoint paths connecting vertices in X to vertices in ∪ 
If run on P processors each equipped with a local memory of size M the I/O complexity is:
at connections between I/O complexity, (pebbling) space-time tradeoffs [22] , and VLSI areatime tradeoffs [25] ; these connections deserve further attention. Some CDAGs for which non trivial I/O complexity lower bounds are known only in the case of no recomputations are described in [9] . These CDAGs are of particular interest as examples of speedups superlinear with the number of processors, in the "limiting technology", defined by fundamental limitations on device size and message speed. Whether such speedups hold even when recomputation is allowed remains an open question, which the techniques introduced here might help answer.
In general, while it is well known that recomputation may reduce the I/O complexity of some CDAGs, we are far from a characterization of those CDAGs for which recomputation is effective. This broad goal remains a challenge for any attempt toward a general theory of the communication requirements of computations. Decompose A and B into four equally sized bloc matrices as follows:
6:
9:
10:
11:
12:
13:
14:
15: Figure 1a . Note that the index assigned to each output corresponds to the index of the sub-problem generated by Strassen's algorithm for which the corresponding value is used as an input (see Figure 2a and Figure 2b in Section 2).
In order to verify that this lemma holds, we study all possible compositions of a subset Y of the output vertices of Enc A . Each of these compositions is identified by a vector y with seven components, where y i = 1 if the i-th output of Enc A is in Y or zero otherwise, for i ∈ {1, 2, . . . , 7}. In order to improve the presentation, we associate to each of the possible 128 compositions a code given by 7 i=1 y i 2 7−i . In Table 1 , we study each of the 128 possible compositions of Y , which are ordered by the value of |Y | and by their code. The value in the last column |X| denotes the maximum size of a sub-set X of the input vertices of Enc A such that each vertex in X can be connected to a distinct vertex in the subset Y corresponding to y. Each of these values can be obtained through a straightforward analysis of Enc A . The lemma then follows from observing that for any possible composition of the set Y we have which multiplies matrices A 1 and B 1 . Without loss of generality we assume that for both the basic encoder CDAGs for A and B at least one of the outputs is given by a nontrivial linear combination of elements of the input matrix. It is in fact well known that any algorithm which computes linear combinations of only one of the input matrices performs no better than the naíve-definition based matrix multiplication and it is therefore not a Strassen-like fast matrix multiplication algorithm. This implies that at least one of the subproblems P 2 generated by P 1 multiplies matrices A 2 and B 2 such that A 2 is a nontrivial linear combination of sub-matrices of A 1 . Similarly, at least one of the sub-computations P 3 generated by P 2 multiplies matrices A 3 by B 3 such that B 3 is non-trivial linear combination of sub-matrices of B. A 3 (resp., B 3 ) shares no input with A (resp., B). Thus at least one of the m 2 0 sub-computations for input size n/n i 0 × n/n i 0 generated by P 1 is input-disjoint from it. As by hypothesis, non-trivial combinations of the input matrices are used as input for at most one sub-problem, all the sub-computations P 3 corresponding to each P 1 are input-disjoint and the corresponding sub-CDAGs H 
