Asymptotically tight lower bounds are derived for the I/O complexity of a general class of hybrid algorithms computing the product of n × n square matrices combining "Strassen-like" fast matrix multiplication approach with computational complexity Θ n log 2 7 , and "standard " matrix multiplication algorithms with computational complexity Ω n 3 . We present a novel and tight Ω n max{ √ M,n0}
lower bound for the I/O complexity a class of "uniform, non-stationary" hybrid algorithms when executed in a two-level storage hierarchy with M words of fast memory, where n 0 denotes the threshold size of sub-problems which are computed using standard algorithms with algebraic complexity Ω n 3 . The lower bound is actually derived for the more general class of "non-uniform, nonstationary" hybrid algorithms which allow recursive calls to have a different structure, even when they refer to the multiplication of matrices of the same size and in the same recursive level, although the quantitative expressions become more involved. Our results are the first I/O lower bounds for these classes of hybrid algorithms. All presented lower bounds apply even if the recomputation of partial results is allowed and are asymptotically tight.
The proof technique combines the analysis of the Grigoriev's flow of the matrix multiplication function, combinatorial properties of the encoding functions used by fast Strassen-like algorithms, and an application of the Loomis-Whitney geometric theorem for the analysis of standard matrix multiplication algorithms. Extensions of the lower bounds for a parallel model with P processors are also discussed.
Introduction
Data movement plays a critical role in the performance of computing systems, in terms of both time and energy. This technological trend [28] appears destined to continue, as physical limitations on minimum device size and on maximum message speed lead to inherent costs when moving data, whether across the levels of a hierarchical memory system or between processing elements of a parallel system [10] . While the communication requirements of algorithms have been widely investigated in literature, obtaining significant and tight lower bounds based on such requirements remains an important and challenging task.
In this paper, we focus on the I/O complexity of a general class of hybrid algorithms for the computing the product of square matrices which combine. Such algorithms combine fast algorithms with base case 2 × 2 similar to Strassen's matrix multiplication algorithm [35] with algebraic (or computational) complexity O n log 2 7 with standard (or classic) matrix multiplication algorithms with algebraic complexity Ω n 3 . Further, these algorithms allow recursive calls to have a different structure, even when they refer to the multiplication of matrices in the same recursive level and of the same input size. These algorithms are referred in literature as "non-uniform, non-stationary". This class includes, for example, algorithms that optimize for input sizes [15, 16, 20] . Matrix multiplication is a pervasive primitive utilized in many applications.
While of actual practical importance, to the best of our knowledge, no characterization of the I/O complexity of such algorithms has presented before this work. This is likely due to the the fact that the irregular nature of hybrid algorithms and, hence, the irregular structure of the corresponding Computational Directed Acyclic Graphs (CDAGs), complicates the analysis of the combinatorial properties of the CDAG which is the foundation of many of I/O lower bound technique presented in literature (e.g., [8, 18, 29] ).
The technique used in this work overcomes such challenges and yields asymptotically tight I/O lower bounds which hold even if recomputation of intermediate values is allowed.
Previous and Related Work: Strassen [35] showed that two n × n matrices can be multiplied with O(n ω ) operations, where ω = log 2 7 ≈ 2.8074, hence with asymptotically fewer than the n 3 arithmetic operations required by the straightforward implementation of the definition of matrix multiplication. This result has motivated a number of efforts which have lead to increasingly faster algorithms, at least asymptotically, with the current record being at ω < 2.3728639 [24] .
I/O complexity has been introduced in the seminal work by Hong and Kung [18] ; it is essentially the number of data transfers between the two levels of a memory hierarchy with a fast memory of M words and a slow memory with an unbounded number of words. Hong and Kung presented techniques to develop lower bounds to the I/O complexity of computations modeled by computational directed acyclic graphs (CDAGs). The resulting lower bounds apply to all the schedules of the given CDAG, including those with recomputation, that is, where some vertices of the CDAG are evaluated multiple times. Among other results, they established a Ω n 3 / √ M lower bound to the I/O complexity of standard, definitionbased matrix multiplication algorithms, which matched a known upper bound [13] . The techniques of [18] have also been extended to obtain tight communication bounds for the definition-based matrix multiplication in some parallel settings [4, 21, 33] and for the special As the matching upper bounds do not recompute any intermediate value, we conclude that using recomputation may reduce the I/O complexity of the considered classes of hybrid algorithms by at most a constant factor.
Our proof technique is of independent interest since it exploits to a significant extent the "divide and conquer " nature exhibited by many algorithms. Our approach combines elements from the "G-flow " I/O lower bound technique originally introduced by Bilardi and De Stefani, with an application of the Loomis-Whitney geometric theorem, which has been used by Irony et al. to study the I/O complexity of standard matrix multiplication algorithms [21] , to recover an information which relates to the concept of Minimum set introduced in Hong and Kung's method. We follow the dominator set approach pioneered by Hong and Kung in [18] . However, we focus the dominator analysis only on a select set of target vertices, which, depending on the algorithm structure, correspond either to the outputs of the sub-CDAGs that correspond to sub-problems of a suitable size (i.e., chosen as a function of the fast memory capacity M ) computed using a fast Strassen-like algorithm, or to the the vertices corresponding to the elementary products evaluated by a standard (definition) matrix multiplication algorithm.
We derive our main results for the hierarchical memory model (or external memory model). Our results generalize to parallel models with P processors. For these parallel models, we derive lower bounds for the "bandwith cost", that is for the number of messages (and, hence, the number of memory) that must be either sent or received by at least one processor during the CDAG evaluation.
Paper organization: In Section 2 we outline the notation and the computational models used in the rest of the presentation. In Section 3 we rigorously define the class of hybrid matrix multiplication algorithms H being considered. In Section 4 we discuss the construction and important properties of the CDAGs corresponding to algorithms in H. In Section 5 we introduce the concept of Maximal Sup-Problem (MSP) and describe their properties which lead to the I/O lower bounds for algorithms in H in Section 6.
Preliminaries
We consider algorithms that compute the product of two square matrices A × B = C with entries from a ring R. We use A to denote the set variables each corresponding to an entry of matrix A. We refer to the number of entries of a matrix A as its "size" and we denote it as |A|. We denote the entry on the i-th row of the j-th column of matrix A as
In this work we focus on algorithms whose execution can be modeled as a computational directed acyclic graph (CDAG) G = (V, E). Each vertex v ∈ V represents either an input value or the result of a unit-time operation (i.e., an intermediate result or one of the output values) which is stored using a single memory word. For example, each of the input (resp., output) vertices of G corresponds to one of the 2n 2 entries of the factor matrices A and B (resp., to the n 2 entries of the product matrix C). The directed edges in E represent data dependencies. That is, a pair of vertices u, v ∈ V are connected by an edge (u, v) directed from u to v if and only if the value corresponding to u is an operand of the unit time operation which computes the value corresponding to v. A directed path connecting vertices u, v ∈ V is an ordered sequence of vertices starting with u and ending with v, such that there is an edge in E directed from each vertex in the sequence to its successor.
We say that
. Note that, according to this definition, every CDAG is a sub-CDAG of itself. We say that two sub-CDAGs G ′ = (V ′ , E ′ ) and
Analogously, two directed paths in G are vertex disjoint if they do not share any vertex.
When analyzing the properties of CDAGs we make use of the concept of dominator set originally introduced in [18] . We use the following -slightly different -definition: I/O model and machine models: We assume that sequential computations are executed on a system with a two-level memory hierarchy, consisting of a fast memory or cache of size M (measured in memory words) and a slow memory of unlimited size. An operation can be executed only if all its operands are in cache. We assume that each entry of the input and intermediate results matrices (including entries of the output matrix) is maintained in a single memory word (the results trivially generalize if multiple memory words are used).
Data can be moved from the slow memory to the cache by read operations and, in the other direction, by write operations. These operations are also called I/O operations. We assume the input data to be stored in slow memory at the beginning of the computation. The evaluation of a CDAG in this model can be analyzed by means of the "red-blue pebble game" [18] . The number of I/O operations executed when evaluating a CDAG depends on the "computational schedule", that is, it depends on the order in which vertices are evaluated and on which values are kept in/discarded from cache.
The I/O complexity IO G (M ) of a CDAG G is defined as the minimum number of I/O operations over all possible computational schedules. We further consider a generalization of this model known as the "External Memory Model" by Aggarwal and Vitter [2] , where B ≥ 1 values can be moved between cache and consecutive slow memory locations with a single I/O operation. For B = 1, this model clearly reduces to the red-blue pebble game.
Given an algorithm A, we only consider "parsimonious execution schedules", that is schedules such that: (i) each time an intermediate result (excluding the output entries of C) is computed, such value is then used to computed to compute at least one of the values of which it is an operand before being removed from the memory (either the cache or slow memory); and (ii) any time an intermediate result is read from slow to cache memory, such value is then used to computed to compute at least one of the values of which it is an operand before being removed from the memory or moved back to slow memory using a write I/O operation. Clearly, any non-parsimonious schedule C can be reduced to a parsimonious schedule C ′ by removing all the steps which violate the definition of parsimonious computation. C ′ has therefore less computational and I/O operations than C. Hence, restricting the analysis to parsimonious computations leads to no loss of generality.
We also consider a parallel model where P processors, each with a local memory of size 2n 2 /P ≤ M < n 2 , are connected by a network. We do not, however, make any assumption on the initial distribution of the input data nor regarding the balance of the computational load among the P processors. Processors can exchange point-to-point messages, with every message containing up to B m memory words. For this parallel model, we derive lower bounds for the number of messages that must be either sent or received by at least one processor during the CDAG evaluation. The notion of "parsimonious execution schedules" straightforwardly extends to this parallel model.
Hybrid matrix multiplication algorithms
In this work, we consider a family of hybrid matrix multiplication algorithms obtained by hybridizing the two following classes of algorithms:
Standard matrix multiplication algorithms: This class includes all the square matrix multiplication algorithms which, given the input factor matrices A, B ∈ R n×n , satisfy the following properties: This class, also referred in literature as classic, naive or conventional algorithms, correspond to that studied for the by Hong and Kung [18] (for the sequential setting) and by Irony et al. [21] (for the parallel setting). Algorithms in this class have computational complexity Ω n 3 . This class includes, among others, the sequential iterative definition algorithm, the sequential recursive divide and conquer algorithm based on block partitioning, and parallel algorithms such as Cannon's "2D " algorithm [13] , the "2.5D " algorithm by Solomonik and Demmel [34] , and "3D " algorithms [1, 23] .
Fast Strassen-like matrix multiplication algorithms with base case 2×2: This class includes algorithms following a structure similar to that of Strassen's [35] (see Appendix B) and Winograd's variation [37] (which reduces the leading coefficient of the arithmetic complexity reduced from 7 to 6). Algorithms in this class generally follow three steps:
1. Encoding: Generate the inputs, of size n/2 × n/2 of seven sub-problems, as linear sums of the input matrices; 2. Recursive multiplications: Compute (recursively) the seven generated matrix multiplication sub-products;
3. Decoding: Computing the entries of the product matrix C via linear combinations of the output of the seven sub-problems.
Algorithms in this class have algebraic complexity O n log 2 7 , which is optimal for algorithms with base case 2 × 2 [37] .
Remarkably, the only properties of relevance for the analysis of the I/O complexity of algorithms in these classes are those used in the characterization of the classes themselves.
In this work we consider a general class of non-uniform, non-stationary hybrid square matrix multiplication algorithms, which allow mixing of schemes from the fast Strassenlike class with algorithms from the standard class. Given an algorithm A let P denote the problem corresponding to the computation of the product of the input matrices A and B. Consider an "instruction function" f A (P ), which, given as input P returns either (a) indication regarding the algorithm from the standard class which is to be used to compute P , or (b) indication regarding the fast Strassen-like algorithm to be used to recursively generate seven sub-problems P 1 , P 2 , . . . , P 7 and the instruction functions f A (P i ) for each of the seven sub-problems. We refer to the class of non-uniform, non-stationary algorithms which can be characterized by means of such instruction functions as H. Algorithms in H allow recursive calls to have a different structure, even when they refer to the multiplication of matrices in the same recursive level. E.g., some of the sub-problems with the same size may be computed using algorithms form the standard class while others may be computed using recursive algorithms from the fast class. This class includes, for example, algorithms that optimize for input sizes, (for sizes that are not an integer power of a constant integer).
We also consider a sub-class UH (n 0 ) of H constituted by uniform, non-stationary hybrid algorithms which allow mixing of schemes from the fast Strassen-like class for the initial ℓ recursion levels, and then cut the recursion off once the size the generated sub-problems is smaller or equal to a set threshold n 0 × n 0 , and switch to using algorithm form the standard class. Algorithms in this class are uniform, i.e., sub-problems of the same size are all either recursively computed using a scheme form the fast class, or are all computed using algorithms from the standard class.
This corresponds to actual practical scenarios, as the use of Strassen-like algorithms is mostly beneficial for large input size. As the size of the input of the recursively generated sub-problems decreases, the asymptotic benefit of fast algorithms is lost due to the increasing relative impact of the constant multiplicative factor, and algorithms in the standard class exhibit lower actual algebraic complexity. For a discussion on such hybrid algorithms and their implementation issues we refer the reader to [16, 20] (sequential model) and [15] (parallel model).
The CDAG of algorithms in H
Let G A = (V A , E A ) denote the CDAG that corresponds to an algorithm A ∈ H used to multiply input matrices A, B ∈ R n×n . The challenge in the characterization of G A comes from the fact that rather than considering to a single algorithm, we want to characterize the CDAG corresponding to the class H. Further, the class H is composed by a rich variety of vastly different and irregular algorithm. Despite such variety, we show a general template for the construction of G A and we identify some of it properties which crucially hold regardless of the implementation details of A and, hence, of G A .
(a) G A CDAG for base case n = 2, using Strassen's algorithm [35] (see Appendix B). Construction: G A can be obtained by using a recursive construction that mirrors the recursive structure of the algorithm itself. Let P denote the entire matrix multiplication problem computed by A. Consider the case for which, according to the instruction function f A (P ), P is to be computed using an algorithm from the standard class. As we do not fix a specific algorithm, we do not correspondingly have a fixed CDAG. The only feature of interest for the analysis is that, in this case, the CDAG G A corresponds to the execution of an algorithm from the standard class for input matrices of size n × n.
Consider instead the case for which, according to f A (P ), P is to be computed using an algorithm from the fast class. In the base case for n = 2 the problem P is computed without generating any further sub-problems. As an example, we present in Figure 1a the base case for Strassen's original algorithm [35] . If n > 2, then f A (P ) specifies the divide and conquer scheme to be used to generate the seven sub-problems P 1 , P 2 , . . . , P 7 , and the instruction function for each of them. The sub-CDAGs of G A corresponding to each of the seven sub-problems P i , denoted as G A P i are constructed according to f A (P i ), following recursively the steps discussed previously. G A can then be constructed by composing the seven sub-CDAGs G A P i
. n 2 disjoint copies of an encoder sub-CDAG Enc A (resp., Enc B ) are used to connect the input vertices of G 2n×2n , which correspond to the values of the input matrix A (resp., B) to the appropriate input vertices of the seven sub-CDAGs G A P i ; the output vertices of the sub-CDAGs G A P i (which correspond to the outputs of the seven sub-products) are connected to the appropriate output vertices of the entire G A CDAG using n 2 copies of the decoder sub-CDAG Dec. We present an example of such recursive construction in Figure 1b .
Properties of G A : While the actual internal structure G A , and, in particular, the structure of encoder and decoder sub-CDAGs depends on the specific Strassen-like algorithm being used by A, all versions share some properties of great importance. Let G(X, Y, E) denote an encoder CDAG for a fast multiplication algorithm 2×2 base case, with X (resp., Y ) denoting the set of input (resp., output) vertices, and E denoting the set of edges directed from X to Y .
Lemma 1 (Lemma 3.3 [26]). Let G = (X, Y, E) denote an encoder graph for a fast matrix multiplication algorithm with base case 2 × 2. There are no two vertices in Y with identical neighbors sets.
While the correctness of this Lemma can be simply verified by inspection in the case of Strassen's algorithm [35] , Lemma 1 generalizes the statement to all encoders corresponding to fast matrix multiplication algorithms with base case 2 × 2. From Lemma 1 we have: Lemma 2. Let A ∈ H and let P 1 and P 2 be any two sub-problems generated by A with input size greater than n 0 × n 0 , such that P 2 is not recursively generated while computing P 1 and vice versa. Then, the sub-CDAGs of G A corresponding, respectively, to P 1 and to P 2 are vertex disjoint.
The following lemma, originally introduced for Strassen's algorithm in [11] and then generalized for Strassen-like algorithms with base case 2×2 in in [26] , captures a connectivity property of encoder sub-CDAGs.
Lemma 3 (Lemma 3.1 [26] ). Given an encoder CDAG for any Strassen- The proofs of Lemma 1 and Lemma 3 are based on an argument originally presented by Hopcroft and Kerr [19] . We refer the reader to [26] for the proofs.
Maximal sub-problems and their properties
For an algorithm A ∈ H, let P ′ denote a sub-problem generated by A. In our presentation we consider the entire matrix multiplication problem an improper sup-problem generated by A. We refer as the ancestor sub-problems of P ′ as the sequence of sub-problems P ′ 0 , P ′ 2 , . . . , P ′ i generated by A such that P ′ j+1 was recursively generated to compute P ′ j for j = 0, 1, . . . , i−1, and such that P was recursively generated to compute P ′ i . Clearly, if P ′ is the entire problem, then P ′ has no ancestors.
Towards studying the I/O complexity of algorithms in H we focus on the analysis of a particular set of sub-problems.
Definition 2 (Maximal Sub-Problems (MSP)). Let A ∈ H be an algorithm used to multiply matrices
A, B ∈ R n×n . If n ≤ 2 √ M we
say that A does not generate any Maximal SubProblem (MSP).
Let P i be a sub-problem generated by A with input size n i × n i , with n i ≥ 2M , and such that all its ancestors sub-problems are computed, according to A using algorithms from the fast class. We say that: In the following we denote as ν 1 (resp., ν 2 ) the number of Type 1 (resp., Type 2) MSPs generated by A. Let P i denote the i-th MSP generated by A and let G A P i denote the corresponding sub-CDAG of G A . We denote as A i and B i (resp., C i ) the input factor matrices (resp., the output product matrix) of P i .
Properties of MSPs and their corresponding sub-CDAGs: By Definition 2, we have that for each pair of distinct MMSPs P 1 and P 2 , P 2 is not recursively generated by A in order to compute P 1 or vice versa. Hence, by Lemma 2, the sub-CDAGs of G A that correspond each to one of the MSPs generated by A are vertex, disjoint.
In order to obtain our I/O lower bound for algorithms in H, we characterize properties regarding the minimum dominator size of an arbitrary subset of Y and Z. For each Type 1 MSP P i generated by A, with input size 2 n i × n i , we denote as T i the set of variables whose value correspond to the n 3 i elementary products
Further, we denote as T i the set of vertices corresponding to the variables in T i , and we define 
Lemma 6 (Proof in Appendix A.3). For any Type 1 MSPs generated by
Proof. We prove the result in (1) (resp., (2)) for the case B = 1 (resp., B m = 1). The result then trivially generalizes for a generic B (resp., B m ). We first prove the result for the sequential case in in (1). The bound for the parallel case in (2) will be obtained as a simple generalization. The fact that IO A (n, M, 1) ≥ 2n 2 follows trivially from the fact that as in our model the input matrices A and B are initially stored in slow memory, it will necessary to move the entire input to the cache at least once using at least 2n 2 I/O operations. If A does not generate any MSPs the statement in (1) is trivially verified. In the following, we assume ν 1 + ν 2 ≥ 1.
Let G A denote the CDAG associated with algorithm A according to the construction in Section 4. By definition, and from Lemma 2, the ν 1 + ν 2 sub-CDAGs of G A corresponding each to one of the MSPs generated by A are vertex-disjoint. Hence, the T i 's are a partition of T and |T | =
By Definition 2, the MSP generated by A have input (resp., output) matrices of size greater or equal to 2 √ M × 2 √ M . Recall that we denote as Z the set of vertices which correspond to the outputs of the ν 2 Type 2 MSPs, we have |Z| ≥ 4M ν l .
Let C be any computation schedule for the sequential execution of A using a cache of size M . We partition C into non-overlapping segments C 1 , C 2 , . . . such that during each C j either (a) exactly M 3/2 distinct values 3 corresponding to vertices in T , denoted as T (j) , are explicitly computed (i.e., not loaded from slow memory), or (b) 4M distinct values corresponding to vertices in Z (denoted as Z j ) are evaluated for the first time. Clearly there are at least max{|T |/M 3/2 , ν 2 } such segments.
Below we show that the number g j of I/O operations executed during each C j satisfies g j ≥ cM for case (a) and g j ≥ M for case (b), from which the theorem follows. i 's constitute a partition of T (j) . Let A i and B i (resp., C i ) denote the input matrices (resp., output matrix) of P i with A i , B i , C i ∈ R n i ×n i , and let A i and B i (resp., C i ) denote the set of the variables corresponding to the entries of A i and B i (resp., C i ). Further, we denote as T i the set of values corresponding to the vertices in T
) denote the set of vertices corresponding to the entries of A i (resp., B i ) which are accessed during C j , and let Z ′ i denote the set of vertices corresponding to the entries of C i which are active during C j . Then
Lemma 8 is a reworked version of a property originally presented by Irony et al. [21, Lemma 2.2], which itself is a consequence of the Loomis-Whitney geometric theorem [25] .
Let
[s] entirely during C j (i.e., without using partial accumulation of the summation
), it will be necessary to evaluate all the n i elementary products
[s] denote an entry of C i which is active but not entirely computed during C j . There are two possible scenarios:
[s] is computed during C j : The computation thus requires for a partial accumulation of
[s] to have been previously computed and either held in the cache at the beginning of C j , or to moved to cache using a read I/O operation during C j ;
is not computed during C j : As C is a parsimonious computation, the partial accumulation of
obtained from the elementary products computed during C j must either remain in the cache at the end of C j , or be moved to slow memory using a write I/O operation during C j ;
In both cases, any partial accumulation either held in memory at the beginning (resp., end) of C j or read from slow memory to cache (resp., written from cache to slow memory) during C j is, by definition, not shared between multiple entries in C i .
Let i |/n i } entries of C i which are active but not entirely computed during C j , either one of the entries of the cache must be occupied at the beginning of C j , or one I/O operation is executed during C j . Let
As, by Lemma 2, the sub-CDAGs corresponding to the ν 1 Type 1 MSPs are vertex disjoint, so are the the sets
We have:
where the last passage follows from the fact that, by Definition 2, n i ≥ 2M . From Lemma 8, the set of vertices Y ′ i,A (resp., Y ′ i,B ) which correspond to entries of A i (resp., B i ) which are accessed during
i |/ |Z ′ i | entries from the input matrices of P i are accessed during C j . Let Y denote the set of vertices corresponding to the entries of the input matrices A i , B i of P i . From Lemma 6 we have that there exists a set 
As, by Lemma 2, the sub-CDAGs corresponding to the ν l Type 1 MSPs are vertex disjoint, so are the the sets Y ′ i for i = 1, 2, . . . , ν 1 . Hence
From Lemma 4 any dominator D Y of Y , must be such that
Hence, we can conclude that any dominator D ′′ of T (j) must be such that
Consider the set D of vertices of G A corresponding to the at most M values stored in the cache at the beginning of C j and to the at most g j values loaded into the cache form the slow memory (resp., written into the slow memory from the cache) during C j by means of a read (resp., write) I/O operation. Clearly, |D| ≤ M + g j .
In order for the M 3/2 values from T (j) to be computed during C j there must be no path connecting any vertex in T (j) , and, hence, Y , to any input vertex of G A which does not have at least one vertex in D, that is D has to admit a subset D ′′ ⊆ D such that D ′′ is a dominator set of T (j) . Note that, as the values corresponding to vertices in T (j) are actually computed during C J (i.e., not loaded from memory using a read I/O operation), D ′′ does not include vertices in T (j) itself. Further, as motivated in the previous discussion, D must include all the vertices in the set D ′ corresponding to values of partial accumulators of the active output values of Type 1 MSPs during C j .
By construction, D ′ and D ′′ are vertex disjoint. Hence, from (4) and (5) we have:
As, by construction, |T (j) | = M 3/2 , we have:
By studying its derivative after opportunely accounting for the minimum, we have that (6) is minimized for |Z ′ | = 2 −2/3 M . Hence we have:
Case (b):
In order for the 4M values from Z j to be computed during C j there must be no path connecting any vertex in Z j to any input vertex of G A which does not have at least one vertex in D j , that is D j has to be a dominator set of Z j . From Lemma 5, any dominator set D of any subset Z ⊆ Z with |Z| ≤ 4M satisfies |D| ≥ |Z|/2, whence M + g i ≥ |D i | ≥ |Z j |/2 = 2M , which implies g j ≥ M as stated above. This concludes the proof for the sequential case in (1).
The proof for the bound for the parallel model in (2), follows from the observation that at least one of the P processors, denoted as P * , must compute at least |T |/P values corresponding to vertices in T or |Z|/P values corresponding to vertices in Z (or both). The bound follows by applying the same argument discussed for the sequential case to the computation executed by P * .
Note that if A is such that the product is entirely computed using an algorithm from the standard class (resp., a fast matrix multiplication algorithm), the bounds of Theorem 7 corresponds asymptotically to the results of [18] for the sequential case and [21] for the parallel case (resp., the results in [11] ).
The bound in (2) can be further generalized to a slightly different model in which each of the P processors is equipped with a cache memory of size M and a slow memory of unbounded size. In such case, the I/O complexity of the algorithms in H corresponds to the total number of both messages received and the number of I/O operations used to move data from cache to slow memory and vice versa (i.e., read and write) executed by at least one of the P processors.
I/O lower bound for uniform, non stationary algorithms in UH (n 0 ) : For the subclass of uniform, non stationary algorithms UH (n 0 ) , given the values of n, M and n 0 is possible to compute a closed form expression for the values of ν 1 , ν 2 and |T |. 4 Then, by applying Theorem 7 we have: Theorem 9. Let A ∈ UH (n 0 ) be an algorithm to multiply two square matrices A, B ∈ R n×n . If run on a sequential machine with cache of size M and such that up to B memory words stored in consecutive memory locations can be moved from cache to slow memory and vice versa using a single memory operation, A's I/O complexity satisfies:
4 The constants terms in Theorem 9 assume that n, n0 and M are powers of two. If that not the case the statement holds with minor adjustments.
If run on P processors each equipped with a local memory of size M < n 2 and where for each I/O operation it is possible to move up to B m memory words, A's I/O complexity satisfies:
Proof. The statement follows by bounding the values ν 1 ,ν 2 and |T | for A ∈ UH (n 0 ) , and by applying the general result from Theorem 7. In order to simplify the presentation, in the following we assume that the values n, n 0 , M are powers of two. If that is not the case, the theorem holds with some minor adjustments to the constant multiplicative factor. Let let i be the smallest value in N such that n/2 i = max{n 0 , 2 √ M }. By definition of H, at each of the i recursive levels A generates 7 i sub-problems of size n/2 i .
Type 1 MSP each with input size n 0 × n 0 . As, by Definition 2, the Type 1 MSP are input dijoint we have:
Type 2 MSP each with input size 2
The statement then follows by applying the result in Theorem 7.
Theorem 9 extends the result by Scott [32] by expanding the class of hybrid matrix multiplication algorithms being considered (e.g., it does not limit the class of standard matrix multiplication to the divide and conquer algorithm based on block-partitioning), and by removing the assumption that no intermediate value may be recomputed.
On the tightness of the bound: An opportune composition of the cache-optimal version of the Strassen's algorithm [35] (as discussed in [3] ) with the standard cache-optimal divide and conquer algorithm for square matrix multiplication based on block-partitioning [13] leads to a sequential hybrid algorithms in H (resp., UH (n 0 ) ) whose I/O cost asymptotically matches the I/O complexity lower bounds in Theorem 7 (1) (resp., Theorem 9 (7)).
Parallel algorithms in H (resp., UH (n 0 ) ) asymptotically matching the I/O lower bounds in for the parallel case in Theorem 7 (2) (resp., Theorem 9 (8)) can be obtained by composing the communication avoiding version of Strassen's algorithm by Ballard et al. [3] with the communication avoiding "2.5 " standard algorithm by Solomonik and Demmel [34] .
Hence, the lower bounds in Theorem 7 and Theorem 9 are asymptotically tight and the mentioned algorithms form H and UH (n 0 ) whose I/O cost asymptotically match the lower bounds are indeed I/O optimal.
Further, as the mentioned I/O optimal algorithms from H and UH (n 0 ) do not recompute any intermediate value 5 , we can conclude that using recomputation may lead to at most a constant factor reduction of the I/O cost of hybrid algorithms in H and UH (n 0 ) .
Generalization to fast matrix multiplication model with base other than 2 × 2: The general statement of Theorem 7 can be extended to by enriching H to include any fast Strassen-like algorithm with base case other than 2 × 2 provided that the associated encoder CDAG satisfies properties equivalent to those expressed by Lemma 2(i.e., the input disjointedness of the sub-problems generated at each recursive step) and Lemma 3 (i.e., the connectivity between input and output of the encoder CDAGs via vertex disjoint paths) for the 2 × 2 base. If these properties hold, so does the general structure of Theorem 7, given an opportune adjustment of the definition of maximal sub-problem.
Conclusion
This work introduced the first characterization of the I/O complexity of hybrid matrix multiplication algorithms combining fast Strassen-like algorithms with standard algorithms. We established asymptotically tight lower bounds that hold even when recomputation is allowed. The generality of the technique used the analysis makes it promising for the analysis of other hybrid recursive algorithms, e.g., for hybrid algorithms for integer multiplication [12] .
Our results contribute to the study of the effect of recomputation with respect to the I/O complexity of CDAG algorithms. While we are far from a characterization of those CDAGs for which recomputation is effective, this broad goal remains a fundamental challenge for any attempt toward a general theory of the communication requirements of computations.
A Proofs of technical lemmas A.1 Proof of Lemma 4
The proof of Lemma 4 uses an analysis similar to that in [11] (Lemma 6), albeit with several important variations. Before delving into the details of the proof of Lemma 4 we introduce the following technical lemma: Let a i , a 2 , . . . a i ∈ N (resp., b i , b 2 , . . . b i ∈ N) such that b j = 0 if and only if a i = 0 and such that:
using the convention that 0/0 = 0. Then we have:
Proof. The proof is by induction on the value of i. In the base case case i = 1, hence a = a 1 and b = b 1 . Thus the statement is trivially verified. We assume inductively that the statement holds for i ≥ 1 and we show that it holds also for i + 1. Let
be can therefore apply the inductive hypothesis to the first i elements and obtain:
Let a ′ = i j=1 a j and b ′ = i j=1 b j . If a i+1 = 0 the statement is trivially verified. In the following we consider the case for which a i+1 > 0 and, by assumption, b i+1 > 0.
From (10), we have that the lemma is guaranteed to be verified if the following holds:
Clearly (11) is verified for b ′ ≥ b i+1 . Since, by assumption, b = b ′ + b i+1 , we can therefore conclude that the statement is verified for b ′ ≥ 3b/4. In the following we consider the case for b ′ < 3b/4. Let a ′ = ax (resp., b ′ = by), with x ∈ (0, 1]. By assumption x, y = 0. Further, as we are considering the case such that b ′ < 3b/4, we have that y ∈ (0, 3/4). Since, by assumption, a = a ′ + a i+1 and b = b ′ + b i+1 , we have:
Therefore, the Lemma is clearly verified if the following holds:
By multiplying both sides of the previous inequality by 2 √ y √ y − 1 (which is clearly greater than 0), we have that (12) holds if:
which in turn holds if:
As y ∈ (0, 3/4) we have 2 √ 1 − y > 1 > √ y and x − √ y (x − 1). Hence (14) holds ant the Lemma follows.
Recall that given an algorithm A ∈ H used to multiply input squared matrices A, B ∈ R n×n , we denote as G A the corresponding CDAG constructed according to the description in Section 4. Further, we denote as X the set of input vertices of G A . That is the set of vertices corresponding each to the entries in the input matrices A and B. We now present the proof of Lemma 4.
Proof of Lemma 4.
The proof proceeds by induction on the number of Type 1 MSPs ν 1 generated by A. In the base case ν 1 = 1 and, by Definition 2 the entire problem is the only, improper Type 1 MSP generated by A. Hence, the sets Y and X coincide and the statement is trivially verified.
Assuming now inductively that the statement holds ν 1 = k ≥ 1, we shall show it also holds for ν 1 = k + 1.
As ν 1 > 1, we have that the algorithm executes at least one recursive step. Let P (j) denote the 7 sub-problems generated at the first recursion step. We distinguish two cases: (a) at least two of of the seven sub-problems P (j) generate each at least one Type 1 MSP; (b) only one of the seven sub-problems generates all ν 1 Type 1 MSPs. We first address case (a), as case (b) will follow from a simple extension.
For case (a), let G (j) , for j = 1, 2, . . . , 7 denote the seven sub-CDAGs of G A , each corresponding to one of the seven sub-problems generated in the first recursive step of A according to the chosen Strassen-like scheme as discussed in Section 3. By Definition 2 and as, by assumption, A generates Type 1 MSPs, we have that each of these seven sub-problems has input size greater or equal to 2
Further, each of the seven sub-problems
, By Lemma 2 the G (j) 's have distinct input values and, hence, are pairwise vertex-disjoint sub-CDAGs of G A . Thus, the
By the inductive hypothesis, any dominator set D (j) of Y (j) with respect to the set K (j) composed by the input vertices of G (j) must be such that
of the input vertices of G (j) such that |K (j) | ≥ a j / √ b j , using vertex-disjoint paths. Since the sub-CDAGs G (j) are vertex disjoint, so are the paths connecting vertices in Y (j) to vertices in K (j) . In the following we show that it is indeed possible to extended at least min{2M, 7 j=1 a j / 7 j=1 b j } of these paths to vertices in X while maintaining them vertex disjoint.
According to the construction of G A as discussed in Section 4, vertices in X corresponding to the entries of input matrix A (resp., B) are connected to vertices in K (1) , K (2) , . . . , K (7) (and, hence, K (1) , K (2) , . . . , K (7) ) by means of n 2 encoding sub-CDAGs Enc A (resp., Enc B ). None of these 2n 2 encoding sub-CDAGs share any input or output vertices. No two output vertices of the same encoder sub-CDAG belong to the same sub-CDAG G (j) , for j = 1, 2, . . . , 7. This fact ensures that for a single sub-CDAG G (j) , for j = 1, 2, . . . , 7, it is possible to connect all the vertices in K (j) (and, hence, K (j) ) to a subset of the vertices in X via vertex disjoint paths.
For each of the 2n 2 encoder sub-CDAGs, let us consider the vector y l ∈ {0, 1} 7 such that y l [j] = 1 iff the corresponding j-th output vertex of the encoder, which is an input of G (j) , is in K (j) . Therefore |y l | equals the number of output vertices of the l-th encoder sub-CDAG which are in K. From Lemma 3, for each encoder sub-CDAG there exists a subset X l ∈ X of the input vertices of the l-th encoder sub-CDAG for which it is possible to connect each vertex in X l to a distinct output vertex of the l-th encoder sub-CDAG using vertex disjoint paths, each constituted by a singular edge with min{|y l |,
The number of vertex disjoint paths connecting vertices in X , to vertices in ∪ 7 j=1 K (j) is therefore at least Let us assume w.l.o.g. that
As previously stated, it is possible to connect all vertices in K 1 to vertices in X through vertex disjoint paths. Consider now all possible dispositions of the vertices in ∪ 7 j=2 K (j) over the outputs of the 2n 2 encoder sub-CDAGs. Hence, if a 1 / √ b 1 ≥ 2M we have that there are therefore at least M vertex disjoint paths connecting vertices in X to vertices in K 1 , and, thus, to vertices in Y as desired. In the following we assume a 1 / √ b 1 < M . Recall that the output vertices of an encoder sub-CDAG belong each to a different sub-CDAG G (j) . From Lemma 3 we have that for each encoder there exists a subset X l ⊂ X of the input vertices of the l-th encoder sub-CDAG, with |X l | ≥ min |y l |, 1 +
, for which is possible to connect all vertices in X l to |X l | distinct output vertices of the l-th encoder sub-CDAG which are in ∪ 7 j=1 K (j) using |X l | vertex disjoint paths. As all the Enc sub-CDAGs are vertex-disjoint, we can add their contributions so that the number of vertex-disjoint paths connecting vertices in X to vertices in ∪ 7 j=1 K (j) is at least
where the last passage follows by applying Lemma 10. There are therefore at least a/ √ b vertex disjoint paths connecting vertices in X to vertices in ∪ 7 j=1 K (j) , and, thus, to vertices in Y as desired. This concludes the proof for case (a).
For case (b), only one of the seven sub-problems P (j) generates all ν 1 Type 1 MSPs. Without loss of generality, let P (1) denote such sub-problem and let G A P (1) be the corresponding sub-CDAG. According to the construction of G A as discussed in Section 4, vertices in X corresponding to the entries of input matrix A (resp., B) are connected to the input vertices of G A P (1) , by means of n 2 encoding sub-CDAGs Enc A (resp., Enc B ). None of these 
A.2 Proof of Lemma 5
Given an algorithm A ∈ H used to multiply input squared matrices A, B ∈ R n×n , let G A denote the corresponding CDAG constructed according to the description in Section 4. In the following we denote as X the set of input vertices of G A . That is the set of vertices corresponding each to the entries in the input matrices A and B. Further, for each Type 2 MSP P i we denote as Y i the set of input vertices of the sub-CDAG G A P i associated with P i . That is the set of vertices corresponding each to the entries in the input matrices A i and
In order to simplify the presentation of the proof of Lemma 5 we first introduce Lemma 11 which is a heavily modified version of a result previously introduced in [11, Lemma 6] . Proof. The proof proceeds by induction on the number ν 2 of Type 2 MSPs generated by A. In the base case ν 2 = 1 and, by Definition 2 the entire problem is the only, improper, Type 2 MSP generated by A and, thus, the sets Y and X coincide. Consider now the following lemma: As, by Definition 2, G A is a G n×n CDAG, with n ≥ 2 √ M , we can conclude that the statement of Lemma 11 is verified in the base case as a consequence of Lemma 12 and of the fact that, as previously mentioned, Y and X coincide, Lemma 12 was originally introduced in [11, Lemma 5] and is based on the analysis of the Grigoriev's flow of the matrix multiplication function. For the sake of completeness we present the proof in Appendix A.4.
Assuming now inductively that the statement holds for ν 2 = k > 1, we shall show it also holds for ν 2 = k + 1. As ν 2 > 1, we have that the algorithm executes at least one recursive step. Let P (j) denote the 7 sub-problems generated at the first recursion step. We distinguish two cases: (a) at least two of of the seven sub-problems P (j) generate each at least one Type 2 MSP; (b) only one of the seven sub-problems generates all ν 2 Type 2 MSPs. We first address case (a), as case (b) will follow from a simple extension.
For case (a), let G (j) , for j = 1, 2, . . . , 7 denote the seven sub-CDAGs of G A , each corresponding to one of the seven sub-problems generated in the first recursive step of A according to the chosen Strassen-like scheme as discussed in Section 4. By Definition 2 and as, by assumption, A generates Type 2 MSPs, we have that n/2 ≥ 2 √ M . Further, each of the seven sub-problems P (j) generates at most ν 2 − 1 Type 2 MSPs.
Let Z (j) , Y (j) and Q (j) respectively denote the subsets of Z, Y and Q in G (j) , for j = 1, 2, . . . , 7. By Lemma 2 the G (j) 's have distinct input values and, hence, are pairwise vertexdisjoint sub-CDAGs of G A . Thus,
Applying the inductive hypothesis to each G (j) , we have that there is a subset
such that vertices of Y (j) are connected to vertices in Z (j) via paths with no vertex in Q (j) . In the sequel the set Y referred to in the statement will be identified as a suitable subset of ∪ 7 i=j Y (j) so that property (b) will be automatically satisfied. Towards property (a), we observe by the inductive hypothesis that vertices in Y (j) can be connected to a subset K (j) of the input vertices of G (j) with |K (j) | = |Y (j) |, using vertex-disjoint paths. Since the sub-CDAGs G (j) are vertex disjoint, so are the paths connecting vertices in Y (j) to vertices in K (j) . It remains to show that at least 4 M (|Z| − 2|Q|) of these paths can be extended to X while maintaining them vertex-disjoint.
According to the construction of G A as discussed in Section 4, vertices in X corresponding to the entries of input matrix A (resp., B) are connected to vertices in K 1 , K 2 , . . . , K 7 by means of n 2 encoding sub-CDAGs Enc A (resp., Enc B ). None of these 2n 2 encoding sub-CDAGs share any input or output vertices. No two output vertices of the same encoder sub-CDAG belong to the same sub-CDAG G (j) , for j = 1, 2, . . . , 7. This fact ensures that for a single sub-CDAG G (j) , for j = 1, 2, . . . , 7, it is possible to connect all the vertices in K i to a subset of the vertices in X via vertex disjoint paths.
For each of the 2n 2 encoder sub-CDAGs, let us consider the vector y l ∈ {0, 1} 7 such that y l [j] = 1 iff the corresponding j-th output vertex of the encoder, which is an input of G (j) , is in K (j) . Therefore |y l | equals the number of output vertices of the l-th encoder sub-CDAG which are in K. From Lemma 3, for each encoder sub-CDAG there exists a subset X l ∈ X of the input vertices of the l-th encoder sub-CDAG for which it is possible to connect each vertex in X l to a distinct output vertex of the l-th encoder sub-CDAG using vertex disjoint paths, each constituted by a singular edge with min{|y l |, 1 + ⌈(|y l | − 1) /2⌉} ≤ |X l | ≤ |y l |. The number of vertex disjoint paths connecting vertices in X , to vertices in ∪ 7 j=1 K (j) is therefore at least
Let us assume w.l.o.g. that δ (1) ≥ δ (20 ≥ . . . ≥ δ (7) . As previously stated, it is possible to connect all vertices in K 1 to vertices in X through vertex disjoint paths. Consider now all possible dispositions of the vertices in ∪ 7 j=2 K (j) over the outputs of the 2n 2 encoder sub-CDAGs. Recall that the output vertices of an encoder sub-CDAG belong each to a different sub-CDAG G (j) . From Lemma 3 we have that for each encoder there exists a subset X l ⊂ X of the input vertices of the l-th encoder sub-CDAG, with
, for which is possible to connect all vertices in X l to |X l | distinct output vertices of the l-th encoder sub-CDAG which are in ∪ 7 j=1 K (j) using |X l | vertex disjoint paths. As all the Enc subCDAGs are vertex disjoint, we can add their contributions so that the number of vertex disjoint paths connecting vertices in X to vertices in
There are therefore at least 4 M (|Z| − 2|Q|) vertex disjoint paths connecting vertices in X to vertices in ∪ 7 j=2 K (j) as desired. This concludes the proof for case (a).
For case (b), only one of the seven sub-problems P (j) generates all ν 1 Type 1 MSPs. Without loss of generality, let P (1) denote such sub-problem and let G A P (1) be the corresponding sub-CDAG. According to the construction of G A as discussed in Section 4, vertices in X corresponding to the entries of input matrix A (resp., B) are connected to the input vertices of G A P (1) , by means of n 2 encoding sub-CDAGs Enc A (resp., Enc B ). None of these 2n 2 encoding sub-CDAGs share any input or output vertices. No two output vertices of the same encoder sub-CDAG belong to the same sub-CDAG G (j) , for j = 1, 2, . . . , 7. This fact ensures that it is possible to connect all the input vertices of G A P (1) to a subset of the vertices in X via vertex disjoint paths. The proof for case (b) then follows by recursively applying the arguments in the Proof of Lemma 11 to G A P (1) .
Lemma 11, provides the base for the proof of Lemma 5, which is itself a reworked version a result from [11] (Lemma 7) modified in order to account for the hybrid nature of algorithms being considered in this work. 
Proof of Lemma
Again, |D| ≤ 2M − 1 implies M (|Z| − 2 (|D| − |D ′ |)) > 0. Hence, we can take the square root on both sides of (15) and conclude that φ > 0. Therefore, for |D| ≤ 2M − 1 there are at least φ > 0 paths connecting a global input vertex to a vertex in Z with no vertex in D, contradicting the assumption that D is a dominator of Z.
A.3 Proof of Lemma 6
Proof of Lemma 6. Without loss of generality, let us assume |Y at least one of the elementary products in T ′ i assumes value a. The lemma follows combining statements (i) and (ii): (i) There exists an assignment of the input variables of P i corresponding to vertices in paths connecting these vertices to the output vertices in O ′ include at least a vertex in D (i.e., I ′′ is the largest subset of I with respect to whom D is a dominator set for O ′ ). From Lemmas 14 and 13 the following must hold:
Let I ′ = I \I ′′ . By the definition of I ′′ , the vertices in I ′ are exactly those that are connected to vertices in O ′ by directed paths with no vertex in D. Since |I| = 2n 2 , from (17) we have |I ′ | 2 ≥ 4n 2 (|O ′ | − 2|D|). if n = 1 then Decompose A and B into four equally sized block matrices as follows: 
The original version of Strassen's fast matrix multiplication [35] is reported in Algorithm 1. We refer the reader to [37] for Winograd's variant, which reduces the number of additions.
