Almost asymptotically tight lower bounds are derived for the I/O complexity IO A (n, M) of a general class of hybrid algorithms computing the product of two integers, each represented with n digits in a given base s, in a twolevel storage hierarchy with M words of fast memory, with different digits stored in different memory words. The considered hybrid algorithm combine the Toom-Cook-k (or Toom-k) fast integer multiplication approach with computational complexity Θ c k n log k (2k−1) , and "standard" integer multiplication algorithms which compute Ω n 2 digit multiplications. We present an Ω (n/max{M, n 0 }) log k (2k−1) (max{1, n 0 /M}) 2 M lower bound for the I/O complexity a class of "uniform, non-stationary" hybrid algorithms when executed in a two-level storage hierarchy with M words of fast memory, where n 0 denotes the threshold size of sub-problems which are computed using standard algorithms with algebraic complexity Ω n 2 .
I. INTRODUCTION
Data movement has a critical impact on the performance of computing systems as it affects both execution time and energy utilization. This technological trend [1] is destined to continue, as physical limitations on minimum device size and on maximum message speed lead to unavoidable costs when moving data, whether across the levels of a hierarchical memory system or between processing elements of a parallel system [2] . While communication requirements of algorithms have been widely investigated in the literature, obtaining meaningful lower bounds on the data movement requirement remains an important and challenging task.
In this paper, we study the I/O complexity of a class of hybrid algorithms for computing the product of two n-digit integer numbers. Such algorithms combine with standard (or long) integer multiplication algorithms that require Ω n 2 digit operations, with fast algorithms based on the Toom-Cook (or Toom-k) recursive algorithm [3] which require Θ c k n log k (2k−1) digit operations where k denotes the number of sub-problems generated at each recursive step, and where c k is a parameter that is constant with respect to n while it grows with k. The considered hybrid algorithms are "nonuniform, non-stationary" as they allow for different algorithms to be used in different recursion level (i.e., non-stationary), and as they allow recursive calls to have a different structure, even when they refer to the multiplication of integer values in the same recursive level and of the same size in terms of number of digits (i.e., non-statitionary). This class models, for example, algorithms that optimize for different input sizes of recursively generated sub-problems. This corresponds to an actual practical scenario: while for large input size the use of fast algorithms such as Toom-Cook's and Karatsuba's (which can be considered as a special case of the former) leads to a reduction of the number of the required digit-wise operations compared to standard Ω n 2 algorithms, as the size of the input of the recursively generated sub-problems decreases, the asymptotic benefit of fast algorithms wanes due to the increasing relative impact of the constant multiplicative factor.
While of actual practical importance, the characterization of hybrid algorithms is challenging due to the intrinsic irregular nature of such algorithms and, hence, the irregular structure of the corresponding Computational Directed Acyclic Graphs (CDAGs). In turn, this considerably complicates the analysis of the combinatorial properties of the CDAG which is the foundation of many of I/O lower bound technique presented in the literature (e.g., [4] - [6] ).
In [7] , De Stefani proposed a technique that overcomes such challenges and yields asymptotically tight I/O lower bounds of a general class of hybrid algorithms for matrix multiplication combining fast Strassen-like algorithms and standard "definition" algorithms. In this work, we demonstrate the generality of the approach in [7] by deploying it to the analysis of hybrid integer multiplication algorithms. However, the analysis requires several non-trivial adaptations that correspond to the specific properties of the integer multiplication function and to the combinatorial properties of Toom-Cook algorithms. All the bounds obtained in this work are within at most a O k 2 multiplicative factor of corresponding upper bounds, where k denotes the maximum value for any version of Toom-k used in the hybrid scheme, and hence are asymptotically tight whenever k is constant. To the best of our knowledge, ours are the first I/O lower bounds for hybrid algorithms for integer multiplication.
Previous and Related Work: Karatsuba [8] showed that two n-digits integers can be multiplied with O(n log 2 3 ) digit operations, hence with asymptotically fewer than the n 2 operations required by the straightforward implementation of the standard long, or "schoolbook, integer multiplication multiplication algorithms.
Following this effort, the Toom-Cook algorithmic scheme was originally introduced by Andrei Toom [3] for circuits and later adapted by Stephen Cook [9] to software programs. The multiplicands A and B are viewed as the result of evaluating, at a suitable point z, two polynomials p(x) and q(x) of degree k−1 (where k is a design parameter within the scheme), whose coefficients are integers directly obtained from the digits of A and B, respectively. Therefore, the product AB = p(z)q(z) can then be obtained by evaluating at z the polynomial r(x) = p(x)q(x), of degree (2k−2). When the scheme is implemented using for k the same value at all levels of recursion the resulting algorithm is commonly referred as "Toom-k" and executes in Θ c k n log k (2k−1) digit operations, where c k is constant with n, while it grows with k. Knuth ( [10] , p. 286) shows how, by suitably choosing k as a function of the (sub)problem size, the Toom-Cook scheme can be implemented to achieve O n2 √ 2 log n log n operations.
Asymptotically faster algorithms based on the application of the Fast Fourier Transform have been presented in literature. Among these the Schönhage-Strassen algorithm [11] with complexity Θ (n log n log log n), and Fürer's algorithm [12] with complexity Θ n log n2 O(log * n) , where log * n is the iterated logarithm. While the former is indeed used in practice for input integers greater than 2 2 17 [13], Fürer's algorithm and others, asymptotically faster ones (i.e., [14] - [17] ), are not practical for reasonable input size.
In practice, Toom-Cook is slower than the standard algorithm for small numbers. It becomes competitive for input integers of intermediate size, up to 2 2 17 [13] , before the Schönhage-Strassen becomes actually faster. I/O complexity has been introduced in the seminal work by Hong and Kung [5] ; it denotes the number of data transfers between the two levels of a memory hierarchy with a fast memory (or cache) of M words and a slow memory with an unbounded number of words. Hong and Kung presented techniques to develop lower bounds for the I/O complexity of computations modeled by computational directed acyclic graphs (CDAGs). The resulting lower bounds apply to all the possible execution schedules of the given CDAG, including those with recomputation, that is, those for which some vertices of the CDAG are evaluated multiple times. In the same paper, the authors presented I/O lower bounds for several important algorithms including standard matrix multiplication and Fast Fourier Transform. The techniques of [5] have also been extended to obtain tight communication bounds for the definition-based matrix multiplication in some parallel settings [18] - [20] .
While a general (S + 1) T = n 2 /64 time (T )-space (S) tradeoff for the integer multiplication function is known [21, Theorem 10. 5.3] , to the best of our knowledge, this work provides the first asymptotically tight I/O lower bounds for standard long integer multiplication algorithms which require Ω n 2 digit-wise operations.
In [22] , Bilardi and De Stefani applied the "G-flow technique" I/O technique (which was previously introduced in [23] ) to obtain almost asymptotically tight for a class of uniform, non-stationary Toom-Cook algorithms. Their technique is based on the analysis of the Partial Grigoriev's flow of the integer multiplication function. In this work, we extend their result to a class of hybrid non-uniform, non-stationary algorithms.
In [7] , De Stefani provided an analysis of the I/O complexity of a class of hybrid algorithms for matrix multiplication combining fast Strassen-like algorithms and standard "definition" algorithms. In this work, we use the general framework introduced in [7] for the analysis of hybrid integer multiplication algorithms.
Our results: In our main result, we present an I/O lower bound for a class H of non-uniform, non-stationary hybrid integer multiplication algorithms when executed in a two-level storage hierarchy with M words of fast memory. Algorithms in H combine the Toom-Cook recursive strategy and standard algorithms that require Ω n 2 digit operations. Such hybrid algorithms allow recursive calls to have a different structure, even when they refer to the multiplication of matrices in the same recursive level and of the same input size. Following the technique introduced in [7] , the result in Theorem 7 relates the I/O complexity of algorithms in H to the number and the input size of an opportunely selected set of the sub-problems generated by the algorithms themselves.
As a special case, Theorem 9 yields an asymptotically tight Θ n 2 /M I/O lower bound for standard integer multiplication algorithms (Corollary 8).
By specializing the result in Theorem 7, we also present,
lower bound for the I/O complexity of algorithms in a subclass UH (k, n 0 ) of H composed by uniform non-stationary hybrid algorithms where Toom-k is used in the inital levels of recursion until the size of the generated subproblems falls below a set threshold n 0 and those are computed using standard algorithms executing Ω n 2 digit operations. Our lower bounds in Theorem 7 and Theorem 9 allow for recomputation of intermediate values and are tight within at most a O k 2 multiplicative factor. Hence, they are asymptotically tight for constant values of k. Since the (almost) matching algorithms do not recompute any intermediate value, we conclude that using recomputation may reduce the I/O complexity of the considered classes of hybrid algorithms by at most a O k 2 multiplicative factor.
The application of the technique introduced in [7] for the analysis of algorithms in H requires a non-trivial combination of the "G-flow" I/O lower bound technique originally introduced by Bilardi and De Stefani for Toom-Cook algorithms in [22] , with an analysis of the combinatorial properties of standard integer multiplication algorithms.
We derive our main results for the hierarchical memory model (or I/O model). Our results generalize to parallel models with P processors. For these parallel models, we derive lower bounds for the "bandwidth cost", that is for the number of messages (and, hence, the number of memory words) transmitted (either sent or received by at least one processor during the execution of the algorithm.
Paper organization: In Section II we outline the notation, the computational models, and preliminary concepts used in the rest of the presentation. In Section III we rigorously define the class of hybrid integer multiplication algorithms H being considered. In Section IV we discuss the construction and important properties of the CDAGs corresponding to algorithms in H. In Section V we discuss Maximal Sup-Problem (MSP) and describe their properties which lead to the I/O lower bounds for algorithms in H in Section VI.
II. PRELIMINARIES
We consider algorithms that compute the product of two integers: C = A × B. We assume the input integers to be expressed as a sequence of n base-s digits in positional notation. We further assume that each integer is represented as an unsigned integer with an additional bit to denote the sign.
For a given integer A, we denote its expansion in base s as:
where n is the number of digits in the base-s expansion of A, and its digits are indexed in order from the least significant digit A[0] to the most significant digit A[n − 1]. Further, we refer to A[i − 1] (resp., A[n − i]) as the i-th least (resp., most) significant digit of A for i = 1, . . . , n. With a slight abuse of notation, we use A (resp., B) to denote both the value being multiplied and the set of input variables to the algorithm. We refer to the number of digits of the base-s expansion of an integer A as its "size", and we denote it as | A|.
In this work we focus on algorithms whose execution can be modeled as a computational directed acyclic graph (CDAG) G = (V, E). Each vertex v ∈ V represents either an input value or the result of a unit-time operation (i.e., an intermediate result or one of the output values) which is stored using a single memory word. For example, each of the input (resp., output) vertices of G corresponds to one of the n digits of the base s expansion of A and B (resp., C). The directed edges in E represent data dependencies. That is, a pair of vertices u, v ∈ V are connected by an edge (u, v) directed from u to v if and only if the value corresponding to u is an operand of the unit time operation which computes the value corresponding to v. A directed path connecting vertices u, v ∈ V is an ordered sequence of vertices starting with u and ending with v, such that there is an edge in E directed from each vertex in the sequence to its successor.
We say that
. Note that, according to this definition, every CDAG is a sub-CDAG of itself. We say that two sub-CDAGs
Analogously, two directed paths in G are vertex-disjoint if they do not share any vertex.
Dominator sets When analyzing the properties of CDAGs we use the concept of dominator set originally introduced in [5] . In this work, we use the following, slightly modified, definition:
. Given a CDAG G = (V, E), let I ⊂ V denote the set of its input vertices. A set D ⊆ V is a dominator set for V ′ ⊆ V with respect to I ′ ⊆ I if every path from a vertex in I ′ to a vertex in V ′ contains at least a vertex of D. When I ′ = I, D is simply referred as "a dominator set for V ′ ".
Let G A n denote the CDAG corresponding to an unspecified straight-line algorithm for computing the product of two input integers A, B whose base-s expression has up to n digits. The following lemma provides a lower bound to the size of any dominator set for selected sets of the output vertices of G A n : Lemma 1. Given G A n , let A and B denote the two input factor integers to be multiplied expressed as n-digit numbers in base s, and let C denote the product integer whose base-s expansion has up to 2n digits, with n ≥ 8M. Let X A (resp., X B ) denote the set of input (resp., output) vertices of G A n associated with the variables corresponding to the ⌊n/2⌋ least significant digits of A (resp., B). Further, let Y denote the set composed by the ⌊n/2⌋ variables corresponding to the digits of C from the ⌊n/2⌋ + 1-th least significant to the n-th least significant. For any X ′ ⊆ X A (resp., X ′ ⊆ X B ) and any Y ′ ⊆ Y we have that any dominator D of Y ′ with respect to X ′ must be such that |D| ≥ |X ′ ||Y ′ |/n.
The property stated in Lemma 1 was originally introduced by Bilardi and De Stefani in [22] . Its proof combines an analysis of the Partial Grigoriev's flow of the integer multiplication function [22, Corollary 5] , and its relation with the size of dominator sets of selected vertices of G A n with respect to opportunely chose subsets of the input vertices [22, Lemma 1] . We refer the reader to [22] for the detailed proofs.
Remarkably, the property synthesized by Lemma 1 captures a property that is common to all CDAGs corresponding to any integer multiplication integer. This derives from the analysis of the Partial Grigoriev's Flow which is a property of the integer multiplication function itself, rather than of a particular algorithm used to compute it. Such generality makes this result particularly helpful for the analysis of hybrid algorithms, due to their intrinsic heterogeneity.
I/O model and machine models:
We assume that sequential computations are executed on a system with a two-level memory hierarchy, consisting of a fast memory or cache of size M (measured in memory words) and a slow memory of unlimited size. An operation can be executed only if all its operands are in cache.
We assume both the input integers and intermediate results to be stored in memory expressed as their base-s expansion, with s ∈ N + , and with s being at most equal to the maximum value which can be maintained in a single memory word plus one. (That is, if a memory word can fit 32 bits, we have 2 ≤ s ≤ 2 32 − 1.) In particular, we assume each digit in the base-s expansion of a value to be stored in a different memory word.
We assume that the processor is equipped with digit-wise product and algebraic sum elementary operations. Further, we assume that the processor is equipped with operations for producing the most and least significant digits of an integer in base s. Unless explicitly stated otherwise, when referring to the "digits" of integers, we mean the digits of their expansion in the base chosen for their representation in memory. Data can be moved from the slow memory to the cache by read operations and, in the other direction, by write operations. These operations are also called I/O operations. We assume the input data to be stored in the slow memory at the beginning of the computation. The evaluation of a CDAG in this model can be analyzed by means of the "red-blue pebble game" [5] . The number of I/O operations executed when evaluating a CDAG depends on the "computational schedule", that is, it depends on the order in which vertices are evaluated and on which values are kept in/discarded from cache.
The I/O complexity IO M (G) of a CDAG G is defined as the minimum number of I/O operations over all possible computational schedules. We further consider a generalization of this model known as the "External Memory Model" by Aggarwal and Vitter [24] , where B ≥ 1 values can be moved between cache and consecutive slow memory locations with a single I/O operation.
For any given algorithm A, in this work we only consider "parsimonious execution schedules", that is schedules such that: (i) each time an intermediate result (excluding the output digits of C) is computed, such value is then used to compute at least one of the values of which it is an operand before being removed from the memory (either the cache or slow memory); and (ii) any time an intermediate result is read from slow to cache memory, such value is then used to compute at least one of the values of which it is an operand before being removed from the memory or moved back to slow memory using a write I/O operation. Clearly, any nonparsimonious schedule C can be reduced to a parsimonious schedule C ′ by removing all the steps which violate the definition of parsimonious computation. C ′ has therefore less computational and I/O operations than C. Hence, restricting the analysis to parsimonious computations leads to no loss of generality.
We also consider a parallel model where P processors, each with a local memory of size 2n 2 /P ≤ M < n 2 , are connected by a network. We do not, however, make any assumption on the initial distribution of the input data nor regarding the balance of the computational load among the P processors. Processors can exchange point-to-point messages, with every message containing up to B m memory words. For this parallel model, we derive lower bounds for the number of messages that must be either sent or received by at least one processor during the CDAG evaluation. The notion of "parsimonious execution schedules" straightforwardly extends to this parallel model.
III. HYBRID INTEGER MULTIPLICATION ALGORITHMS
In this work, we consider a family of hybrid integer multiplication algorithms obtained by combining the two following classes of algorithms:
Standard integer multiplication algorithms: This class includes all the integer multiplication algorithms which, given the input factor integers a, b which have up to n digits in their respective base-s expansion, satisfy the following properties: Algorithms in this class have computational complexity Ω n 2 . This class includes, among others, the long or schoolbook algorithm, the Grid-method multiplication, and the Lattice multiplication.
Toom-Cook algorithms: This class includes algorithms following the Toom-Cook paradigm [3] . Algorithms in this class are also referred as Toom-k algorithms, where k > 0 denotes the number of parts in which the digits of the input operands are divided at every recursive step to generate 2k − 1 subproblems. While variants of the Toom-Cook paradigm for which k can be a non-integer number (e.g., k = 1.5, k = 2.5) have been presented in the literature, in this work we consider only versions of the Toom-Cook paradigm for which k ≥ 2 is an integer. Algorithms in this class follow five common steps: 1) Splitting: The digits of the base-s expansion of A (resp., B) are split into k blocks of ⌈n/k⌉ successive digits:
The A i 's and B i 's are then used as coefficients to construct the k − 1-degree polynomials,
with the property that p(s ⌈n/k ⌉ ) = A and q(s ⌈n/k ⌉ ) = B.; 2) Evaluation: In this step the algorithm evaluates the polynomials p(x) and q(x) for be 2k − 1 chosen distinct evaluation points x 0 , x 1 , . . . , x 2k−2 . Let x denote the vector such that x[i] = x i for 0 ≤ i ≤ 2k − 2, and let V (x) denote the Vandermonde matrix, such that the the value of the entry in the i-th row and j-th column is given by:
Let p(x) (resp., q(x)) denote the column vector such that
Finally, let A (resp., B) denote the column vectors such that
. The values of the p(x i )'s and q(x i )'s are computed as:
While any choice of 2k − 1 distinct evaluations point for the Vandermonde matrix is admissible, small integer values (eg., 0, 1, −1, 2, . . .) are generally chosen in order to reduce the computational cost of the algorithm. 3) Recursive multiplications: Let r(·) = p(·)q(·) denote the product polynomial with degree 2k − 1. In this step, the algorithm computes the integer products r( of the product polynomial r are computed as:
As the 2k −1 evaluations points are chosen to be distinct, due to the properties of Vandermonde matrices [25] , V(x), can guaranteed to be invertible. 5) Recomposition: Compute C = 2k−2 i=0 r i s i ⌈n/k ⌉ where multiplications by powers of s ⌈n/k ⌉ correspond to shifting the r i by i⌈n/k⌉ base-s digits to the left.
Algorithms in this class compute Θ c k n log k (2k−1) digit operations, where c k is constant with n, while it grows with k. For further details on the Toom-Cook algorithm and its variations, we refer the reader to [22] , [26] .
Remarkably, the only properties of relevance for the analysis of the I/O complexity of algorithms in these classes are those used in the characterization of the classes themselves. While implementation details of these algorithms (e.g., the choice of the evaluation points used for Toom-Cook algorithms) may impact the execution time, they will not affect our I/O lower bound analysis. Such property is particularly desirable when analyzing hybrid algorithms, due to their intrinsic heterogeneous nature.
In this work we consider a general class of non-uniform, non-stationary hybrid integer multiplication algorithms, which allow mixing of schemes from the fast Toom-Cook class with algorithms from the standard class. Given an algorithm A let P denote the problem corresponding to the computation of the product of the input factor integers A and B. Consider an "instruction function" f A (P), which, given as input P returns either (a) indication regarding the algorithm from the standard class which is to be used to compute P, or (b) indication regarding the value k for the Toom-k algorithm to be used to recursively generate seven sub-problems P 1 , P 2 , . . . , P 2k−1 and the instruction functions f A (P i ) for each of the 2k − 1 subproblems, where k ∈ N and k ≥ 2. In this work, we denote as H the class of non-uniform, non-stationary algorithms which can be characterized by means of such instruction functions and such that the size of the input integers of the sub-problems recursively generated when applying the Toom-Cook scheme is strictly decreasing. Algorithms in H allow recursive calls to have a different structure, even when they refer to the multiplication of matrices in the same recursive level. E.g., some of the sub-problems with the same size may be computed using algorithms form the standard class, while others may be computed using recursive algorithms from the fast class.
We also consider a sub-class of UH (k, n 0 ) of H constituted by uniform, non-stationary hybrid algorithms for which the same version of Toom-k (i.e., using the same value k) is used in the first ℓ recursion levels, before switching to using algorithm form the standard class. Algorithms in this class are uniform, i.e., sub-problems of the same size are all either recursively computed using a scheme form the fast class, or are all computed using algorithms from the standard class.
IV. THE CDAG OF ALGORITHMS IN H
Let G A = (V A , E A ) denote the CDAG that corresponds to an algorithm A ∈ H used for multiplying input integers A, B of size n. Rather than considering a single algorithm, we aim to characterize the properties of CDAG corresponding to any algorithm in the class H that will be relevant for the I/O analysis. Such analysis appears challenging due to the variety and irregularity of algorithms in H.In this section, we show a general template for the construction of G A and we identify some of its properties which crucially hold regardless of the implementation details specific of A and, hence, of G A .
Construction: G A can be obtained by using a recursive construction that mirrors the recursive structure of the algorithm itself. Let P denote the entire integer multiplication problem computed by A. Consider the case for which, according to the instruction function f A (P), P is to be computed using an algorithm from the standard class. As we do not fix a specific algorithm, we do not correspondingly have a fixed CDAG. The only feature of interest for the analysis is that, in this case, the CDAG G A corresponds to the execution of an algorithm from the standard class for input integers of size n.
Consider instead the case for which, according to f A (P), P is to be computed using a Toom-k algorithm by generating the 2k − 1 sub-problems P 1 , P 2 , . . . , P 2k−1 each associated with an instruction function. The sub-CDAGs of G A corresponding to each of the 2k − 1 sub-problems P i , denoted as G A P i are constructed according to f A (P i ), following recursively the previos steps. G A can then be constructed by composing the 2k − 1 sub-CDAGs G A P i by means of an encoder sub-CDAG Enc k,n which is used to connect the input vertices of G A , which correspond to the values of the input integers A (resp., B) to the appropriate input vertices of the 2k − 1 sub-CDAGs G A P i ; the output vertices of the sub-CDAGs G A P i (which correspond to the outputs of the 2k − 1 sub-products) are connected to the appropriate output vertices of the entire G A CDAG by means of a decoder sub-CDAG Dec. The input vertices of the sub-CDAG Enc k,n correspond to the n digits in the base-s expression of the inputs integer A and B, while 
the output vertices of Enc k,n correspond the digits of the inputs of the 2k − 1 sub-problems generate by the recursive strategy of A. We present an example of such recursive construction in Figure 1 .
Properties of G A : While the actual internal structure G A , and, in particular, the structure of encoder and decoder sub-CDAGs depends on the actual implementation details of the specific Toom-Cook algorithm being used by A (e.g., the choice of evaluation points, how to deal with carryover), all versions share the following properties, originally discussed in [22] , which are of great importance for the remainder of the analysis. We refer the reader to [22] for the proofs.
Lemma 2. [22, Lemma 3.1] Let A ∈ H and denote an algorithm that recursively computes the product of two input integers A and B, whose base-s expansion have n digits each.
Let P 1 and P 2 be any two sub-problems generated by A using algorithms from the Toom-Cook class, such that P 2 is not recursively generated by P 1 and vice versa. Then, the sub-CDAG of G A corresponding to P 1 and the sub-CDAG corresponding to P 2 are vertex disjoint.
The following lemma captures a connectivity property shared by the encoder sub-CDAGs for all Toom-k algorithms:
. For k ≥ 2, given the encoder Enc k,n , let I ′ A (resp., I ′ B ) denote the subset of the input vertices that correspond to the ⌈n/k⌉ least significant digits in the base-s expansion of A (resp., B) . A (resp., B) .
denote the subset of the output vertices of Enc k,n that correspond to the ⌈n/k⌉ least significant digits of the input of the i-th subproblem obtained from the values of
. We then have that there exists X A ⊆ I ′ A (resp., X B ⊆ I ′ B ) with |X A | = |Y | (resp., |X b | = |Y |), such that there exist |Y | vertex disjoint paths connecting the vertices in X A (resp., X B ) to the vertices in Y .
V. MAXIMAL SUB-PROBLEMS AND THEIR PROPERTIES For a given a sub-problem P ′ generated by an algorithm A ∈ H, let P ′ 0 , P ′ 2 , . . . , P ′ i be the sequence of sub-problems generated by A such that P ′ j+1 was recursively generated to compute P ′ j for j = 0, 1, . . . , i − 1, and such that P ′ was recursively generated to compute P ′ i . We refer to the subproblems in such sequence as the ancestor sub-problems of P ′ . If P ′ is the entire integer multiplication problem being computed by A, it has no ancestors.
To study the I/O complexity of algorithms in H we focus on the analysis of a particular set of sub-problems.
Definition 2 (Integer Multiplication Maximal Sub-Problems (MSP)). Let A ∈ H be an algorithm used to multiply two integers A, B whose base-s expansion has at most n digits. If n ≤ 8M we say that A does not generate any Maximal Sub-Problem (MSP). Let P i be a sub-problem generated by A with input integers of size n i ≥ 8M, and such that all its ancestors sub-problems are computed, according to A using algorithms from the Toom-Cook class. We say that:
using an algorithm from the standard class. If the entire problem is solved using an algorithm from the standard class, we say that the entire problem is the unique (improper) Type 1 MSP generated by A. • P i is a Type 2 MSP of A if, according to A, is computed by generating 2k − 1 sub-problems using to the recursive Toom-k scheme, and if the generated sub-problems have input size strictly smaller than 8M. If the entire problem uses a recursive algorithm from the Toom-Cook class to generate sub-problems with input size smaller than Y, we say that the entire problem is the unique, improper, Type 2 MSP generated by A.
In the following we denote as ν 1 (resp., ν 2 ) the number of Type 1 (resp., Type 2) MSPs generated by A. Let P i denote the i-th MSP generated by A and let G A P i denote the corresponding sub-CDAG of G A . We denote as A i and B i (resp., C i ) the input integers (resp., the output product) of P i .
Properties of MSPs and their corresponding sub-CDAGs:
By Definition 2, we have that for each pair of distinct MMSPs P 1 and P 2 , P 2 is not recursively generated by A in order to compute P 1 or vice versa. Hence, by Lemma 2, the sub-CDAGs of G A that correspond each to one of the MSPs generated by A are vertex-disjoint. Note that, in general, different MSPs may have different input sizes. For a given MSP P i with input integers of size n i ≥ 8M, we denote as Y i (resp., Z i ) the set of input (resp., output) vertices of the sub-CDAG G A P i which correspond each to one of the n i /2 ≥ 4M least significant digits of the input integers A i and B i (resp., to one of the digits of the output product C i form the (n i /2 + 1)least significant to the n i -least significant). Finally, we define
In order to obtain our I/O lower bound for algorithms in H, we characterize properties regarding the minimum dominator size of an arbitrary subset of Y and Z.
For each Type 1 MSP P i generated by A, with input integers of size n i ≥ 8M, we denote as T i the set of variables whose value correspond to the n 2 i /4 elementary products
for j, k = 0, 1, . . . , (n i − 1) /2 which are computed, by definition, by any integer multiplication algorithm form the standard class. Further, we denote as T i the set of vertices corresponding to the variables in T i , and we define T = ∪ ν 1 i=1 T i . Lemma 4. For any Type 1 MSPs generated by A consider
) denote a subset of the vertices corresponding to entries of A i (resp., B i ) which are multiplied in at least one of the elementary products in T ′ i . Then any dominator D of the vertices corresponding to T ′ i with respect to the the vertices in Y i is such that Proof. Let ℓ denote the number of recursive steps of algorithm A at which the MSP P i is generated. The proof is by induction on ℓ. In the base case ℓ = 0, that is P i is the only improper MSP generated by A. Hence, Y and X coincide and the statement is trivially verified. We assume inductively that the statement holds for ℓ = j ≥ 0, and we show it holds also for ℓ = j + 1. As ℓ > 0, we have that in the first level of recursion algorithm, A generates 2k − 1 subproblems, P (1) , P (2) , . . . , P (2k−1) , where the value k is given by the instruction function associated with A. Let us assume, without loss of generality, that P i is recursively generated by A while solving P (1) , and let G A P 1 be the sub-CDAG of G A corresponding to P 1 . By inductive hypothesis, and by the definition of dominator set, there exists a subset K of the input vertices of G A P 1 with |K | = |Y | such that there exists |K | vertex disjoint paths connecting vertices of K to vertices in Y . From the construction of G A , that the input vertices of G A P 1 , and hence the vertices in K, are connected to the vertices in X by means of an encoder sub-CDAG Enc k,n of G A , where n is the number of digits in the base-s expression of the input integers A and B. From Lemma 3, we have that there exists a set X ⊆ X, with |X | = |K |, such that there exist |K | vertex disjoint paths connecting the vertices in X to the vertices in K. By composing the |X | = |K | vertex-disjoint paths connecting vertices in X to vertices K with the |K | = |Y | vertex-disjoint paths connecting vertices in K to vertices in Y, we can conclude that there exist indeed |X | = |Y | vertex-disjoint paths connecting vertices in X to vertices in Y .
Proof. Without loss of generality, let us assume |Y
By composing the result of Lemma 5 and an analysis of the Partial Grigoriev's flow of the integer multiplication function [22] we obtain the following result. Lemma 6. Let G A be the CDAG corresponding to an algorithm A ∈ H which admits ν 2 Type 2 MSPs. Given any subset Z ⊆ Z in G A with |Z | ≤ 4M, any dominator set D of Z satisfies |D| ≥ |Z |/2.
Proof. If ν 2 = 0, the statement is vacuously true. In the following, we assume ν 2 ≥ 1.
Assume towards contradiction that there exists a dominator D for Z such that |D| < |Z |/2. Thus, any path connecting input an input vertex of G A to a vertex of Z must include at least one vertex in D. Let G A P 1 , G A P 2 . . . , G A Pν 2 denote the sub-CDAGs of G A , each corresponding to one of the ν 2 Type 2 MSPs generated by A. We define as D ′ the subset of D composed by vertices internal to the sub-CDAGs G A P i , excluding their respective inputs For i = 1, 2, . . . , ν 2 , let Z i (resp., D i ) denote the subset of Z (resp., D ′ ) composed by vertices in G A P i . As, from Lemma 2, the sub-CDAGs G A P i are vertex disjoint, {Z 1 , Z 2 , . . . , Z ν 2 } partition Z (resp.,
In the following, we denote as n i * the input size of the Type 2 MSP P i * , that is, n i * denotes the number of digits of the base-s expansion of each of the input integers of P i * . Let Y ′ ⊆ Y i * denote the subset such that all paths connecting these vertices to the output vertices in Z i * include at least a vertex in D i * (i.e., Y ′ is the largest subset of Y i * with respect to whom D i * is a dominator set for Z i * ). From Lemma 1, it must hold that
By the definition of Y ′ , the vertices in Y are those among the ones in Y i * that are connected to vertices in Z i * by directed paths with no vertex in D i * . We have:
where the last passage follows from the fact that, by definition of Y i * , we have |Y i * | = n i * /2. According to a well-known property of the ratios of sums we have:
and thus:
|Y | = n i * 1/2 − |D ′ i * |/|Z i * | ≥ n i * (1/2 − |D ′ |/|Z |) , By Lemma 5, there exists a set X ⊆ X, with |X | = |Y |, such that vertex in X are be connected to vertices in Y by |X | = |Y | vertex disjoint paths. As such paths are vertex disjoint, at most |D \ D ′ | of these paths may include at least a vertex in D \ D ′ . Hence, we have that there exist π paths connecting vertices in Z to vertices in X where:
where the last passage follows as P i * is a Type 2 MSP and, by Definition 2, n i * ≥ 8M ≥ 2|Z |. For |D| ≤ |Z |/2 − 1, we have that π ≥ 1, which contradicts D being a dominator set for Z.
The property synthesized in Lemma 6 extends a result originally presented in [22] for uniform, non-stationary Toom-Cook algorithms, to the more general family of non-uniform, non-stationary hybrid algorithms considered in this work.
VI. I/O LOWER BOUNDS

Theorem 7. Let A ∈ H be an algorithm to multiply two integers A, B represented as n-digit base-s numbers. If run on a sequential machine with cache of size M and such that up to B memory words stored in consecutive memory locations can be moved from cache to slow memory and vice versa using a single memory operation, A's I/O complexity satisfies:
where |T | denotes the total number of internal elementary products computed by the Type 1 MSPs generated by A and ν 2 denotes the total number of Type 2 MSPs generated by A. Proof. We prove the result in (1) (resp., (2)) for the case B = 1 (resp., B m = 1). The result then trivially generalizes for a generic B (resp., B m ). We first prove the result for the sequential case in in (1) . The bound for the parallel case in (2) will be obtained as a simple generalization. The fact that IO A (n, M, 1) ≥ 2n follows from the fact that as in our model the input factor integers A and B are initially stored in slow memory, it will necessary to move the entire input to the cache at least once using at least 2n I/O operations. If A does not generate any MSPs the statement in (1) is trivially verified. In the following, we assume ν 1 + ν 2 ≥ 1.
Let G A denote the CDAG associated with algorithm A according to the construction in Section IV. By definition, and from Lemma 2, the ν 1 + ν 2 sub-CDAGs of G A corresponding each to one of the MSPs generated by A are vertex-disjoint. Hence, the T i 's are a partition of T and |T | = ν 1 i=1 |T i |. By Definition 2, the MSP generated by A have input (resp., output) integer factors whose base-s expansion has at least 8M digits. Recall that we denote as Z the set of vertices which correspond to the digits of outputs of the ν 2 Type 2 MSPs, starting from the (4M + 1)-least significant up to the 8M-th least significant one. We have |Z| ≥ 4Mν 2 .
Let C be any computation schedule for the sequential execution of A using a cache of size M. We partition C into non-overlapping segments C 1 , C 2 , . . . such that during each C j either (a) exactly 4M 2 distinct values corresponding to vertices in T , denoted as T (j) , are explicitly computed (i.e., not loaded from slow memory), or (b) 4M distinct values corresponding to vertices in Z (denoted as Z j ) are evaluated for the first time.
Clearly there are at least max{|T |/4M 2 , ν 2 } such segments.
Below we show that the number g j of I/O operations executed during each C j satisfies g j ≥ cM for case (a) and g j ≥ M for case (b), from which the theorem follows. i 's constitute a partition of T (j) . Let A i and B i (resp., C i ) denote the input integers (resp., output product) of P i , where A i and B i can be expressed in base-s using up to n i digits.
For k = n i /4, n i /4 + 1, . . . , 3n i /4 − 1, we say that C i [k] is "active during C j " if any of the at least n i /4 elementary products A i [r]B i [s], for 0 ≤ r, s < n i /2 and r + s = k, correspond to any of the vertices in T (j)
for k = 0, 1, . . . , n i − 1, correspond to any of the vertices in T (j) i . In the following we denote as α i (resp., β i ) as the number of digits of A i (resp., B i ) which are accessed during C j , and let γ i denote the number of digits of C i which are active during C j . By definition, max{α i , β i } ≥ ⌈ |T i |⌉ and γ i ≥ ⌈|T i |/max{α i , β i }⌉.
Let C[k] be active during C j . In order to compute C[k] entirely during C j (i.e., without using partial accumulation of the summation r,s ≥0|r+s=k A i [r]B[s]), it will be necessary to evaluate all the at least n i /4 elementary products which are added int the summation during C j itself. Thus, at least n i /4 values corresponding to digits form A i and B i must be accessed during C j . Consider the set Y of at least n i /4 vertices corresponding to the values of A i being which are active during C j .
Consider the set D of vertices of G A corresponding to the at most M values stored in the cache at the beginning of C j and to the at most g j values loaded into the cache form the slow memory (resp., written into the slow memory from the cache) during C j by means of a read (resp., write) I/O operation. Clearly, |D| ≤ M + g j .
In order for the at least n i /4 values corresponding to the digits of A i which are accessed during C j to be either stored in memory at the beginning of C j , read form slow memory to the cache during C j or computed from other values during C j there must be no path connecting any vertex in Y, to any input vertex of G A which does not have at least one vertex in D. That is D has to be a dominator set of Y . By Lemma 4, any denominator D od Y satisfies
Hence, we can conclude that if any of the active values C[k] is computed entirely during C j we have g j ≥ M.
Suppose instead that none of the values of C[k]
which are active during C j are computed entirely during C j itself. Thus, as C is a parsimonious computation, for any value of C[k] which is active during C j the values of the elementary products which are computed as part of the computation of C[k] must be added to partial accumulator corresponding to the summation tree of C [k] . As, by assumption, none of the active C[k] is computed entirely during C j , Such accumulator may have either already been previously initiated and be maintained in the memory (either cache or slow), or it may necessary to maintain it in memory (either cache or slow) at the end of C j . Thus, all the ν 1 i=1 γ i partial accumulators corresponding to active values of C must be included in the cache at the beginning of C j , read form slow memory to the cache by means of a read I/O operation, maintained in the cache at the end of C j , or written form the cache to the slow memory by means of a write I/O operation. Assume that for all Type 1 MSPs generated by A at most 2M input digits are accessed during C j . That is max i {α i , β i } < 2M. By the previous considerations, we have:
Since at most M of such accumulators may be in the cache at the beginning of (resp., remain in the cache at the of) C j , we can conclude that at least M I/O operations are executed during C j . Assume instead that for at least one of the Type 1 MSPs generated by A, the number of digits of the inputs A i or B i accessed during C j is at least 2M. That is, assume that max{α i , β i } ≥ 2M for some i. Then, by applying the same considerations discussed in the case for which at least one of the C[k]'s which are active during C j is entirely computed during C j we can conclude that at least g j ≥ 2M I/O operations are executed during C j .
Case (b):
In order for the 4M values from Z j to be computed during C j there must be no path connecting any vertex in Z j to any input vertex of G A which does not have at least one vertex in D j , that is D j has to be a dominator set of Z j . From Lemma 6, any dominator set D of any subset Z ⊆ Z with |Z | ≤ 4M satisfies |D| ≥ |Z |/2, whence M + g i ≥ |D i | ≥ |Z j |/2 = 2M, which implies g j ≥ M as stated above. This concludes the proof for the sequential case in (1).
The proof for the bound for the parallel model in (2), follows from the observation that at least one of the P processors, denoted as P * , must compute at least |T |/P values corresponding to vertices in T or |Z|/P values corresponding to vertices in Z (or both). The bound follows by applying the same argument discussed for the sequential case to the computation executed by P * .
Note that if A is a uniform, non-stationary algorithm such that the product is entirely computed using algorithms from the Toom-Cook class (i.e., at each level of recursion the same version of Toom-k -with the same value k-is used for all the sub-problems generated in all recursion branches, and at level of recursion different version of the Toom-k may be used), then A only generates Type 2 MSPs and the the bounds of Theorem 7 correspond asymptotically to the results in [22] .
Instead, if A is such that the integer product is entirely computed using algorithms form the standard class, Theorem 7 yields the following bound. 
For the sub-class of uniform, non stationary algorithms UH (k, n 0 ) , given the values of n, M, n 0 and k, it is possible to compute a closed form expression for the values of ν 1 , ν 2 and |T |. Then, by applying Theorem 7 we have: Proof. As until the size of the sub-problems is higher than max{n 0 , 8M} A behaves as an uniform Toom-k algorithm we have that the inputs of the (2k − 1) i sub-problems generated in the i-th level of recursion have size at least n/k i . Let ℓ be the minimum level of recursion levels ℓ, necessary for the size of the input of the sub-problems to fall below max{n 0 , 8M}. We have: ℓ = ⌈log k n max{n 0 , 8M} ⌉.
Then, by the definition of MSP, either A generates ν 2 ≥ Ω (n/M) log k (2k−1) Type 2 MSPs, or A generates ν 1 ≥ Ω (n/n 0 ) log k (2k−1) vertex disjoint Type 1 MSP, each with input size at least n 0 . Thus, |T | ≥ Ω (n/n 0 ) log k (2k−1) n 2 0 .
On the tightness of the bound: As an I/O efficient sequential implementation of the Grid method multiplication requires O n 2 /M I/O operations, we can conclude that such implementations are asymptotically optimal, and that the I/O lower bound in (3) for the sequential model is asymptotically tight. In [22] , Bilardi and De Stefani presented a sequential version of the Toom-Cook algorithm which is uniform, nonstationary, which we henceforth refer to as A B D . That is, at each level of recursion the same version of Toom-k (with the same value k) is used for all the sub-problems generated in all recursion branches, while different versions of the Toomk scheme may be used in different levels of recursion. The algorithm proceeds according to such a scheme until one such recursive level for which all the generated sub-problems can be computed in the cache. According to our definition of MSP, A B D generates only Type-2 MSPs, and as shown in [22, Theorem 5.1 and Lemma 5.2], the number of I/O operations executed by it is within a O k 2 max multiplicative of the I/O lower bound in Theorem 7, provided that the size of the available cache memory is such that M ≥ Ω k 3 log s k .
Through simple, albeit tedious modifications of algorithm A B D , is it possible to obtain non-uniform, non-stationary algorithms for integer multiplication using the Toom-Cook scheme H whose I/O complexity is is within a O k 2 max multiplicative of the I/O lower bound in Theorem 7. Let us refer to such modified algorithms as A * B D . A * B D follows the same steps as A B D but, according to the instruction function f A * BD , may use different versions of Toom-k even for subproblems with inputs of the same size generated at the same levels of recursion. A * B D computes entirely in cache subproblem whose input size fits in the available cache.
An opportune composition of version of algorithm A * B D (for different choices of the associated instruction function) with the I/O optimal sequential Grid method multiplication algorithm lead to sequential hybrid algorithms in H (resp., UH (k, n 0 ) ) whose I/O cost asymptotically is within an O k 2 max multiplicative factor of the I/O complexity sequential lower bounds in Theorem 7 and Theorem 9. For constant values of k max (with respect to n, M) the such hybrid algorithms are I/O optimal and the corresponding lower bounds are asymptotically tight.
Interestingly, as all the mentioned algorithms from H and UH (k, n 0 ) do not recompute any intermediate value, we can conclude that using recomputation may lead to at most a O k 2 max multiplicative factor reduction of the I/O cost of hybrid algorithms in H and UH (k, n 0 ) .
VII. CONCLUSION
This work presented a characterization of the I/O complexity of a general class of non uniform, non stationary hybrid integer multiplication algorithms combining fast Toom-Cook algorithms with standard algorithms. Our work further showcase the generality of the approach based on the analysis of Maximal Sub-Problems introduced in [7] .
While the presented lower bounds are (almost) asymptotically tight for the sequential setting, an important open question is whether it is possible to obtain parallel versions of these algorithms whose I/O bandwidth cost matches the lower bound presented in this work.
