The increasingly high relative cost of moving data on modern parallel machines has caused a paradigm shift in the design of high-performance algorithms: to achieve efficiency, one must focus on strategies which minimize data movement, rather than minimize arithmetic operations. We call this a communication-avoiding approach to algorithm design.
Introduction
The runtime of an algorithm can be modeled as a function of computation cost, proportional to the number of arithmetic operations, and communication cost, proportional to the amount of data movement. Traditionally, to increase the efficiency of an algorithm, one sought to minimize the arithmetic complexity. On modern computer systems, however, the time to move one word of data is much greater than the time to complete one floating point operation. This can cause the performance of algorithms with low arithmetic intensity (see [11] ) to be communication bound. Technology trends indicate that the performance gap between communication and computation will only widen in future systems. This has resulted in a paradigm shift in the design of high-performance algorithms: to achieve efficiency, one must focus on strategies which minimize data movement, rather than minimize arithmetic operations. We call this a communication-avoiding approach to algorithm design.
We consider simplified machine models, where a sequential machine moves words of data between a slow memory of unbounded capacity and a fast memory of size M words, and a parallel machine consists of p processors, who move data between their local memories each of size M words. In both cases, data has a fixed width and is moved in messages of up to M contiguous words. In the parallel case, we assume the processors communicate in a point-to-point fashion over a completely connected network (so no contention) and each processor can send or receive at most one message at a time. We characterize an algorithm by its latency cost, the number of messages sent, its bandwidth cost, the number of words moved, and its arithmetic (flop) cost, the number of arithmetic operations performed. We characterize a sequential or parallel machine by its latency α, reciprocal bandwidth β, and arithmetic (flop) rate γ, in addition to M and p, and will estimate the runtime T of an algorithm (along the critical path) as T = (#messages · α) + (#words moved · β) + (#flops · γ).
(
For simplicity, we will not discuss overlapping communication and computation, although this potential factor of 2 savings may be important in practice.
Computing k repeated sparse matrix-vector 1 multiplications (SpMVs), or, a matrix powers computation, with A ∈ C n×n and x ∈ C n×q , where typically q n, can be written
where p j is a degree-j polynomial. For i = 1 to q, the span of the i th columns of x (0) , x (1) , . . . is often called a Krylov (sub)space. Matrix powers computations constitute a core kernel in a variety of applications, including steepest descent algorithms, the power method to compute PageRank, and Krylov subspace methods for linear systems and eigenvalue problems. Due to its low arithmetic intensity, SpMV performance is communication bound on modern architectures.
In [4] , the authors derive parallel communication-avoiding matrix powers algorithms to compute (2) that achieve an O(k) reduction in parallel latency cost versus computing k repeated SpMVs, for a set number of iterations [4, 9] . These algorithms assume that A is well partitioned, i.e., we can partition A such that for each processor, computing powers up to A k involves communication only between O(1) nearest neighbors.
Although such advances show promising speedups for a wide variety of problems, the requirement that A is well partitioned in the communication-avoiding matrix powers formulation of [4] excludes matrices whose partitions have poor surface-to-volume ratio. In this work, we consider matrices of the form A = D + U SV H , where D is well partitioned and U SV H may not be well partitioned but has low rank. (Recall x H = x T denotes the Hermitian transpose of x, and that if an n-by-n matrix has rank r, it can be stored and applied to a vector with O(nr) words and flops, instead of the usual O(n 2 ).) There are many practical situations where such structures arise, including analysis of power-law (or scale-free) graphs representing the Web and social networks, and circuit simulation. Hierarchical (H) matrices (see, e.g., [1] ), which appear in numerical partial differential equation solvers, multigrid methods, fast multipole methods, and preconditioned iterative methods, also have this form.
In this paper, we derive a new parallel communication-avoiding matrix powers algorithm for matrices of the form A = D + U SV H , where we only require that D (not A) be well-partitioned. If this splitting is possible, our algorithm enables a communication-avoiding approach; we demonstrate that, with respect to the cost of computing k SpMVs (the standard algorithm), our algorithm asymptotically reduces the parallel latency by a factor of O(k) for small additional bandwidth and computational costs. Using detailed complexity analysis, our performance model predicts that this reduction in communication allows for up to 24× speedups over the standard algorithm on petascale machines.
Background and Related Work
We discuss work in related areas, namely parallel communication-avoiding matrix powers algorithms and the serial blocking covers technique. We touch on applications that can benefit from our approach.
Efficient Matrix Powers Algorithms
The communication-avoiding matrix powers algorithms derived in [5] fuse together a sequence of k SpMV operations into one kernel invocation. Depending on the nonzero structure of A (more precisely, of the polynomials {p j (A)} k j=1 ), this enables communication avoidance in both serial and parallel implementations.
The serial matrix powers kernel reorganizes the k SpMVs to maximize reuse of A and the k + 1 vectors, ideally reading A and x once, and writing the k output vectors only once. When reading A is the dominant communication cost versus reading/writing the vectors (a common situation), this implies an k-fold decrease in both latency and bandwidth cost.
In parallel, the matrix powers kernel reorganizes the computation in a similar way but with a slightly different goal, since in a parallel SpMV operation only vector entries must be communicated. The parallel matrix powers optimization avoids interprocessor synchronization by storing some redundant elements of A and x on different processors and performing redundant computation to compute the k matrix powers without further synchronization. Provided the additional communication cost to distribute x is a lower-order term (equivalently, A k is well partitioned), this gives k-fold savings in latency.
We note that this discussion applies even if A is not given explicitly. Serial and parallel variants of the matrix powers kernel, for both structured and general sparse matrices, are described in [8] , which summarizes most of [4] and elaborates on the implementation in [9] . Within [8] , we refer the reader to the complexity analysis in Tables 2.3-2. 4, the performance modeling in §2.6, and the performance results in §2.10.3 and §2.11.3, which demonstrate that this optimization leads to speedups in practice. For example, for a 2D 9-point stencil on a n 1/2 -by-n 1/2 mesh with p processors, assuming k (n/p) 1/2 , the number of arithmetic operations grows by a factor 1 + 2k(p/n) 1/2 , the number of messages decreases by a factor of k, and the number of words moved grows by a factor of 1 + k(p/n) 1/2 [8] . Therefore since the additional arithmetic operations and additional words moved are lower order terms, we expect the parallel matrix powers algorithm of [8] to give a Θ(k) speedup on latency-bound problems.
The Blocking Covers Technique
Many earlier research efforts sought to reorganize out-of-core algorithms to minimize data movement within the memory hierarchy on a single processor. In [6] , Hong and Kung prove a lower bound on I/O complexity for a matrix powers computation on a regular mesh: if the computational graph has a τ -neighborhood cover, then the number of I/Os can be reduced by a factor of τ over the standard method. This is the same restriction required in the matrix powers algorithms in [4] , namely, that A be well partitioned.
A shortcoming of Hong and Kung's technique is that certain graphs with low diameter (such as multigrid graphs) cannot be covered efficiently with small, high-diameter subgraphs. In [7] , this restriction is overcome by relaxing the constraint that dependencies in the computational graph be respected, using a blocking covers technique. A blocking cover of the computational graph has the property that the subgraphs forming the cover have large diameters (equivalent to our definition of well partitioned) once a small number of vertices (the blockers) have been removed. The authors show that, by introducing a variable for each blocker for each of the τ iterations and maintaining linear dependences among the removed vertices, each subgraph can compute τ matrix powers on its data without communicating with other subgraphs. We extend the blocking covers technique of [7] to the parallel case, and further generalize this technique to handle a larger class of data-sparse matrix representations.
Motivating Applications
Many scientific applications require solving linear systems Ax = b. Iterative methods are commonly used when the coefficient matrix is large and sparse. The most general and flexible class of iterative methods are preconditioned Krylov subspace methods. In each iteration m, the approximate solution is chosen from the expanding Krylov subspace,
, or some variation (depending on the preconditioner M −1 used). This includes the power method, e.g., in the PageRank algorithm.
Significant reductions in communication can be achieved by rearranging the iterations to compute a k element basis in one step (a matrix powers computation), essentially unrolling the inner loop k times. Such methods are called k-step, or communication-avoiding, Krylov subspace methods. There is a wealth of literature related to k-step methods. Space constraints prohibit an in-depth discussion; we direct the reader to the thorough overview given in [5, §1.5-1.6].
As mentioned above, there are many applications where iterative methods are used for matrices with the property of being mostly sparse, but with a few low-rank components that may be dense. This includes matrices from Web graphs, social networks, and circuit simulations. Hierarchical semiseparable (HSS) matrices [3] , commonly used in PDE solvers, also have this property. Our technique is general and can be implemented to avoid communication in all these applications.
Derivation
Recall that, given matrices A ∈ C n×n and x ∈ C n×q , and k ∈ N, our task is to compute the matrix in (2) . In this work, we will assume that the polynomials p j in (2) satisfy a three-term recurrence
It is convenient to represent the polynomials {p j } k j=0 by their coefficients, stored in a tridiagonal matrix
Parallel Matrix Powers Algorithms
For reference, we review two of the parallel matrix powers algorithms from [4, 5, 8] , referred to by the authors as PA0 and PA1. PA0 refers to the naïve algorithm for computing (2) via k SpMV operations, and PA1 is the communication-avoiding variant. For simplicity, we use versions of PA0 and PA1 which do not exploit overlapping communication and computation. While our approach can be extended to exploit overlapping communication and computation, the potential savings are limited by a factor of 2 and thus will not affect the asymptotic complexity of the algorithms given here. Let the notation nz(A) = {(i, j) : A ij treated as nonzero} represent the edges in the directed graph of A, and let the notation A I indicate the submatrix of A consisting of rows i ∈ I. To simplify the discussion (without affecting correctness), we will ignore cancellation, meaning we assume nz(p j (A)) ⊆ nz(p j+1 (A)) and every entry of x (j) is treated as nonzero for all j ≥ 0. We construct a directed graph G = (V, E) representing the dependencies in computing
The edge set E consists of k copies of nz(A), between each adjacent pair of the k + 1 levels V (j) := {x (j)
i : 1 ≤ i ≤ n}, unioned with the edges due to the polynomial recurrence, i.e.,
where d is the number of nonzero superdiagonals of T k (counting the main diagonal). We ignore possible sparsity within these diagonals; this simplifies the discussion without affecting correctness. Now we partition V 'rowwise,' that is, each
is assigned a processor affinity m ∈ {0, . . . , p − 1}, for 0 ≤ j ≤ k. Let V m and V (j) m restrict V and V (j) to their elements with affinity m. Let R(S) denote the reachability set of S ⊆ V, i.e., the set S and vertices reachable from S via (directed) paths in G; then as with V, we define the subsets R (j) , R m , and R and R (0) (V m ). With this notation, we present the parallel matrix powers algorithms PA0 (Alg. 1) and PA1 (Alg. 2), as pseudocode for processor m. The advantage of PA1 over PA0 is that it may send fewer messages between processors: whereas PA0 requires k rounds of messages, PA1 requires only one. If the number of other processors = m that each processor m must communicate with is the same for both algorithms, PA1 obtains a k-fold latency savings. In general, however, PA1 incurs greater bandwidth, arithmetic, and storage costs, as processors may perform redundant computations to avoid communication (see Table 1 ). Whether this tradeoff is worthwhile depends on many machine and algorithm parameters; we refer to the detailed analysis and modeling in [4, 5, 8] . 
4:
Receive
m via recurrence (3). 7: end for Algorithm 2 PA1. Code for processor m.
1: for all processors = m do 2: Send x
Parallel Blocking Covers Algorithm
Consider computing (2) with a dense matrix A. As before, PA0 must communicate at every step, but now the cost of PA1 may be much worse: when k > 1, every processor needs all n rows of A and x (0) , and so there is no parallelism in computing all but the last SpMV operation. (Note that when k = 1, PA1 degenerates to PA0.) If A can be split in the form D + U SV H , where D is well-partitioned and U SV H has low rank, we can use a generalization of the blocking covers approach [7] to recover parallelism; we call this algorithm PA1-BC. Split the matrix A =: D + U SV H , and write
Now we manipulate the three-term recurrence (3) to obtain the following identity, which we will then apply to PA1 to obtain the PA1-BC (the blocking covers algorithm). Let p i j (z) represent a degree-j polynomial related to p j (z) by reindexing the coefficients (α j , β j , γ j ) := (α i+j , β i+j , γ i+j ) in (3).
Lemma 3.1. Given the additive splitting z = z 1 + z 2 , (3) can be rewritten as
Proof. The result is readily established for j = 0 and j = 1; supposing it holds up through j = t,
Observe the following identities (for any j ≥ 0 and z): p j 0 (z) = 1, and
Substituting these identities, we obtain
This completes the induction. (6), premultiply by SV H , and postmultiply by x, to obtain
To simplify (7), we define
and rewrite the polynomials p 
and they can be computed by the recurrence
where I r,c denotes the leading r-by-c submatrix of an identity matrix.
Proof. For any j, we verify the claim for i = 0 and i = 1; supposing it holds up to some i = t. Write
i ⊗ I r,r ) for i = t − 1 and i = t. The conclusion follows from rewriting (3) as
(for any i), and substituting.
Substituting (8), (9), (10), and (11) into (7), we obtain
Ultimately we must evaluate (5), which we rewrite as
Computing [ 
Given the notation established, we construct PA1-BC (Alg. 3). We redefine G = (V, E) in terms of the graph of D (as opposed to that of A, which may be complete). Processor m must own the following rows of D, U , V , and x (0) :
, and
Algorithm 3 PA1-BC. Code for processor m. In exact arithmetic, PA1-BC returns the same output as PA0. However, by exploiting the splitting A = D + U SV H , PA1-BC may overcome the problems of PA1, and allow us to avoid communication when A itself is not well partitioned.
One can think of the sequential blocking covers algorithm in [7] as a special case of a sequential execution of PA1-BC. Given a set of blocker vertices, indexed I ⊆ {1, . . . , n}, the algorithm in [7] removes the outgoing edges from those vertices, i.e., removes rows I from A, to obtain a well-partitioned matrix D. In our notation, this means setting U := [e i : i ∈ I] and SV H := A I , where e i is the i th identity column. + 1) 2 -point stencil on an n 1/2 -by-n 1/2 mesh. These results are modified from [4, Table 1 ] to allow for x to have q > 1 columns. When computing the 3-term recurrence (3) resp. (15), the flop costs for both algorithms increase by additive factors of (5k − 7)q(n/p) resp. (6k − 8)q(n/p).
PA0

Flops Table 2 ; the '(3)' resp. '(3c)' suffixes in the subscripts of F indicate the 3-term recurrences (3) resp. (15) (see caption of Table 2 ).
Complexity Analysis for a Model Problem
To demonstrate the potential performance benefits of PA1-BC over PA1 and PA0, we consider the following example. Let A = D + U V H be a dense n × n matrix, where the graph of D is a (2s + 1) 2 -point stencil on a n 1/2 -by-n 1/2 mesh, and U and V are dense n-by-r matrices. We assume p and n are perfect squares and p 1/2 divides n 1/2 , and we partition the vertices {1, . . . , n} so that each processor owns a (n/p) 1/2 -by-(n/p) 1/2 square of the domain. Following [4] , we assume k < (n/p) 1/2 , so that PA1 (applied to D) need only communicate among neighboring processors. We give the complexity for PA0 and PA1 in Table 2 , modified from [4, Table 1 ] to allow x to have q > 1 columns, and to use the three-term recurrences (3) and (15).
We assume our collective communications attain the lower bounds in [2, Table 1 ]. We consider Lines 1-3 of PA1-BC as 'offline' operations, and Lines 4-7 as 'online' computations, since typically the matrix A and the polynomials p j are fixed across many calls to PA1-BC (or PA0/1), while the input matrix x changes on each invocation. We compare PA0 and PA1-BC in Table 3 .
We use the following arithmetic counts: apply(S, n 1 , n 2 ) = 0 is the cost of applying S to an n 1 -by-n 2 matrix (in this example S is the identity), add(n 1 , n 2 ) = scal(n 1 , n 2 ) = n 1 n 2 is the cost of adding two n 1 -by-n 2 matrices or multiplying one by a scalar, and mm(n 1 , n 2 , n 3 ) = 2n 1 n 2 n 3 − n 1 n 3 is the cost of multiplying matrices of size n 1 -by-n 2 and n 2 -by-n 3 .
PA0
We modify PA0 slightly to exploit the splitting Ax = Dx + U (V H x) to save computation and communication. We assign each processor m a block row (indices {i :
, and x (0) -a nonoverlapping partition of each, as opposed to PA1, whose partitions may overlap when k > 1. We first compute y = V H x via a local matrix multiplication (costing mm(r, n/p, q) flops) followed by an Allreduce of y ((p − 1)rq flops, 2(p − 1)rq words, and lg(p) messages). Then we compute U y by a local matrix multiplication (costing mm(n/p, r, q) flops). Then we communicate rows of x (j) in order to compute the local rows of Dx (j) ; the costs of this step follow from Table 2 .
PA1-BC PA1-BC makes 3 calls to PA1, in Lines 1, 4, and 7. These costs are shown in Table 2 (note that the third PA1 call uses the PA1(3c) variant). Now we analyze the applications of V H (Lines 1 and 4) and subsequent Allreduce collectives (Lines 2 and 5). Applying the local columns of V
H costs mm(r, n/p, (k − 1)r) (resp. mm(r, n/p, kq)) flops, and the subsequent Allreduce costs (p − 1)(k − 1)r 2 flops and 2(p − 1)(k − 1)r 2 words (resp. (p − 1)krq flops and 2(p − 1)krq words), and lg(p) messages (both).
Computing w 
Extension to Hierarchical Matrices
Hierarchical (H-) matrices (mentioned above) are amenable to the splitting D + U SV H . Particularly interesting are a special class of H-matrices called hierarchical semiseparable (HSS) matrices. We briefly review the HSS representation, using the notation in [3] (see also Fig. 1 ). For any L ∈ {0, . . . , lg n }, we can write A hierarchically by recursively defining its diagonal blocks as A =: D 0;i and for 1 ≤ ≤ L and 1 ≤ i ≤ 2 −1 , D −1;i =:
, whose off-diagonal blocks are defined recursively by U −1;i =:
and V −1;i =: 
and
For any level we assemble the block diagonal matrices
denoted here as direct sums of their diagonal blocks. We also define matrices S , representing the recurrences for f ;i and g ;i , satisfying
We will now discuss parallelizing v = Ax, to generalize PA0 and PA1 to HSS matrices. We make the following assumptions for the subsequent sections. The matrix A is n-by-n, and its HSS representation has a perfect binary tree structure to some level L > 2. We have p ≥ 4 processors, and for simplicity assume that p is a power of 2. For each processor m ∈ {0, 1, . . . , p − 1}, let L m denote the smallest level ≥ 1 such that p/2 divides m. We also define the intermediate level 1
where equality is attained when m is odd.
PA0 for HSS Matrices
Given the notation and assumptions in the preceding section, we show how to modify PA0 when A is HSS, exploiting the v = Ax recurrences for each 1 ≤ j ≤ k -the result is called PA0-HSS. For clarity, we write the parallel upsweep/downsweep subroutine separately, in Alg. 5.
Our PA0-HSS algorithm is based on the serial algorithm for HSS matrix-vector multiplication from [3] , summarized in the recurrences in the preceding section. We are unaware of previous work demonstrating a parallelization of these recurrences, although it is obvious, given the perfect binary tree structure. First, on the upsweep, each processor locally computes V H Lp x (its subtree, rooted at level L p = lg(p)) and then performs L p steps of parallel reduction, until there are two processors active, and then a downsweep until level L p , at which point each processor is active, owns S Lp V H Lp x, and recurses into its local subtree to finally compute its rows of
More precisely, we assign processor m the computations f ;i and g ;i for , i :
L≥ ≥Lp 2 m/p+1≤i≤2 (m+1)/p and for , i :
(so only processors 0 and p/2 are active at the top level ( = 1)), and the matrices D L , U L , and V L are distributed contiguously block rowwise, so processor m stores blocks D Lp;m+1 , U Lp;m+1 , and V Lp;m+1 . The R ;i , W ;i , and B ;i matrices are distributed so that they are available for the computations in the upsweep/downsweep (Alg. 5); we will omit further details for brevity, but will analyze the memory requirements when we compare with PA1-HSS, below. 
10:
Send g +1;2i−1 to proc. m + p/2 +1 .
11:
Receive g +1;2i from proc. m + p/2 +1 .
12:
if > 0 then
13:
Compute g ;i .
14:
17:
Send g +1;2i to proc. m − p/2 +1 .
18:
Receive g +1;2i−1 from proc. i = 2 m/p + 1.
25:
Compute f +1;2i−1 .
26:
if < L p − 1 then
27:
Send f +1;2i−1 to proc. m + p/2 +2 . Receive f ;i from proc. m − p/2 +1 .
33:
end if
34:
Compute f +1;2i .
35:
36:
Send f +1;2i to proc. m + p/2 +2 .
37:
end if 38: end if 39: end for
Compute f +1;2i−1 and f +1;2i . 43: end for 44: end for 45: Compute v L;i for i=mn/(pr)+1, . . . , (m+1)n/(pr).
PA1 for HSS Matrices
The block-diagonal structure of D , U , and V in (16) suggests an efficient parallel implementation of PA1-BC, which we present as PA1-HSS (Alg. 6). As opposed to PA0-HSS, the only parallel communication in PA1-HSS occurs in two Allgather operations, in Lines 1 and 5. The computation cost increases, however, since each processor performs the entire upsweep/downsweep between levels 1 and L p locally. Recall in PA0-HSS, there was parallelism, albeit less than full, during these levels (Lines 7-39 of Alg. 5). Our further loss of parallelism shows up in our complexity analysis (see Table 4 ) as a factor of p, compared to a factor of lg(p) in PA0-HSS; we also illustrate this tradeoff in our performance modeling (see §5).
We assume the same data layout as PA0-HSS: each processor stores a diagonal block of D Lp , U Lp , and V Lp (but only stores the smaller blocks of level L). We assume each processor is able to apply S Lp . We rewrite (14) for the local rows, and exploit the block diagonal structure of D Lp and U Lp , to write
We will not exploit the fact that each processor ultimately needs only a subset of the rows of b j computed in Line 6. Each processor locally computes all rows of
where z is the maximal parenthesized term in (13), using the HSS recurrences:
The rest of PA1-HSS is similar to PA1-BC, except that the Allreduce operations have now been replaced by Allgather operations, to exploit the block structures of W i and y i .
Algorithm 6 PA1-HSS (Blocking Covers)
. Code for processor m. 
Complexity Analysis for a Model Problem
We compare the asymptotic complexity of PA0-HSS and PA1-HSS. We assume A is n-by-n and dense, and represented by the dense r-by-r matrices R ;i and W ;i (for levels 2 through L), B ;i,i±1 (for levels 1 through L) and D L;i , U L;i , and V L;i (i.e., at the leaf level). To simplify the presentation, we assume n and r are powers of 2 and L = lg(n/r). A matrix satisfying these assumptions is said to have HSS rank r; here we assume A is already represented as such a set of r-by-r matrices. We summarize the results in Table 4 . PA0-HSS requires enough memory to store the local blocks of D L , U L , and V L , a total of 3rn/p words. Furthermore, each processor must store the R ;i , W ;i and B ;i,i±1 matrices for its local upsweep/downsweep, a total of L =Lp (2 /p)2r 2 words, and some subset of these matrices for the parallel portion of the upsweep/downsweep. Processor 0 must be able to store 2L p · 2r 2 more, an upper bound for the other processors. Lastly, each processor must be able to store their (k +1)qn/p entries of x (0) , . . . , x (k) .
PA0-HSS
PA1-HSS
As opposed to PA1-BC, Lines 1, 4, and 7 of PA1-HSS will not require communication, due to the block diagonal structure of D, U , and V . We assume all three lines are implemented to exploit the upsweep/downsweep recurrences for (locally) applying the (HSS rank r) matrix D Lp;m+1 to an n/p-by-r matrix (Line 1) or to an n/p-by-q matrix (Lines 4 and 7). (Note that in Line 1 we could further expand U Lp;m+1 into a block diagonal matrix (with r-by-r blocks), using the R ,i matrices; however, this seems to only increase the arithmetic cost.) The arithmetic cost for (locally) multiplying D Lp;m+1 (with HSS rank r) by a dense n/p-by-q matrix is:
flops. We apply D Lp;m+1 k − 2, k − 1, and k times, in lines Line 1, Line 4, and Line 7, resp. Evaluating the three-term recurrences (in the same three lines) increases the cost by (5k −17)r(n/p), (5k −12)q(n/p), and (6k − 8)q(n/p) flops, resp. Then applying V H Lp;m+1 in Lines 1 and 4 costs mm(r, n/p, r) and mm(r, n/p, q) flops, resp. (note that it would nearly quadruple these costs to apply V H Lp;m+1 using the HSS upsweep recurrences, rather than as a full matrix). The Allreduce operations in PA1-BC have become Allgather operations in Lines 2 and 5, due to the structure and parallel layout of the matrices
x. Since we parallelize at HSS level L p (rather than the leaf level L), each processor distributes k − 1 r-by-r (resp. k r-by-q) blocks. This costs 0 flops, (k − 1)r 2 (p − 1) resp. kqr(p − 1)) words, lg(p) msgs Line 3 (performed locally) has the same O(k 3 ) cost as in PA1-BC, a lower order term. The analysis of Line 6 (also performed locally) is similar to that of PA1-BC, except now r is replaced by pr, except for the terms involving W i , which is block diagonal with p r-by-r diagonal blocks. In total, each processor performs
flops, where each application of S to a pr-by-q matrix costs
flops.
The memory requirements for PA1-HSS are the same as PA0-HSS, except that each processor needs all the R ;i , W ;i , and B ;i matrices for levels 1 through L p , a cost of 8pr 2 words.
Performance Modelling
We model speedups of our new algorithms for the two example problems discussed in the text: a 2D stencil plus rank-r component, and an HSS matrix. Complexity counts used can be found in Tables 3 and 4 , resp. We use two machine models used in [8] -'Peta,' an 8100 processor petascale machine, and 'Grid,' 125 terascale machines connected via the Internet. The Peta machine has a flop rate of γ = 2 · 10 s/flop, latency cost α = 10 −5 s/message, and bandwidth cost β = 2 · 10 −9 s/word. The Grid machine has flop rate γ = 10 −12 s/flop, latency cost α = 10 −1 s/message, and bandwidth cost β = 25 · 10 −9 s/word. For the stencil plus rank-r test, we ran the performance model for q = r = s = 1, i.e., a 2D 9-point stencil plus rank-1 matrix (times a vector). We picked n and k based on those chosen in [8] ; note that the dimension of A is n 2 . Fig. 2 shows the best speedup of PA1-BC over PA0 over all p values such that p ≤ min(p max , (n/k) 2 ). The p values which resulted in the maximum speedup for this test problem are shown in Fig. 3 . On Peta, the largest speedup was 7.8×. Speedups were generally higher on Grid, with a maximum speedup of 25.9× for this range of k. We expect higher speedups on Grid since PA0 is extremely latency bound on this machine. For both models, predicted speedups decrease with increasing n and k due to growing additional flop and bandwidth terms.
Speedups of PA1-HSS over k invocations of PA0-HSS, for both Peta and Grid, are shown in Fig. 4 . We use the parameter triplets (n i , p i , r i ) where p = (4, 16, 64, 256, 1024, 4096), n = (2.5, 5, 10, 20, 40, 80) · 10 3 , and r = (5, 5, 5, 5, 6, 7), based on the parameters for parallel HSS performance tests in [10] . Note that for Grid we only use the first 3 parameter triplets since p max = 125. On Grid, PA0-HSS is extremely latency bound, so our 3k reduction in latency results in a 3k× faster algorithm (the extra flop and bandwidth costs are insignificant for these values of k). This is the best we can expect. As shown in Fig. 5 , for very large k, the extra terms begin to dominate, and speedups eventually begin to decrease with k. On Peta, we see O(k) speedups for smaller p and k, but as these quantities increase, the expected speedup drops. 2621  2904  3236  3628  4096  4660  4869  4863  4858  4854  4852  4851  4853  4859  4871  4893  4934  5020  5251  7118   8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100   8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100   8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100  8100 125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125   125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125   125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125   125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125   125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125   125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125  125 This is due to the extra factor of p/ lg(p) in the bandwidth cost and the factor of k 2 p/ lg(p) in the flop cost of PA1-HSS. Since the relative latency cost is lower on Peta, the effect of the extra terms becomes apparent for larger k and p.
Future Work and Conclusions
In this work, we derive a new parallel communication-avoiding algorithm for the matrix powers computation with A = D + U SV H , where D is well partitioned and U SV H has low rank but may not be well partitioned. This is a significant improvement over the algorithms in [4, 8] , which require A to be well partitioned. Our approach allows us to exploit the low-rank structure to asymptotically reduce the parallel latency cost: on latency-bound problems, our model indicates an O(k) speedup. We demonstrate the generality of our parallel blocking covers technique by applying it to matrices with hierarchical structure. Detailed performance modeling suggests 24× speedups on petascale machines, and up to 3k speedups on extremely latency-bound machines, despite growth in arithmetic and bandwidth costs. Future work includes a high-performance parallel implementation of our matrix powers kernel variants to verify the speedups predicted by performance modeling. We also plan to explore the application of blocking covers to other parallel algorithms. 
