Many large-scale scientific computations require eigenvalue solvers in a scaling regime where efficiency is limited by data movement. We introduce a parallel algorithm for computing the eigenvalues of a dense symmetric matrix, which performs asymptotically less communication than previously known approaches. We provide analysis in the Bulk Synchronous Parallel (BSP) model with additional consideration for communication between a local memory and cache. Given sufficient memory to store c copies of the symmetric matrix, our algorithm requires Θ( √ c) less interprocessor communication than previously known algorithms, for any c ≤ p 1/3 when using p processors. The algorithm first reduces the dense symmetric matrix to a banded matrix with the same eigenvalues. Subsequently, the algorithm employs successive reduction to O (log p) thinner banded matrices. We employ two new parallel algorithms that achieve lower communication costs for the full-to-band and band-to-band reductions. Both of these algorithms leverage a novel QR factorization algorithm for rectangular matrices.
INTRODUCTION
The eigenvalue decomposition of a symmetric matrix A is A = U DU T where D is a diagonal matrix of eigenvalues and the columns of the orthogonal matrix U are the eigenvectors of A. Dense symmetric eigensolvers typically reduce the matrix to a tridiagonal matrix with the same eigenvalues, compute the eigenvalues D of this intermediate banded matrix rather than directly to tridiagonal, has been used to reduce vertical communication and synchronization costs [8] . Further, in practice, algorithms using a two-stage (fullto-banded and banded-to-tridiagonal) approach [5, 25] have been shown to outperform libraries that reduce directly to tridiagonal (like ScaLAPACK [12] ). However, a disadvantage of successive band reduction is an increase in cost of the back transformations done to compute eigenvectors. Unlike the forward application of transformations whose computation cost scales linearly with the matrix band-width, known algorithms for back transformations require O (n 3 ) operations for each intermediate band-width used.
The BSP model allows us to formulate and analyze algorithms as compositions of a set of common building-blocks. We leverage algorithms for matrix multiplication and QR factorization within our symmetric eigensolvers. For QR factorization, we extends known algorithms for tall-and-skinny matrices [16] and square matrices [38] to be efficient for arbitrary rectangular matrices.
We use these building blocks to define algorithms for reducing a dense matrix to a banded matrix, and a banded matrix to a thinner band-width, while preserving eigenvalues. Our main algorithm combines these, using O (log p) intermediate band-widths. The algorithm is work-efficient for computing eigenvalues, requires O (n 2 / √ cp) horizontal communication, O (n 2 log p/ √ cp) vertical communication, and O ( √ cp log 2 p) synchronizations (BSP supersteps). Known approaches for back-transformations to compute eigenvectors require the same asymptotic amount of computation for matrices of any band-width, meaning our approach may require a computation cost of O (n 2 k log p/p) if k eigenvectors are needed.
We leave the analysis of back-transformation computation for future work, but propose a potential approach to reduce the number of intermediate band-widths needed by our symmetric eigensolver.
THEORETICAL COST MODEL
We use the Bulk Synchronous Parallel (BSP) model [39] with an additional parameter to measure the cost of traffic between memory and cache. We derive asymptotic bounds on the parallel runningtime of our algorithms for this two-level architectural model, with consideration for both communication between processors and in the memory hierarchy of each processor. The BSP model permits an all-to-all communication to be done with unit synchronization cost, which will allow us to construct BSP algorithms for general matrix distributions and compose them without significant overhead. We employ cost notation typically used for the α-β communication model. As all stored and communicated datasets in this paper consist exclusively of floating-point numbers, we quantify sizes in terms of 'words' (floating-point numbers of a given precision). We model the memory hierarchy of each processor by a main 'slow' memory (i.e. DRAM) and a 'fast' memory (i.e. cache). We permit interprocessor (horizontal) communication to move data between main memories of different processors, and intraprocessor (vertical) communication to move data between main memory and cache of a single processor. Our architectural model is characterized by the following parameters:
• p -processors on a fully-connected network,
• M -words of memory owned by each processor,
• H -words of cache owned by each processor,
• γ -time to compute a floating point operation,
• β -time to send or receive a word, • ν -time to move a word between cache and memory, • α -time to perform a (global) synchronization.
We bound the cost of each algorithm by measuring four quantities:
• F -number of local floating point operations performed (computation cost), • W -number of words of data moved between processors (horizontal communication cost), • Q -number of words of data moved between main memory and cache (vertical communication cost), • S -number of BSP supersteps (synchronization cost).
If at each superstep i ∈ [1, S], processor j performs F j i local operations, sends and receives W j i total words, and performs Q j i reads and writes to memory, then the costs of the BSP algorithm are
and the BSP execution time of this algorithm is
This model does not consider overlap between communication and computation (or between other costs), as such overlap does not affect the overall asymptotic time.
We simplify asymptotic cost expressions by assuming γ ≤ β. Further, we write only vertical communication terms which are not associated with horizontal communication or with computations that achieve a factor of √ H cache reuse (optimal for matrix multiplication [27] ). These simplifications correspond to the assumptions on the relative communication times, ν ≤ β and the floating point rate ν ≤ γ · √ H . However, general vertical communication cost
upper-bounds may be obtained from our stated results for arbitrary ν by reinserting the term O (ν · (F / √ H + W )).
We will provide asymptotic bounds for the BSP cost of all algorithms in the paper. Sometimes, we will employ algorithms as building blocks whose cost has been analyzed in the standard α − β model, which is restricted to point-to-point messaging (pairwise synchronization). These algorithms are trivially translated to the BSP model used in this paper, which is less restrictive (allows bulk synchronizations).
Throughout the paper, we will assume that matrix dimensions are greater than and divisible by the number of processors. When it is clear that the asymptotic costs would not be affected, we will also omit floors and ceilings when subdividing the number of processors and matrix dimensions.
BUILDING BLOCKS
We first state known results and provide minor extensions to quantify the complexity of matrix multiplication and of QR factorization in our cost model. These results will be critical in the cost analysis of the new symmetric eigensolvers, which use matrix multiplication and QR factorization as subroutines.
Matrix Multiplication
Our symmetric eigensolvers will perform matrix multiplications, often of nonsquare matrices. We consider the BSP cost of multiplication of arbitrary rectangular matrices with any starting distribution. Additionally, we specially consider the BSP cost of a matrix multiplication of a pre-replicated matrix with another matrix in an arbitrary distribution. We start with the vertical communication cost of a matrix multiplication done by a single processor.
Lemma 3.1. The multiplication of matrices of dimensions m × n and n × k can be done by a single processor in time,
The Rec-Mult algorithm [22, Theorem 1] obtains the vertical communication cost given in Lemma 3.1. We omit the usual term
We now consider the full BSP cost of parallel rectangular matrix multiplication. The communication cost of square matrix multiplication is well known [1, 3, 9, 14, 29, 31] . The horizontal costs of rectangular matrix multiplication have also been analyzed within the α-β communication model, where a recursive algorithm was proposed [15] that attains the communication lower bound. We show that the algorithm in [15] can be executed within the time specified in the subsequent Lemma, for any initial load balanced distribution of the matrices. It is possible to also design different matrix multiplication algorithms in the BSP model with a Θ(log p) factor less in synchronization cost, but the overall synchronization costs of our QR and symmetric eigensolve algorithms would not improve. We parameterize the memory used by the algorithm by a parameter v, which controls how many block matrix multiplications are performed. Lemma 3.2. For any v ≥ 1, the multiplication of matrices of dimensions m × n and n × k in any load-balanced starting layout can be done in BSP time,
memory.
Proof. We consider the cost of the recursive 'CARMA' algorithm [15] . The algorithm assumes specific initial matrix layouts, but does not assume any initial data is replicated. Therefore, starting from load balanced layouts, the BSP time to move to the layouts specified by CARMA is O (β · mn+nk+mk p + α ). Because the computation is load balanced, the computation cost is O (γ · mnk/p). The latency cost of the CARMA algorithm is an upper-bound on the number of BSP supersteps necessary to execute it. In [15] , the latency cost is shown to be O mnk pM 3/2 log p = O (v log p). The communication cost of CARMA is presented in cases for 1D, 2D, and 3D processor grids. We show that the postulated BSP time upper-bound holds for all cases.
We first argue that the vertical communication cost of the local matrix multiplications (given by Lemma 3.1) is dominated by horizontal communication due to the assumption β ≥ ν . In the 3D case, the operand matrix blocks are nearly square, and either one of the operands or the output is always communicated, so horizontal communication cost dominates vertical communication cost. In the 1D and 2D cases, each processor performs a single local matrix multiplication, where the largest operand has size O ( mn+nk+mk p ), since it is the local block of the largest matrix.
We leverage previous analysis [15] to derive the horizontal communication cost. Let
The algorithm analyzed in Lemma 3.2 allows any initial load balanced matrix distributions. We now consider Algorithm 3.1, which assumes an initial distribution with replicated data and subsequently can multiply certain matrices in less time than given by Lemma 3.2. In Algorithm 3.1, one of the input matrices is stored redundantly on c = p 2δ −1 2D processor grids for any c ∈ [1,
The parameterization by δ is the same as α in [38] , while c is the same replication factor as in [35] . The parameter w controls the number of supersteps in Algorithm 3.1.
The algorithm permits the distribution to be defined as a blocking of the matrices after permutation by P (1) , P (2) . Our analysis assumes the blocking is roughly, but not necessarily exactly load balanced, permitting both cyclic and block-cylic matrix factorization algorithms where different processors perform updates (matrix multiplications) with a slightly different amount of local data at each step. We will employ Algorithm 3.1 with cyclic distributions, for which P (1) i j = 1 for i = (j mod q)(m/q) + ⌊j/q⌋ and P (2) jk = 1 for k = (j mod q)(n/q) + ⌊j/q⌋. On each processor grid layer, the algorithm executes a variant of the SUMMA algorithm [40] , which communicates the operand B and reduces the output C. This variant is chosen, since we will use the algorithm with the operand A being larger in dimensions than B and C. Lemma 3.3. Consider Algorithm 3.1 for multiplication of matrices A and B of dimensions m × n and n × k, where the initial distributions of A and B satisfy the stated requirements for permutations P (1) and P (2) where each block
when H ≥ mn/p 2(1−δ ) and the copies of A start inside cache, and otherwise with an extra cost of O (ν · wmn p 2(1−δ ) ).
Proof. As required by Algorithm 3.1, B starts in any load-balanced distribution over the p processors. As the initial layout is loadbalanced the redistribution done on line 4 costs O (β ·nk/p +α ). The gather on line 9 and reduce-scatter on line 11 are dual communication patterns that together have cost O (β · (mk + nk )/(qcw ) + α ).
Require: Given positive integers p, m, n, k, w and δ ∈ [1/2, 2/3]:
Π is a grid of q × q × c processors with q = p 1−δ and c
elements in A i j , defined by square permutation matrices P (1) ,
B is in any load balanced layout over all p processors. 2: Let z = wc 3: Partition B into blocks:
% Execute loop iterations in sequence 8:
Gather B jh on Π[i, j, l]
10:
Reduce-scatter
Over all w iterations of index h,
The w local matrix multiplications take time,
by Lemma 3.1. However, if the entire matrix A starts in cache, which is possible if H ≥ mn/p 2(1−δ ) , it suffices to read only the entries of B jh from memory into cache and write the entries of C i jh out to memory. In this case, the vertical communication cost
This term is dominated by the interprocessor communication term since β ≥ ν . The memory usage corresponds to the storage necessary for each block:
QR Factorization
We will use QR factorization within our symmetric eigensolver algorithms to obtain orthogonal transformations that introduce zeros when applied to the symmetric matrix. The vertical communication cost of executing a sequential QR factorization is proportional to that of matrix multiplication.
Lemma 3.4. The QR factorization of an m ×n matrix A with m ≥ n can be done sequentially in time, O (γ · mn 2 + ν · mn).
The sequential Communication-Avoiding QR (CAQR) algorithm achieves the vertical communication cost given above [16] . The Householder representation, lower trapezoidal m × n matrix U and upper-triangular n × n matrix T so that Q = I − UTU T , may be obtained with the cost of Lemma 3.4 using Householder reconstruction [6] .
We first consider parallel QR factorization of square matrices.
Lemma 3.5. The QR factorization of an n × n matrix A distributed in any load-balanced layout can be computed using
The QR algorithm given by [38] in the BSP model achieves the costs given in Lemma 3.5. The vertical communication cost was not analyzed in [38] . However, the algorithm consists purely of distributed matrix multiplications or QR factorizations, which by Lemma 3.1 and Lemma 3.4 have a vertical communication cost proportional to the matrix sizes. As the analysis in [38] assumes all matrices that participate in multiplication or QR factorization are communicated, due to ν < β, the horizontal communication cost dominates the vertical communication costs associated with these operations.
We now adapt the QR algorithm from [38] to handle rectangular matrices with a desirable asymptotic cost (the embedding used in [38] is inefficient for tall-and-skinny matrices). Our adaptation is based on a binary QR reduction tree, with QR factorizations of nearly square matrices done at every node in the tree performed using the algorithm from [38] . An approach employing a QR reduction tree using Givens rotations goes back to [23] , a blocked flat tree approach (optimal sequentially) was presented in [24] , and a parallel block reduction tree approach was given earlier in [13] . Our approach is closest to the TSQR algorithm [16] , except a set of up to q max processors works on each tree node. Algorithm 3.2 computes the QR factorization of an m × n matrix, outputting the first n columns of the orthogonal Q factor, as well as the n × n upper-triangular matrix R. The algorithm assumes the existence of a sequential routine 'QR' and a parallel routine for (nearly) square matrices 'square-QR'. Theorem 3.6. Algorithm 3.2 can compute the QR factorization of any m × n matrix A with m ≥ n in a load-balanced layout, using
Proof. We assume without loss of generality that m/n and p are powers of two. Let T (m) be the cost of Algorithm 3.2 for an m × n matrix using p processors. Note that m corresponds to the number of rows in the original input matrix, whilem will be used to refer to the number of rows at a given recursive step. We select the maximum number of processors to be used in base-case square QR factorizations to be q max = pn m log(p) 1/δ , in order to minimize synchronization cost while achieving an optimal horizontal communication cost.
The cost of the sequential base case of Algorithm 3.2 is, by Lemma 3.4, T bs1 (m) = O (γ ·mn 2 +ν ·mn). When reaching the square base case (dimension 2n × n, since m/n is a power of two), we employ the square QR algorithm [38] with up to
Require: Given positive integers p, m, n, q max and δ ∈ [1/2, 2/3]:
Π is a set of p processors, A is m × n, m/n and p are powers of two, and each Π[i] owns mn/p elements of A.
min(p, q max )]) and exit.
is m × n with orthogonal columns, R is n × n and upper-triangular, both are distributed in load balanced layouts across Π.
processors. We can bound the cost of this QR by Lemma 3.5. We break the cost into two cases: T bp (p) = T bp1 (p) whenp < q max and T bp (p) = T bp2 whenp ≥ q max , where
The square QR algorithm requires that the matrix be embedded into a slanted panel [38] . This can be done generally by using a somewhat larger matrix, but in all except the first recursive call, the 2b × b matrix will have the structure of two stacked uppertriangular matrices. The rows of these upper-triangular matrices can be interleaved to produce a slanted panel without embedding into a larger matrix. The recursive calls on line 6 always immediately encounter one of the base-cases. The only time base cases can have a matrix with dimension other than 2n × n is during the invocations on line 6 at the first recursive step of the algorithm, and only when m > 2np. Therefore, we consider this first recursive step of Algorithm 3.2 separately. The cost of the first recursive step, when m > 2np, includes
• the cost of a potential redistribution, O (β · mn/p + α ),
• the cost of the invocations on line 6 (which lead to base cases), T bs1 (m/p), since r = min(p, ⌈m/2n⌉) = p, • the cost of the matrix multiplications on line 11, which are done concurrently, each by a single processor, is O (γ ·mn 2 /p+ ν · mn/p).
Therefore, the total BSP time of the algorithm for m > 2np is
The cost of this initiale step for m > 2np is no greater than the cost in the theorem. Subsequent recursive calls into line 11 or the case when m ≤ 2np, the matrix multiplications done on line 11 involve matrices of size at most 2n×n, each executed using pn/m processors. By Lemma 3.2 with v = (pn/m) 2−3δ , these matrix multiplications (done concurrently) take time, T MM (m)
whereT b (pn/m) is a base case where up to q max processors perform the QR. We consider the two cases (form ≤ 2np),
Since q max = pn m log(p) 1/δ , andm decreases by a factor of two at each step, up to the first (1/δ ) log log p recursive steps make the call on line 6 with more than q max processors. The computation and communication cost of these calls are no greater than that of matrix multiplication (part of T MM (m)), while the synchronization cost increases geometrically, up to the latency cost in T bp2 . Therefore, the recurrence is asymptotically equivalent to (form ≤ 2np),
Since,m ≤ 2np, one of the base-cases is reached after log p steps, and so the above time is the one postulated in the theorem. □ Alternate communication-efficient formulations of a rectangular QR algorithm are also possible (for instance by combining columnrecursion [20] with communication-efficient matrix multiplication, see [32] ). We would like to work with the Householder representation to apply orthogonal transformations efficiently in our symmetric eigensolver algorithms, so we give the following corollary.
Corollary 3.7. The Householder representation of the m × n orthogonal matrix Q computed by Algorithm 3.2, Q = (I − UTU T 1 ), where U 1 is the lower triangular top n × n block of U , while T is upper-triangular and U T U = T −1 + T −T , can be obtained with the same cost and memory usage as in Theorem 3.6.
Proof. The Householder representation U ,T can be obtained stably by executing [U 1 ,W 1 ] = LU(Q 1 − S ) where Q 1 is the top n × n block of Q and S is a diagonal sign matrix, then computing U = QW −1 1 and T = W 1 U −T 1 [6] . The matrices U 1 , W 1 , U −1 1 , and W −1 1 can be obtained by a parallel non-pivoted LU factorization algorithm augmented to subtract S as in [6] , which makes the matrix diagonally dominant. The LU algorithms in [37] and [35] both obtain the desired costs. We use the former in our analysis.
When executed using pn/m processors, the algorithm in [37] takes BSP time, O (γ ·mn 2 /p +β ·m δ n 2−δ /p δ +α · (np/m) δ ). This cost was presented in [37] , modulo analysis of vertical communication cost, but as the algorithm is based purely on parallel multiplication of square matrices, the vertical communication cost is dominated by the horizontal communication cost. The algorithm also outputs the inverses of the triangular factors [37] , so matrix multiplications suffice to compute U = QW −1 1 and T = W 1 U −T 1 . These can be done using all the processors in time, O (γ ·mn 2 /p+β·m δ n 2−δ /p δ +α ) with M = O n δ m 1−δ p 1−δ 2 memory. As these costs and memory usage are no greater than in Theorem 3.6, we arrive at the postulated conclusion. □
SYMMETRIC EIGENSOLVERS
Algorithms for blocked computation of the eigenvalue decomposition of a symmetric matrix via a tridiagonal matrix were studied by [18, 19, 28] . These algorithms reduce an n × n symmetric matrix A to a matrix B with band-width b and the same eigenvalues as A via a series of k = (n − b)/b orthogonal transformations,
. A key property employed by these algorithms is that each twosided trailing matrix update of blocked Householder transformations may be done as a rank-2b symmetric update. To compute the two-sided transformation Q T XQ where X = X T and Q = (I − UTU T ), we can write
where V = 1 2 UT T U T XUT − XUT . This form of the update is cheaper to compute than the explicit two-sided update and is easy to aggregate by appending additional vectors to U (to aggregate the Householder form itself requires computing a largerT matrix). Since the trailing matrix update does not have to be applied immediately, but only to the columns which are factorized, this two-sided update can also be aggregated and used in a left-looking algorithm. For instance, to multiply Q T XQ by a matrix Y , we can compute
Returning to algorithms that compute a series of k two-sided transformations, we note that when computing V 2 from U 2 (to apply Q 2 ), we need to multiply U 2 by a submatrix of Q T 1 AQ 1 , which can be done without applying Q 1 , using the above form. Left-looking algorithms which generalize this idea and employ a delayed trailing matrix update have been used to reduce directly to tridiagonal form (b = 1) [18] .
However, there are disadvantages to reducing the symmetric matrix directly to tridiagonal form, since it requires that a vector be multiplied by the trailing matrix for each computation of V i of which there are n − 2. These matrix-vector multiplications require O (n) synchronizations and O (n) transfers of the trailing matrix between memory and cache (so long as it does not fit into cache). These disadvantages motivated approaches where the matrix is not reduced directly to tridiagonal form, but rather to banded form, which allows for b > 1 Householder vectors to be computed via QR at each step without needing to touch the trailing matrix from within the QR. After such a reduction to banded form, it is then necessary to reduce the banded matrix to tridiagonal form. However, this can be significantly less expensive because the trailing matrix is banded and requires less work and vertical communication to update than during the full-to-banded reduction step.
Such a multi-stage reduction approach was introduced by [10, 11] with the aim of achieving BLAS 3 reuse. These algorithms can reduce the banded matrix to tridiagonal or perform more stages of reduction, employing multiple intermediate band-widths. Performing more stages of successive band reduction can improve the synchronization cost of the overall approach, from O (n) as needed if reducing to tridiagonal form directly, to O ( √ p) as shown by [8] .
ELPA [5] is a distributed-memory library implementing a two-step reduction approach, motivated by reducing vertical communication cost. ELPA employs the parallel banded-to-tridiagonal algorithm introduced by [30] . Performance studies by [5] have demonstrated that this approach is particularly beneficial for large matrices. We first introduce an algorithm for reducing a full dense matrix to banded form, with up to O (p 1/6 ) less horizontal communication than previously known schemes. We subsequently introduce an algorithm for reducing a banded matrix to a smaller band-width, again with less communication than known approaches. Both of these reduction algorithms use a parallel routine 'QR', which performs QR factorization and outputs the Householder representation (U , T ) of the Q factor. We then give a combined, 2.5D symmetric eigensolver algorithm, that uses the first algorithm to reduce the dense symmetric matrix to band-width n max(p 2−3δ ,log p )
, then uses O (log p) calls to our band-to-band reduction, to arrive at a bandwidth of n/p, which is small enough to allow for efficient sequential computation of eigenvalues. The resulting symmetric eigensolver has the same BSP complexity as QR factorization (Lemma 3.5), modulo logarithmic factors in the number of processors for the vertical communication and synchronization costs. 
where U
1 and
Replicate U 1 and V 1 so that they are distributed cyclically over
11: % Recursively reduce the trailing matrix to banded form 12:
Ensure: B is a symmetric n-by-n matrix with band-width b and the same eigenvalues as The algorithm replicates the matrix A and aggregates as well as replicates the updates U (0) and V (0) (these update matrices should have m = 0 columns for the initial invocation of Algorithm 4.1) over c = p 2δ −1 layers of q 2 = p 2(1−δ ) processors. In the definition of the algorithm and the analysis we assume that c and q are integers for any given p. Each of these replicated matrices is stored in a 2D cyclic distribution on each processor grid layer, adhering to the layout assumptions of Algorithm 3.1. A cyclic layout yields local blocks which can be used within sequential routines the same way as done in a blocked layout. The assumption b mod q = 1 ensures that whenever each new panel of U and V is replicated (U 1 and V 1 on line 10), they can be concatenated to previously replicated panels while maintaining a perfectly load balanced cyclic distribution. Algorithm 4.1 performs the update correctly since, first, the computation of W =ĀU whereĀ = Q T AQ (line 8) follows the identity Eqn. (4.2) . Further, as computed on line 9, V takes the desired form,
Full-to-Band Reduction
the same one as the aggregated update matrix derived in Eqn. (4.1) . Consequently, the eigenvalues of the original matrix are preserved in the resulting banded matrix due to the ensured condition on the result of the tail recursion, which performs the update and factorization of the trailing matrix. In the base case, the matrix dimension is less than or equal to the desired matrix band-width, which means it suffices to perform the aggregated update and return the result, which would appear in the lower right block of the full banded matrix. We now analyze the execution time of Algorithm 4.1.
Lemma 4.1. Algorithm 4.1 can reduce any symmetric n-by-n matrix (input in any evenly-distributed layout and with n ≥ p) to a banded matrix with the same eigenvalues and any band-width
Proof. Since b ≥ n/p δ , we assume without loss of generality that b mod p 1−δ = 0. We also note that since b ≥ n/p δ , z = (bp δ /n) (1−δ )/δ ≥ 1. We note that the dimensions of A, U (0) , and V (0) at any recursive step will always be less than the dimension of the original matrix, n. Algorithm 4.1 assumes A, U (0) , and V (0) are initially replicated. Since each b × b block of these matrices is distributed cyclically and since b mod q = 0 (q = p 1−δ ), the submatrix extraction and concatenation done between recursive steps, can preserve perfect load balance without communication. To satisfy initial assumptions of the first invocation of Algorithm 4.1, we need to replicate the A matrix. Since, by assumption, it is distributed over all processors initially, the replication can be done
At each recursive step, Algorithm 4.1 performs a QR factorization, several matrix multiplications, and replicates U 1 and V 1 . Each O (n)× b QR factorization is done using a processor subgrid of dimensions p 1−δ × z × p 2δ −1 with a total of zp δ = p(b/n) (1−δ )/δ processors (picked to minimize both communication and synchronization) using Algorithm 3.2. By Theorem 3.6 and the fact that z ≥ 1, it takes BSP time,
memory. The two matrix multiplications on line 5 and the five matrix multiplications on line 8 (done right to left), all correspond to an O (n) ×O (n) replicated matrix multiplied by an O (n) ×b rectangular matrix. By Lemma 3.3, with w = max(1, bp 2−3δ /n), using M = O (n 2 /p 2(1−δ ) + nb/(wp δ )) = O (n 2 /p 2(1−δ ) ) memory, the time to compute these matrix multiplications is, if U (0) and V (0) start in cache,
In general (for any cache size), there is an additional cost of
The memory usage needed for these matrix multiplications is greater than that needed for the QR factorizations done by each set of processors. Since n 1/δ b 3−1/δ < n 2 b the computation cost of these matrix multiplications also dominates that of the QR factorizations.
The matrix multiplications needed to compute line 9 from right to left either operate on an O (n) × b matrix and a b × b matrix, like W · T , or result in a b × b matrix, like U T · (W T ). By Lemma 3.2 any matrix multiplication where two of the matrix dimensions are b and one is O (n), with v = p 2−3δ , takes BSP time,
Since b ≤ n/ log p, the above communication cost is never greater than that of the larger matrix multiplications, i.e. n 2/3 b 4/3 /p δ ≤ nb/p δ . The synchronization cost of the QR factorizations dominates that of of the matrix multiplications.
Replicating U 1 and V 1 over c subsets of q 2 processors (line 10) can be done in time, O β · nb/p 2(1−δ ) + α . Therefore, the cost over all n/b − 1 recursive steps when all replicated matrices fit into cache (when H > 3n 2 /p 2(1−δ ) ) is the total cost postulated in the theorem. In the second scenario (when H < 3n 2 /p 2(1−δ ) ), the algorithm incurs an extra additive factor of O (n/b) wn 2 p 2(1−δ ) in vertical communication cost. The memory usage is dominated by the replicated matrix multiplication (invocation of Lemma 3.3 above), which is also as stated in the theorem. □
Band-to-Band Reduction
We now consider algorithms for reducing a banded matrix to a smaller band-width, while preserving eigenvalues. We start by recalling a parallel algorithm designed for small band-widths [8] , then present Algorithm 4.2, which is designed to exploit additional parallelism given larger starting band-widths. Algorithm 4.2 describes the QR factorizations and applications necessary to reduce a symmetric banded matrix A from band-width b to band-width h = b/k via bulge chasing. The algorithm eliminates n/h trapezoidal panels via QR factorization, each of which generate bulges of nonzeros in the trailing matrix. Each bulge is subsequently chased down the band by O (n/b) eliminations again done by QR factorizations. Every new panel elimination is done immediately after the previously generated bulge is chased twice (including its initial panel elimination). Figure 2 depicts the QR factorizations necessary to eliminate a trapezoidal panel and chase two bulges generated from eliminating the first two panels, which are done concurrently in the algorithm. This type of pipelined successive band reduction approach was first considered by [10, 11] . The CA-SBR algorithm in [8] is similar, but assigns each processor a set of bulge chases at 
Proof. We consider the cost of one step of the CA-SBR algorithm [8] . A redistribution from any initial layout costs O (β ·nb +α ). The analysis in [8] shows that the cost of reducing from bandwidth b to b/2 has the computation, horizontal communication, and synchronization costs, as well as the memory usage postulated in the lemma. The algorithm consits of a bulge chase pipeline, executed in O (p) parallel steps, in which each processor works on O (n/p) columns, chasing O (n/(pb)) bulges O (n/(pb)) times, for a total of O (n 2 /(p 2 b 2 )) bulge chases. Since each bulge chase consists of a QR factorization and a matrix multiplication, with matrices of size O (b) × O (b), by Lemma 3.1 and Lemma 3.4, the vertical communication cost is O (ν · b 2 ) for each bulge chase. Summing the costs of the bulge chases over all parallel steps yields the postulated total cost. □
We now consider the cost of Algorithm 4.2. Its primary innovation is to perform each QR factorization and update in parallel using a subset of processors, leveraging both pipelined parallelism across different bulge chases as well as parallelism within a bulge chase. When the band-width becomes smaller, fewer processors are used to execute each bulge chase. %Π j applies chase j of bulge i as soon asΠ j−1 executes chase (j − 1) 6: for j = 1 :
% Define row and column offsets 8:
% Define index ranges needed for bulge chase 12: n r = min(n − o qr.r , b), n c = min(n − o up.c , h + 3b) 13: I qr.rs = o qr.r + (1 : n r ), I qr.cs = o qr.c + (1 : h) 14:
% Perform a rectangular parallel QR factorization 
Proof. The cost of each inner loop iteration (loop on line 6) can be derived from the costs of the matrix multiplications and QR done inside it. Let the pair (i, j) correspond to the the ith iteration of the outer loop and jth iteration of the inner loop. Figure 2 
The amount of memory needed for this QR factorization is given in Lemma 3.6 as
The is multiplied by a (b − h) × h matrix, while the update UV T involve (3b − h) × h matrix multiplied by an h × (b − h) matrix (VU T is just the transpose of the former). In both cases, by Lemma 3.2 with v =p 2−3δ /(k − 1) (we subtract one from k to make sure v ≥ 1), the BSP time to compute the matrix multiplications usingp processors is For a given outer loop (line 4) iteration i, each j loop iteration (line 6) is done by a different processor group. The total number of inner loop iterations is roughly (n/h)(n/b)/2 and they are pipelined among n/b groups of processors, up to n/(2b) of them working concurrently on different bulge chases at any given time. Consequently, the algorithm can be executed in O (n/h) phases, where at the ith phase, min(i − 1, (n − ih)/(2b)) processor groups chase bulges concurrently and the ith panel is eliminated. At each phase, a synchronization and data exchange is required between the QR factorization and trailing matrix updates computed by adjacent active processor groups. Therefore, the BSP cost of each recursive step of the algorithm corresponds to the cost of computing O (n/h) = O (kn/b) inner loop iterations using one processor group, which corresponds to the cost postulated in the lemma. □
Complete Symmetric Eigensolver
Algorithm 4.3 combines our algorithms for full-to-band reduction (Algorithm 4.1) with multiple subsequent stages of band-to-band reduction (Algorithm 4.2) and band-halving steps of the CA-SBR algorithm from [8] , which we refer to as CA-BR. Algorithm 4.1 reduces the symmetric matrix to one with band-width at most
The new 2.5D-Symmetric-Eigensolver algorithm trades off a variable amount of extra work, synchronization, and memory usage for a lower communication cost. Implementations of the algorithms in this paper permit optimizations such as
• alternating between left-looking partial updates and complete trailing matrix updates in Algorithm 4.1, • smaller bulge width in Algorithm 4.2 to increase parallelism in the bulge chase pipeline, • lookahead [2, 36] (overlapping QR with updates).
Our analysis shows that a carefully parameterized collage of parallel algorithms and optimizations yields asymptotic cost improvements with minimal overhead. We combine approaches (2.5D algorithms, aggregation, successive band reduction) that have been successful on modern architectures [5, 6, 33] , so our innovations should pave the path for practical improvements in scalability of applications computing singular values or eigenvalues of matrices.
