A communication-avoiding parallel algorithm for the symmetric eigenvalue
  problem by Solomonik, Edgar et al.
A communication-avoiding parallel algorithm for
the symmetric eigenvalue problem
Edgar Solomonik
ETH Zurich
Email: solomonik@inf.ethz.ch
Grey Ballard
Sandia National Laboratory
Email: gmballa@sandia.gov
James Demmel
University of California, Berkeley
Email: demmel@cs.berkeley.edu
Torsten Hoefler
ETH Zurich
Email: htor@inf.ethz.ch
Abstract—Many large-scale scientific computations require
eigenvalue solvers in a scaling regime where efficiency is limited
by data movement. We introduce a parallel algorithm for comput-
ing the eigenvalues of a dense symmetric matrix, which performs
asymptotically less communication than previously known ap-
proaches. We provide analysis in the Bulk Synchronous Parallel
(BSP) model with additional consideration for communication
between a local memory and cache. Given sufficient memory to
store c copies of the symmetric matrix, our algorithm requires
Θ(
√
c) less interprocessor communication than previously known
algorithms, for any c ≤ p1/3 when using p processors. The
algorithm first reduces the dense symmetric matrix to a banded
matrix with the same eigenvalues. Subsequently, the algorithm
employs successive reduction to O(log p) thinner banded ma-
trices. We employ two new parallel algorithms that achieve
lower communication costs for the full-to-band and band-to-
band reductions. Both of these algorithms leverage a novel QR
factorization algorithm for rectangular matrices.
I. INTRODUCTION
The eigenvalue decomposition of a symmetric matrix A is
A = UDUT where D is a diagonal matrix of eigenvalues and
the columns of the orthogonal matrix U are the eigenvectors of
A. Dense symmetric eigensolvers typically reduce the matrix
to a tridiagonal matrix with the same eigenvalues, compute the
eigenvalues D of this tridiagonal matrix [1], and, if desired,
apply the orthogonal transformation backwards to compute
the eigenvectors U . Although algorithms for tridiagonalizing a
symmetric matrix require the same asymptotic amount of work
as one-sided decompositions such as LU and QR factorization,
they have a more complex dependency structure, which makes
communication-efficient parallelization challenging. Efficient
execution of scientific applications such as electronic struc-
ture methods, which compute eigenvalue decompositions of
a sequence of symmetric matrices (see, e.g. Hartree-Fock
method [2], [3]), requires scalable symmetric eigensolvers.
We analyze the scalability of parallel algorithms in a Bulk
Synchronous Parallel (BSP) cost model [4]. In addition to
quantifying horizontal communication (data movement be-
tween processors) and synchronization, we augment the BSP
model with an additional bandwidth cost parameter for vertical
communication (data movement between memory and cache).
There are known algorithms for Cholesky, LU, and QR fac-
torization [5]–[7], which for n × n input matrices on a p-
processor system, have horizontal communication complexity
W = O(n2/
√
cp), require S = O(
√
cp) synchronizations, and
use M = O(cn2/p) memory per processor. Most commonly,
2D processor grids are used by algorithms that achieve this
communication complexity for c = 1, but 3D processor
grids and more complicated schemes are needed to achieve
the complexity with any c ∈ [1, p1/3] and obtain practical
performance improvements [7]. For Cholesky factorization,
which is simpler than LU and QR, these algorithms attain
communication lower bounds W = O
(
n3
pM1/2
)
[8] and
W · S = Ω(n2) [9], for a range of W parameterized by c.
The best previously known algorithms for solving the sym-
metric eigenvalue problem directly, use 2D parallelizations and
achieve the cost W = O(n2/
√
p). We introduce algorithms
that reduce the horizontal communication cost asymptotically
by a factor of
√
c, while using a factor of c more memory and√
c more synchronizations, in the same fashion as previously
done for one-sided factorizations. The new algorithms are gen-
eralizations of previously known approaches, and the flexibil-
ity offered by the parameter c increases the dimensionality of
the tuning space for symmetric eigensolver implementations.
In particular, employing a large c is attractive for bandwidth-
constrained problems on massively-parallel architectures.
Our algorithms focus on reducing the symmetric matrix to
thinner and thinner banded matrices with the same eigenvalues.
This “successive band reduction” approach [10], [11], i.e.
reducing to an intermediate banded matrix rather than directly
to tridiagonal, has been used to reduce vertical communication
and synchronization costs [12]. Further, in practice, algorithms
using a two-stage (full-to-banded and banded-to-tridiagonal)
approach [13], [14] have been shown to outperform libraries
that reduce directly to tridiagonal (like ScaLAPACK [15]).
However, a disadvantage of successive band reduction is in-
creasing the number of back transformations, which are needed
to compute eigenvectors. Unlike the forward application of
transformations whose computation cost scales linearly with
the matrix band-width, known algorithms for back transfor-
mations require O(n3) operations for each intermediate band-
width used.
The BSP model allows us to formulate and analyze al-
gorithms as compositions of a set of common building-
blocks. We leverage algorithms for matrix multiplication and
QR factorization within our symmetric eigensolvers. For QR
factorization, we provide an approach that extends approaches
for tall-and-skinny matrices [16] and square matrices [6] to be
efficient for arbitrary rectangular matrices.
ar
X
iv
:1
60
4.
03
70
3v
1 
 [c
s.D
C]
  1
3 A
pr
 20
16
We use these building blocks to define algorithms for reduc-
ing a dense matrix to a banded matrix, and a banded matrix
to a thinner band-width, while preserving eigenvalues. Our
main algorithm combines these, using O(log p) intermediate
band-widths. The algorithm is work-efficient for computing
eigenvalues, requires O(n2/
√
cp) horizontal communication,
O(n2 log p/
√
cp) vertical communication, and O(
√
cp log2 p)
synchronizations (BSP supersteps). Known approaches for
back-transformations to compute eigenvectors require the
same asymptotic amount of computation for matrices of any
band-width, meaning our approach may require a computation
cost of O(n3 log p/p) if all eigenvectors are needed. We
leave the analysis of back-transformation computation for
future work, but propose a potential approach to reduce the
number of intermediate band-widths needed by our symmetric
eigensolver.
II. THEORETICAL COST MODEL
We use the Bulk Synchronous Parallel (BSP) model [4]
with an additional parameter to measure the cost of traffic
between memory and cache. We derive asymptotic bounds on
the parallel running-time of our algorithms for this two-level
architectural model, with consideration for both communica-
tion between processors and in the memory hierarchy of each
processor. The BSP model permits an all-to-all communication
to be done with unit synchronization cost, which will allow
us to construct BSP algorithms for general matrix distributions
and compose them without significant overhead.
We employ cost notation typically used for the α–β com-
munication model. As all stored and communicated datasets
in this paper consist exclusively of floating-point numbers, we
quantify sizes in terms of ‘words’ (floating-point numbers of
a given precision). We model the memory hierarchy of each
processor by a main ‘slow’ memory (i.e. DRAM) and a ‘fast’
memory (i.e. cache). We permit interprocessor (horizontal)
communication to move data between main memories of dif-
ferent processors, and intraprocessor (vertical) communication
to move data between main memory and cache of a single
processor. Our architectural model is characterized by the
following parameters:
• p – processors on a fully-connected network,
• M – words of memory owned by each processor,
• H – words of cache owned by each processor,
• γ – time to compute a floating point operation,
• β – time to send or receive a word,
• ν – time to move a word between cache and memory,
• α – time to perform a (global) synchronization.
We bound the cost of each algorithm by measuring four
quantities:
• F – number of local floating point operations performed
(computation cost),
• W – number of words of data moved between processors
(horizontal communication cost),
• Q – number of words of data moved between main
memory and cache (vertical communication cost),
• S – number of BSP supersteps (synchronization cost).
If at each superstep i ∈ [1, S], processor j performs F ji local
operations, sends and receives W ji total words, and performs
Qji reads and writes to memory, then the costs of the BSP
algorithm are
F =
S∑
i=1
max
j∈[1,p]
F ji , W =
S∑
i=1
max
j∈[1,p]
W ji , Q =
S∑
i=1
max
j∈[1,p]
Qji ,
and the BSP execution time of this algorithm is
T = Θ(γ · F + β ·W + ν ·Q+ α · S).
This model does not consider overlap between communication
and computation (or between other costs), as such overlap does
not affect the overall asymptotic time.
We simplify asymptotic cost expressions by assuming γ ≤
β. Further, we write only vertical communication terms which
are not associated with horizontal communication or with com-
putations that achieve a factor of
√
H cache reuse (optimal for
matrix multiplication [17]). These simplifications correspond
to the assumptions on the relative communication times, ν ≤ β
and the floating point rate ν ≤ γ · √H . However, general
vertical communication cost upper-bounds may be obtained
from our stated results for arbitrary ν by reinserting the term
O(ν · (F/√H +W )).
We will provide asymptotic bounds for the BSP cost of
all algorithms in the paper. Sometimes, we will employ
algorithms as building blocks whose cost has been analyzed
in the standard α − β model, which is restricted to point-to-
point messaging (pairwise synchronization). These algorithms
are trivially translated to the BSP model used in this paper,
which is less restrictive (allows bulk synchronizations).
Throughout the paper, we will assume that matrix di-
mensions are greater than and divisible by the number of
processors. When it is clear that the asymptotic costs would
not be affected, we will also omit floors and ceilings when
subdividing the number of processors and matrix dimensions.
III. BUILDING BLOCKS
We first state known results and provide minor extensions
to quantify the complexity of matrix multiplication and of QR
factorization in our cost model. These results will be critical
in the cost analysis of the new symmetric eigensolvers, which
use matrix multiplication and QR factorization as subroutines.
A. Matrix Multiplication
Our symmetric eigensolvers will perform matrix multipli-
cations, often of nonsquare matrices. We consider the BSP
cost of multiplication of arbitrary rectangular matrices with
any starting distribution. Additionally, we specially consider
the BSP cost of a matrix multiplication of a pre-replicated
matrix with another matrix in an arbitrary distribution. We start
with the vertical communication cost of a matrix multiplication
done by a single processor.
Lemma III.1. The multiplication of matrices of dimensions
m× n and n× k can be done by a single processor in time,
O(γ ·mnk + ν · [mn+mk + nk]).
The Rec-Mult algorithm [18, Theorem 1] obtains the ver-
tical communication cost given in Lemma III.1. We omit the
usual term O(ν ·mnk/√H), since we have ν ≤ γ · √H .
We now consider the full BSP cost of parallel rectangular
matrix multiplication. The communication cost of square ma-
trix multiplication is well known [5], [19]–[23]. The horizontal
costs of rectangular matrix multiplication have also been ana-
lyzed within the α–β communication model, where a recursive
algorithm was proposed [24] that attains the communication
lower bound. We show that the algorithm in [24] can be
executed within the time specified in the subsequent Lemma,
for any initial load balanced distribution of the matrices.
It is possible to also design different matrix multiplication
algorithms in the BSP model with a Θ(log p) factor less in
synchronization cost, but the overall synchronization costs of
our QR and symmetric eigensolve algorithms (which use the
subsequent Lemma) would not be affected. We parameterize
the memory used by the algorithm by a parameter v, which
controls how many block matrix multiplications are performed
by each processor.
Lemma III.2. For any v ≥ 1, the multiplication of matrices
of dimensions m×n and n×k in any load-balanced starting
layout can be done in BSP time,
O
(
γ · mnk
p
+ β ·
[
mn+ nk +mk
p
+ v1/3
(
mnk
p
)2/3]
+ α · v log p
)
,
using M = O
(
mn+nk+mk
p +
(
mnk
vp
)2/3)
memory.
Proof. We consider the cost of the recursive ‘CARMA’ al-
gorithm [24]. The algorithm assumes specific initial ma-
trix layouts, but does not assume any initial data is repli-
cated. Therefore, starting from load balanced layouts, the
BSP time to move to the layouts specified by CARMA is
O(β · mn+nk+mkp + α). Because the computation is load
balanced, the computation cost is O(γ ·mnk/p). The latency
cost of the CARMA algorithm is an upper-bound on the
number of BSP supersteps necessary to execute it. In [24],
the latency cost is shown to be O
(
mnk
pM3/2
log p
)
= O(v log p).
The communication cost of CARMA is presented in cases for
1D, 2D, and 3D processor grids. We show that the postulated
BSP time upper-bound holds for all cases.
We first argue that the vertical communication cost of
the local matrix multiplications (given by Lemma III.1) is
dominated by horizontal communication due to the assumption
β ≥ ν. In the 3D case, the operand matrix blocks are nearly
square, and either one of the operands or the output is always
communicated, so horizontal communication cost dominates
vertical communication cost. In the 1D and 2D cases, each
processor performs a single local matrix multiplication, where
the largest operand has size O(mn+nk+mkp ), since it is the
local block of the largest matrix, which is distributed across
all processors.
For the horizontal communication costs, we let d1 =
min(m,n, k), d2 = median(m,n, k), and d3 = max(m,n, k)
as in [24]. If p < d3/d2 (1D case), then d1d2 < d1d3/p, so
the provided cost O(β · d1d2) = O(β · (mn+ nk +mk)/p).
If d3/d2 ≤ p ≤ d2d3/d21 (2D case), then the provided
cost O(β ·
√
d21d2d3/p) = O(β · (mn + nk + mk)/p).
Finally, if p > d2d3/d21 (3D case), the provided cost O(β ·
[mnk/(p
√
M) + (mnk/p)2/3]) = O(β · v1/3(mnk/p)2/3).
The algorithm analyzed in Lemma III.2 allows any ini-
tial load balanced matrix distributions. We now consider
Algorithm III.1, which assumes an initial distribution with
replicated data and subsequently can multiply certain matrices
in less time than given by Lemma III.2. In Algorithm III.1,
one of the input matrices is stored redundantly on c = p2δ−1
2D processor grids for any c ∈ [1, p1/3] (δ ∈ [1/2, 2/3]). The
parameterization by δ is the same as α in [6], while c is the
same replication factor as in [7]. The parameter w controls
the number of supersteps (block matrix multiplications) in
Algorithm III.1.
The algorithm permits the distribution to be defined as a
blocking of the matrices after permutation by P (1), P (2). Our
analysis assumes the blocking is roughly, but not necessarily
exactly load balanced, permitting the analysis to be used within
a cyclic or block-cylic matrix factorization algorithm where
different processors perform updates (matrix multiplications)
with a slightly different amount of local data at each step.
We will employ Algorithm III.1 with cyclic distributions,
for which P (1)ij = 1 for i = (j mod q)(m/q) + bj/qc
and P (2)jk = 1 for k = (j mod q)(n/q) + bj/qc. On each
processor grid layer, the algorithm executes a variant of the
SUMMA algorithm [25], which communicates the operand B
and reduces the output C. This variant is chosen, since we will
use the algorithm in cases where the operand A is of greater
size than B and C.
Lemma III.3. Consider Algorithm III.1 for multiplication of
matrices A and B of dimensions m×n and n× k, where the
initial distributions of A and B satisfy the stated requirements
for permutations P (1) and P (2) where each block Aij of
P (1)AP (2) has dimensions O(m/p1−δ) × O(n/p1−δ). Then,
using M = O(mn/p2(1−δ) + (mk+ nk)/(wpδ)) memory for
any w ∈ [1, p1−δ], the algorithm can be executed in BSP time,
O
(
γ · mnk
p
+ β · mk + nk
pδ
+ α · w
)
,
when H ≥ mn/p2(1−δ) and the copies of A start inside
cache, and otherwise with an extra cost of O(ν · wmn
p2(1−δ) ).
Proof. As required by Algorithm III.1, B starts in any load-
balanced distribution over the p processors. As the initial
layout is load-balanced the redistribution done on line 4 costs
O(β · nk/p+ α). The gather on line 9 and reduce-scatter on
line 11 are dual communication patterns with respect to each
other. Together, they cost O(β · (mk + nk)/(qcw) + α), and
over all w iterations over index h cost, O(β ·(mk+nk)/(qc)+
α · w) = O(β · (mk + nk)/pδ + α · w).
Algorithm III.1 [C]← Streaming-MM(A,B, P (1), P (2),Π)
Require: Given positive integers p,m, n, k, w and δ ∈ [1/2, 2/3]:
Π is a grid of q×q×c processors with q = p1−δ and c = p2δ−1,
A is m × n, B is n × k. For each l ∈ [1, c], Π[i, j, l] owns all
elements in Aij , defined by square permutation matrices P (1),
P (2), as P (1)AP (2) =
A11 · · · A1q... . . . ...
Aq1 · · · Aqq
.
1: B is in any load balanced layout over all p processors.
2: Let z = wc
3: Partition B into blocks: P (2)TB =
B11 · · · B1z... ...
Bq1 · · · Bqz
.
4: Redistribute B so that each Π[i, j, l] owns k/(zq) columns of
Bjh for each h ∈ {l, l + c, . . . , l + (w − 1)c}.
5: % Execute loop iterations in parallel
6: for i ∈ [1, q], j ∈ [1, q], l ∈ [1, c] do
7: % Execute loop iterations in sequence
8: for h ∈ {l, l + c, . . . , l + (w − 1)c} do
9: Gather Bjh on Π[i, j, l]
10: Compute C¯ijh = Aij ·Bjh on Π[i, j, l]
11: Reduce-scatter Cih =
∑c
j=1 C¯ijh so that each Π[i, j, l]
owns k/(zq) columns of Cih
Require: C = A ·B is distributed so that each processor in Π owns
mk/p elements of C.
The w local matrix multiplications take time,
O
(
γ · mnk
p
+ ν ·
(
wmn
p2(1−δ)
+
mk + nk
pδ
))
,
by Lemma III.1. However, if the entire matrix A starts in
cache, which is possible if H ≥ mn/p2(1−δ), it suffices to
read only the entries of Bjh from memory into cache and
write the entries of C¯ijh out to memory. In this case, the
vertical communication cost is O(ν ·mk+nkqc ) = O(ν ·mk+nkpδ ).
This term is dominated by the interprocessor communication
term since β ≥ ν. The memory usage corresponds to the
storage necessary for each block: Aij , Bjh, and C¯ijh, M =
O
(
mn
p2(1−δ) +
mk+nk
wpδ
)
.
B. QR Factorization
We will use QR factorization within our symmetric eigen-
solver algorithms to obtain orthogonal transformations that
introduce zeros when applied to the symmetric matrix. The
vertical communication cost of executing a sequential QR
factorization is proportional to that of matrix multiplication.
Lemma III.4. The QR factorization of an m × n matrix A
with m ≥ n can be done by a single processor in time,
O(γ ·mn2 + ν ·mn).
The sequential Communication-Avoiding QR (CAQR) al-
gorithm achieves the vertical communication cost given above
[16]. The Householder representation, lower trapezoidal m×n
matrix U and upper-triangular n × n matrix T so that Q =
I − UTUT , may be obtained with the cost of Lemma III.4
using Householder reconstruction [26].
We now consider parallel QR factorization, firstly for square
matrices.
Lemma III.5. The QR factorization of an n × n matrix A
distributed in any load-balanced layout can be computed using
M = O
(
n2
p2(1−δ)
)
memory for any δ ∈ [1/2, 2/3] in BSP time,
O
(
γ · n
3
p
+ β · n
2
pδ
+ α · pδ
)
.
The QR algorithm given by [6] in the BSP model achieves
the costs given in Lemma III.5. The vertical communica-
tion cost was not analyzed in [6]. However, the algorithm
consists purely of distributed matrix multiplications or QR
factorizations, which by Lemma III.1 and Lemma III.4 have a
vertical communication cost proportional to the matrix sizes.
As the analysis in [6] assumes all matrices that participate
in multiplication or QR factorization are communicated, due
to ν < β, the horizontal communication cost dominates the
vertical communication costs associated with these operations.
We now adapt the QR algorithm from [6] to handle rectan-
gular matrices with a desirable asymptotic cost (the embedding
used in [6] is inefficient for tall-and-skinny matrices). Our
adaptation is based on a binary QR reduction tree, with
QR factorizations of nearly square matrices done at every
node in the tree performed using the algorithm from [6].
An approach employing a QR reduction tree using Givens
rotations goes back to [27], a blocked flat tree approach
(optimal sequentially) was presented in [28], and a parallel
block reduction tree approach was given earlier in [29]. Our
approach is closest to the TSQR algorithm [16], except a set
of up to qmax processors works on each tree node.
Algorithm III.2 computes the QR factorization of an m×n
matrix, outputting the first n columns of the orthogonal Q
factor, as well as the n × n upper-triangular matrix R. The
algorithm assumes the existence of a sequential routine ‘QR’
and a parallel routine for (nearly) square matrices ‘square-QR’.
Theorem III.6. Algorithm III.2 can compute the QR fac-
torization of any m × n matrix A with m ≥ n in a load-
balanced layout, using M = O
((
nδm1−δ
p1−δ
)2)
memory for any
δ ∈ [1/2, 2/3], in BSP time,
O
(
γ · mn
2
p
+ β ·
(
mδn2−δ
pδ
+
mn
p
)
+ α ·
(
np
m
)δ
log2 p
)
.
Proof. We assume without loss of generality that m/n and p
are powers of two. Let T (m¯) be the cost of Algorithm III.2 for
an m¯×n matrix using p processors. Note that m corresponds to
the number of rows in the original input matrix, while m¯ will
be used to refer to the number of rows at a given recursive step.
We select the maximum number of processors to be used in
base-case square QR factorizations to be qmax = pnm log(p)
1/δ ,
in order to minimize synchronization cost while achieving an
optimal horizontal communication cost.
The cost of the sequential base case of Algorithm III.2 is, by
Lemma III.4, Tbs1(m¯) = O(γ · m¯n2 +ν · m¯n). When reaching
the square base case (dimension 2n×n, since m/n is a power
of two), we employ the square QR algorithm [6] with up to
qmax =
pn
m log(p)
1/δ processors. We can bound the cost of
this QR is by Lemma III.5. We break the cost into two cases:
Algorithm III.2 [Q,R]← rect-QR(A,Π)
Require: Given positive integers p,m, n, qmax and δ ∈ [1/2, 2/3]:
Π is a set of p processors, A is m× n, m/n and p are powers
of two, and each Π[i] owns mn/p elements of A.
1: if p = 1 then Compute [Q,R] = QR(A) sequentially and exit.
2: if m ≤ 2n then Compute [Q,R] = square-QR(A,Π[1 :
min(p, qmax)]) and exit.
3: Let r = min(p, d m
2n
e) and partition A = [AT1 · · · ATr ]T so
that each Ai is m/r × n
4: % Execute loop iterations in parallel
5: for i ∈ [1, r] do
6: [Wi, Ri] = rect-QR(Ai,Π[(i−1)(p/r)+1 : i(p/r)])
7: [Z,R] = rect-QR
([
RT1 · · · RTr
]T
,Π
)
8: Partition Z =
[
ZT1 · · · ZTr
]T so that each Zi is n× n
9: % Execute loop iterations in parallel
10: for i ∈ [1, r] do
11: Compute Qi = WiZi using Π[(i−1)(p/r) + 1 : i(p/r)]
Require: A = Q ·R where Q = [QT1 · · · QTr ]T is m× n with
orthogonal columns, R is n × n and upper-triangular, both are
distributed in load balanced layouts across Π.
Tbp(p¯) = Tbp1(p¯) when p¯ < qmax and Tbp(p¯) = Tbp2 when
p¯ ≥ qmax, where
Tbp1(p¯) = O(γ · n3/p¯+ β · n2/p¯δ + α · p¯δ),
Tbp2 = O
(
γ · mn
2
p log(p)1/δ
+ β · m
δn2−δ
pδ log p
+ α ·
(np
m
)δ
log p
)
.
The square QR algorithm requires that the matrix be em-
bedded into a slanted panel [6]. This can be done generally
by using a somewhat larger matrix, but in all except the first
recursive call, the 2b× b matrix will have the structure of two
stacked upper-triangular matrices. The rows of these upper-
triangular matrices can be interleaved to produce a slanted
panel without embedding into a larger matrix.
The recursive calls on line 6 always immediately encounter
one of the base-cases. The only time base cases can have
a matrix with dimension other than 2n × n is during the
invocations on line 6 at the first recursive step of the algorithm,
and only when m > 2np. Therefore, we consider this first
recursive step of Algorithm III.2 separately. The cost of the
first recursive step, when m > 2np, includes
• the cost of a potential redistribution, O(β ·mn/p+ α),
• the cost of the invocations on line 6 (which lead to base
cases), Tbs1(m/p), since r = min(p, dm/2ne) = p,
• the cost of the matrix multiplications on line 11, which
are done concurrently, each by a single processor, is O(γ ·
mn2/p+ ν ·mn/p).
We can therefore bound the total BSP time of the algorithm
for m > 2np by
T (m) = T (np) + Tbs1
(m
p
)
+O
(
γ · mn
2
p
+ β · mn
p
)
= T (np) +O(γ ·mn2/p+ β ·mn/p+ α).
We note that the cost of this first recursive step for m > 2np
is no greater than the cost postulated in the theorem. We now
focus on subsequent recursive calls into line 11 or the case
when m ≤ 2np, the matrix multiplications done on line 11
involve matrices of size at most 2n× n, each executed using
pn/m¯ processors. By Lemma III.2 with v = (pn/m¯)2−3δ ,
these matrix multiplications (done concurrently) take time,
TMM(m¯) =
O
(
γ · m¯n
2
p
+ β ·
(
m¯n
p
+
m¯δn2−δ
pδ
)
+ α ·
(
pn
m¯
)2−3δ
log p
)
and use M = O
((
nδm¯1−δ
p1−δ log p
)2)
memory. When combined
with the concurrent recursive calls on line 11 on matrices of
size 2n × n with pn/m¯ processors and the recursive call on
line 7 on a matrix of size m¯/2× n with all p processors, we
obtain the following BSP time recurrence for m¯ ≤ 2np,
T (m¯) =T (m¯/2) + Tbp(pn/m¯) + TMM(m¯),
where Tb(pn/m¯) is a base case where up to qmax processors
perform the QR. We consider the two cases (for m¯ ≤ 2np),
T (m¯) =T (m¯/2) + TMM(m¯) +
{
Tbp1(pn/m¯) : pn/m¯ < qmax
Tbp2 : pn/m¯ ≥ qmax
Since qmax = pnm log(p)
1/δ , and m¯ decreases by a factor of
two at each step, up to the first (1/δ) log log p recursive steps
make the call on line 6 with more than qmax processors. The
computation and communication cost of these calls are no
greater than that of matrix multiplication (part of TMM(m¯)),
while the synchronization cost increases geometrically, going
up to the latency cost in Tbp2. Therefore, the recurrence is
asymptotically equivalent to (for m¯ ≤ 2np),
T (m¯) = T (m¯/2) + TMM(m¯) + Tbp2
=T (m¯/2) +O
(
γ ·
(
m¯n2
p
+
mn2
p log(p)1/δ
)
+ β ·
(
m¯n
p
+
m¯δn2−δ
pδ
+
mδn2−δ
pδ log p
)
+ α ·
(
pn
m
)δ
log p
)
.
Since, m¯ ≤ 2np, one of the base-cases is reached after log p
steps, and so the above time reduces to the one postulated in
the theorem.
Alternate communication-efficient formulations of a rectan-
gular QR algorithm are also possible (for instance by com-
bining column-recursion [30] with communication-efficient
matrix multiplication, see [31]). We would like to work with
the Householder representation to apply orthogonal transfor-
mations efficiently in our symmetric eigensolver algorithms,
so we give the following corollary.
Corollary III.7. The Householder representation of the m×n
orthogonal matrix Q computed by Algorithm III.2, Q = (I −
UTUT1 ), where U1 is the lower triangular top n×n block of
U , while T is upper-triangular and UTU = T−1 +T−T , can
be obtained with no greater asymptotic cost or memory than
given in Theorem III.6.
Proof. The Householder representation U, T can be obtained
stably by executing [U1,W1] = LU(Q1 − S) where Q1 is the
top n × n block of Q and S is a diagonal sign matrix, then
computing U = QW−11 and T = W1U
−T
1 [26]. The matrices
U1, W1, U−11 , and W
−1
1 can be obtained by a parallel non-
pivoted LU factorization algorithm augmented to subtract S as
in [26], which makes the matrix diagonally dominant. The LU
algorithms in [32] and [7] would both obtain the desired costs,
but the former is slightly more convenient for our analysis.
When executed using pn/m processors, the algorithm
in [32] takes BSP time, O(γ · mn2/p + β · mδn2−δ/pδ +
α · (np/m)δ). This cost was presented in [32], modulo anal-
ysis of vertical communication cost, but as the algorithm is
based purely on parallel multiplication of square matrices, the
vertical communication cost is dominated by the horizontal
communication cost. The algorithm also outputs the inverses
of the triangular factors [32], so matrix multiplications suffice
to compute U = QW−11 and T = W1U
−T
1 . These can be done
using all the processors in time, O(γ·mn2/p+β·mδn2−δ/pδ+
α) with M = O
((
nδm1−δ
p1−δ
)2)
memory. As these costs and
memory usage are no greater than in Theorem III.6, we arrive
at the postulated conclusion.
IV. SYMMETRIC EIGENSOLVERS
Algorithms for blocked computation of the eigenvalue de-
composition of a symmetric matrix via a tridiagonal matrix
were studied by [33]–[35]. These algorithms reduce an n× n
symmetric matrix A to a matrix B with band-width b and the
same eigenvalues as A via a series of k = (n−b)/b orthogonal
transformations,
B = QT1 · · ·QTkBQk · · ·Q1,
where each Qi is representable in terms of b Householder
vectors, aggregated in a trapezoidal matrix Ui, as Qi = (I −
UiTiU
T
i ).
A key property employed by these algorithms is that each
two-sided trailing matrix update of blocked Householder trans-
formations may be done as a rank-2b symmetric update. To
compute the two-sided transformation QTXQ where X =
XT and Q = (I − UTUT ), we can write
QTXQ =(I − UTTUT )X(I − UTUT )
=X + UV T + V UT , (IV.1)
where V = 12UT
TUTXUT−XUT . This form of the update
is cheaper to compute than the explicit two-sided update and
is easy to aggregate by appending additional vectors to U (to
aggregate the Householder form itself requires computing a
larger T matrix). Since the trailing matrix update does not
have to be applied immediately, but only to the columns which
are factorized, this two-sided update can also be aggregated
and used in a left-looking algorithm. For instance, to multiply
QTXQ by a matrix Y , we can compute
QTXQY = XY + UV TY + V UTY. (IV.2)
Returning to algorithms that compute a series of k two-
sided transformations, we note that when computing V2 from
U2 (to apply Q2), we need to multiply U2 by a submatrix
of QT1 AQ1, which can be done without applying Q1, using
the above form. Left-looking algorithms which generalize this
idea and employ a delayed trailing matrix update have been
used to reduce directly to tridiagonal form (b = 1) [33].
However, there are disadvantages to reducing the symmetric
matrix directly to tridiagonal form, since it requires that a vec-
tor be multiplied by the trailing matrix for each computation of
Vi of which there are n−2. These matrix-vector multiplications
require O(n) synchronizations and O(n) transfers of the
trailing matrix between memory and cache (so long as it does
not fit into cache). These disadvantages motivated approaches
where the matrix is not reduced directly to tridiagonal form,
but rather to banded form, which allows for b > 1 Householder
vectors to be computed via QR at each step without needing
to touch the trailing matrix from within the QR. After such
a reduction to banded form, it is then necessary to reduce
the banded matrix to tridiagonal form. However, this can
be significantly less expensive because the trailing matrix is
banded and requires less work and vertical communication to
update than during the full-to-banded reduction step.
Such a multi-stage reduction approach was introduced
by [10], [11] with the aim of achieving BLAS 3 reuse.
These algorithms can reduce the banded matrix to tridiagonal
or perform more stages of reduction, employing multiple
intermediate band-widths. Performing more stages of succes-
sive band reduction can improve the synchronization cost
of the overall approach, from O(n) as needed if reducing
to tridiagonal form directly, to O(
√
p) as shown by [12].
ELPA [13] is a distributed-memory library implementing a
two-step reduction approach, motivated by reducing vertical
communication cost. ELPA employs the parallel banded-to-
tridiagonal algorithm introduced by [36]. Performance studies
by [13] have demonstrated that this approach is particularly
beneficial for large matrices.
We first introduce an algorithm for reducing a full dense
matrix to banded form, with up to O(p1/6) less horizontal
communication than previously known schemes. We subse-
quently introduce an algorithm for reducing a banded matrix
to a smaller band-width, again with less communication than
known approaches. Both of these reduction algorithms use
a parallel routine ‘QR’, which performs QR factorization
and outputs the Householder representation (U , T ) of the Q
factor. We then give a combined, 2.5D symmetric eigensolver
algorithm, that uses the first algorithm to reduce the dense
symmetric matrix to band-width n
max(p2−3δ,log p) , then uses
O(log p) calls to our band-to-band reduction, to arrive at
a band-width of n/p, which is small enough to allow for
efficient sequential computation of eigenvalues. The resulting
symmetric eigensolver has the same BSP complexity as QR
factorization (Lemma III.5), modulo logarithmic factors in
the number of processors for the vertical communication and
synchronization costs.
A. Full-to-Band Reduction
Algorithm IV.1 reduces a symmetric n-by-n matrix A to
band-width b using replication of data and aggregation. It
achieves a horizontal communication cost of W = O(n2/pδ),
when the amount of available memory on each processor is
M = O(n2/p2(1−δ)). The algorithm is left looking, meaning
it updates the next matrix panel (line 5) immediately prior to
performing the QR of the panel. Figure 1 displays the key
matrices employed in Algorithm IV.1, specifically the third
and fourth steps of recursion.
The algorithm replicates the matrix A and aggregates as
well as replicates the updates U (0) and V (0) (these update
matrices should have m = 0 columns for the initial invocation
of Algorithm IV.1) over c = p2δ−1 layers of q2 = p2(1−δ) pro-
cessors. In the definition of the algorithm and the analysis we
assume that c and q are integers for any given p. Each of these
replicated matrices is stored in a 2D cyclic distribution on
each processor grid layer, adhering to the layout assumptions
of Algorithm III.1. A cyclic layout yields local blocks which
can be used within sequential routines the same way as done
in a blocked layout. The assumption b mod q = 1 ensures
that whenever each new panel of U and V is replicated (U1
and V1 on line 10), they can be concatenated to previously
replicated panels while maintaining a perfectly load balanced
cyclic distribution.
Algorithm IV.1 performs the update correctly since, first, the
computation of W = A¯U where A¯ = QTAQ (line 8) follows
the identity Eqn. (IV.2). Further, as computed on line 9, V
takes the desired form,
V =
(
1
2
UTTUT − I
)
WT =
1
2
UTTUT A¯UT − A¯UT,
the same one as the aggregated update matrix derived in
Eqn. (IV.1). Consequently, the eigenvalues of the original
matrix are preserved in the resulting banded matrix due to
the ensured condition on the result of the tail recursion, which
performs the update and factorization of the trailing matrix.
In the base case, the matrix dimension is less than or equal
to the desired matrix band-width, which means it suffices to
perform the aggregated update and return the result, which
would appear in the lower right block of the full banded
matrix. We now analyze the execution time of Algorithm IV.1
in the BSP cost model.
Lemma IV.1. Algorithm IV.1 can reduce any symmetric n-
by-n matrix (input in any evenly-distributed layout and with
n ≥ p) to a banded matrix with the same eigenvalues and any
band-width n/pδ ≤ b ≤ n/ log p, using M = O(n2/p2(1−δ))
memory for any δ ∈ [1/2, 2/3], when H > 3n2/p2(1−δ), in
BSP time,
O
(
γ · n
3
p
+ β · n
2
pδ
+ α · pδ log2 p
)
.
If H ≤ 3n2/p2(1−δ), then there is an additional vertical
communication cost of O(ν · (n/b)n2/p2(1−δ)).
Proof. Since b ≥ n/pδ , we assume without loss of generality
that b mod p1−δ = 0. We also note that since b ≥ n/pδ ,
z = (bpδ/n)(1−δ)/δ ≥ 1. We note that the dimensions of
A, U (0), and V (0) at any recursive step will always be less
than the dimension of the original matrix, n. Algorithm IV.1
assumes A, U (0), and V (0) are initially replicated. Since each
b × b block of these matrices is distributed cyclically and
since b mod q = 0 (q = p1−δ), the submatrix extraction
and concatenation done between recursive steps, can preserve
perfect load balance without communication. To satisfy initial
assumptions of the first invocation of Algorithm IV.1, we need
to replicate the A matrix. Since, by assumption, it is distributed
Algorithm IV.1 [B]← 2.5D-Full-to-Band(A,U (0), V (0),Π, b)
Require: Given nonnegative integers p, n,m, b and δ ∈ [1/2, 2/3],
z = (bpδ/n)(1−δ)/δ: Π is a grid of q × q × c processors where
q = p1−δ and c = p2δ−1 and b mod q = 1, A is an n-by-n
symmetric matrix, U (0) and V (0) are n-by-m matrices where
U (0) is trapezoidal (zero in top right upper b-by-b triangle) and
V (0) is dense, A (stored as a nonsymmetric matrix), U (0), and
V (0) are distributed cyclically over Π[:, :, k] for each k ∈ [1, c].
1: if n ≤ b then
2: Compute B = A+ U (0)V (0)
T
+ V (0)U (0)
T
and exit.
3: Subdivide A =
[
A11 A
T
21
A21 A22
]
where A11 is b-by-b
4: Subdivide U (0) =
[
U
(0)
1
U
(0)
2
]
and V (0) =
[
V
(0)
1
V
(0)
2
]
where U (0)1 and
V
(0)
1 are b-by-m
5: Compute
[
A¯11
A¯21
]
=
[
A11
A21
]
+ U (0)V
(0)
1
T
+ V (0)U
(0)
1
T
6: % Compute QR of matrix panel
7: [U1, T,R]← QR(A¯21,Π[:, 1 : z, :])
8: Compute W = A22U1 + U
(0)
2 (V
(0)
2
T
U1) + V
(0)
2 (U
(0)
2
T
U1)
9: Compute V1 = 12U1(T
T (UT (WT )))−WT
10: Replicate U1 and V1 so that they are distributed cyclically over
Π[:, :, k] for each k ∈ [1, c]
11: % Recursively reduce the trailing matrix to banded form
12: B2 = 2.5D-Full-to-Band(A22, [U
(0)
2 , U1], [V
(0)
2 , V1],Π, b)
13: B =
 A¯11 RT 0R
B20

Ensure: B is a symmetric n-by-n matrix with band-width b and the
same eigenvalues as A+ U (0)V (0)
T
+ V (0)U (0)
T
.
Fig. 1. A depiction of matrices used in Algorithm IV.1 for two subsequent
recursive steps.
over all processors initially, the replication can be done with
O(n2/q2) = O(n2/p2(1−δ)) horizontal communication cost.
At each recursive step, Algorithm IV.1 performs a QR
factorization, several matrix multiplications, and replicates U1
and V1. Each O(n) × b QR factorization is done using a
processor subgrid of dimensions p1−δ×z×p2δ−1 with a total
of zpδ = p(b/n)(1−δ)/δ processors (picked to minimize both
communication and synchronization) using Algorithm III.2.
By Theorem III.6 and the fact that z ≥ 1, it takes BSP time,
O
(
γ · n
1/δb3−1/δ
p
+ β · nb
pδ
+ α · b
n
pδ log2 p
)
,
using M = O
((
bδn1−δ
(zpδ)1−δ
)2)
= O
((n(b/n)(2δ−1)/δ
p1−δ
)2)
memory.
The two matrix multiplications on line 5 and the five
matrix multiplications on line 8 (done right to left), all
correspond to an O(n)×O(n) replicated matrix multiplied by
an O(n) × b rectangular matrix. By Lemma III.3, with w =
max(1, bp2−3δ/n), using M = O(n2/p2(1−δ) +nb/(wpδ)) =
O(n2/p2(1−δ)) memory, the time to compute these matrix
multiplications is, if U (0) and V (0) start in cache,
O
(
γ · n
2b
p
+ β · nb
pδ
+ α · w
)
.
In general (for any cache size), there is an additional cost
of O(ν · (n/b)n2
p2(1−δ) ). The memory usage needed for these matrix
multiplications is greater than that needed for the QR factoriza-
tions done by each set of processors. Since n1/δb3−1/δ < n2b
the computation cost of these matrix multiplications also
dominates that of the QR factorizations.
The matrix multiplications needed to compute line 9 from
right to left either operate on an O(n)× b matrix and a b× b
matrix, like W ·T , or result in a b×b matrix, like UT · (WT ).
By Lemma III.2 any matrix multiplication where two of the
matrix dimensions are b and one is O(n), with v = p2−3δ ,
takes BSP time,
O
(
γ · nb
2
p
+ β ·
[
nb
p
+
n2/3b4/3
pδ
]
+ α · p2−3δ log p
)
.
Since b ≤ n/ log p, the above communication cost is never
greater than that of the larger matrix multiplications, i.e.
n2/3b4/3/pδ ≤ nb/pδ . The synchronization cost of the QR
factorizations dominates that of of the matrix multiplications.
Replicating U1 and V1 over c subsets of q2 processors
(line 10) can be done in time, O
(
β · nb/p2(1−δ) + α).
Therefore, the cost over all n/b − 1 recursive steps when
all replicated matrices fit into cache (when H > 3n2/p2(1−δ))
is the total cost postulated in the theorem. In the second sce-
nario
(
when H < 3n2/p2(1−δ)
)
, the algorithm incurs an extra
additive factor of O
(
(n/b) wn
2
p2(1−δ)
)
in vertical communication
cost. The memory usage is dominated by the replicated matrix
multiplication (invocation of Lemma III.3 above), which is
also as stated in the theorem.
B. Band-to-Band Reduction
We now consider algorithms for reducing a banded matrix
to a smaller band-width, while preserving eigenvalues. We
start by recalling a parallel algorithm designed for small band-
widths [12], then present Algorithm IV.2, which is designed
to exploit additional parallelism given larger starting band-
widths. Algorithm IV.2 describes the QR factorizations and
applications necessary to reduce a symmetric banded matrix
A from band-width b to band-width h = b/k via bulge
chasing. The algorithm eliminates n/h trapezoidal panels via
QR factorization, each of which generate bulges of nonzeros
in the trailing matrix. Each bulge is subsequently chased
down the band by O(n/b) eliminations again done by QR
factorizations. Every new panel elimination is done imme-
diately after the previously generated bulge is chased twice
(including its initial panel elimination). Figure 2 depicts the
QR factorizations necessary to eliminate a trapezoidal panel
and chase two bulges generated from eliminating the first two
panels, which are done concurrently in the algorithm. This
type of pipelined successive band reduction approach was first
considered by [10], [11]. The CA-SBR algorithm in [12] is
similar, but assigns each processor a set of bulge chases at
each pipeline step, rather than performing each bulge chase
with a set of processors as done in Algorithm IV.2.
Lemma IV.2. An n×n symmetric matrix (input in any load-
balanced layout) of band-width b ≤ n/p can be reduced to one
with the same eigenvalues and band-width b/2, using M =
O(nb/p) memory, in BSP time,
O
(
γ · n
2b
p
+ β · nb+ ν · n
2
p
+ α · p
)
.
Proof. We consider the cost of one step of the CA-SBR
algorithm [12]. A redistribution from any initial layout costs
O(β · nb + α). The analysis in [12] shows that the cost
of reducing from bandwidth b to b/2 has the computation,
horizontal communication, and synchronization costs, as well
as the memory usage postulated in the lemma. The algorithm
consits of a bulge chase pipeline, executed in O(p) parallel
steps, in which each processor works on O(n/p) columns,
chasing O(n/(pb)) bulges O(n/(pb)) times, for a total of
O(n2/(p2b2)) bulge chases. Since each bulge chase consists of
a QR factorization and a matrix multiplication, with matrices
of size O(b) × O(b), by Lemma III.1 and Lemma III.4, the
vertical communication cost is O(ν · b2) for each bulge chase.
Summing the costs of the bulge chases over all parallel steps
yields the postulated total cost.
We now consider the cost of Algorithm IV.2. Its primary
innovation is to perform each QR factorization and update in
parallel using a subset of processors, leveraging both pipelined
parallelism across different bulge chases as well as parallelism
within a bulge chase.
Lemma IV.3. Algorithm IV.2 can reduce an n×n symmetric
matrix (input in any evenly-distributed layout) of band-width
b ≥ n/p to one with the same eigenvalues and band-width
b/k, using M = O((n1−δbδ/p1−δ)2) memory for any δ ∈
[1/2, 2/3] and any k ≤ 1 + p2−3δ , in BSP time,
O
(
γ · n
2b
p
+ β · n
1+δb1−δ
pδ
+ α · k
δn1−δpδ
b1−δ
log p
)
.
Proof. The cost of each inner loop iteration (loop on line 6)
can be derived from the costs of the matrix multiplications and
QR done inside it. Let the pair (i, j) correspond to the the ith
iteration of the outer loop and jth iteration of the inner loop.
Figure 2 displays the QR factorizations and updates computed
during a few such iterations. Each iteration computes a QR
factorization of a matrix with dimensions at most (b−h)×h,
B[Iqr.rs, Iqr.cs] on line 16 with p¯ = pb/(nk(1−δ)/δ) proces-
sors. The BSP time to compute such a QR factorization is by
Theorem III.6 for δ ∈ [1/2, 2/3],
O
(
γ · bh
2
p¯
+ β · b
δh2−δ
p¯δ
+ α · p¯δ log(p¯)
)
= O
(
γ · nb
2
k3−1/δp
+ β · n
δb2−δ
kpδ
+ α · kδ−1(pb/n)δ log p
)
Fig. 2. QR factorizations and updates in iterations (i, j) ∈
{(3, 1), (2, 3), (1, 5)} (left) and (i, j) ∈ {(3, 2), (2, 4), (1, 6)} (right) of
Algorithm IV.1 with k = 2. These two sets of iterations are executed
concurrently by processor groups Πˆ1, Πˆ3, and Πˆ5 (left) and Πˆ2, Πˆ4, and
Πˆ6 (right), respectively. Only the unique part of the trailing matrix update is
shown, while the pseudocode performs both symmetric reflections of it. Each
matrix V is labeled with the iteration in which it is computed.
The amount of memory needed for this QR factorization
is given in Lemma III.6 as M = O
(
(hδb1−δ/p¯1−δ)2
)
=
O((n1−δbδ/(p1−δk(2δ−1)/δ))2).
The matrix multiplications to form the V matrix are on lines
19 and 20, while those to perform the updates are on lines 21
and 22. The matrix multiplications on line 20 should be done
from right to left. We can then observe that the most costly
matrix multiplications in Algorithm IV.2 are B[Iup.cs, Iqr.rs]U
on line 19 and the updates UV T and V UT on lines 21 and
22. In the first case, a (3b − h) × (b − h) is multiplied by a
(b−h)×h matrix, while the update UV T involve (3b−h)×h
matrix multiplied by an h× (b− h) matrix (V UT is just the
transpose of the former). In both cases, by Lemma III.2 with
v = pˆ2−3δ/(k − 1) (we subtract one from k to make sure
v ≥ 1), the BSP time to compute the matrix multiplications
using pˆ processors is
O
(
γ · b
2h
pˆ
+ β · b
2
kpˆδ
+ α · pˆ
2−3δ
k
log p
)
= O
(
γ · nb
2
kp
+ β · n
δb2−δ
kpδ
+ α · (pb/n)
2−3δ
k
log p
)
,
with a memory footprint of M = O(b2/pˆ+(b2h/(vpˆ))2/3) =
O((b/pˆ1−δ)2) = O((n1−δbδ/p1−δ)2), which is greater than
the memory needed to perform the QR factorizations. The
other matrix multiplications have strictly lower cost and the
cost of redistributions necessary for all of these matrix mul-
tiplications is included in the horizontal communication cost
of Lemma III.2. As A and B are stored in load balanced
layouts, each processor subset can obtain the submatrix which
it factorizes and the submatrix which it updates at every
iteration with O(b2/pˆ) horizontal communication.
Thus, the overall cost for each iteration of Algorithm IV.2
is the sum of the two different costs above,
O
(
γ · nb
2
kp
+ β · n
δb2−δ
kp
+ α · kδ−1(pb/n)δ log p
)
.
For a given outer loop (line 4) iteration i, each j loop
iteration (line 6) is done by a different processor group. The
total number of inner loop iterations is roughly (n/h)(n/b)/2
and they are pipelined among n/b groups of processors, up
to n/(2b) of them working concurrently on different bulge
Algorithm IV.2 [B]← 2.5D-Band-to-Band(A,Π, b, k)
Require: Given positive integers b, p, n, k and h = b/k with n mod
b ≡ 0 and b mod k ≡ 0: A is a banded symmetric matrix of
dimension n with band-width b ≤ n, Πˆj ⊂ Π is the jth group
of pˆ ≡ pb/n processors for j ∈ [1, n/b].
1: Set B = A
2: Let B[(j − 1)b+ 1 : jb , (j − 1)b+ 1 : jb] be replicated in Πˆj
over (bp/n)2δ−1 subsets of (bp/n)2(1−δ) processors.
3: % Iterate over panels of B
4: for i ∈ [1, n/h− 1] do
5: % Πˆj applies chase j of bulge i as soon as Πˆj−1 executes
chase (j − 1)
6: for j = 1 : b(n− ih− 1)/bc do
7: % Define row and column offsets
8: Let oblg = (i− 1)h+ (j − 1)b, oqr.r = oblg + h
9: if j = 1 then oqr.c = oqr.r − h, ov = 0
10: else oqr.c = oqr.r − b, ov = b− h, oup.c = oqr.c + h
11: % Define index ranges needed for bulge chase
12: nr = min(n− oqr.r, b), nc = min(n− oup.c, h+ 3b)
13: Iqr.rs = oqr.r + (1 : nr), Iqr.cs = oqr.c + (1 : h)
14: Iv.rs = ov + (1 : nr), Iup.cs = oup.c + (1 : nc)
15: % Perform a rectangular parallel QR factorization
16: [U, T,R]← QR(B[Iqr.rs, Iqr.cs], Πˆj [1 : ph/n])
17: B[Iqr.rs, Iqr.cs] =
[
R
0
]
, B[Iqr.cs, Iqr.rs] =
[
R
0
]T
18: % Perform trailing matrix updates
19: W = B[Iup.cs, Iqr.rs]UT, V = −W
20: V [Iv.rs, :] = V [Iv.rs, :] + 12U(T
T (UTW [Iv.rs, :]))
21: B[Iqr.rs, Iup.cs] = B[Iqr.rs, Iup.cs] + UV
T
22: B[Iup.cs, Iqr.rs] = B[Iup.cs, Iqr.rs] + V U
T
Ensure: B is a banded matrix with band-width h and the same
eigenvalues as A
chases at any given time. Consequently, the algorithm can
be executed in O(n/h) phases, where at the ith phase,
min(i − 1, (n − ih)/(2b)) processor groups chase bulges
concurrently and the ith panel is eliminated. At each phase, a
synchronization and data exchange is required between the
QR factorization and trailing matrix updates computed by
adjacent active processor groups. Therefore, the BSP cost of
each recursive step of the algorithm corresponds to the cost
of computing O(n/h) = O(kn/b) inner loop iterations using
one processor group, which corresponds to the cost postulated
in the lemma.
C. Complete Symmetric Eigensolver
Algorithm IV.3 combines our algorithms for full-to-band
reduction (Algorithm IV.1) with multiple subsequent stages
of band-to-band reduction (Algorithm IV.2) and band-halving
steps of the CA-SBR algorithm from [12], which we refer to as
CA-BR. Algorithm IV.1 reduces the symmetric matrix to one
with band-width at most n/ log p. Algorithm IV.2 is then used
to successively half the band-width to n/pδ . Subsequently, the
CA-BR algorithm (same function signature as 2.5D-Band-to-
Band) is used to reduce the band-width to n/p. At that point,
the matrix is small enough for one processor to compute the
eigenvalues efficiently.
For every 2.5D-Band-to-Band step that reduces the band-
width by a factor of k, Algorithm IV.2 reduces the number of
Algorithm IV.3 [D]← 2.5D-Symmetric-Eigensolver(A,Π)
Require: Given positive integers p, n, and δ ∈ [1/2, 2/3] with
n mod b ≡ 0, A is a symmetric matrix of dimension n.
1: Let b = n
max(p2−3δ,log p) , k = 2, and ζ = (1− δ)/δ
2: Set B = A
3: Execute B = 2.5D-Full-to-Band(A, {}, {},Π, b)
4: for i = 0 : log2(bp
δ/n)− 1 do
5: Let Π¯ = Π[1 : p/kiζ ]
6: Gather B onto Π¯
7: Execute B = 2.5D-Band-to-Band(B, b/ki, Π¯, k)
8: Let Π¯ = Π[1 : pδ]
9: for i = 0 : log2(p
1−δ)− 1 do
10: Execute B = CA-BR(B,n/(pδki), Π¯, k)
11: Gather B onto a processor and compute its eigenvalues D
Ensure: D is a vector containing the eigenvalues of A
processors used by kζ where ζ = (1−δ)/δ. The parameter ζ is
chosen to be (1−δ)/δ in order to keep the per-stage horizontal
cost term O(nb/pδ) from increasing at each recursive step,
since n(b/k)/(p/kζ)δ = nb/pδ . Decreasing the number of
active processors in this way also keeps the synchronization
cost equal at every stage. Overall, we now obtain a parallel
algorithm that has horizontal communication of O(n2/pδ),
vertical communication of O(n2 log p/pδ), and O(pδ log2 p)
synchronizations. Modulo logarithmic cost factors in vertical
communication and synchronization, this amounts to the same
communication cost as the best known algorithms for LU and
QR factorization [5]–[7].
Theorem IV.4. Algorithm IV.3 computes the eigenvalues of
a symmetric n-by-n matrix (input in any evenly-distributed
layout), using M = O(n2/p2(1−δ)) memory for any δ ∈
[1/2, 2/3], in BSP time,
O
(
γ · n
3
p
+ β · n
2
pδ
+ ν · n
2 log p
pδ
+ α · pδ log2 p
)
.
Proof. The cost of the gather/redistribution of B onto Π¯ is
dominated by the subsequent 2.5D-Band-to-Band invocation.
The cost of computing the eigenvalues of B sequentially at
the end is O(γ ·n3/p+β ·n2/p+α), since the band-width is
n/p [12]. We employ Lemma IV.1 with b = n
max(p2−3δ,log p)
to obtain the cost of 2.5D-Full-to-Band. The computation,
horizontal communication, and synchronization costs are the
same for the call to 2.5D-Full-to-Band as the overall costs
postulated in Theorem IV.4. The vertical communication cost
term incurred for small cache sizes, O(ν · (n/b)n2/p2(1−δ)) is
bounded by O(ν·[n2/pδ+n2 log p/p2/3]) = O(ν·n2 log p/pδ).
We now consider the memory footprint and cost of the
invocations of 2.5D-Band-to-Band. By Lemma IV.3 with k =
2, the memory usage is M = O((n1−δ b¯δ/p¯1−δ)2), where
b¯ = b/ki where p¯ = p/kiζ at iteration i. We observe that
(n1−δ b¯δ/p¯1−δ)2 = O(n2/p2(1−δ)) for all iterations i, because
at each subsequent iteration b¯ decreases by k while p¯ decreases
by kζ , and so b¯δ/p¯1−δ ≤ bδ/p1−δ ≤ nδ/p1−δ for all i, since
k(1−δ)ζ/kδ = k(1−δ)
2/δ/kδ ≤ k1−δ/kδ ≤ 1. The cost of each
band reduction with starting band-width b¯ and p¯ processors is
by Lemma IV.3 with k = 2,
O
(
γ · n
2b¯
p¯
+ β · n
1+δ b¯1−δ
p¯δ
+ α · n
1−δ p¯δ
b¯1−δ
log p
)
.
Algorithm W (β) Q (ν) S (α)
ScaLAPACK [15] n2/
√
p n3/p n log p
ELPA [37] n2/
√
p - n log p
CA-SBR [12] n2/
√
p n2 logn/
√
p
√
p(log2 p+ logn)
Theorem IV.4 n2/pδ n2 log p/pδ pδ log2 p
TABLE I
ASYMPTOTIC COMMUNICATION COSTS FOR COMPUTING EIGENVALUES
(WITH δ ∈ [1/2, 2/3]). ALL VARIANTS REQUIRE O(n3/p) COMPUTATION.
The computation cost clearly decreases with each itera-
tion i. The horizontal communication cost is O(nb/pδ) =
O(n2/(pδ log p)) (since b ≤ n/ log p) at each iteration, since
b¯1−δ
p¯δ
=
(b/ki)1−δ
(p/kiζ)δ
=
b1−δ
pδ
.
Therefore, over all O(log p) iterations, the bandwidth cost of
the SBR invocations is O(n2/pδ). Finally, the synchronization
cost is n
1−δ p¯δ
b¯1−δ log p = O(p
δ log p) at each iteration, since
p¯δ/b¯1−δ = pδ/b1−δ . Thus, the overall synchronization cost
is bounded by the cost postulated in the theorem.
The time to execute CA-BR using pδ processors start-
ing from band-width n/pδ and reducing it to n/p is via
Lemma IV.2, O(γ · n3
p2δ
+β · n2
pδ
+ν · n2 log p
pδ
+α ·pδ log p).
A disadvantage of this multi-stage approach arises when
eigenvectors are required in addition to eigenvalues. The cost
of the back-transformations scales linearly with the number
of band-reduction stages (each stage requires O(n2) memory
and O(n3) computation). We leave the consideration of eigen-
vector construction for future work. To reduce the number of
band-reduction stages when δ < 2/3, one can use k = p2−3δ
with each invocation of 2.5D-Band-to-Band, but this results
in a greater synchronization cost. It may also be possible to
improve the costs of the 2.5D-Band-to-Band algorithm, by
using aggregation as in the 2.5D-Full-to-Band algorithm.
V. CONCLUSION
Table I provides a comparison of communication and syn-
chronization costs to previous work. Our new direct method for
computing the eigenvalues of a symmetric matrix, performs up
to p1/6 less horizontal communication than alternatives. The
vertical communication cost (Q) for ScaLAPACK assumes
H < n2/p and arises from the matrix-vector multiplications
computing V for each column. For CA-SBR, Q is inferred
from Lemma IV.2. For ELPA, we assume the full-to-band
step reduces to band-width b =
√
H , in which case either
(when
√
H > n/p) the banded matrix fits in cache, or
ν ·Q = O(ν · [n3/(pb) + nb2]) = O(γ · F/√H) [37].
The new 2.5D-Symmetric-Eigensolver algorithm trades off
a variable amount of extra work, synchronization, and memory
usage for a lower communication cost. Implementations of the
algorithms in this paper permit optimizations such as
• alternating between left-looking partial updates and com-
plete trailing matrix updates in Algorithm IV.1,
• smaller bulge width in Algorithm IV.2 to increase paral-
lelism in the bulge chase pipeline,
• lookahead [38], [39] (overlapping QR with updates).
Our analysis shows that a carefully parameterized col-
lage of parallel algorithms and optimizations yields asymp-
totic cost improvements with minimal overhead. We combine
approaches (2.5D algorithms, aggregation, successive band
reduction) that have been successful on modern architec-
tures [13], [26], [40], so our innovations should pave the
path for practical improvements in scalability of applications
computing singular values or eigenvalues of matrices.
REFERENCES
[1] I. S. Dhillon, B. N. Parlett, and C. Vo¨mel, “The design and implemen-
tation of the MRRR algorithm,” ACM Transactions on Mathematical
Software, vol. 32, no. 4, pp. 533–560, Dec. 2006.
[2] D. R. Hartree, “The wave mechanics of an atom with a non-coulomb
central field. Part I. Theory and methods,” Mathematical Proceedings of
the Cambridge Philosophical Society, vol. 24, pp. 89–110, 1 1928.
[3] V. Fock, “Na¨herungsmethode zur Lo¨sung des quantenmechanischen
Mehrko¨rperproblems,” Zeitschrift fu¨r Physik, vol. 61, no. 1-2, pp. 126–
148, 1930. [Online]. Available: http://dx.doi.org/10.1007/BF01340294
[4] L. G. Valiant, “A bridging model for parallel computation,” Communi-
cations of the ACM, vol. 33, no. 8, pp. 103–111, 1990.
[5] A. Aggarwal, A. K. Chandra, and M. Snir, “Communication complexity
of PRAMs,” Theoretical Computer Science, vol. 71, no. 1, pp. 3 – 28,
1990.
[6] A. Tiskin, “Communication-efficient parallel generic pairwise elimina-
tion,” Future Generation Computer Systems, vol. 23, no. 2, pp. 179 –
188, 2007.
[7] E. Solomonik and J. Demmel, “Communication-optimal parallel 2.5D
matrix multiplication and LU factorization algorithms,” in Euro-Par
2011 Parallel Processing, ser. Lecture Notes in Computer Science.
Springer Berlin Heidelberg, 2011, vol. 6853, pp. 90–109. [Online].
Available: http://dx.doi.org/10.1007/978-3-642-23397-5 10
[8] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, “Minimizing
communication in numerical linear algebra,” SIAM Journal on Matrix
Analysis and Applications, vol. 32, no. 3, pp. 866–901, 2011.
[9] E. Solomonik, E. Carson, N. Knight, and J. Demmel, “Tradeoffs
between synchronization, communication, and computation in parallel
linear algebra computations,” in Proceedings of the 26th ACM
Symposium on Parallelism in Algorithms and Architectures, ser.
SPAA ’14. ACM, 2014, pp. 307–318. [Online]. Available: http:
//doi.acm.org/10.1145/2612669.2612671
[10] C. Bischof, B. Lang, and X. Sun, “A Framework for Symmetric Band
Reduction,” ACM Transactions on Mathematical Software, vol. 26, no. 4,
pp. 581–601, Dec 2000.
[11] ——, “Algorithm 807: The SBR Toolbox – Software Successive Band
Reduction,” ACM Transactions on Mathematical Software, vol. 26, no. 4,
pp. 602–616, Dec 2000.
[12] G. Ballard, J. Demmel, and N. Knight, “Avoiding communication in
successive band reduction,” ACM Transactions on Parallel Computing,
vol. 1, no. 2, pp. 11:1–11:37, Feb. 2015. [Online]. Available:
http://doi.acm.org/10.1145/2686877
[13] T. Auckenthaler, H.-J. Bungartz, T. Huckle, L. Kra¨mer, B. Lang, and
P. Willems, “Developing algorithms and software for the parallel solution
of the symmetric eigenvalue problem,” Journal of Computational Sci-
ence, vol. 2, no. 3, pp. 272 – 278, 2011, social Computational Systems.
[14] A. Haidar, H. Ltaief, and J. Dongarra, “Parallel reduction to condensed
forms for symmetric eigenvalue problems using aggregated fine-grained
and memory-aware kernels,” in Proceedings of 2011 International
Conference for High Performance Computing, Networking, Storage and
Analysis, ser. SC ’11. New York, NY, USA: ACM, 2011, pp. 8:1–8:11.
[Online]. Available: http://doi.acm.org/10.1145/2063384.2063394
[15] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Dem-
mel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Pe-
titet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users’
Guide. Philadelphia, PA, USA: SIAM, May 1997, also available from
http://www.netlib.org/scalapack/.
[16] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, “Communication-
optimal parallel and sequential QR and LU factorizations,” SIAM Jour-
nal on Scientific Computing, vol. 34, no. 1, pp. A206–A239, 2012.
[17] H. Jia-Wei and H. T. Kung, “I/O complexity: The red-blue pebble game,”
in Proceedings of the thirteenth annual ACM symposium on Theory of
computing, ser. STOC ’81. New York, NY, USA: ACM, 1981, pp.
326–333.
[18] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, “Cache-
oblivious algorithms,” in Proceedings of the 40th Annual Symposium on
Foundations of Computer Science, ser. FOCS ’99. Washington, DC,
USA: IEEE Computer Society, 1999, p. 285.
[19] W. F. McColl and A. Tiskin, “Memory-efficient matrix multiplication in
the BSP model,” Algorithmica, vol. 24, pp. 287–297, 1999.
[20] E. Dekel, D. Nassimi, and S. Sahni, “Parallel matrix and graph algo-
rithms,” SIAM Journal on Computing, vol. 10, no. 4, pp. 657–675, 1981.
[21] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar,
“A three-dimensional approach to parallel matrix multiplication,” IBM
Journal of Research and Development, vol. 39, pp. 575–582, September
1995.
[22] J. Berntsen, “Communication efficient matrix multiplication on hyper-
cubes,” Parallel Computing, vol. 12, no. 3, pp. 335–342, 1989.
[23] S. L. Johnsson, “Minimizing the communication time for matrix mul-
tiplication on multiprocessors,” Parallel Computing, vol. 19, pp. 1235–
1257, November 1993.
[24] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and
O. Spillinger, “Communication-optimal parallel recursive rectangular
matrix multiplication,” in Proceedings of the 27th IEEE International
Symposium on Parallel and Distributed Processing, ser. IPDPS ’13, May
2013, pp. 261–272.
[25] R. A. Van De Geijn and J. Watts, “SUMMA: Scalable Universal Ma-
trix Multiplication Algorithm,” Concurrency: Practice and Experience,
vol. 9, no. 4, pp. 255–274, 1997.
[26] G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, H. D. Nguyen, and
E. Solomonik, “Reconstructing Householder vectors from tall-skinny
QR,” in Proceedings of the 28th IEEE International Symposium on
Parallel and Distributed Processing, ser. IPDPS ’14, May 2014, pp.
1159–1170.
[27] G. H. Golub, R. J. Plemmons, and A. Sameh, Parallel block schemes
for large-scale least-squares computations. University of Illinois Press,
1986.
[28] B. C. Gunter and R. A. Van De Geijn, “Parallel out-of-core
computation and updating of the QR factorization,” ACM Transactions
on Mathematical Software, vol. 31, no. 1, pp. 60–78, Mar. 2005.
[Online]. Available: http://doi.acm.org/10.1145/1055531.1055534
[29] R. D. da Cunha, D. Becker, and J. C. Patterson, “New parallel (rank-
revealing) QR factorization algorithms,” in Euro-Par 2002 Parallel
Processing. Springer, 2002, pp. 677–686.
[30] E. Elmroth and F. Gustavson, “New serial and parallel recursive QR
factorization algorithms for SMP systems,” in Applied Parallel Comput-
ing. Large Scale Scientific and Industrial Problems., ser. Lecture Notes
in Computer Science, B. K. et al., Ed. Springer, 1998, vol. 1541, pp.
120–128.
[31] E. Solomonik, “Provably efficient algorithms for numerical tensor alge-
bra,” Ph.D. dissertation, University of California, Berkeley, 2014.
[32] A. Tiskin, “Bulk-synchronous parallel Gaussian elimination,” Journal
of Mathematical Sciences, vol. 108, pp. 977–991, 2002. [Online].
Available: http://dx.doi.org/10.1023/A%3A1013588221172
[33] J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling, “Block reduction
of matrices to condensed forms for eigenvalue computations,” Journal
of Computational and Applied Mathematics, vol. 27, no. 1, pp. 215–227,
1989.
[34] J. J. Dongarra and R. A. van de Geijn, “Reduction to condensed form for
the eigenvalue problem on distributed memory architectures,” Parallel
Computing, vol. 18, no. 9, pp. 973 – 982, 1992. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/016781919290011U
[35] T. Joffrain, T. M. Low, E. S. Quintana-Ortı´, R. v. d. Geijn, and F. G. V.
Zee, “Accumulating Householder transformations, revisited,” ACM
Transactions on Mathematical Software, vol. 32, no. 2, pp. 169–179, Jun.
2006. [Online]. Available: http://doi.acm.org/10.1145/1141885.1141886
[36] B. Lang, “A parallel algorithm for reducing symmetric banded
matrices to tridiagonal form,” SIAM Journal on Scientific Computing,
vol. 14, no. 6, pp. 1320–1338, 1993. [Online]. Available: http:
//dx.doi.org/10.1137/0914078
[37] T. Auckenthaler, “Highly scalable eigensolvers for petaflop applica-
tions,” Ph.D. dissertation, Universita¨t Mu¨nchen, 2012.
[38] R. C. Agarwal and F. G. Gustavson, “A parallel implementation of matrix
multiplication and LU factorization on the IBM 3090,” in Proceedings
of the IFIP WG, vol. 2, 1988, pp. 217–221.
[39] P. Strazdins, “A comparison of lookahead and algorithmic blocking tech-
niques for parallel matrix factorization,” International Journal Parallel
and Distributed Systems and Networks, vol. 4, no. 1, pp. 26–35, 2001.
[40] E. Solomonik, A. Bhatele, and J. Demmel, “Improving communication
performance in dense linear algebra via topology aware collectives,” in
Proceedings of 2011 International Conference for High Performance
Computing, Networking, Storage and Analysis, ser. SC ’11. New
York, NY, USA: ACM, 2011, pp. 77:1–77:11. [Online]. Available:
http://doi.acm.org/10.1145/2063384.2063487
