Applying OOC Techniques in the Reduction to Condensed Form for Very Large Symmetric Eigenproblems on GPUs by Davidović, Davor & Quintana-Ortí, Enrique S.
Applying OOC Techniques in the
Reduction to Condensed Form for
Very Large Symmetric Eigenproblems on GPUs
Davor Davidovic´
Rudjer Bosˇkovic´ Institute
Centre for Informatics and Computing
Bijenicˇka street 54, 10000–Zagreb, Croatia
davor.davidovic@irb.hr
Enrique S. Quintana-Ortı´
Universitat Jaume I
Depto. de Ingenierı´a y Ciencia de Computadores
Av. Vincent Sos Baynat, 12.071–Castello´n, Spain
quintana@icc.uji.es
Abstract
In this paper we address the reduction of a dense matrix
to tridiagonal form for the solution of symmetric eigenvalue
problems on a graphics processor (GPU) when the data is
too large to fit into the accelerator memory. We apply out-
of-core techniques to a three-stage algorithm, carefully re-
designing the first stage to reduce the number of data trans-
fers between the CPU and GPU memory spaces, maintain
the memory requirements on the GPU within limits, and en-
sure high performance by featuring a high ratio between
computation and communication.
1 Introduction
A few applications arising, e.g., in molecular dynamics,
computational quantum chemistry, finite element modeling
and multivariate statistics, require the solution of large-scale
dense eigenproblems of the form:
AX = ΛX, (1)
where A ∈ Rn×n is symmetric, Λ ∈ Rn×n is a diagonal
matrix with the sought-after eigenvalues, and X ∈ Rn×n
contains the associated eigenvectors [5]. When a significant
fraction of the eigenvalues are desired, efficient and numer-
ically stable solvers first reduce A to tridiagonal form,
QTAQ → T, (2)
where T ∈ Rn×n is tridiagonal and Q ∈ Rn×n stands
for the applied orthogonal transforms [5], to then obtain
the eigenvalues/eigenvectors of T using, e.g. the MR3 al-
gorithm [4]. In case the eigenvectors of the original matrix
are also desired, a back-transform is necessary to recover
them from those of T .
In this paper we focus on the reduction to tridiagonal
form via orthogonal transforms. There exist basically two
alternatives to perform this reduction. The one-stage ap-
proach computes a sequence of n− 2 Householder (orthog-
onal) transforms [5] which annihilate the entries below the
first sub-diagonal. Overall the algorithm requires 4n3/3
flops but, unfortunately, half of them are cast in terms of the
symmetric matrix-vector product, a memory-bounded oper-
ation which renders the global algorithm quite inefficient on
current general-purpose architectures. The memory bottle-
neck is partially alleviated in [7] where the authors leverage
the higher memory bandwidth of the GPU to report an im-
portant increase in the performance of this algorithm.
The multi-stage approaches [2] diminish the number of
level 2 BLAS flops in the reduction to tridiagonal form in
exchange for an increment in the computational cost. In
particular, these algorithms first compute
QT1 AQ1 → B1, (3)
where B1 ∈ Rn×n is a matrix of bandwidth w1 and Q1 ∈
R
n×n collects the corresponding orthogonal transforms [5].
In the two-stage variant, this band matrix is then reduced to
tridiagonal form
QT2 B1Q2 → T, (4)
so that Q = Q1Q2 ∈ Rn×n yields the reduction/orthogonal
transform in (2). Alternatively, a truly multi-stage algorithm
can be employed to successively transform A into a series
of matrices of narrower band, w1 > w2 > w3, . . . , wm−1.
In [1] the authors show the potential of the two-stage algo-
rithm to leverage the data-parallel architecture of a GPU.
In [6] the same approach is show to outperform the one-
stage procedure on multi-core processors as well.
All previous work for the reduction of a symmetric ma-
trix to tridiagonal form on a hybrid CPU-GPU platform
assume that the matrix fits into the memory of the GPU.
However, the storage capacity of as-of-today GPUs is rel-
atively small, typically between 1.5 and 3 Gbytes, so that
the dimension of the matrices that they can hold is lim-
ited (roughly n ≈ 13,500-19,000 for real double-precision
data). For very large problems, as those appearing in molec-
ular dynamics or computational quantum chemistry, this is
clearly insufficient since, in these applications, n can be as
large as 300,000. The main contribution of this paper is a re-
formulation of the multi-stage tridiagonalization algorithm
that is not constrained by the dimension of the memory of
the GPU. In particular, we maintain the whole matrix A in
the main memory of the system, and carefully orchestrate
the data transfers with the device (GPU) to hide the latency
of the (PCI-Express) interconnect between the two memory
spaces. As a result, the limitation on the dimension of the
problems that can be solved using hardware accelerators is
overcome while delivering high performance. Our solution
can be viewed as the result of applying Out-of-Core (OOC)
techniques to deliver sustainable (scalable) performance in
the reduction to tridiagonal form, where the (in-)core mem-
ory corresponds to the device storage while the secondary
(out-of-core) storage is the main memory. In this article we
refer to the scalability as the ability to handle growing ma-
trix sizes on the fixed architecture.
In our approach, we choose the bandwidth w1 so that
the whole matrix B1 in (4), when saved in packed storage
using a w1×n array, fits into the GPU memory. In the prac-
tical applications that we target, n ranges between 100,000
and 300,000 so that w1 can still be chosen large enough to
hide the interconnect latency. Thus, the application of OOC
techniques is only necessary in the first stage.
The paper is structured as follows. In the next section
we review the algorithm in the SBR toolbox [2] for the or-
thogonal reduction of a dense symmetric matrix to band
form which is the basis for the software that we have de-
veloped. In Section 3 we introduce hybrid CPU-GPU al-
gorithms for the computation of the QR factorization of a
rectangular panel and the update of a symmetric matrix in-
volved in the reduction to band form. The hybrid reduc-
tion of the band matrix to narrower band form is addressed
next, in Section 4. These codes are evaluated on a system
equipped with an Intel-based multi-core processor and an
NVIDIA “Fermi” GPU in Section 5. A few remarks close
the paper in Section 6.
2 The SBR Toolbox
SBR (Successive Band Reduction) is a software pack-
age for the reduction of dense symmetric matrices to band
form (SYRDB) and the reduction of band matrices to nar-
rower band (SBRDB) or tridiagonal form (SBRDT). Accu-
mulation of the orthogonal transformations and repacking
routines for storage rearrangement are also provided in the
w j
j
0
b
k
w−b k=n−(j+w)+1
0 1 2A A A
Figure 1. Partitioning of the matrix during one
iteration of routine SYRDB.
toolbox. We next describe routine SYRDB for the reduc-
tion of a dense matrix to band form that performs the initial
transformation in (3).
Consider that the first j − 1 columns of the matrix A
have been already reduced to band form with bandwidth
w. Let b denote the algorithmic block size, and assume for
simplicity that j + w + b − 1 ≤ n, and n, w are integer
multiples of w, b, respectively; see Figure 1. Then, during
the current iteration of routine SYRDB, b new columns of
the band matrix are computed as follows:
1. Compute the QR factorization of A0 ∈ Rk×b, k =
n− (j + w) + 1:
A0 = Q0R0, (5)
where R0 ∈ Rb×b is upper triangular and the orthog-
onal factor Q0 is implicitly stored as a sequence of
b Householder vectors. The cost of this first step is
2b2(k − b/3) flops.
2. Construct the factors of the compact WY representa-
tion [5] of the orthogonal matrix Q0 = Ik + WY T ,
with W, Y ∈ Rk×b. The cost of this step is kb2 flops.
3. Apply the orthogonal matrix to A1 ∈ Rk×w−b:
A1 := Q
T
0 A1 = A1 + Y (W
TA1). (6)
The cost of this step becomes 4kb(w − b) flops. In
case the bandwidth equals the block size (w = b), A1
comprises no columns and, therefore, no operation is
performed in this step.
4. Apply the orthogonal matrix to A2 ∈ Rk×k:
A2 := Q
T
0 A2Q0
= A2 + YW
TA2 +A2WY
T + YWTA2WY
T .
(7)
During this step only the lower (or the upper) triangu-
lar part of A2 is updated as follows:
(SYMM) X1 := A2W, (8)
(GEMM) X2 := X
T
1 W/2, (9)
(GEMM) X3 := X1 + Y X2, (10)
(SYR2K) A2 := A2 +X3Y T + Y XT3 . (11)
The major cost of this step is in the computation of the
symmetric matrix product (8) and the symmetric rank-
2k update (11), each with a cost of 2k2b flops. The
global cost of this step is 4k2b+4kb2, which is higher
than those of the Steps 1–3.
Provided b and w are both small compared with n, the
global cost of the reduction of a full matrix to band form is
4n3/3 flops. The bulk of the computation is cast in terms of
the BLAS-3 operations in (8) and (11), so that high perfor-
mance can be expected from routine SYRDB in case a tuned
implementation of BLAS is used.
3 OOC Reduction to Band Form
Our algorithm for the OOC reduction to band form on a
hybrid CPU-GPU platform is based on the SBR reduction
to band form. To achieve high utilization and performance
in the OOC algorithm, we have to operate with wide factors
W and Y . As the column-size of these panels is upper-
bounded by the algorithmic block size b, we set the w = b
so that our algorithm only consists of steps 1, 2 and 4 as in
Section 2.
Algorithm 1 illustrates the two phases of the reduction
procedure. The first phase (line 2) performs a blocked QR
decomposition of A˜ = A(j + w : n, j : j + w − 1) while
simultaneously constructing the factors W and Y (steps 1
and 2). The second phase (line 3) computes the two-sided
update Aˆ = A(j + w : n, j + w : n) := QT Aˆ Q, with Q =
I +WY T (step 4).
Algorithm 1 Reduction to band form
Input: Real symmetric matrix A ∈ Rn×n, bandwidth w
Output: A overwritten with the resulting band-matrix
1: for j := 1, j < n, j := j + w do
2: QR decomposition of A˜ →W,Y
3: Two-sided update: Aˆ := (I+WY T )T Aˆ(I+WY T )
4: end for
Subsection 3.1 introduces a hybrid algorithm where CPU
and GPU collaborate to obtain the QR factorization of A˜
and the construction of the correspondingW and Y . Given
our premises, we can safely assume that A˜ fits into the
GPU storage so that this algorithm operates in-core. Sub-
section 3.2 describes the hybrid OOC update of the trailing
k × k block Aˆ.
In our implementations we do not apply overlapping be-
tween computation and copying. Our research aims at in-
creasing the ratio between these two factors by reducing
the number of transfers and at the same time increasing the
amount of computation per GPU kernel. The overlapping
of copying with the computation is among future research.
3.1 Hybrid (in-core) QR decomposition
Consider the computation of the QR factorization of the
k × w matrix A˜, k = n − (j + w) + 1, and the construc-
tion of the corresponding factors of the same dimension.
For clarity, in the following discussion (d)Wi stands for
(d)W (i : k, i : i + b˜ − 1) and (d)Yi for (d)Y (i : k, i : i +
b˜ − 1), where b˜ is the algorithmic block dimension for this
phase. The prefix “d” identifies matrices that are stored in
the GPU memory; all operations on these matrices imply
the execution on the GPU without further explicit mention
of it. For simplicity, we also assume that w is an exact mul-
tiple of b˜.
Our implementation of the hybrid QR decomposition is
illustrated in Algorithm 2. The procedure operates on col-
umn blocks of width b˜. Consider that the first i− 1 columns
of A˜ have already been factorized. During the current itera-
tion, we copy the first b˜ columns starting from the dA˜(1, i)
of the GPU to the CPU (line 3), obtain QR factorization on
the CPU (line 4), and generate the factors Wi and Yi there
(line 5). Then,Wi and Yi are transferred to the GPU (line 7)
and, to complete the iteration, sub-matrix dA˜(i : k, i+ b˜ : k)
is updated on the GPU from the left (line 8), and the previ-
ous transforms are applied to the current factor dWi (line
9). At the end of the for-loop the updated A˜ and Y reside
on both the CPU and the GPU, while dW is stored on the
GPU and has to be copied back to the CPU. The cost of this
algorithm is 4(k2w − kw2 + w3/3) flops.
We have implemented two versions of this algorithm,
which differ in the number of data transfers between GPU
and CPU and the amount of workspace needed in the GPU
memory.
3.1.1 Variant QR-1
This version requires space on the GPU to hold two ma-
trices of size k × w (with k ≤ n), for dA˜ and dW , and
an additional workspace dZ of size w × w. At each itera-
tion, the factor dYi is stored overwriting the entries of dAi
since, once dAi is copied to A˜i at the beginning of each
iteration, it is not referenced anymore. The update of the
Algorithm 2 Hybrid QR decomposition
Input: Real matrix A˜ ∈ Rk×w , block size b˜
Output: A˜ overwritten with factor R. W,Y ∈ Rk×w
1: Copy A˜→ dA
2: for i := 1, i < w, i := i+ b˜ do
3: Copy dA( : , i : i+ b˜− 1) → A˜( : , i : i+ b˜− 1)
4: (GEQRL) Compute QR factorization of block A˜i
5: (GEWYG) Construct Wi and Yi
6: W (1 : i − 1, i : i+ b˜− 1) := 0
Y (1 : i− 1, i : i+ b˜− 1) := 0
7: Copy Wi → dWi, and Yi → dYi
8: dA(i : k, i+ b˜ : w) :=
(I + dWidY
T
i )
TdA(i : k, i+ b˜ : w)
9: dWi := dWi + dW (i : k, 1: i− 1)×
dY (i : k, 1: i− 1)TdWi
10: end for
11: Copy dW →W
rest of dA (line 8) is split into two parts, with each one be-
ing performed via a single invocation of the GEMM routine
(matrix-matrix multiplication). Specifically, dZ is used to
temporarily hold the partial result of the first matrix product
dZ := dWTi dA(i : k, i + b˜ : w) while, in the second prod-
uct, dA(i : k, i + b˜ : w) := dA(i : k, i + b˜ : w) + dY dZ .
The application of the previous transforms to the current
Wi can also be divided into two parts, being obtained with
two calls of the GEMM routine as both dW (i : k, 1: i − 1)
and dY (i : k, 1: i− 1) reside on the GPU.
The advantage of this version is that it performs a re-
duced number of data transfers to/from the GPU. However,
it requires the full dA˜ and dW to reside on the GPU so that,
as the matrix dimension n grows, the bandwidthw has to be
conformally reduced so that they fit into the GPU memory.
For very large matrices, this results in operating with long,
narrow column blocks of A˜ as well as narrow factors Wi
and Yi, which may decrease the ratio between computation
and transfers, and impair overall performance.
3.1.2 Variant QR-2
This version requires less storage space on the GPU: a
workspace of size k × w for dA˜, a panel of size k × b˜ for
a single column block of dY , and an additional workspace
dZ of size w×w. The factor dWi is now stored overwriting
dA once the entries of this panel have been copied to A˜i (in
the previous version we did this for dYi).
The update of the rest of the matrix dA˜ (line 8) is per-
formed in the same way as in the previous version using dZ
as a temporary workspace. As we do not hold the full factor
Y in the GPU memory in this variant, the application of the
previous transformations to the currentWi differs from that
of the previous version. In particular, this computation is
split into the following two parts:
dZ := dY (i : k, 1: i− 1)T dWi, (12)
dWi := dWi + dW (i : k, 1: i− 1) dZ, (13)
where dW is stored in dA(1 : k, 1: i + b˜ − 1). The update
(12) is performed by copying one-by-one the column blocks
Y (i : k, p : p+ b˜− 1) to dY and then computing dZ(p : p+
b˜− 1, : ) := dY T dWi, for p = 1, b˜+ 1, 2b˜+ 1, . . . , i− b˜.
Once all blocks of Yj have been processed and the full dZ
is computed, the update in (13) can be performed by single
call of GEMM.
The advantage of this approach is that it requires less
memory on the GPU and, therefore, the bandwidth (i.e. col-
umn width of matrix A˜ and the factors) can be larger. How-
ever, the workspace on the GPU is still proportional to num-
ber of rows in the panel k and, as the problem size grows,
the bandwidth w has to be conformally reduced. Neverthe-
less, for our test cases, this version permits the solution of
symmetric eigenproblems in the desired range, with n up to
300,000 (provided it fits into the CPU memory). Compared
with the previous version, the drawback of the Variant QR-
2 is that it requires more transfers to/from the GPU as, for
each p, one panel of size (k − i)× b˜ is copied to the GPU,
which results in additional overhead.
3.2 Hybrid OOC two-sided update
The OOC algorithm for the two-sided update of the sym-
metric trailing matrix Aˆ = A(i + w : n, i + w : n) imple-
ments step 4 from Section 2. The update is divided into 4
operations, see (8)–(11), and each one of these has to be re-
implemented to work with out-of-core data (i.e., matrices
that reside out of the GPU memory).
Consider a partitioning of matrix Aˆ into square blocks of
size bˆ× bˆ, with the the algorithmic block size for this phase,
bˆ, being fixed to be small enough so that a “few” blocks fit
into the GPU memory (check in Subsections 3.2.1 and 3.2.2
for the exact number). For simplicity, we introduce Aˆip as
substitution for the block Aˆ(i : i+ bˆ− 1, p : p+ bˆ− 1), Wi
and Yi for W (i : i+ bˆ− 1, 1: w) and Y (i : i+ bˆ− 1, 1: w)
respectively, and Xi for X(i : i+ bˆ− 1, 1: w).
We have implemented two versions of the OOC two-
sided update matrix which differ in the requirements of stor-
age in the GPU. This affects the performance of the global
algorithm, especially when the size of the matrix grows.
3.2.1 Variant Update-1
This version requires two panels of size k×w, and one block
of size bˆ× bˆ. The block size bˆ is calculated at the beginning
of the algorithm so that these data fit into the GPU memory.
Consider first the computation ofX1 in equation (8). The
OOC implementation for this operation is given in Algo-
rithm 3. Note that, because of the symmetry of Aˆ, only
the lower triangle of this matrix is referenced. Thus, once
we copy the block Aˆip to the GPU, we can update both Xi
and Xp (lines 9–10). The diagonal blocks are updated in
advance (lines 2–4).
Algorithm 3 OOC version 1: X1 := AˆW
Input: Real symmetric matrix Aˆ ∈ Rk×k, W ∈ Rk×w
Output: Real matrix X1 ∈ Rk×w
1: Copy W → dW
2: for p := 1, p < n, p := p+ bˆ do
3: Copy Aˆpp → dA
4: dXp := dA dWp
5: end for
6: for p := 1, p < n, p := p+ bˆ do
7: for i := p+ bˆ, i < n, i := i+ bˆ do
8: Copy Aˆip → dA
9: dXp := dXp + dAT dWi
10: dXi := dXi + dA dWp
11: end for
12: end for
The next two operations, corresponding to equations (9)
and (10), can be computed with two calls to the GEMM rou-
tine providedX1,X3 andW (Y ) are small enough to fit into
the GPU memory. Note that the factor Y has to be copied
to the GPU before the third operation (10) (we can reuse the
storage space dW ).
After the first step, both matrices W and X1 are stored
in the GPU memory, while matrix Aˆ is unchanged. In the
second phase, X2, of size w×w, is stored in the workspace
dA as w is chosen so that w ≤ bˆ. For the third step, X3
overwrites dX ; therefore no additional workspace on the
GPU is required.
The last operation, corresponding to (11), is performed
by copying the blocks in the lower triangular part of Aˆ to
the GPU and updating them as:
Aˆip := Aˆip + (X3)iY
T
p + Yi(X3)
T
p . (14)
Now, as the block size bˆ is chosen so that only one block
Aˆip fits into the GPU memory, after each update the corre-
sponding block has to be retrieved back to the CPU.
The advantage of this version is that we keep dW and
one of X1, X2 or X3 in the GPU memory, and only the
block Aˆip is copied to/from the GPU. This reduces the num-
ber of memory transfers and enables data reuse once it re-
sides on the GPU. However, with the increase of the n, the
block size/bandwidth have to be reduced which leads to a
loss of efficiency as the number of transfers grows and the
algorithm operates on smaller blocks.
3.2.2 Variant Update-2
The second version of the algorithm requires less memory
space on the GPU, but performs a higher number of memory
transfers to the GPU. Nevertheless, the memory transfers
can be compensate with the computation as the block size
remains large enough, as presented in the section 5.
This version requires one block of size bˆ×bˆ to keep block
Aˆip, one panel of size k × w, and one additional panel of
size bˆ×w. Therefore, this version requires less storage than
the previous variant so that the block size/bandwidth can be
larger. The drawback of the new version is that it increases
the number of memory transfers as we will now hold only
one factor, W or Y , in the GPU memory, while X1 or X3
will be transferred by blocks.
Consider first the computation in (8). As we keep only
one block of size bˆ × w of X1 on the GPU, some changes
are required to Algorithm 3 in lines 5–11. These lines are
exchanged with the lines presented in Algorithm 4. Each
Algorithm 4 OOC version 2 difference to version 1
1: for i := 1, i < n, i := i+ bˆ do
2: if i = p then
3: Copy Aˆip → dA
4: dX := dX + dA dWp
5: end if
6: end for
7: Copy dX → X1(p : p+ bˆ− 1, : )
block Aˆip has to be transferred to the GPU which results
in twice as much copying as occurs in OOC version 1. In
each pass of p-loop one block of bˆ rows X1 (stored in dX)
is calculated (line 8). As we do not have storage space on
the GPU to keep full X1, the block has to be saved in the
CPU memory. Therefore, an additional workspace on the
CPU of size k × w is required.
At the beginning of the second step/equation (9), the ma-
trix X1 is stored on the CPU and, therefore, X1(i : i +
bˆ − 1, 1: w) is copied to dX block-by-block, and then
X2 is updated as X2 := X2 + 12 dX
T dWi, for i =
1, bˆ+1, 2bˆ+1, . . . , k− bˆ. Block dA is used to keep X2, so
no additional workspace is necessary. The only requirement
is that bˆ ≥ w, so the dA can hold the w × w block X2.
Before equation (10), factor Y is saved in dW (as
the later is not required in future computations). Then,
X1(i : i + bˆ − 1, 1: w) is copied to dX and the block
X3(i : i+bˆ−1, 1: w) is computed as dX := dX+dWi dA,
for i = 1, bˆ + 1, 2bˆ + 1, . . . , k − bˆ. Once each block is
updated, the resulting dX is transferred to X3(i : i + bˆ −
1, 1: w). X3 is then saved in the same workspace as X1
(X1 is not required anymore).
In the last step, corresponding to equation (11), blocks
Xi = X3(i : i + bˆ − 1, 1: w) and Xp = X3(p : p + bˆ −
1, 1: w) are needed in order to update each block Aˆip.
Thus, two additional transfers of blocks to the GPU are
required to update Aip, as shown in Algorithm 5. can be
updated as follows:
Algorithm 5 OOC version 2: Symmetric 2k update
1: Copy Aˆip → dA
2: Copy Xi → dX
3: dA := dA+ dX dY Tp
4: Copy Xp → dX
5: dA := dA+ dYi dXT
6: Copy dA→ Aˆip
4 Remaining Stages
In this section, we briefly describe the hybrid CPU-GPU
implementation of the reduction of the band matrix result-
ing from the initial stage. We focus on how to exploit the
GPU in this second stage, organizing the computations care-
fully to reduce the number of data transfers between the
memory spaces of the host and the GPU.
Our algorithm is a hybrid CPU-GPU implementation of
routine SBRDB from SBR. For a full description of the cor-
responding CPU algorithm, refer to [3].
The algorithm operates on the band matrix B stored into
aw1×n array, where thew1 is the band size and n is the di-
mension of the matrix. The main diagonal occupies the first
row of the array, and the second, third, . . . subdiagonals are
saved in successive rows of this matrix. In other words, if
Bˆ is a band storage representation of B, then each element
B(i, j) is saved in entry Bˆ(1 + i− j, j).
Assume the n × n band matrix Bw1 , of bandwidth w1,
is to be transformed into a band matrix Bw2 , with narrower
bandwidth w2 < w1. The goal is thus to annihilate the ele-
ments of d = w1−w2 subdiagonals. The procedure is based
on a “annihilate-and-chase” strategy and employs House-
holder transforms for this purpose. The first step annihilates
the d outermost subdiagonals from the first nb columns. In
particular, consider Figure 2 (left). After the computation
of the QR factorization of the block labelled as “QR”, the
transforms accumulated in Q = I+WY T are applied from
the left to block “PRE”, from both sides to the symmetric
block “SYM” (only the lower triangular part is actually up-
dated/referenced), and from the right to block “POST”.
The application of the transforms to block “POST” fills-
in d diagonals below the band. Next, the first nb columns
of the space that was filled-in –see Figure 2 (right)– are
removed by a second QR decomposition, and the corre-
sponding transforms are applied to the blocks marked with
“PRE”, “SYM”, and “POST” in the figure. This generates
new fill-in below the “POST” block, and the process is re-
peated, chasing down the fill-in until it is pushed off the
matrix. After the completion of this, a new annihilate-and-
chase loop starts in order to eliminate a new block of nb
columns.
POST
PRE
w1
w2
QR
SYM
w2
PREQR
SYM
POST
w1
Figure 2. Left: Annihilation of the outermost
sub–diagonals and updating the rest of the
matrix. Right: bulge chasing part of the first
fill-in block.
During the reduction to narrower band form, the updates
marked as “SYM“ and “POST“ are candidates for being
computed on the GPU, while the QR decomposition (”QR”)
is better suited for the CPU. The block “PRE“ can be up-
dated in either the CPU or the GPU but, to reduce memory
transfers, in our code this computation is performed in the
GPU as well. Now, assume that initially the entire matrix
Bw1 resides in the GPU memory and that in the first i − 1
column blocks of size nb all sub–diagonals below w2 are
annihilated. The algorithm then proceeds by repeating the
“annihilate-and-chase“ step as shown in Algorithm 6.
Algorithm 6 ith Annihilate-and-chase step
Input: Band matrix Bw1 ∈ Rw1×n
Output: Matrix Bw2
1: Copy “QR” from GPU to CPU
2: Compute QR of the “QR” →W , Y
3: Copy W , Y to the GPU and update “PRE”
4: for k = i+ w2, k < n, k = k + w1 do
5: Update “SYM” and “POST” on the GPU
6: Copy fill-in block “QR” from GPU to CPU
7: Compute the QR of the “QR” →W , Y
8: Copy W , Y to the GPU, and update “PRE”
9: end for
Upon completion, there exists an updated copy of the
band matrix Bw2 in the CPU. (The updated parts of the ma-
trix were transfered at the beginning of each step.) Addi-
tional storage is required in the GPU to keep the W and Y
factors as well as an d× n workspace for the fill-ins gener-
ated during the successive annihilate-and-chase steps.
The last stage of the algorithm, the reduction from
banded to the tridiagonal form, is done using rou-
tine SBRDT.
5 Experimental Results
The target platform was a workstation with two Intel
Xeon E5520 QuadCore CPUs running at 2.27 GHz, with
24 GB RAM, and an aggregated theoretical peak perfor-
mance of 74.6 GFLOPS in double-precision (1 GFLOPS =
109 flops/second). The workstation was also equipped with
an NVIDIA Tesla C2050 running at 1.15 GHz, with 2.8 GB
RAM, and a theoretical peak performance of 515 GFLOPS
in double-precision. The Intel chipset and the Tesla board
are connected via a PCI-Express Gen2 interface with a peak
bandwidth of 4.6 GB/second. GNU C compiler (gcc 4.1.2)
and GotoBLAS2 were employed for all computations per-
formed on the Intel cores. NVIDIA CUBLAS (version 4.0)
built on top of CUDA (version 4.0) were used in our tests.
Double precision was employed in all our experiments and
no page-locked memory on the CPU was used.
In our results, we consider that the cost of the OOC re-
duction from full to banded form is 2n3/3 flops, and that of
the full multi-stage reduction from full to tridiagonal form
(either using SBR SYRDB+SBRDT or OOC banded form
reduction+SBRDB+SBRDT) is 4n3/3 flops. This count
may be considerably smaller than the actual number of flops
performed, but it allows a fair comparison of the different
variants that we designed for any bandwidth. We do not
build the orthogonal factors in our experiments. The cost of
all data transfers between main memory and GPU memory
is included in the timings.
Note that the matrices with n > 70,000 can not fit into
the main memory of our platform. Therefore we have sim-
ulated the computation on such matrices by generating the
matrix that can fit into the main memory (i.e. one panel).
The calculations and copying to the GPU are done as we
calculate with the large matrix (but always operating on the
same panel). In that way we have made the correct number
of data transfers and the computations although the final re-
sult is not correct.
Our first experiments analyze the performance of the
OOC reduction of a full symmetric matrix to banded form.
These experiments evaluate the algorithms described in
Subsection 3.1 and 3.2. Specifically, we consider the fol-
lowing two variants for the reduction from full symmetric
matrix to banded form:
Variant 1 employs variants QR-1 and Update-1.
Variant 2 employs variants QR-2 and Update-2.
Variant 2 requires less storage space on the GPU than Vari-
ant 1 at the cost of increasing the number of memory trans-
fers from/to the GPU.
In all our tests the parts of the code that are executed on
the CPU employ multi-threaded BLAS/LAPACK libraries
utilizing all 8 available cores.
20
30
40
50
60
70
80
90
100
0 50000 100000 150000 200000
G
FL
O
PS
Matrix size
Variant 1,w=1024
Variant 1,w=2048
Variant 1,w=3072
Variant 2,w=1024
Variant 2,w=2048
Variant 2,w=3072
Figure 3. Performance of the OOC Variant 1
(red) and Variant 2 (blue).
Preliminary experiments showed that the optimal block
size for the hybrid QR algorithm was b˜ = 128. The per-
formance of the reduction from full to banded form mostly
depends on the band size w, as shown in Figure 3. In partic-
ular, increasing the band size decreases the number of com-
putations and increments the performance. The best perfor-
mance attained with Variant 1 is 89 GFLOPS (2, 048 sec.)
for n=70,000 andw=2,048. The performance is bounded by
the total memory capacity of GPU, as for the matrix of di-
mension n=70,000 and the maximum bandwidth w=2,272
the largest block size for Variant 1 is bˆ=3,035. By decreas-
ing the bandwidth, the block size can be made larger, but
the overall performance decreases. The problem with the
scalability occurs when the matrix size n grows as two ma-
trices of size n × w have to be stored in the GPU mem-
ory and, thus, less storage is available for the workspace
of size bˆ × bˆ. This is illustrated in Figure 3 for w=3,072.
In that case, the largest problem dimension that could be
evaluated for w=3,072 was 30,000. The performance of
Variant 2 (Figure 3 blue) is only slightly lower than that
of Variant 1. The reason for that is that we perform more
memory transfers from/to the GPU. However, at the same
time, we can operate with larger blocks bˆ. For the matrix
of size n=200,000 and the bandwidth w=1,024, Variant 2
achieves 83,9 GFLOPS (58,885 sec). Compared with Vari-
ant 1, for the same bandwidth w, we can solve much larger
problems. The maximum matrix size that we have tested
was n=250,000 with the band size w=1,024. The execution
rate for that specific problem is 91 GFLOPS which outper-
forms Variant 1 for the same bandwidth but smaller problem
size (n=70,000).
The impact of the memory transfer to/from the GPU in
the total execution time decreases as the band size increases.
The impact of the copy is starting from≈46% forw=512 up
to 9.5% for w=4256 and does not depend on the problem
size n. For the same problem size, decreasing the band size
results in the considerable increase of the transfers while the
computational time remains almost the same.
The second experiment compares our OOC multi-stage
algorithm for the reduction of a full symmetric matrix to
tridiagonal form with the (in-core) SBR two-stage approach
(SYRDB + SBRDT) as presented in [1]. Our approach per-
forms the first and second stage (SYRDB and SBRDB) on
the GPU, while the rest of the computation is done on the
CPU. The flop counts for both alternatives is estimated as
4n3/3. Our algorithm for reduction to tridiagonal form con-
sists of three phases: The first phase performs the reduction
from full symmetric matrix to banded form employing the
OOC algorithm (Section 3); the second phase is the GPU
implementation of the SBR routine SBRDB (Section 4) for
reduction from band to narrower band; and the third phase
invokes the SBR routine SBRDT for reduction of the banded
matrix to tridiagonal form. After the first phase, band ma-
trices are stored in compact format using a w × n array.
In Table 1 we compare the performance of the two OOC
multi-stage algorithms, that use Variant 1 and Variant 2 of
the OOC reductions from full to banded form, with that of
the SBR two-stage GPU implementation.
Table 1. Performance in seconds of two OOC
versions and SBR two-stage algorithms on
the GPU.
Size SBR 2-stage OOC version 1 OOC version 2
total copy total copy total copy
5000 7.25 0.28 8.35 0.32 8.42 1.00
10000 39.60 0.83 58.78 3.02 40.04 5.15
16000 128.80 2.23 108.52 12.95 122.94 18.65
18000 174.20 2.62 150.38 11.16 156.55 22.46
The largest problem size that can be solved in double-
precision arithmetic with the in-core GPU implementation
of the SBR two-stage approach is n=18,000. Our approach,
on the other hand, can be used for problems of size up to
n=300,000 (as long as we can maintain w ≥512, the bot-
tleneck is the amount of the CPU main memory). In our
algorithm for the reduction to tridiagonal form, the most
time-consuming part is the first phase (reduction from full
to band form) as this routine performs most of the compu-
tations. The best performance for this stage, as presented in
the Figure 3 is achieved when the band size is large. The
band size for the last phase (reduction from band to tridi-
agonal) affects the performance as well as should be kept
small. Therefore, the best performance is achieved when
the band size for the first phase is 2048, and for the third is
64. The second phase is an in-core algorithm that operates
on the GPU, and requires only a workspace of dimension
(2w + 1) × n. The design of an efficient OOC reduction
algorithm from band to narrower band, that solves the cur-
rent constraints on the bandwidth, is among future research
goals.
6 Conclusions
In this paper, we have proposed new OOC algorithms
for the reduction of a dense symmetric matrix to tridiagonal
form when the problem data does not fit into the GPU mem-
ory. The crucial point is the transformation from a full dense
matrix to banded form, when most of the computations are
performed on the full matrix. The second phase, the re-
duction from band matrix to narrower band or directly to
tridiagonal form, can be executed using in-core algorithms,
as the bandwidth can be chosen so that the band matrix,
saved in compact form, fits into the GPU memory. The im-
provements in our algorithms come from a careful reformu-
lation of the two-sided update as an OOC algorithm. Specif-
ically, the input matrix is divided into blocks that fit into the
GPU memory and the algorithm is redesigned to amortize
the transfers of data with the computation, calculating with
large blocks so that the ratio between the computation and
transfer remains high.
Our experiments demonstrate that the performance of the
OOC algorithm greatly depends on the bandwidth, increas-
ing with it. The experiments also show that, for the re-
duction to band form, the use of OOC techniques provides
higher scalability and higher performance, up to 2− 3× for
large problems, than their in-core counterparts.
References
[1] P. Bientinesi, F. D. Igual, D. Kressner, M. Petschow, and
E. S. Quintana-Ortı´. Condensed forms for the symmetric
eigenvalue problem on multi-threaded architectures. Conc.
& Comp.: Practice and Experience, 23(7):694–707, 2011.
[2] C. H. Bischof, B. Lang, and X. Sun. Algorithm 807: The
SBR Toolbox—software for successive band reduction. ACM
Trans. Math. Soft., 26(4):602–616, 2000.
[3] C. H. Bischof, B. Lang, and X. Sun. A framework for sym-
metric band reduction. ACM Trans. Math. Soft., 26(4):581–
601, 2000.
[4] S. Dhillon, N. Parlett, and Vomel. The design and imple-
mentation of the MRRR algorithm. ACM Trans. Math. Soft.,
32(4):533–560, 2006.
[5] G. H. Golub and C. F. V. Loan. Matrix Computations. The
Johns Hopkins University Press, Baltimore, 3rd edition, 1996.
[6] P. Luszczek, H. Ltaief, and J. Dongarra. Two-stage tridiagonal
reduction for dense symmetric matrices using tile algorithms
on multicore architectures. In IPDPS, pages 944–955. IEEE,
2011.
[7] S. Tomov, R. Nath, and J. Dongarra. Accelerating the reduc-
tion to upper hessenberg, tridiagonal, and bidiagonal forms
through hybrid GPU-based computing. Parallel Computing,
36(12):645–654, 2010.
