Unleashing GPU acceleration for symmetric band linear algebra kernels and model reduction by Benner, Peter et al.
Unleashing GPU Acceleration for Symmetric Band Linear
Algebra Kernels and Model Reduction
Peter Benner · Ernesto Dufrechou · Pablo Ezzatti ·
Enrique S. Quintana-Ort´ı · Alfredo Remo´n
Received: date / Accepted: date
Abstract Linear algebra operations arise in a myriad
of scientific and engineering applications and, therefore,
their optimization is targeted by a significant number of
high performance computing (HPC) research eﬀorts. In
particular, the matrix multiplication and the solution
of linear systems are two key problems with eﬃcient
implementations (or kernels) for a variety of high per-
formance parallel architectures. For these specific prob-
lems, leveraging the structure of the associated matrices
often leads to remarkable time and memory savings,
as is the case, e.g., for symmetric band problems. In
this work, we exploit the ample hardware concurrency
of many-core graphics processors (GPUs) to accelerate
the solution of symmetric positive definite band linear
systems, introducing highly tuned versions of the corre-
sponding LAPACK routines. The experimental results
with the new GPU kernels reveal important reductions
of the execution time when compared with tuned imple-
mentations of the same operations provided in Intel’s
MKL. In addition, we evaluate the performance of the
GPU kernels when applied to the solution of model or-
der reduction problems and the associated matrix equa-
tions.
P. Benner · A. Remo´n
Max Planck Institute for Dynamics of Complex Technical
Systems, Magdeburg, Germany
E-mail: {benner,remon}@mpi-magdeburg.mpg.de
E. Dufrechou · P. Ezzatti
Instituto de Computacio´n, Universidad de la Repu´blica, Mon-
tevideo, Uruguay
E-mail: {edufrechou,pezzatti}@fing.edu.uy
E. S. Quintana-Ort´ı
Depto. de Ingenier´ıa y Ciencia de Computadores, Universi-
dad Jaime I, Castello´n, Spain
E-mail: quintana@icc.uji.es
1 Introduction
Linear systems with symmetric positive definite (s.p.d.)
band coeﬃcient matrix arise, among others, in the nu-
merical solution of partial diﬀerential equations, finite
element analysis in civil engineering, and as part of
matrix equation solvers in control and systems theory.
In practice, exploiting the structure of the matrix in
these problems yields huge savings, both in the num-
ber of computations and storage space. Therefore, it is
natural that LAPACK [1,10] comprises eﬃcient meth-
ods to solve s.p.d. band linear systems on general-pur-
pose (multicore) processors, provided a tuned (multi-
threaded) implementation of BLAS is available.
Hybrid compute servers, consisting of general-pur-
pose multicore processors (CPUs) and a graphics pro-
cessing unit (GPU), have evolved dramatically in the
last decade, becoming an interesting platform to tackle
many scientific and engineering applications with high
computational requirements [19]. Among the reasons
that have contributed to the progressive adoption of
GPU hardware accelerators, we can highlight the in-
troduction of application programming interfaces such
as CUDA [15,12], OpenCL [14] and OpenACC [16], in
conjunction with aﬀordable price, impressive raw per-
formance, and favorable power–performance ratio. In
particular, in the area of dense linear algebra, many
studies have recently demonstrated the benefits of GPU
computing; see, among many others, [20,3,6,5].
In this paper we present new LAPACK-style rou-
tines (kernels) that leverage the large hardware con-
currency of hybrid CPU-GPU platforms to accelerate
the solution of s.p.d. band linear systems. In particular,
our experimental evaluation, performed on two plat-
forms equipped with hardware from the two latest gen-
erations of nVIDIA’s GPUs (nVIDIA “Fermi” S2070
2 Peter Benner et al.
and nVIDIA “Kepler” K20), exposes that the GPU-
enabled routines oﬀer superior performance and scala-
bility as compared to the highly-tuned multithreaded
symmetric band kernels in Intel’s MKL (Math Ker-
nel Library). Furthermore, the application of the novel
codes to a matrix equation solver shows how these ben-
efits carry over to the solution of model reduction prob-
lems arising in control theory.
The rest of the paper is structured as follows. In
Section 2 we revisit the BLAS routines for the band
symmetric matrix multiplication and the GPU versions
proposed in [11]. Next, in Section 3, we study the LA-
PACK strategies for the solution of s.p.d. symmetric
band linear systems on multicore architectures, and we
introduce our new hybrid CPU-GPU routines in de-
tail. Section 4 presents the experimental evaluation of
our new solvers, and Section 5 analyses the application
of the new kernels to the solution of band symmetric
Lyapunov matrix equations arising in model reduction.
Finally, a few concluding remarks close the paper in
Section 6.
2 Symmetric Band Matrix Multiplication
In this section we describe several kernels that lever-
age the computational power of GPUs to accelerate the
computation of the matrix multiplication:
C = AB + C, (1)
where A is a symmetric band matrix. In this operation,
for simplicity we consider that A,B,C are all three of
dimension n× n hereafter.
2.1 The operation in BLAS
The BLAS specification contains several routines to
operate with symmetric band matrices. In particular,
BLAS include kernel sbmv to compute a matrix-vector
product involving a band symmetric matrix but, in con-
trast, the interface does not oﬀer the equivalent ker-
nel for matrix multiplication when one of the matrices
presents a band structure. To tackle this, the matrix
multiplication can be easily implemented on top of the
sbmv routine. Concretely, in (1) we can partition the
matrix B column-wise, and perform the sequence of
matrix-vector products:
Cj = ABj + Cj , 1 ≤ j ≤ n, (2)
where Cj , Bj stand for the j-th columns of B and C
respectively.
Although this simple approach allows to compute
the matrix multiplication using only BLAS kernels, it
is based on a level-2 BLAS routine, while the prod-
uct of matrices is, in principle, a level-3 BLAS oper-
ation. Thus, with this implementation each element of
the matrix A is accessed n times, in general resulting
in a suboptimal usage of the memory hierarchy.
2.2 Algorithm sbmmBLK
Blocked algorithms for linear algebra operations take
advantage of the memory hierarchy of current computer
architectures to hide the limited memory latency and
bandwidth and deliver higher performance. In this con-
text, in [11] we proposed the blocked algorithm to com-
pute the matrix multiplication (1) given in Figures 1
and 2. This implementation only accesses the elements
in the lower triangular part of A, adhering to the packed
storage format employed by BLAS and LAPACK for
this type of matrices. Analogous procedures that only
access the elements in the upper triangle or all the ele-
ments of A are straight-forward.
The algorithm consists of two loops. The outer loop
(Figure 1) partitions the matrices B and C into blocks
of c columns and, at each iteration, updates the ele-
ments in the active column-block of C. The inner loop
(Figure 2) proceeds along the main diagonal of A (i.e.,
from the top-left corner to the bottom-right one), up-
dating the corresponding elements of C. Matrices B
and C are partitioned row-wise, while A is partitioned
into 3× 3 blocks. At each iteration, the blocks Ai1 and
Bi, with i = 1, 2, 3, are accessed; while the blocks Ci,
are updated. Note that A11 and A31 are, respectively,
lower and upper triangular blocks. Figure 3 details the
blocks accessed and updated at a given iteration of the
inner loop.
Algorithm sbmmBLK can be conveniently adapted
to the underlying architecture and problem by care-
fully choosing the blocking parameters c and b, which
strongly depend on the memory organization of the tar-
get architecture and the bandwidth k of the matrix A.
For example, in current multicore processors, a “small”
b is usually a convenient choice (e.g., b = 32), while for
GPUs, larger values are recommended (e.g., b = 128),
see [4,17].
2.3 GPU implementations
We next describe two routines to compute the symmet-
ric band matrix multiplication on a GPU accelerator.
These implementations intensively invoke kernels from
CUBLAS to carry out the computation. In both cases,
the matrices are initially sent to the GPU; the corre-
Unleashing GPU Acceleration for Symmetric Band Linear Algebra Kernels and Model Reduction 3
Algorithm: [C] := sbmmBLK outer(C,A,B, k)
Partition C → ￿CL CR ￿ , B → ￿BL BR ￿
where CL, BL have 0 columns
while n(CL) < n(C) do
Determine block size c
Repartition￿
CL CR
￿→ ￿C0 C1 C2 ￿ , ￿BL BR ￿→ ￿B0 B1 B2 ￿
where C1, B1 have c columns
C1 := sbmmBLK inner(C1, A,B1, k)
Continue with￿
CL CR
￿← ￿C0 C1 C2 ￿ , ￿BL BR ￿← ￿B0 B1 B2 ￿
endwhile
Fig. 1 Outer loop of Algorithm sbmmBLK that computes C := AB+C. In the notation, n(·) returns the number of columns
of a matrix.
Algorithm: [E] := sbmmBLK inner(E,A,D, k)
Partition E →
 ETEM
EB
 , A→
 ATLAML AMR
ABR
 , D →
 DTDM
DB

where ET , DT have 0 elements; ATL is 0× 0 and EM , AML have k rows
while m(ET ) < m(E) do
Determine block size b
Repartition
 ETEM
EB
→

E0
E1
E2
E3
E4
 ,
 ATLAML AMR
ABR
→

A00
A10 A11
A20 A21 A22
A31 A32
A42
 ,
 DTDM
DB
→

D0
D1
D2
D3
D4

where D1 has b rows;
E1 has b rows;
E3 has 0 rows if m(D0) > (n(A)− k − 1) and has b rows otherwise;
A11 is b× b;
A31 is empty if m(D0) > (n(A)− k − 1) and is b× b otherwise;
E1 := E1 +A11D1
E1 := E1 +AT21D2
E1 := E1 +AT31D3
E2 := E2 +A21D1
E3 := E3 +A31D1
Continue with
 ETEM
EB
←

E0
E1
E2
E3
E4
 ,
 ATLAML AMR
ABR
←

A00
A10 A11
A20 A21 A22
A31 A32
A42
 ,
 DTDM
DB
←

D0
D1
D2
D3
D4

endwhile
Fig. 2 Inner loop of Algorithm sbmmBLK that computes C := AB +C. In the notation, m(·) returns the number of rows of
a matrix.
sponding operations are next executed; and the result
is finally transferred back to the CPU.
2.3.1 Kernel sbmmblk
Routine sbmmblk is an implementation of Algorithm
sbmmBLK with all the computations performed via the
appropriate CUBLAS kernels. The update of C1 re-
4 Peter Benner et al.
A30
.
k+1
= +
b
b
updated
read
cc k+1
k+1
c
B1C1 C1
C2C2
C3 C3 A31
A21
A11
Fig. 3 Elements read and updated during an iteration of the inner loop of the sbmmBLK algorithm. The matrix bandwidth
is denoted as k.
quires three matrix multiplications: one involving a sym-
metric block (A11); a second with an upper triangular
block (A31); and the last one with two general matrices.
CUBLAS provides specific routines for all these oper-
ations. In addition, C2 and C3 are respectively updated
via a product of two general matrices and a product of
an upper triangular matrix times a general matrix.
The use of CUBLAS routines also presents a draw-
back, as the kernel that implements the product of a tri-
angular matrix times a general matrix (routine trmm)
indeed computes
C = op(A) B, (3)
where A is an upper or lower triangular matrix and
op(A) denotes A or AT . In contrast, the updates of C1
and C3 both require an operation of the type
C = C + op(A) B, (4)
To overcome this problem, routine sbmmblk updates C1
with the following sequence of operations:
(trmm) W = AT31B1,
(geam) C1 = C1 +W.
(Next to each operation, we indicate the CUBLAS ker-
nel that implements it.) The update of C3 is analo-
gous. In both cases, this procedure requires an auxiliary
workspace W of dimension b× c.
High performance can be expected of this imple-
mentation due to the use of tuned CUBLAS routines.
2.3.2 Kernel sbmmblk+ms
The sbmmblk implementation presents some drawbacks
with a negative impact on performance. In particular, it
requires up to 6 operations per iteration. Furthermore,
two of the operations involve a triangular matrix and
are computed in two steps (as discussed in the previ-
ous subsection). Thus, up to 8 kernels are invoked at
A31
A21
A11
Padding
Fig. 4 Modified storage scheme for symmetric band matrices
and how it is accessed in the inner loop from sbmmBLK .
each iteration, and some of them require a low com-
putational eﬀort (e.g., the two invocations to geam).
The sbmmblk+ms implementation aims to reduce the
number of routine invocations by removing those calls
with a lower computational cost. To make this possi-
ble, we perform some changes in the matrix storage.
Concretely, consider that the lower triangular part of
the symmetric band matrix A is stored following the
BLAS specific format. Then, we add b additional rows
to the bottom of A where, for a GPU accelerator, this
specific value of b is chosen to enable a coalesced access
to the elements of A. (When the upper part of A is
stored, then the new rows should be added at the top
of A.)
Figure 4 shows the modified storage scheme and how
it is accessed during an iteration of the inner loop of
sbmmBLK . The strictly lower triangular part of A31
is now conveniently placed in the added rows. Conse-
quently, the blocks A21 and A31 can be merged, and
the operations they are involved in can be fused. Thus,
the updates performed at each step of the inner loop
can be reorganized as:
Unleashing GPU Acceleration for Symmetric Band Linear Algebra Kernels and Model Reduction 5
E1 := E1 +A11D1,
E1 := E1 +
￿
AT21A
T
31
￿ ￿D2
D3
￿
,￿
E2
E3
￿
:=
￿
E2
E3
￿
+
￿
A21
A31
￿
D1.
This approach presents two major advantages:
– The number of invocations to CUBLAS kernels is
reduced from 8 to 3 per step and, consequently, the
overhead introduced by the invocations is also re-
duced.
– It eliminates the calls to kernels with a moderate to
low cost, which can not exploit the massively paral-
lel architecture of the GPU. Concretely, the opera-
tions that disappear involve triangular matrices and
present load-balancing problems.
There are also two drawbacks in this implementation:
the memory requirement is enlarged; and the number
of arithmetic operations is also increased, as the new
kernel operates with the null elements in A11 and A31.
3 Solution of S.P.D. Band Linear Systems
In this section, we review the procedure for the solution
of s.p.d. band linear systems in LAPACK, and detail
how to accelerate this computation when the target is
a hybrid CPU-GPU platform.
3.1 LAPACK solver
LAPACK supports the solution of linear systems of the
form
AX = B, (5)
where A ∈ Rn×n, B ∈ Rn×m, and X ∈ Rn×m is the
sought-after solution. When A is s.p.d., the system is
solved in three steps:
1. Compute the Cholesky factorizationA = LLT , where
L is lower triangular1.
2. Solve the lower triangular system L Y = B
3. Solve the upper triangular system LT X = Y
BLAS and LAPACK support this method by pro-
viding tailored triangular solvers for steps 2–3 and the
factorization in step 1, respectively. When A exhibits a
band structure, the factorization of this matrix is ini-
tially computed using the kernel pbtrf, and then two
1 Alternatively, one can decompose the matrix as A =
UTU , where U = LT is upper triangular.
band triangular systems (involving L and its transpose)
are solved for each column of Y,X (routine tbsv). This
is necessary since there is no routine in BLAS to solve a
linear system with multiple right-hand sides when the
coeﬃcient matrix presents a triangular band form.
This implementation benefits from the intensive use
of tuned BLAS kernels, but has also some important
limitations. Concretely, an important drawback lies in
that L is fully accessed 2m times, yielding an ineﬃcient
use of the memory hierarchy.
3.2 Hybrid CPU-GPU s.p.d. band solvers
3.2.1 Basic solver (GPUV 1)
This blocked algorithm, illustrated in Figure 5, oﬀ-loads
the computationally-intensive operations of the factor-
ization to the accelerator, while exploiting the multi-
core processor to eﬃciently execute the fine-grain oper-
ations. In particular, the CPU computes the factoriza-
tion of block A11 and the GPU performs the remaining
updates. Additionally, we leverage a modified storage
scheme analogous to that described in Section 2 in or-
der to reduce the number of operations performed by
the GPU at each iteration from 5 to only 2: a triangular
solver to update [A21;A31]; and a rank-k update for the
remaining three blocks.
Once the factorization is computed, the subsequent
triangular-band linear systems are solved in the CPU.
The reason is twofold: (i) usually the number of columns
of B is reduced and therefore this operation presents a
moderate cost; and (ii) its degree of data-parallelism is
limited by the inherent dependencies.
In summary, the factorization is accelerated on the
GPU via the fusion of operations. Moreover, the tri-
angular solves are accelerated when B presents several
columns, as the number of memory accesses to L is re-
duced to only once per solve.
3.2.2 Merged solver (GPUV 2)
The three main operations (factorization and two tri-
angular solvers) in the previous variant are computed
sequentially. However, the factorization and the first
of the triangular band solves can be obtained concur-
rently, as the first solve requires the elements of L in
the same order as they are computed during the factor-
ization. Consequently, as soon as one element/block of
L is computed, the corresponding operations of the first
triangular band linear solve can be performed. This is
especially appealing in our implementation, as the bulk
6 Peter Benner et al.
Algorithm: [A] := pbtrfBLK(A, k)
Partition A→
 ATL ATM ATRAML AMM AMR
ABL ABM ABR

where ATL is 0× 0, AMM is k × k
while m(ATL) < m(A) do
Determine block size b
Repartition
 ATL ￿ ￿AML AMM ￿
ABL ABM ABR
→

A00 ￿ ￿
A10 A11 ￿ ￿
A20 A21 A22 ￿ ￿
A31 A32 A33 ￿
A42 A43 A44

where A11, A33 are b× b, A22 is l × l,
with l := min(k − b,m(A)−m(ATL)− b)
Copy to CPU(A11)
A11 := pbtrfUNB(A11) (in CPU)
Copy to GPU(A11)
A21 := A21 tril(A11)−T (in GPU)
A31 := A31 tril(A11)−T (in GPU)
A22 := A22 −A21AT21 (in GPU)
A32 := A32 −A31AT21 (in GPU)
A33 := A33 −A31AT31 (in GPU)
Continue with
 ATL ￿ ￿AML AMM ￿
ABL ABM ABR
←

A00 ￿ ￿
A10 A11 ￿ ￿
A20 A21 A22 ￿ ￿
A31 A32 A33 ￿
A42 A43 A44

endwhile
Fig. 5 Algorithm pbtrfBLK that computes the factorization A = LLT . In the notation, pbtrfUNB(·) is an unblocked
version of this routine and tril(·) returns the lower triangular part of a matrix.
of the computations during the factorization are per-
formed in the GPU while the system solve is executed
in the CPU.
Figure 6 details how the computations involved in
the factorization and the first system solve are reor-
ganized in this variant. The new order allows to over-
lap concurrent computations in both architectures and,
in an ideal scenario, completely hides the cost of the
first triangular band system solve. This option may also
yield a better exploitation of the memory system, due
to the use of the elements of L immediately after they
are computed, which may incur a lower number volume
of cache misses.
4 Experimental Evaluation
In this section we analyse the computational perfor-
mance of the new routines for the solution of s.p.d.
band linear systems, GPUV 1 and GPUV 2, comparing
them against analogous routines in release 11.1 of In-
tel’s MKL (multi-threaded version).
Buenaventura Enrico
GPU nVIDIA K20 nVIDIA S2070
#CUDA cores 2,496 448
Frequency (GHz) 0.71 1.15
GPU memory (GB) 6 5
CPU Intel i3-3220 Intel i7-2600
#CPU cores 2 4
Frequency (GHz) 3.4 3.3
Memory (GB) 16 8
Operating System CentOS 6.4 CentOS 6.2
C/Fortran gcc v4.4.7 gcc v4.4.6
CUDA/CUBLAS 5.0 5.5
Table 1 Hardware and software employed.
The evaluation was carried out on two servers, Bue-
naventura and Enrico, equipped with state-of-the-
art nVIDIA GPUs and Intel multi-core processors; see
Table 1. All experiments were performed using ieee
double-precision real arithmetic. We generated s.p.d.
band coeﬃcient matrices of 8 dimensions n = 38, 400,
Unleashing GPU Acceleration for Symmetric Band Linear Algebra Kernels and Model Reduction 7
Algorithm: [A,B] := pbtrf solver(A, k,B)
Partition A→
 ATL ATM ATRAML AMM AMR
ABL ABM ABR
 , B → ￿BT
BB
￿
where ATL is 0× 0, AMM is k × k, BT is 0× 0
while m(ATL) < m(A) do
Determine block size b
Repartition
 ATL ￿ ￿AML AMM ￿
ABL ABM ABR
→

A00 ￿ ￿
A10 A11 ￿ ￿
A20 A21 A22 ￿ ￿
A31 A32 A33 ￿
A42 A43 A44
 ,
￿
BT
BB
￿
→

B0
B1
B2
B3

where A11, A33 are b× b, A22 is l × l,
B1 is b× n(B), B2 is l × n(B),
with l := min(k − b,m(A)−m(ATL)− b),
Copy to CPU(A11)
A11 := pbtrfUNB(A11) (in CPU)
Copy to GPU(A11)￿
A21
A31
￿
:=
￿
A21
A31
￿
tril(A11)−T (in GPU)
B1 := B1 tril(A11)−T (in CPU)
Copy to CPU([A21;A31])￿
A11 ￿
A21 A22
￿
:=
￿
A22 ￿
A23 A33
￿
−
￿
A21
A23
￿ ￿
AT21A
T
23
￿
(in GPU)
B2 := B2 −
￿
A21
A31
￿
B1 (in CPU)
Continue with
 ATL ￿ ￿AML AMM ￿
ABL ABM ABR
←

A00 ￿ ￿
A10 A11 ￿ ￿
A20 A21 A22 ￿ ￿
A31 A32 A33 ￿
A42 A43 A44

endwhile
Fig. 6 Algorithm pbtrf solver that computes the factorization A = LLT and simultaneously solves L · Y = B, overwriting
B with the result Y .
51,200, 64,000, 76,800, 89,600, 102,400, 115,200 and
128,000. For each matrix size, we produced 3 problem
instances which varied in bandwidth, k =1%, 2% and
4% of n, except for the two largest dimensions in which
the GPU memory could not allocate the problem in-
stance with largest bandwidth. Matrices B and X were
selected to have a single column (i.e., m = 1 and they
are column vectors), so the impact of the solver with
multiple right-hand sides is not analyzed. We evaluated
several values for the algorithmic blocking parameter b
but, for brevity, we only include the results correspond-
ing to the best block size.
Table 2 and Figure 7 compare the performance of
the three kernels for band s.p.d. linear systems: the two
GPU-accelerated kernels (GPUV 1 and GPUV 2) and the
MKL implementation on both platforms. For GPUV 1
and MKL, the costs of the factorization and the solution
of triangular systems are decoupled into two columns
of Table 2: “Fact.” and “Tr. Sol.”, respectively. For
GPUV 2, the time includes the factorization and both
triangular band solves.
The experimental results demonstrate the superi-
ority of the new routines over the MKL solver, espe-
cially when the computational cost is large. In particu-
lar, both GPUV 1 and GPUV 2 outperform the MKL im-
plementation for all the problem instances in platform
Buenaventura. In contrast, in Enrico the MKL im-
plementation is faster for those cases presenting a mod-
erate computational cost. Specifically, the GPU-based
kernels outperform the MKL variant when k = 4% of n
for all tested dimensions, and when n >76,800 and k =
2% of n in this platform.
These results were expected considering the diﬀer-
ence between the peak performance of the CPU and the
GPU that form each platform. Concretely, the CPU
in Enrico has a higher peak performance than that
8 Peter Benner et al.
in Buenaventura. However, the K20 GPU in Bue-
naventura contains a larger number of CUDA cores
than the S2070 GPU in Enrico. In consequence, the
gap between the peak performance of the CPU and the
GPU in Buenaventura is larger than that for the pro-
cessors in Enrico; and Buenaventura is more suited
for the GPU-based solvers.
This gap between the CPU and the GPU computa-
tional power is especially important for GPUV 2. This
variant oﬀ-loads most of the computations during the
factorization to the GPU, while the CPU solves a trian-
gular system. Obviously, this technique suits platforms
where the GPU is much more powerful than the CPU.
As a consequence, GPUV 1 outperforms GPUV 2 in En-
rico, but in platform Buenaventura it is the other
way around.
On the other hand, B has a single column, which
motivates that the MKL triangular solver outperforms
the new codes. A separate evaluation of both solvers
showed that the cost of the MKL triangular solver in-
creases linearly with the number of columns in B (as
expected), while the computational time of the new rou-
tine reports only minimal increments when a moder-
ate number of columns (e.g., 10) is added to B. Thus,
we suggest to use our routine whenever the number of
columns of B is at least 2.
In general, both hybrid codes outperform the MKL
routine for large matrices while they are still competi-
tive for relatively small ones. This is explained by the
fact that the hybrid routines incur in a communication
overhead that can be compensated only when the prob-
lem is considerably large.
5 Application in Model Reduction
Several important problems in the area of control, such
as model order reduction and linear-quadratic (LQ) op-
timal control, involve the solution of linear and quadratic
matrix equations [2]. In this particular work, we focus
our eﬀort on the Lyapunov equation
AX +XAT +BBT = 0, (6)
where A ∈ Rn×n is sparse, B ∈ Rn×m, with m ￿ n,
and X ∈ Rn×n is the sought-after solution. When n
is large, these equations represent the major computa-
tional challenge for the solution of model order reduc-
tion and LQ optimal control problems.
5.1 The LRCF-ADI method
The low-rank Cholesky factor–alternating directions im-
plicit (LRCF-ADI) method [18] is an eﬃcient approach
to tackle (6) when the coeﬃcient matrix A is large and
sparse or structured (e.g. a symmetric band matrix).
This iterative solver benefits from the frequently en-
countered low-rank property of the BBT factor in (6) to
deliver a low-rank approximation to a Cholesky or full-
rank factor of X. Specifically, given an “l–cyclic” set of
complex shift parameters {p1, p2, . . .}, pk = αk + βk ı,
with ı =
√−1 and pk = pk+l, the cyclic low-rank al-
ternating directions implicit (LR-ADI) iteration can be
formulated as shown in Algorithm 1.
Algorithm 1: LR-ADI
begin
V0 ← (A+ p1In)−1B,
Sˆ0 ← √−2 α1 V0,
k ← 0
repeat
Vk+1 ← Vk − δk(A+ pk+1In)−1Vk,
Sˆk+1 ←
￿
Sˆk , γkVk+1
￿
,
k ← k + 1
until until convergence;
There, γk =
￿
αk+1/αk, δk = pk+1+pk, with pk the
conjugate of pk, and In denotes the identity matrix of
order n. On convergence, after kˆ iterations, a low-rank
matrix Sˆkˆ ∈ Rn×kˆm is computed such that SˆkˆSˆTkˆ ≈ X.
From a computational viewpoint, the performance
of the algorithm is determined by that of the matrix
multiplication and the sequence of shifted linear sys-
tems of the form (A + pkIn)−1X = Y , where we note
that, in practice, Y presents only few columns. More-
over, the algorithm used to calculate the shift parame-
ters pk is an Arnoldi iteration that relies on the solution
of linear systems of the form Ax = y, where x and y
are both vectors. Hence, it is interesting to evaluate
the impact of the new symmetric band solvers on this
method.
Note that this variant does not incorporate recent
improvements in the ADI method avoiding complex
arithmetic and automating the shift selection process [8].
Here, we use the publicly available implementations in
Lyapack [7].
5.2 Implementation on hybrid CPU-GPU platforms
In Section 3 we presented two hybrid solvers for s.p.d.
band linear systems: GPUV 1 exploits the GPU to ac-
celerate the computation of the Cholesky factorization,
and computes the subsequent triangular band systems
in the CPU. GPUV 2 reorders the operations such that
the factorization and the first band triangular solve are
Unleashing GPU Acceleration for Symmetric Band Linear Algebra Kernels and Model Reduction 9
Platform
Dimension Bandwidth MKL GPUV 1 GPUV 2n k Total Fact. Tr. Sol. Total Fact. Tr. Sol.
Buenaventura
38,400
1% 0.24 0.21 0.02 0.23 0.16 0.07 0.22
2% 0.85 0.80 0.04 0.33 0.19 0.13 0.31
4% 3.89 3.79 0.09 0.65 0.39 0.26 0.62
51,200
1% 0.57 0.52 0.04 0.34 0.21 0.12 0.32
2% 2.63 2.55 0.08 0.57 0.33 0.24 0.53
4% 15.67 15.50 0.17 1.21 0.75 0.46 1.14
64,000
1% 0.92 0.85 0.06 0.49 0.30 0.19 0.47
2% 4.37 4.24 0.13 0.92 0.55 0.37 0.87
4% 17.16 16.89 0.27 2.03 1.30 0.72 1.91
76,800
1% 1.68 1.59 0.09 0.66 0.39 0.27 0.63
2% 7.90 7.70 0.19 1.32 0.79 0.52 1.24
4% 36.67 36.27 0.39 3.15 2.09 1.05 2.94
89,600
1% 3.92 3.79 0.13 0.89 0.53 0.36 0.85
2% 11.75 11.49 0.26 1.82 1.11 0.71 1.73
4% 46.99 46.44 0.54 4.68 3.23 1.44 4.52
102,400
1% 5.32 5.13 0.19 1.14 0.66 0.47 1.08
2% 31.44 31.08 0.35 2.44 1.50 0.93 2.31
4% 96.16 95.43 0.73 7.80 5.67 2.12 6.11
115,200
1% 6.47 6.23 0.23 1.44 0.84 0.60 1.36
2% 24.60 24.17 0.43 3.22 2.04 1.18 3.04
128,000
1% 8.88 8.62 0.26 1.90 1.12 0.78 1.76
2% 35.86 35.30 0.56 4.25 2.76 1.48 3.99
Enrico
38,400
1% 0.14 0.12 0.01 0.23 0.19 0.04 0.23
2% 0.40 0.37 0.03 0.50 0.43 0.07 0.51
4% 1.79 1.72 0.06 0.81 0.67 0.13 0.81
51,200
1% 0.31 0.28 0.03 0.43 0.35 0.07 0.42
2% 0.98 0.92 0.05 1.04 0.92 0.12 1.07
4% 6.02 5.91 0.11 1.56 1.32 0.24 1.60
64,000
1% 0.49 0.45 0.04 0.66 0.55 0.11 0.68
2% 1.68 1.59 0.08 1.81 1.61 0.19 1.84
4% 11.07 10.89 0.17 2.68 2.30 0.38 2.69
76,800
1% 0.81 0.75 0.06 1.01 0.86 0.14 1.03
2% 3.64 3.51 0.12 1.59 1.30 0.28 1.64
4% 19.50 19.25 0.25 4.25 3.70 0.54 4.24
89,600
1% 1.19 1.10 0.08 1.43 1.23 0.19 1.46
2% 6.62 6.46 0.16 2.24 1.86 0.37 2.29
4% 30.40 30.06 0.34 6.29 5.56 0.73 6.27
102,400
1% 1.97 1.85 0.11 2.10 1.85 0.25 2.15
2% 12.23 12.00 0.22 3.15 2.65 0.49 3.23
4% 74.84 74.39 0.44 8.99 8.03 0.96 8.95
115,200
1% 2.37 2.23 0.14 2.79 2.48 0.31 2.85
2% 16.02 15.75 0.27 4.15 3.53 0.61 4.21
128,000
1% 3.40 3.23 0.17 3.63 3.24 0.38 3.70
2% 22.49 22.13 0.35 5.39 4.63 0.76 5.44
Table 2 Execution time (in seconds) of the two versions of the hybrid CPU+GPU s.p.d. band system solver and Intel’s MKL
implementation.
performed concurrently using both architectures, and
the second band triangular solver completes the pro-
cess in the CPU.
In our implementation, we applied GPUV 1 to cal-
culate the shift parameters, as a single matrix factor-
ization is required for this process. On the other hand,
a new factorization is computed per LR-ADI iteration,
and we leverage GPUV 2 for this purpose. We note that
only l diﬀerent coeﬃcient matrices appear during the
LR-ADI procedure while, in general, the number of it-
erations for convergence is larger. Here, due to memory
Problem n k m
RAILS 5,177 139 7
RAILM 20,209 276 7
RAILL 79,841 550 7
Table 3 Instances of the RAIL example from the Oberwol-
fach model reduction collection employed in the evaluation.
restrictions, we decided to recompute the factorizations
instead of storing the Cholesky factors.
10 Peter Benner et al.
Problem
Solver
Speed-up
MKL GPU-based
RAILS 0.45 0.70 0.64
RAILM 2.79 2.81 0.99
RAILL 30.51 15.04 2.02
Table 4 Execution time (in seconds) of the hybrid CPU-
GPU Lyapunov solvers and the MKL-based counterpart in
Buenaventura, and speed-ups of the GPU-accelerated rou-
tines with respect to the CPU-only (MKL) implementation.
5.3 Experimental evaluation
We employed three instances of the RAIL model re-
duction problem from the Oberwolfach benchmark col-
lection [13] for the evaluation of the Lyapunov solvers.
The coeﬃcient matrix A in (6) for these problems is
s.p.d. and sparse. Therefore, we applied the Reverse
Cuthill-McKee algorithm [9] to transform the unstruc-
tured sparse linear systems in the expressions for V0
and Vk+1 in Algorithm 1 into s.p.d. problems with band
coeﬃcient matrix; see Table 3 where n is the matrix di-
mension, k is the resulting bandwidth and m the num-
ber of columns of the rigth hand side matrix (i.e. B).
Table 4 reports the execution times obtained from
the Lyapunov solvers that employ Intel’s MKL to per-
form the computations in the CPU compared with our
hybrid CPU-GPU solvers (GPUV 1-GPUV 2) in buena-
ventura. The results show that the hybrid Lyapunov
solver outperforms its MKL counterpart for the largest
problem instance, reaching a speed-up factor around
2×. For the medium-size problem, the performance of
both solvers is similar; and for the smallest one, the
MKL-based solver is the best option. This was expected
as large problems are needed to exploit the capabilities
of the GPU and, therefore, compensate the overhead
incurred by the CPU-GPU data transfers. It should be
noted that this result is strongly aligned with the ob-
tained results for s.p.d. band linear systems, and im-
plicitily indicates the behaviour that can be expected
from an execution with other problems.
6 Concluding Remarks
We have presented new hybrid CPU-GPU routines that
accelerate the solution of s.p.d. band linear systems by
oﬄoading the computationally-expensive operations to
the GPU. Our first CPU-GPU implementation com-
putes the Cholesky factorization by executing the in-
volved BLAS-3 kernels in the GPU, while maintaining
a modest volume of CPU-GPU communication. The al-
ternative CPU-GPU variant overlaps the Cholesky fac-
torization with the solution of the first triangular band
linear systems. Both variants rely on appropriate ker-
nels from nVIDIA’s CUBLAS and Intel’s MKL.
The experimental results observed using several band
test cases (with dimensions between 38,400 and 128,000
and a bandwidth of 1%, 2% and 4% of the problem
size), in two platforms equipped with modern GPUs,
reveal that our hybrid codes outperform the MKL rou-
tines for large matrices while they are still competi-
tive for relatively small ones. Additionally, we have ap-
plied the new solvers to the solution of Lyapunov equa-
tions. The results show that the proposed Lyapunov
solver outperforms its MKL counterpart in the solu-
tion of large problems, reaching a speed-up of 2×, when
tackling a model order reduction problem of dimension
5,177.
Acknowledgements Ernesto Dufrechou and Pablo Ezzatti
acknowledge the support from Programa de Desarrollo de
las Ciencias Ba´sicas, and Agencia Nacional de Investigacio´n
e Innovacio´n, Uruguay. Enrique S. Quintana-Ort´ı was sup-
ported by project TIN2011-23283 of the Ministry of Science
and Competitiveness (MINECO) and EU FEDER, and project
P1-1B2013-20 of the Fundacio´ Caixa Castello´-Bancaixa and
UJI.
References
1. Anderson, E., Bai, Z., Demmel, J., Dongarra, J.E.,
DuCroz, J., Greenbaum, A., Hammarling, S., McKen-
ney, A.E., Ostrouchov, S., Sorensen, D.: LAPACK Users’
Guide. SIAM, Philadelphia (1992)
2. Antoulas, A.: Approximation of Large-Scale Dynamical
Systems. SIAM Publications, Philadelphia, PA (2005)
3. Barrachina, S., Castillo, M., Igual, F.D., Mayo, R.,
Quintana-Ort´ı, E.S.: Solving dense linear systems on
graphics processors. In: Euro-Par, pp. 739–748 (2008)
4. Barrachina, S., Castillo, M., Igual, F.D., Mayo, R.,
Quintana-Ort´ı, E.S., Quintana-Ort´ı, G.: Exploiting the
capabilities of modern gpus for dense matrix computa-
tions. Concurrency and Computation: Practice and Ex-
perience 21(18), 2457–2477 (2009)
5. Benner, P., Dufrechou, E., Ezzatti, P., Quintana-Ort´ı,
E.S., Remo´n, A.: Accelerating Band Linear Algebra Op-
erations on GPUs with Application in Model Reduction,
Lecture Notes in Computer Science, vol. 8584. Springer
(2014)
6. Benner, P., Ezzatti, P., Quintana-Ort´ı, E.S., Remo´n, A.:
Matrix inversion on CPU–GPU platforms with applica-
tions in control theory. Concurrency and Computation:
Practice and Experience 25(8), 1170–1182 (2013)
7. Benner, P., Ku¨rschner, P., Saak, J.: Eﬃcient handling of
complex shift parameters in the low-rank Cholesky fac-
tor ADI method. Numerical Algorithms 62(2), 225–251
(2013)
8. Benner, P., Ku¨rschner, P., Saak, J.: Self-generating and
eﬃcient shift parameters in ADI methods for large Lya-
punov and Sylvester equations. Electronic Transactions
on Numerical Analysis 43, 142–162 (2014)
9. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse
symmetric matrices. In: Proceedings of the 1969 24th
Unleashing GPU Acceleration for Symmetric Band Linear Algebra Kernels and Model Reduction 11
National Conference, ACM ’69, pp. 157–172. ACM, New
York, NY, USA (1969)
10. Du Croz, J., Mayes, P., Radicati, G.: Factorization of
band matrices using level 3 BLAS. LAPACK Working
Note 21, Technical Report CS-90-109, University of Ten-
nessee (1990)
11. Dufrechou, E., Ezzatti, P., Quintana-Ort´ı, E., Remo´n,
A.: Eﬃcient symmetric band matrix-matrix multiplica-
tion on GPUs. Communications in Computer and Infor-
mation Science 485, 1–12 (2014)
12. Farber, R.: CUDA application design and development.
Morgan Kaufmann (2011)
13. IMTEK: Oberwolfach model reduction benchmark col-
lection. http://www.imtek.de/simulation/benchmark/
14. Khronos group: URL http://www.khronos.org/opencl
15. Kirk, D., Hwu, W.: Programming Massively Parallel Pro-
cessors, Second Edition: A Hands-on Approach. Morgan
Kaufmann (2012)
16. OpenACC.org: URL http://www.openacc-standard.org
17. Peise, E., Bientinesi, P.: Performance modeling for dense
linear algebra. In: Proceedings of the 2012 SC Compan-
ion: High Performance Computing, Networking Storage
and Analysis, SCC ’12, pp. 406–416. IEEE Computer So-
ciety, Washington, DC, USA (2012)
18. Penzl, T.: A cyclic low-rank Smith method for large
sparse Lyapunov equations. SIAM J. Sci. Comput. 21(4),
1401–1418 (1999)
19. The Top500 list: Available at http://www.top500.org
(2013)
20. Volkov, V., Demmel, J.: LU, QR and Cholesky
factorizations using vector capabilities of GPUs.
Tech. Rep. UCB/EECS-2008-49, EECS Department,
University of California, Berkeley (2008). URL
http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-
2008-49.html
12 Peter Benner et al.
1 2 4
0
0.5
1
1.5
2
2.5
3
3.5
4
Percentage
s
e
c
o
n
ds
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(a) n = 38, 400
1 2 4
0
2
4
6
8
10
12
14
16
Percentage
s
e
c
o
n
ds
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(b) n = 51, 200
1 2 4
0
2
4
6
8
10
12
14
16
18
Percentage
se
co
nd
s
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(c) n = 64, 000
1 2 4
0
5
10
15
20
25
30
35
Percentage
se
co
n
ds
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(d) n = 76, 800
1 2 4
0
5
10
15
20
25
30
35
40
45
Percentage
se
co
nd
s
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(e) n = 89, 600
1 2 4
0
10
20
30
40
50
60
70
80
90
Percentage
s
e
c
o
n
ds
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(f) n = 102, 400
1 2
0
5
10
15
20
25
Percentage
s
e
c
o
n
ds
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(g) n = 115, 200
1 2
0
5
10
15
20
25
30
35
Percentage
se
co
n
ds
 
 
MKLbuen
GPUv1buen
GPUv2buen
MKL
enrico
GPUv1
enrico
GPUv2
enrico
(h) n = 128, 000
Fig. 7 Performance of the two versions of the hybrid CPU+GPU s.p.d. band system solver and Intel’s MKL implementation.
