Holistic Performance Engineering for Sparse Iterative Solvers by Thies, Jonas et al.
www.DLR.de • Chart 1 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Holistic Performance Engineering for Sparse Iterative Solvers
Jonas Thies? Dominik Ernst† Melven Röhrig-Zöllner?
? Institute of Simulation and Software Technology
German Aerospace Center (DLR)
† Erlangen Regional Computing Center (RRZE)
University of Erlangen-Nuremberg
project ESSEX
www.DLR.de • Chart 2 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Sparse Eigenvalue Problems
Formulation Find some Eigenpairs
(λj , vj) of a large and sparse matrix
(pair) in a target region of the
spectrum
Avj = λjBvj
• A Hermitian or general, real or
complex
• B may be identity matrix (or
not)
• ‘some’ may mean ‘quite a few’,










Driven cavity Rayleigh-Benard convection
www.DLR.de • Chart 3 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18




Block GMRES ⇐⇒ Block Krylov-Schur
Jacobi-Davidson ⇐⇒ inexact Newton
etc.
Typical algorithmic pattern:
• obtain new search direction(s) vj (e.g. by using a matvec or solving a
system),
• orthogonalize against previous vectors V = [v0 . . . vj−1 and expand
V ← [V , vj ],
• solve projected problem for V TAV directly.
www.DLR.de • Chart 4 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Performance of Sparse Matrix Algorithms
Typical operations are
memory-bounded:
• ‘spMVM’ y ← A · x ,
• vector operations, s ← xT y ,
x ← αx + βy
... unless the data sets are small:
• CPU/KNL: OpenMP overhead
≈ 25µs
• GPU: launch latency ≈ 35µs
www.DLR.de • Chart 5 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Why Performance Engineering?
simple(?) operation: C = V TV ,V ∈ R1M×4 on an Intel Haswell CPU
www.DLR.de • Chart 5 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Why Performance Engineering?
simple(?) operation: C = V TV ,V ∈ R1M×4 on an Intel Haswell CPU
www.DLR.de • Chart 5 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Why Performance Engineering?
simple(?) operation: C = V TV ,V ∈ R1M×4 on an Intel Haswell CPU
www.DLR.de • Chart 5 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Why Performance Engineering?
simple(?) operation: C = V TV ,V ∈ R1M×4 on an Intel Haswell CPU
www.DLR.de • Chart 6 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Test Hardware
• “Skylake”: Intel Xeon Scalable, 4× 14 cores @2.6GHz, 384 GB DDR4 RAM
• “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode)
• “Volta”: NVidia Tesla V100-SXM2 GPU, 16 GB HBM2 (+UVM)
benchmark Skylake KNL Volta
load 360 338 812
store 200 167 883
triad 260 315 843
Measured streaming memory
bandwidth [GB/s]
www.DLR.de • Chart 6 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Test Hardware
• “Skylake”: Intel Xeon Scalable, 4× 14 cores @2.6GHz, 384 GB DDR4 RAM
• “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode)
• “Volta”: NVidia Tesla V100-SXM2 GPU, 16 GB HBM2 (+UVM)
benchmark Skylake KNL Volta
load 360 338 812
store 200 167 883
triad 260 315 843
Measured streaming memory
bandwidth [GB/s]
www.DLR.de • Chart 6 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Test Hardware
• “Skylake”: Intel Xeon Scalable, 4× 14 cores @2.6GHz, 384 GB DDR4 RAM
• “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode)
• “Volta”: NVidia Tesla V100-SXM2 GPU, 16 GB HBM2 (+UVM)
benchmark Skylake KNL Volta
load 360 338 812
store 200 167 883
triad 260 315 843
Measured streaming memory
bandwidth [GB/s]









www.DLR.de • Chart 6 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Test Hardware
• “Skylake”: Intel Xeon Scalable, 4× 14 cores @2.6GHz, 384 GB DDR4 RAM
• “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode)
• “Volta”: NVidia Tesla V100-SXM2 GPU, 16 GB HBM2 (+UVM)
benchmark Skylake KNL Volta
load 360 338 812
store 200 167 883
triad 260 315 843
Measured streaming memory
bandwidth [GB/s]
nb 1M 2M 4M 8M 16M 32M
1 12 23 37 58 78 83
2 31 35 53 68 81 88
4 34 53 66 83 88 95
8 51 70 85 87 99 100
“% roofline” of XTY ,X ,Y ∈ RN×nb on Volta.
www.DLR.de • Chart 7 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18

























nb 1M 2M 4M 8M 16M 32M
1 12 23 37 58 78 83
2 31 35 53 68 81 88
4 34 53 66 83 88 95
8 51 70 85 87 99 100
• 1000 CG iterations
• 3D 7-point Laplacian
grid N memory
2563 16.7M 2.2 GB
5123 132M 18 GB
www.DLR.de • Chart 8 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Increasing the Flop Intensity
Aim: push operations to
the right
Block solvers (block size nb)
• inner product =⇒ factor n2b more flops
• vector updates remain BLAS1 (X ← X + αY )
• Caveat: may increase number of iterations
• Example: block GMRES for multiple RHS
Kernel fusion
• example: compute Y ← AX and simultaneously
C = XTY ‘for free’
• requires specialized kernels
• no deterioration of numerics
Mixed Precision (work in progress)
• store (block) vectors in single precision
• compute in double to maintain numerical stability
• also allows larger problems on GPU
www.DLR.de • Chart 8 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Increasing the Flop Intensity
Aim: push operations to
the right
Block solvers (block size nb)
• inner product =⇒ factor n2b more flops
• vector updates remain BLAS1 (X ← X + αY )
• Caveat: may increase number of iterations
• Example: block GMRES for multiple RHS
Kernel fusion
• example: compute Y ← AX and simultaneously
C = XTY ‘for free’
• requires specialized kernels
• no deterioration of numerics
Mixed Precision (work in progress)
• store (block) vectors in single precision
• compute in double to maintain numerical stability
• also allows larger problems on GPU
www.DLR.de • Chart 8 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Increasing the Flop Intensity
Aim: push operations to
the right
Block solvers (block size nb)
• inner product =⇒ factor n2b more flops
• vector updates remain BLAS1 (X ← X + αY )
• Caveat: may increase number of iterations
• Example: block GMRES for multiple RHS
Kernel fusion
• example: compute Y ← AX and simultaneously
C = XTY ‘for free’
• requires specialized kernels
• no deterioration of numerics
Mixed Precision (work in progress)
• store (block) vectors in single precision
• compute in double to maintain numerical stability
• also allows larger problems on GPU
www.DLR.de • Chart 9 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Example: Block Orthogonalization
Problem definition
• Given orthogonal vectors
(
w1, . . . ,wk
)
= W
• For X ∈ Rn×nb find orthogonal Y ∈ Rn×n˜b with
YR1 = X −WR2, and W TY = 0
Two phase algorithms
Phase 1 Project: X¯ ← (I −WW T )X
Phase 2 Orthogonalize: Y ← f (X¯ )
• suitable f :
• SVQB (Stathopoulos and Wu, SISC 2002)
• TSQR (Demmel et al., SISC 2012)
• Each phase messes with the accuracy of the other. → iterate
www.DLR.de • Chart 10 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Block Orthogonalization with Kernel Fusion
Rearrange and fuse operations to reduce memory traffic:
Phase 2 X¯ ← XM¯, N ←W T X¯
Phase 1 X¯ ← X −WN, M ← X¯T X¯
Phase 3 X¯ ← XM¯, M ← X¯T X¯
⇒ use SVQB or Cholesky-QR
Increased precision
Idea Calculate value and error of each arithmetic operation
• Store intermediate results as double-double (DD) numbers
• Based on arithmetic building blocks (2Sum, 2Mult)
Muller et al.: Handbook of Floating-Point Arithmetic, Springer 2010
• Exploit FMA operations (AVX2)
www.DLR.de • Chart 11 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18




























" " with fused operations
" " with fused DD operations
Orthog. nb vectors against a block of nproj , 12-core Haswell CPU
www.DLR.de • Chart 12 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Software I: our Kernel Library
General, Hybrid-parallel and
Optimized Sparse Toolkit
• provides memory-bounded kernels for sparse solvers
• data structures:
• row- or col-major block vectors
• SELL-C − σ for sparse matrices
• written (mostly) in C
• ‘MPI+X’ with X OpenMP, CUDA and SIMD intrinsics
• runs on Peta-scale systems (Piz Daint, Oakforest-PACS)
• can use heterogeneous systems (e.g. including CPUs, MIC and GPUs)
https://bitbucket.org/essex/ghost
www.DLR.de • Chart 13 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Software II: Algorithms and Integration Framework
PHIST Pipelined, Hybrid-parallelIterative Solver Toolkit
• Interfaces: C, C++, Fortran,
Python
• testing and benchmarking tools
• includes performance models
• various linear and eigensolvers
Select ‘backend’ at compile time:
, builtin (Fortran), , PETSc
https://bitbucket.org/essex/phist
www.DLR.de • Chart 14 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Example: Anasazi Block Krylov-Schur on Skylake CPU
Matrix: non-symmetric 7-point stencil, N = 1283
(var. coeff. reaction/convection/diffusion)













8.83 8.15 7.36 8.1







Tpetra, their orthog Tpetra, PHIST orthog
GHOST, their orthog GHOST, PHIST orthog
• Anasazi’s kernel interface
is mostly a subset of
PHIST’s
=⇒ extends PHIST by
e.g. BKS and LOBPCG
• not optimized for block
vectors in row-major
storage
www.DLR.de • Chart 14 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Example: Anasazi Block Krylov-Schur on Volta GPU
Matrix: non-symmetric 7-point stencil, N = 1283
(var. coeff. reaction/convection/diffusion)

















• Anasazi’s kernel interface
is mostly a subset of
PHIST’s
=⇒ extends PHIST by
e.g. BKS and LOBPCG
• not optimized for block
vectors in row-major
storage
www.DLR.de • Chart 14 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Example: Anasazi Block Krylov-Schur on Volta GPU
Matrix: non-symmetric 7-point stencil, N = 1283
(var. coeff. reaction/convection/diffusion)




















Tpetra, their orthog Tpetra, PHIST orthog
• Anasazi’s kernel interface
is mostly a subset of
PHIST’s
=⇒ extends PHIST by
e.g. BKS and LOBPCG
• not optimized for block
vectors in row-major
storage
www.DLR.de • Chart 14 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Example: Anasazi Block Krylov-Schur on Volta GPU
Matrix: non-symmetric 7-point stencil, N = 1283
(var. coeff. reaction/convection/diffusion)























Tpetra, their orthog Tpetra, PHIST orthog
GHOST, their orthog
• Anasazi’s kernel interface
is mostly a subset of
PHIST’s
=⇒ extends PHIST by
e.g. BKS and LOBPCG
• not optimized for block
vectors in row-major
storage
www.DLR.de • Chart 14 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Example: Anasazi Block Krylov-Schur on Volta GPU
Matrix: non-symmetric 7-point stencil, N = 1283
(var. coeff. reaction/convection/diffusion)
























Tpetra, their orthog Tpetra, PHIST orthog
GHOST, their orthog GHOST, PHIST orthog
• Anasazi’s kernel interface
is mostly a subset of
PHIST’s
=⇒ extends PHIST by
e.g. BKS and LOBPCG
• not optimized for block
vectors in row-major
storage
www.DLR.de • Chart 15 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Can we do better?
PHIST PerfCheck: replace timing output by simple performance model
• Anasazi BKS with nb = 4 gives lines like this:
function(dim) / (formula) total time %roofline count
phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,*iflag=0) 6.156e+00 11.7 174
STREAM_TRIAD((nV+nW)*n*sizeof(_ST_))
‘realistic’ option: report strided data accesses due to ‘views’ and adjust
perf. model
function(dim) / (formula) total time %roofline count
phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,ldV=85,*iflag=0) 6.013e+00 23.8 174
STREAM_TRIAD((nV_+nW_)*n*sizeof(_ST_))
www.DLR.de • Chart 15 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Can we do better?
PHIST PerfCheck: replace timing output by simple performance model
• Anasazi BKS with nb = 4 gives lines like this:
function(dim) / (formula) total time %roofline count
phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,*iflag=0) 6.156e+00 11.7 174
STREAM_TRIAD((nV+nW)*n*sizeof(_ST_))
‘realistic’ option: report strided data accesses due to ‘views’ and adjust
perf. model
function(dim) / (formula) total time %roofline count
phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,ldV=85,*iflag=0) 6.013e+00 23.8 174
STREAM_TRIAD((nV_+nW_)*n*sizeof(_ST_))
www.DLR.de • Chart 16 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Block Jacobi-Davidson QR
• Use inexact Newton rather than Krylov sequence
• JDQR: subspace acceleration, locking and restart (Fokkema’99)
Block Jacobi-Davidson correction equation
• nb current approximations: Av˜i − λ˜i v˜i = ri , i = 1, . . . , nb
• previously converged Schur vectors
(
q1, . . . , qk
)
= Q
• solve approximately (with Q˜ =
(
Q v˜1 . . . v˜nb
)
):
(I − Q˜Q˜T )(A− λ˜i I )(I − Q˜Q˜T )xi = −ri i = 1, . . . , nb
• use some steps of a block(ed) iterative solver
• orthogonalize new directions x1, . . . , xnb (outer subspace iteration)
www.DLR.de • Chart 17 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR: ‘Numerical Overhead’
block size 2 With larger block size...
• number of (outer) iteratios
decreases
• total number of operations
increases
• tested here for various matrices
www.DLR.de • Chart 17 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR: ‘Numerical Overhead’
block size 4 With larger block size...
• number of (outer) iteratios
decreases
• total number of operations
increases
• tested here for various matrices
www.DLR.de • Chart 17 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR: ‘Numerical Overhead’
block size 8 With larger block size...
• number of (outer) iteratios
decreases
• total number of operations
increases
• tested here for various matrices
www.DLR.de • Chart 18 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR on Different Hardware (here: Laplace problem)













www.DLR.de • Chart 18 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR on Different Hardware (here: Laplace problem)
















www.DLR.de • Chart 18 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR on Different Hardware (here: Laplace problem)




















www.DLR.de • Chart 18 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR on Different Hardware (here: Laplace problem)
























www.DLR.de • Chart 18 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
BJDQR on Different Hardware (here: Laplace problem)




























www.DLR.de • Chart 19 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Preconditioning with ML
• non-symm. PDE as before
• use AMG preconditioner ML from Trilinos
(“non-symmetric smoothed aggregation”)
problem size preconditioner iterations spMVMs ttot tgmres
1283 GMRES 447 9 875 52.6s 25.0s
GMRES+ML 26 612 21.2s 11.4s
2563 GMRES 781 17 223 1 300s 571s
GMRES+ML 40 922 346s 183s
5123 GMRES >1k >22k >1h >1h
GMRES+ML 32 746 624s 320s
www.DLR.de • Chart 20 > PASC’18 > J. Thies et al. • Sparse Solvers > PASC’18
Summary
• Two libs for high-performance sparse solvers: GHOST & PHIST
• support algorithm developer by
• kernel interface inspired by MPI
• kernels & algorithmic core operations
• built-in performance models
• Holistic performance engineering needed for sparse block solvers
• tight interplay of data structures, kernels, core and algorithm
• Example: Block Krylov-Schur with different kernels and
orthogonalization routines
Next steps:
• model that takes fast and slow memory segments into account
• demonstrate strong scaling advantage of block methods
https://bitbucket.org/essex/[ghost|phist]
