Performance Engineering for Sparse Eigensolers on Heterogenous Clusters by Thies, Jonas et al.
www.DLR.de • Chart 1 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Performance Engineering for Sparse Eigensolvers on
Heterogenous Clusters
Jonas Thies Melven Ro¨hrig-Zo¨llner Moritz Kreutzer
Achim Basermann Georg Hager Gerhard Wellein
German Aerospace Center
Simulation and Software Technology
and University of Erlangen
project ESSEX
www.DLR.de • Chart 2 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Motivation
Mathematical problem
• Find 20− 50 eigenpairs
Axi = λixi
of a large, sparse matrix A
• interior or extreme λi
• symmetric or general A
Memory gap
• small memory bandwidth vs.
high peak flop rate
→ increase the compute intensity
Roofline performance model
(2x 12 core Haswell EP)
1
4
16
64
256
1024
1/4 1 4 16 64
Pe
ak
ba
nd
wi
dt
h
Peak Flop/s
BLAS1 (ddot)
D
P
G
F
lo
p
/s
Compute intensity [Flop / DP element]
www.DLR.de • Chart 3 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Block JDQR Method
Block Jacobi-Davidson correction equation
• nb current approximations: Av˜i − λ˜i v˜i = ri , i = 1, . . . , nb
• previously converged Schur vectors
(
q1, . . . , qk
)
= Q
• solve approximately (with Q˜ =
(
Q v˜1 . . . v˜nb
)
):
(I − Q˜Q˜T )(A− λ˜i I )(I − Q˜Q˜T )xi = −ri i = 1, . . . , nb
• use some steps of a block(ed) iterative solver
• orthogonalize new directions x1, . . . , xnb (outer subspace iteration)
Properties (compared to single-vector method)
• usually needs more operations → shunned in practice
• more cache-friendly, fewer global operations
www.DLR.de • Chart 4 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Numerical Behavior
Block size 2
1
1.5
2
2.5
3
3.5
5 10 20 30 40 50
re
la
ti
ve
ov
er
h
ea
d
(#
sp
M
V
M
)
# eigenvalues found
Andrews
cfd1
finan512
torsion1
ck656
cry10000
dw8192
rdb3200l
Block size 4
1
1.5
2
2.5
3
3.5
5 10 20 30 40 50
re
la
ti
ve
ov
er
h
ea
d
(#
sp
M
V
M
)
# eigenvalues found
from: Ro¨hrig-Zo¨llner et al. SISC 2015
www.DLR.de • Chart 5 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Software I:
(General Hybrid and Optimized Sparse Toolkit) provides
• intelligent resource management for heterogenous systems
• automatic pinning of threads to cores
• asynchronous execution of (larger) tasks
• some fully optimized kernels for sparse matrix methods
• sparse matrix-(multi)vector multiplication (spM(M)VM)
• ‘tall and skinny’ matrices in row or column major ordering
• target platforms right now: Intel CPUs, Xeon Phi and Nvidia GPUs
• programming model: ‘MPI+X’,
with X=SIMD intrinsics, OpenMP and CUDA
www.DLR.de • Chart 6 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
MPI+X with
• System with multiple CPUs
(NUMA domains) and GPUs
• -np 1: use entire CPU
• -np 2: use CPU and first GPU
• -np 3: use CPU and both GPUs
• -np 4: use one process per socket
and one for each GPU
Option: distribute problem according
to memory bandwidth measured
www.DLR.de • Chart 6 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
MPI+X with
• System with multiple CPUs
(NUMA domains) and GPUs
• -np 1: use entire CPU
• -np 2: use CPU and first GPU
• -np 3: use CPU and both GPUs
• -np 4: use one process per socket
and one for each GPU
Option: distribute problem according
to memory bandwidth measured
www.DLR.de • Chart 6 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
MPI+X with
• System with multiple CPUs
(NUMA domains) and GPUs
• -np 1: use entire CPU
• -np 2: use CPU and first GPU
• -np 3: use CPU and both GPUs
• -np 4: use one process per socket
and one for each GPU
Option: distribute problem according
to memory bandwidth measured
www.DLR.de • Chart 6 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
MPI+X with
• System with multiple CPUs
(NUMA domains) and GPUs
• -np 1: use entire CPU
• -np 2: use CPU and first GPU
• -np 3: use CPU and both GPUs
• -np 4: use one process per socket
and one for each GPU
Option: distribute problem according
to memory bandwidth measured
www.DLR.de • Chart 6 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
MPI+X with
• System with multiple CPUs
(NUMA domains) and GPUs
• -np 1: use entire CPU
• -np 2: use CPU and first GPU
• -np 3: use CPU and both GPUs
• -np 4: use one process per socket
and one for each GPU
Option: distribute problem according
to memory bandwidth measured
www.DLR.de • Chart 6 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
MPI+X with
• System with multiple CPUs
(NUMA domains) and GPUs
• -np 1: use entire CPU
• -np 2: use CPU and first GPU
• -np 3: use CPU and both GPUs
• -np 4: use one process per socket
and one for each GPU
Option: distribute problem according
to memory bandwidth measured
www.DLR.de • Chart 7 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
What is NOT
• a DSL for programming heterogenous hardware
• easily portable to platforms other than Intel and Nvidia
• easy to integrate in existing code
• a mature library
=⇒ For implementing iterative solvers we use an interface layer (up next)
www.DLR.de • Chart 8 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Software II: PHIST
a Pipelined Hybrid-parallel Iterative Solver Toolkit
• facilitate algorithm development using
• holistic performance engineering
• portability and interoperability
application vertical integration
algorithms
preconditioners
computational core
«abstraction»
eigenproblem
setup/apply
sparseMat mVec sdMat
solver templates
FT strategies
algo core
«interface»
kernel interface h
ol
is
ti
c
pe
rf
or
m
an
ce
en
gi
ne
er
in
g
C wrapper
adapter
www.DLR.de • Chart 9 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Useful abstraction: kernel interface
Choose from several ‘backends’ at compile time, to
• easily use PHIST in existing applications
• perform the same run with different kernel libraries
• compare numerical accuracy and performance
• exploit unique features of a kernel library (e.g. preconditioners)
www.DLR.de • Chart 10 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Cool features of PHIST
Task macros
out-of-order execution of code blocks
• overlap comm. and comp.
• asynchronous checkpointing
• ...
Consistent random vectors
make PHIST runs comparable
• across platforms (CPU, GPU...)
• across kernel libraries
• independent of #procs, #threads
PerfCheck:
print achieved roofline performance of
kernels after complete run to reveal
• deficiencies of kernel lib
• implemntation issues of algorithm
(strided data access etc.)
Special-purpose operations
• fused kernels, e.g. compute
Y = αAX + βY and Y TX
• highly accurate core functions, e.g.
block orthogonalization in simulated
quad precision
www.DLR.de • Chart 11 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Sparse matrix-vector multiplication (in a Chebyshev solver)
1 64 256 10244 16
Number of heterogeneous nodes
0.1
1
10
100
Pe
rfo
rm
an
ce
 in
 T
flo
p/
s
100% Parallel Efficiency
Square, Weak Scaling
Bar, Weak Scaling
Square, Strong Scaling
Weak and strong scaling
(on Piz Daint @ CSCS Lugano)
from: Kreutzer et al. IPDPS’15
SELL-C-σ sparse matrix storage format for heterogenous systems
www.DLR.de • Chart 12 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
‘Tall & skinny’ kernel performance (V ∈ R10M×40,W ∈ R10M×4)
0
20
40
60
80
100
120
SVQB project shrink
%
of
ro
ofl
in
e
p
er
fo
rm
an
ce
10 core Intel(R) IvyBridge
Nvidia(R) Tesla K20m
C4×4 =W TW
W =W ·C4×4 C40×4 =V TW
W =W −V ·C40×4
V:,1:20 =V ·C40×20
⇒ some fallback kernels needed on GPU, further experiments postponed
www.DLR.de • Chart 13 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Strong scaling performance
Setup
• non-symmetric matrix from
7-point 3D PDE discretization
(n ≈ 1.3 · 108, nnz ≈ 9.4 · 108)
• find 20 eigenvalues
• Ivy Bridge Cluster
Results
• nb = 2: significantly faster
• nb = 4: no further improvement 0
1000
2000
3000
4000
5000
8 16 32 64 128
nodes
nb=1
nb=2
nb=4
www.DLR.de • Chart 14 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Block method faster for various matrices
Setup
• different large matrices from
• Quantum physics
• PDE discretization
• find 20 outmost eigenvalues
using (block) Jacobi-Davidson
• block size nb = 2 (similar for 4)
Results
• typically faster by a factor 1.2
• less synchronization but larger
messages during spMMVM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
8 16 32 64 128
sp
ee
d
u
p
th
ro
u
gh
b
lo
ck
in
g
nodes
SpinSZ[28]
tV[28]
Hubbard[16]
BosHub[22]
MATPDE[16k]
MATPDE3D[512]
www.DLR.de • Chart 15 > SIAM PP’16 > J. Thies et al. • Performance Engineering > 2016-04-12
Further information
and PHIST are developed within the DFG (SPPEXA) funded
project ESSEX (Equipping Sparse Solvers for the EXa-scale).
• project website incl. list of publications:
http://blogs.fau.de/essex/
• source code: https://bitbucket.org/essex/[ghost|phist]
We are happy to collaborate on
building blocks, algorithms and applications
and support ‘friendly users’ !
Contact: Jonas.Thies@DLR.de
