Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems by Basermann, Achim & Zöllner, Melven
Folie 1 
20120217-1 SIAM PP2012 Basermann.ppt 
Block Distributed Schur Complement Preconditioners for 
CFD Computations on Many-Core Systems 
Dr.-Ing. Achim Basermann, Melven Zöllner** 
 
German Aerospace Center (DLR) 
Simulation- and Software Technology 
Distributed Systems and Component Software 
Porz-Wahnheide, Linder Höhe, D-51147 Cologne, Germany 
 
achim.basermann@dlr.de  
**also RWTH Aachen University 
Folie 2 
20120217-1 SIAM PP2012 Basermann.ppt 
DLR 




Project Management Agency 
Folie 3 














Locations and employees 
Germany: 6,900 employees across  
33 research institutes and  
facilities at 
 15 sites. 
 
Offices in Brussels,  
Paris and Washington. 
Folie 4 
20120217-1 SIAM PP2012 Basermann.ppt 
Survey 
CFD computations at DLR 
 
Storage schemes for sparse matrices 
 
The Distributed Schur Complement method (DSC) 
 
Experiments with TRACE and TAU matrices 
 
Conclusions and future work 
Folie 5 
20120217-1 SIAM PP2012 Basermann.ppt 
Parallel Simulation System TRACE 
TRACE: Turbo-machinery Research 
Aerodynamic Computational Environment 
 
Developed by the Institute for Propulsion 
Technology of the German Aerospace 
Center (DLR-AT) 
 
Calculates internal turbo-machinery flows 
 
Finite volume method with block-structured 
grids 
 
The linearized TRACE modules require the 
parallel, iterative solution with preconditioning 
of large, sparse, non-symmetric real or 
complex systems of linear equations 
Folie 6 
20120217-1 SIAM PP2012 Basermann.ppt 
TAU: developed for the aerodynamic design of aircrafts by the DLR Institute of 
Aerodynamics and Flow Technology 
 
Unstructured RANS solver (Reynolds-averaged Navier-Stokes), exploits finite 
volumes 
 
Requires the parallel, iterative solution with preconditioning of large, sparse, 
real, non-symmetric systems of linear equations 
 
Solvers used: geometric Multigrid, AMG preconditoned GMRes 
 
Here: experiments with DSC methods 
Preconditioners for TAU: Background 
Folie 7 
20120217-1 SIAM PP2012 Basermann.ppt 
Storage Schemes for Sparse Matrices  
TRACE and TAU apply BCSR with 5x5 blocks. 
Avantage: less indirect addressing 
Disadvantage: A few zeros are stored. 
Compressed Row Storage (CSR) and Block Compressed Row Storage (BCSR) 
Matrix: 
Non-zero values, row-wise: 




20120217-1 SIAM PP2012 Basermann.ppt 




20120217-1 SIAM PP2012 Basermann.ppt 
DSC Method (2) 
DSC Algorithm 
Schematic view on 
each processor 
BiCGstab or GMRes iteration for 
the local interface rows (unknowns) 
Folie 10 
20120217-1 SIAM PP2012 Basermann.ppt 
DSC Method (3) 
Preconditioning within the DSC algorithm 
Folie 11 
Hardware System 
RWTH Bull HPC cluster 
 
Intel Westmere X5675 CPUs 
6 cores per CPU with 3.06 GHz 
12 cores (2 CPUs) per node 
 
 
Computations with 1 MPI process per core 
20120217-1 SIAM PP2012 Basermann.ppt 
Folie 12 
20120217-1 SIAM PP2012 Basermann.ppt 
Experiments: CSR versus BCSR Format 
Block-Jacobi-ILU preconditioning with 12 processes 
 














 ILU construction 
Iterations 
Folie 13 
20120217-1 SIAM PP2012 Basermann.ppt 
Experiments: Strong Scaling, Iterations 









20120217-1 SIAM PP2012 Basermann.ppt 
Experiments: Strong Scaling, Time 

























internal solver ( gmres100,
ssor(0.7,3) )











dsc2011 solver for linearTRACE 
(8 processes, test case "THD stator": dim = 0.8 Mio, nnz = 90 Mio) 







linearTRACE Performance: Internal versus DSC Solver 
 (2x Intel XEON E5520 with 4 cores each, 2.26 GHz ) 
Folie 16 
20120217-1 SIAM PP2012 Basermann.ppt 
Conclusions 
BCSR format application significantly outperforms CSR 
format application for real TRACE and TAU problems. 
 
DSC method achieves higher scalability and faster iteration 
than block-local methods. 
 
DSC method very suitable for TRACE and TAU problems 
 
Future work 
Hybrid parallelization is appropriate to further improve 
scalability.  
Folie 17 
20120217-1 SIAM PP2012 Basermann.ppt 
Questions? 
Folie 18 




























# 76 # 40 # 32 # 29 
(#number of iterations) 
DSC Solver: CSR versus BCSR Format 
(2x Intel XEON E5520 with 4 cores each, 2.26 GHz ) 
Folie 19 
20120217-1 SIAM PP2012 Basermann.ppt 
DSC Method: Effect of the Interface Iteration (real) 




threshold = 10-3; 











0 1 2 3 4 5 6 7 8 9 10
So
lv
er
 it
er
at
io
n 
tim
e 
in
 s
ec
on
ds
  
interface iterations 
interf-bicgstab, bs=1
interf-gmres, bs=1
interf-bicgstab, bs=5
interf-gmres, bs=5
