Multi-block/multi-core SSOR preconditioner for the QCD quark solver for
  K computer by Boku, T. et al.
ar
X
iv
:1
21
0.
73
98
v1
  [
he
p-
lat
]  
28
 O
ct 
20
12
Multi-block/multi-core SSOR preconditioner for the
QCD quark solver for K computer
T. Bokua, K.-I. Ishikawa∗b,c, Y. Kuramashia,c, f , K. Minamic, Y. Nakamurac, F. Shojic,
D. Takahashia, M. Teraic, A. Ukawaa, T. Yoshiéa, f
aCenter for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki 305-8577, Japan
bGraduate School of Science, Hiroshima University, Higashi-Hiroshima, Hiroshima 739-8526,
Japan
cRIKEN Advanced Institute for Computational Science, Kobe, Hyogo 650-0047, Japan
f Faculty of Pure and Applied Sciences, University of Tsukuba, Tsukuba, Ibaraki 305-8571,
Japan
E-mail: ishikawa@theo.phys.sci.hiroshima-u.ac.jp
We study the algorithmic optimization and performance tuning of the Lattice QCD clover-fermion
solver for the K computer. We implement the Lüscher’s SAP preconditioner with sub-blocking
in which the lattice block in a node is further divided to several sub-blocks to extract enough
parallelism for the 8-core CPU SPARC64TM VIIIfx of the K computer. To achieve a better
convergence property we use the symmetric successive over-relaxation (SSOR) iteration with
locally-lexicographical ordering for the sub-blocks in obtaining the block inverse. The SAP pre-
conditioner is included in the single precision BiCGStab solver of the nested BiCGStab solver.
The single precision part of the computational kernel are solely written with the SIMD oriented
intrinsics to achieve the best performance of the SPARC64TM VIIIfx on the K computer. We
benchmark the single precision BiCGStab solver on the three lattice sizes: 123×24, 243×48 and
483×96, with fixing the local lattice size in a node at 63×12. We observe an ideal weak-scaling
performance from 16 nodes to 4096 nodes. The performance of a computational kernel exceeds
50% efficiency, and the single precision BiCGstab has ∼26% susutained efficiency.
The 30th International Symposium on Lattice Field Theory
June 24 – 29, 2012
Cairns, Australia
∗Speaker.
c© Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. http://pos.sissa.it/
Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer K.-I. Ishikawa
1. Lattice QCD on the K computer
The K computer has been developed by RIKEN and Fujitsu since 2006 as the national leader-
ship computer of Japan to promote science and technology [1]. The construction has been finished
on July 2012 and the system is now provided to the strategic programs of the High Performance
Infrastructure (HPCI) and to the research users those who apply the computer resource through the
HPCI [2]. The strategic programs consist of five fields: (i) Predictable life science, healthcare and
drug discovery foundation, (ii) New Materials and Energy Creation, (iii) Projection of Planet Earth
Variations for Mitigating Natural Disasters, (iv) Next-generation manufacturing technology, and (v)
The origin of matter and the universe. Lattice QCD is contained in the fifth strategic program [3].
The K computer consists of over 80,000 computational nodes connected by the so called
“Tofu” network. The Tofu network topology is six dimensional topology with 3D-mesh times
3D-torus shape, with which robustness against the node failure and high node flexibility and avail-
ability are ensured. Each node has a single CPU chip called “SPARC64TM VIIIfx ”. Equipping
8 cores with SIMD enabled 256 registers and 6MB shared L2 cache, the CPU can achieve a high
efficiency for scientific applications (for the detailed system structure see [4]). Lattice QCD is one
of the suitable applications for this kind of massively parallel system architecture.
In this paper we present the algorithmic and performance tuning of the O(a) improved Wilson
quark solver for the K computer. We focus on the solver algorithm preconditioned by the domain-
decomposed Schwartz alternating procedure (SAP) with mixed precision [5]. In the next section we
briefly introduce the SAP preconditioner and the nested BiCGStab algorithm [6]. The optimization
of the algorithm and tuning methods for the computational kernels of the SAP are presented in
section 3. We give the performance benchmark results in section 4 and summarize the paper in the
last section.
2. Nested BiCGStab with Lüscher’s SAP preconditioner
Our target problem is solving the following linear equation:
Dx = b, (2.1)
where D is the clover term preconditioned O(a)-improved Wilson-Dirac operator:
Da,bα ,β (n,m) = δ
a,bδα ,β δ (n,m)−κFa,cα ,γ(n)
4
∑
µ=1
[
(1− γµ)γ ,β (Uµ(n))c,bδ (n+ µˆ ,m)
+(1+ γµ)γ ,β ((Uµ(m))b,c)∗δ (n− µˆ ,m)
]
. (2.2)
Where F(n) is the inverse clover term (1− (cSWκ/2)σµνFµν(n))−1, (n,m) are lattice site, a,b
color, α ,β are spin indexes.
Lüscher has introduced the Schwartz alternating procedure (SAP) to further preconditioning
the Eq. (2.1) [5]. By dividing the whole lattice into two colored blocks in checkerboard manner,
Eq. (2.1) can be rewritten in the following 2×2 block form:(
DEE DEO
DOE DOO
)(
xE
xO
)
=
(
bE
bO
)
, (2.3)
2
Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer K.-I. Ishikawa
where DEE (DOO) is a block restricted operator in even-domain (odd-domain), while DEO (DOE )
contains hopping operations from odd-domain to even-domain (and vice versa). The SAP practi-
tioner MSAP is introduced as
MSAP = K
NSAP−1∑
j=0
(1−DK) j, with K =
(
AEE 0
−AOODOEAEO AOO
)
, (2.4)
where AEE (AOO) can be any approximation for (DEE)−1 ((DOO)−1). When AEE and AOO are exact,
DK becomes block triangular and is expected to be well preconditioned. In the case |DK|< 1, MSAP
converges to D−1 when NSAP → ∞.
The nested BiCGStab solver is designed to be flexible against the preconditioner changing
iteration by iteration [6] using an inner-outer strategy. The outer BiCGStab contains the inner
BiCGStab solver as the flexible preconditioner. The flexibility ensures the double precision accu-
racy of the solution vector even if we use the single precision for the inner BiCGStab solver [7].
We apply the SAP preconditioner MSAP to the inner single precision BiCGStab solver. The use of
single precision has a merit as it requires less system resources than double precision. We target
the single precision part of the solver as the tuning part for the K computer. In the following section
we focus on the tuning of the single precision DK of MSAP.
3. Performance tuning for the K computer
Inverse of block operator and OpenMP threading The approximate inverse of the block op-
erator AEE (AOO) are the important part of the SAP. The exactness is not required for the SAP,
however, the better approximation with less computational cost is preferred for AEE (AOO). The
even-odd site preconditioning has been used in [5] and the SSOR preconditioning has been used
in [8]. The latter has a better performance than the former at the same computational cost. The
SSOR preconditioning is derived by decomposing the original operator into the sum of an upper
and a lower triangular matrices. The preconditioning is achieved by solving the upper and lower
triangular parts through forward and backward substitutions. The decomposition and the efficiency
of the SSOR depend on the site ordering. It is observed that the SSOR with natural ordering have a
better performance [8]. However the SSOR with natural ordering is not suitable for the multi-core
CPU architecture because the natural ordering has a global data recurrence pattern and less paral-
lelism in the forward and backward substitutions. To extract 8 core parallelism for the K computer,
we further divide the block into 16 sub-blocks via the locally-lexicographical ordering (ll-ordering)
described in [9].
Figure 1 shows an example of 64 lattice in a node. The block in a node is divided into 24
sub-blocks. Each single core contains two sub-blocks (two sub-blocks with size 34 adjacent in
temporal direction). The numbering on the sites is the ordering according to the ll-ordering. The
arrows on the links represent the data recurrence direction. The most of the recurrence are limited
in each sub-block and there are little data reference on the surface of sub-blocks. Based on this
ordering we wrote the forward and backward solver to construct AEE (AOO) with the SSOR. The
parallelization in a node is achieved by explicit OpenMP threading. 8 threads are invoked and two
sub-blocks are assigned to each OpenMP thread. The spatial division is mapped to OpenMP 8
3
Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer K.-I. Ishikawa
threads, and the temporal division is dedicated to the loop unrolling in a single thread. To resolve
the recurrence dependency among threads, we carefully insert explicit omp barrier’s at the sites
that require data on other threads.

ᲫᲭ
Ჯ
ᲫᲱ ᲬᲫ ᲫᲮ ᲫᲲ ᲬᲬ
ᲬᲰ ᲭᲪ ᲭᲮ
ᲫᲫ ᲫᲬ
ᲫᲯ ᲫᲳ ᲬᲭ ᲫᲰ ᲬᲪ ᲬᲮ
ᲬᲱ ᲭᲫ ᲭᲯ ᲬᲲ ᲭᲬ ᲭᲰ
ᲭᲭᲬᲯ ᲬᲳ
Წ Ჰ
Ჭ Ჱ Ხ Ჲ
Ჳ ᲫᲪ
x
y

ᲫᲭ
Ჯ
ᲫᲱ ᲬᲫ ᲫᲮ ᲫᲲ ᲬᲬ
ᲬᲰ ᲭᲪ ᲭᲮ
ᲫᲫ ᲫᲬ
ᲫᲯ ᲫᲳ ᲬᲭ ᲫᲰ ᲬᲪ ᲬᲮ
ᲬᲱ ᲭᲫ ᲭᲯ ᲬᲲ ᲭᲬ ᲭᲰ
ᲭᲭᲬᲯ ᲬᲳ
Წ Ჰ
Ჭ Ჱ Ხ Ჲ
Ჳ ᲫᲪ
x,y,z
%QTG %QTG
%QTG %QTG
t
%QTG %QTG
5RCVKCNFKTGEVKQP 5RCVKCN6GORQTCNFKT
Figure 1: Site ordering in a CPU. This is an example with a 64
block in a node. The numbering of sites is limited in the 2-D
plane for simplicity.
The sub-block size affects the
performance of the SAP. It has been
reported that the size of 24 for block
still has a gain against the even-odd
site ordering on a larger size lat-
tice [9]. Although we apply the ll-
ordering to a small lattice (corre-
sponds to a block in a node), we
observed that the use of the SSOR
with sub-blocking via ll-ordering
for AEE still has a gain against the
even-odd site ordering.
SIMDzation The SPARC64TM VIIIfx CPU has 256 double precision (64bit) FP registers per
core. Each core can issue two independent fused multiply add (FMAD) on paired registers (SIMD)
in a cycle resulting in 8 floating operations per cycle. To fully make use of these core infrastructure
we have to extract enough independent data-computation stream in the kernel code DEE and AEE .
The SIMD FMAD operation can be explicitly invoked using intrinsic functions defined in the
C/C++ language provided by Fujitsu. Although our entire program codes were developed in the
Fortran90 at first, we totally write the single precision inner BiCGStab solver using the SIMD
intrinsic functions in the C language. The details of the intrinsics are not available publicly, while
these have one-to-one correspondence to the mnemonic of the SPARC64TM VIIIfx [10]. The
linkage between the Fortran90 and the C codes is realised through the ISO_C_BINDING feature
of the Fortran 2003 which is partly included in the Fujitsu Fortran. We use double precision intrinsic
functions for the SIMD computation because the single precision FLOPS performance is identical
to that in double precision in this CPU architecture. The data are converted from single precision
to double precision (and vice verse) at the data loading (storing) from (to) memory. We employ the
SU(3) reconstruction method in the kernel code where the link variables are stored in compressed
form (keeps only two-columns of the SU(3) matrix) and the third column is reconstructed using
the unitarity condition on the fly.
SU(3)
reconst.
Flop
count
Load/Store
count (real) Real/Flop
Yes 2232 372 0.1667
No 1896 420 0.2215
Table 1: Flop count and load/store count for the
clover hopping matrix per site.
To further improve the kernel performance
we unroll the temporal direction loop, which is
the innermost site loop, by three times for DEE .
For the forward/backward solver in AEE the two
independent sub-blocking in temporal direction is
used as unrolling. This is based on the technique
of the register blocking which hide the data load
latency. The unrolling also contains branch elimination by condition merging. This loop unrolling
with huge loop body is possible thanks to the many registers of the SPARC64TM VIIIfx .
In table 1 we summarize the total flop and load/store count per site for the hopping matrix of D.
4
Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer K.-I. Ishikawa
These numbers are estimated for bulk sites in a block (not for the surface sites). We use the Dirac
representation for γµ (γ4 is diagonal) in the spin projection, and use the chiral representation for the
inverse clover term F(n). When we multiply F(n), it is converted to the Dirac representation on
the fly. The required byte/flop is 0.667 with the SU(3) reconstruction method in single precision.
The theoretical system byte/flop is 64 [GByte/s]/128 [GFlops] = 0.5. Thus our computation is still
limited by the memory bandwidth. The maximum theoretical performance is estimated to be ∼ 96
GFlops from the system memory bandwidth of 64 GByte/s. Excluding the redundant flop count
caused by the SU(3) reconstruction, the effective performance is ∼ 82 GFlops, which is still better
than that without the SU(3) reconstruction (∼ 72 GFlops). This performance number could be
reduced or enhanced by the cache system, site loop control, conditional branch for site location,
thread controlling etc. We test and benchmark the DEE and AEE kernels.
Algorithm 1 w = DKv computation with com-
munication hiding. fE and fO are working vec-
tors.
1: xE = AEEvE
2: MPI_Isend and MPI_Irecv for wO = DOExE .
3: wE = DEExE
4: MPI_Wait for wO = DOExE .
5: fO = vO−wO
6: xO = AOO fO
7: MPI_Isend and MPI_Irecv for fE = DEOxO.
8: fO = DOOxO
9: wO = wO + fO
10: MPI_Wait for fE = DEOxO.
11: wE = wE + fE
Communication hiding The internode com-
munication in the SAP kernel DK is arranged
in the DEO and DOE kernels. The structure
of DEO (DOE) is basically organized as fol-
lows: (i) spin projection, link U†µ(m) multi-
plication and data packing, (ii) sending data,
(iii) receiving data, (iv) link Uµ(n) multipli-
cation, spin reconstruction and accumulation.
Other computation is possible during the step
(ii) and (iii).
We organize the SAP kernel DK to hide
the communication time of DEO (DOE) behind
the computation of DEE (DOO) as shown in
Alg. 1. To hide the communication we em-
ploy the non-blocking MPIs: MPI_Isend, MPI_Irecv, and MPI_Wait. The steps (i)-(ii) are done
at the lines 2 and 7, and the steps (iii)-(iv) are at the lines 4 and 10. We benchmark the commu-
nication performance in the weak-scaling test by comparing the performance of DEE and DMSAP,
where the latter contains the internode communication while the former does not.
4. Results
We benchmark the single precision BiCGStab with the SAP preconditioner on the K computer.
The lattice sizes benchmarked are 123×24, 243×48, and 483×96. The block size of the SAP is
kept fixed at 64 and the local lattice size in a node is 63× 12. The sub-block size for the SSOR is
thus 34. The number of nodes used for the benchmark is 16, 256 and 4096 nodes. The performance
is measured using the profiler provided by the Fujitsu’s compiler system.
Figure 2 shows the SIMD rate in the instruction executed in benchmarking runs. We achieve
90% SIMD rate for the kernel DEE and AEE by explicitly using the SIMD intrinsic functions.
At the solver level it reduces to 85% as it contains internode communication. Figs 3 and 4 rep-
resents the weak-scaling property of the flops performance. The efficiency (Fig. 3) is almost at
constant for the kernels resulting an ideal weak-scaling (Fig. 4) from 16 nodes to 4096 nodes.
5
Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer K.-I. Ishikawa
 80
 90
 100
123x24 243x48 483x96
SI
M
D 
ra
te
 [%
]
Lattice size
DEEAEEDMSAP
Inner-BiCGStab
Figure 2: Weak-scaling of the SIMD rate of the
single precision kernels in a node.
 20
 30
 40
 50
 60
 70
123x24 243x48 483x96
Pe
rfo
rm
an
ce
 E
ffi
cie
nc
y 
[%
]
Lattice size
DEEAEEDMSAP
Inner-BiCGStab
Figure 3: Same as Fig. 2, but for the performance
efficiency in a node.
102
103
104
105
106
123x24 243x48 483x96
Pe
rfo
rm
an
ce
 G
FL
O
PS
Lattice size
DEEAEEDMSAP
Inner-BiCGStab
Figure 4: Weak-scaling of the total flops perfor-
mance.
The efficiency reaches over 50% for the DEE
kernel, while the AEE kernel has lower 35%.
There are two reasons for the lower per-
formance for AEE . One is the load imbalance
of the computation among the threads in AEE .
For example, the number of red arrows is more
than that of purple in the left panel of figure 1.
This means that the core #0 has much compu-
tation than that of the core #3. The second is
the less computational density in the loop body
compared to that of DEE . As mentioned in
previous section the loop of DEE is unrolled
by three times, while AEE is by two with sub-
blocking. The forward/backward substitution
of AEE only has one sided hopping operation for bulk sites. This also reduces the computa-
tional density in the loop body. The theoretical peak efficiency for DEE is estimated be 75%(=96
[GFlops]/128 [GFlops]), while the observed 50% efficiency does not reach the theoretical peak
efficiency. A reason for this could be the conditional branch to distinguish the sites in the bulk or
on the surface of the block. We still need a further investigation on the gap between the theoretical
peak efficiency and the observed efficiency.
The performance efficiency of the single precision inner solver is at ∼26% in all. We observe
an ideal weak-scaling property as shown in Fig. 4. Note that the measured performance presented
in the figures contains redundant floating operations coming from the SU(3) reconstruction and the
use of FMAD for the spin projection. We have to roughly multiply 0.8 on the Flops numbers in
Fig. 4 to exclude the redundant operations and to obtain the effective performance.
5. Summary
We have benchmarked the single-precision BiCGStab solver for the O(a)-improved Wilson
fermion on the K computer. The solver code has been successfully optimized and tuned for the
6
Multi-block/multi-core SSOR preconditioner for the QCD quark solver for K computer K.-I. Ishikawa
system. We have applied the SSOR with the ll-ordered sub-blocking for the approximate block in-
verse of the SAP preconditioner to enhance the parallelism for the multi-core architecture. Thanks
to the science specific architecture of th K computer we could achieve the ideal weak-scaling per-
formance and ∼ 26% efficiency for the single-precision BiCGStab solver. It partly remains unclear
the origin of the gap between the theoretical system performance and the measured performance.
We expect a further improvement on the internode communication by using the “Tofu” specific
optimization which is not covered in this study.
Acknowledgement This work has been done in the “collaborative research for the performance
analysis of large scale simulations on next-generation supercomputers” between RIKEN and Uni-
versity of Tsukuba. We thank the members of the next-generation Technical Computing Unit of
Fujitsu for giving us the technical advice and support, and for tuning the computational kernels.
Part of the results is obtained by using the K computer at the RIKEN Advanced Institute for Com-
putational Science (Proposal numbers hp120108, hp120153, hp120170, hp120281), T2K-Tsukuba
System in Center for Computational Sciences, University of Tsukuba, and a PC-cluster of HPCI
strategic program Field 5. This work is supported in part by Grants-in-Aid for Scientific Re-
search from the Ministry of Education, Culture, Sports, Science and Technology (Nos. 22244018,
24540276).
References
[1] RIKEN Advanced Institute for Computational Science (AICS),
[http://www.aics.riken.jp/en/].
[2] High Performance Computing Infrastructure (HPCI),
[https://www.hpci-office.jp/index.html].
[3] HPCI Strategic Program Field 5, The origin of matter and the universe,
[http://www.jicfus.jp/field5/en/].
[4] FUJITSU SCIENTIFIC & TECHNICAL JOURNAL (FSTJ), The K computer, 2012-7 (Vol.48, No.3)
[http://www.fujitsu.com/global/news/publications/periodicals/fstj/].
[5] M. Lüscher, Comput. Phys. Commun. 165 (2005) 199 [arXiv:hep-lat/0409106]; JHEP 0305 (2003)
052 [hep-lat/0304007].
[6] J.A. Vogel, Appl. Math. Comput, 167 (2005) 1004-1025; H. Tadano and T. Sakurai, LSSC’07, Lec.
Notes Comput. Sci. 4818 (2008) 721.
[7] A. Buttari, J. Dongarra, J. Kurzak, P. Luszczek, and S. Tomov, Using mixed precision for sparse
matrix computations to enhance the performance while achieving 64-bit accuracy, ACM Trans. Math.
Soft., 34 (2008) 1.
[8] S. Aoki et al. [PACS-CS Collaboration], Phys. Rev. D 81 (2010) 074503 [arXiv:0911.2561 [hep-lat]];
Phys. Rev. D 79 (2009) 034503 [arXiv:0807.1661 [hep-lat]].
[9] N. Eicker, W. Bietenholz, A. Frommer, T. Lippert, B. Medeke and K. Schilling, Nucl. Phys. Proc.
Suppl. 73 (1999) 850 [arXiv:hep-lat/9809038]; S. Fischer, A. Frommer, U. Glassner, T. Lippert,
G. Ritzenhofer and K. Schilling, Comput. Phys. Commun. 98 (1996) 20 [arXiv:hep-lat/9602019].
[10] Fujitsu Limited, “SPARC64TM VIIIfx Extensions” Version 15, 2010.
7
