Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi
  Knights Landing by Kanamori, Issaku & Matsufuru, Hideo
ar
X
iv
:1
71
2.
01
50
5v
1 
 [h
ep
-la
t] 
 5 
De
c 2
01
7
Practical Implementation of Lattice QCD Simulation
on Intel Xeon Phi Knights Landing
Issaku Kanamori
Department of Physics,
Hiroshima University
Higashi-hiroshima 739-8526, Japan
Email: kanamori@hiroshima-u.ac.jp
Hideo Matsufuru
Computing Research Center,
High Energy Accelerator Research Organization (KEK)
Oho 1-1, Tsukuba 305-0801, Japan
Email: hideo.matsufuru@kek.jp
©2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Abstract—We investigate implementation of lattice Quantum
Chromodynamics (QCD) code on the Intel Xeon Phi Knights
Landing (KNL). The most time consuming part of the numerical
simulations of lattice QCD is a solver of linear equation for
a large sparse matrix that represents the strong interaction
among quarks. To establish widely applicable prescriptions, we
examine rather general methods for the SIMD architecture of
KNL, such as using intrinsics and manual prefetching, to the
matrix multiplication and iterative solver algorithms. Based on
the performance measured on the Oakforest-PACS system, we
discuss the performance tuning on KNL as well as the code
design for facilitating such tuning on SIMD architecture and
massively parallel machines.
I. INTRODUCTION
Quantum Chromodynamics (QCD) is the fundamental the-
ory of the strong interaction among quarks. Despite its es-
sential roles in the elementary particle physics, the lack
of analytical method to treat its large coupling at low en-
ergy (equivalently at long distance) prevents us from solving
QCD in general. The lattice QCD, that is formulated on a
4-dimensional Euclidean lattice, enables us to tackle such
problems by numerical simulations [1]. Quanti ing with the
path integral formalism, the theory resembles a statistical
mechanical system to which the Monte Carlo methods apply.
Typically the most time consuming part of the lattice QCD
simulations is solving a linear equation for a large sparse
fermion matrix that represents the interaction among quarks.
As the lattice size becomes large necessarily to precision
calculation, the numerical cost grows rapidly. The lattice QCD
simulations have been a typical problem in high performance
computing and conducted development of supercomputers
such as QCDPAX [3] and QCDOC [4].
There are two distinct trends in high performance machines.
One is accelerators, represented by GPUs, which is composed
of many cores of O(1000). The Intel Xeon Phi architecture is
a variant of the other type, massively parallel clusters, while
it has potential to work as an accelerator. Knights Landing
(KNL), the second generation of Xeon Phi series, reinforced
Submitted to LHAM’17 “5th International Workshop on Legacy HPC
Application Migration” in CANDAR’17 “The Fifth International Symposium
on Computing and Networking” and to appear in the proceedings.
usability as a massive parallel machine. Its performance is
assured by the SIMD architecture. Elaborated assignments of
vector registers and exploiting SIMD instructions are essential
to achieve desired performance.
In this paper, we port a lattice QCD simulation code to Intel
Xeon Phi Knights Landing. The aim of this work is not a state-
of-the-art tuning on KNL, but to establish prescriptions that
are applicable to wide part of the application with practical
performance. While a hot spot of the QCD simulation tends
to concentrate in small part of programs, there are plenty
amount of code for measuring various physical quantities that
have been accumulated in legacy codes. One frequently needs
to accelerate such a code on a new architecture. Thus it is
important to establish simple procedures to achieve acceptable
performance as well as implication to future development of
code. For this reason, we restrict ourselves in rather general
methods: change of data layout, application of Intel AVX-512
intrinsics and prefetching. As a testbed of our analysis, we
choose two types of fermion matrices that are widely used
together with iterative linear equation solver algorithms.
This paper is organized as follows. The next section briefly
introduce the linear equation system in lattice QCD simula-
tions with fermion matrices employed in this work. Features
of KNL architecture are summarized in Section III. Section IV
describes the details of our implementation. In the following
sections, we measure the performance of individual fermion
matrices and the whole linear equation solvers. The last section
discusses implication of our results.
II. LATTICE QCD SIMULATION
A. Structure of lattice QCD simulation
For the formulation of lattice QCD and the principle of the
numerical simulation, there are many textbooks and reviews
[1]. Thus we concentrate on the linear equation system for the
fermion matrix, to which high computational cost is required.
The lattice QCD theory consists of fermion (quark) fields
and a gauge (gluon) field. The latter mediates interaction
among quarks and are represented by ‘link variables’, Uµ(x) ∈
SU(3), where x = (x1, x2, x3, x4) stands for a lattice site
and µ=1–4 is the spacetime direction. In numerical sim-
ulations the lattice size is finite, xµ = 1, 2, . . . , Lµ. The
fermion field is represented as a complex vector on lattice
Fig. 1. The schematic feature of the Wilson fermion matrix.
sites, which carries 3 components of ‘color’ and 4 com-
ponents of ‘spinor’, thus in total 12, degrees of freedom
on each site. The dynamics of fermion is governed by a
functional SF =
∑
x,y ψ
†(x)D[U ]−1(x, y)ψ(y), where D[U ]
is a fermion matrix. A Monte Carlo algorithm is applied to
generate an ensemble of the gauge field {Uµ(x)}, that requires
to solve a linear equation x = D−1ψ many times.
B. Fermion matrix
There is a variety of the fermion operator D[U ], since its
requirement is to coincide with that of QCD in the continuum
limit, the lattice spacing a→ 0. Each formulation has advan-
tages and disadvantages. As a common feature, the matrix is
sparse because of the locality of the interaction. In this paper,
we examine the following two types of fermion matrices.
1) Wilson fermion matrix: The first one called the Wilson
fermion matrix has the form
DW (x, y) = (m0 + 4)δx,y −
1
2
4∑
µ=1
[
(1− γµ)Uµ(x)δx+µˆ,y
+(1 + γµ)U
†
µ(x− µˆ)δx−µˆ,y
]
, (1)
where x, y are lattice sites, µˆ the unit vector along µ-th axis,
and m0 the quark mass. Fig. 1 indicates how the interaction to
the neighboring sites are involved in the matrix. As mentioned
above, the link variable Uµ(x) is a 3 × 3 complex matrix
acting on color and γµ is a 4× 4 matrix acting on the spinor
degrees of freedom. Thus DW is a complex matrix of the
rank 4NcLxLyLzLt. It is standard to impose the periodic or
antiperiodic boundary conditions.
2) Domain-wall fermion matrix: The second fermion ma-
trix we treat is called the domain-wall fermion. This matrix
DDW is defined by extending the spacetime to a 5-dimensional
lattice. The structure of DDW in the 5th direction reads
DDW =


DW + 1 −P− 0 · · · mP+
−P+ DW + 1 −P− 0
0 −P+ DW + 1
. . .
...
...
. . .
. . . −P−
mP− 0 · · · −P+ DW + 1


,
(2)
where DW (x, y) is the Wilson fermion matrix above (with
m0 set to certain value), and m instead represents the quark
mass. P± are 4×4 matrices acting on the spinor components.
The size of the 5th direction, Ns, is also a parameter of
DDW . Eq. (2) means that DDW is a block tridiagonal matrix
TABLE I
FEATURES OF FERMION MATRICES: THE NUMBER OF FLOATING POINT
OPERATION AND THE DATA TRANSFER BETWEEN MEMORY AND
PROCESSOR PER SITE. FOR THE DOMAIN-WALL MATRIX, THE 5TH
DIRECTIONAL SIZE IS SET AS Ns = 8.
Fermion type Nflop/site data/site [B] (float) Byte/Flop
Wilson 1368 1536 B 1.12
Domain-wall 11520 8256 B 0.72
including the boundary components in the fifth direction. Note
that Uµ(x) is common in the 5th direction. The domain-wall
fermion is widely used because of its good theoretical prop-
erties despite larger numerical cost than the Wilson matrix.
There are several possible ways to implement DDW . The
simplest way is to repeatedly use the implementation of DW
to a set of 4-dimensional vectors. Another way is to treat the
structure in the 5th direction as additional internal degrees
of freedom, together with the color and spinor ones. The
latter has an advantage in cache reuse. We compare the
both implementation and found that the latter achieves better
performance, and thus concentrate on it in the following.
3) Features of the fermion matrices: Although these
fermion matrices share the property of locality and sparseness,
they have different features in data size transferred between
the memory and the processor cores, number of arithmetic
operations, and data size of communication to other MPI
process. Table I summarizes the former two values per site
for single precision data. For the domain-wall fermion matrix,
these numbers depend on the size of 5th direction, Ns.
Hereafter we adopt Ns = 8 as a typical example. As quoted
in Table I, the domain-wall matrix tends to have smaller byte-
per-flop value, due to the independence of the link variable
Uµ(x) in the 5th direction.
C. Linear equation solver
Since the fermion matrices are large and sparse, iterative
solvers based on the Krylov subspace method are widely used.
For the Wilson fermion matrix, we employ the BiCGStab
algorithm for a nonhermitian matrix. As for the domain-
wall fermion, BiCGStab algorithm does not work because its
eigenvalues scatter also in the left side of the imaginary axis.
We thus apply the conjugate gradient (CG) algorithm to a
hermitian positive matrix D
†
DWDDW .
In practice a mixed precision solver is applicable. In this
case the single precision solver applied as the inner solver
determines the performance. Therefore we measure the per-
formance with the single precision. While there are variety of
improvement techniques for a large-scale linear systems, such
as a multi-grid or domain-decomposition methods, they are
beyond the scope of this paper.
III. KNIGHTS LANDING ARCHITECTURE
The Knights Landing is the second generation of Intel Xeon
Phi architecture, whose details are found in [2]. Its maximal
peak performance is 3 and 6 TFlops for double and single
precision, respectively. It is composed of maximally 72 cores,
2
in units of a tile containing two cores. Each tile has distributed
L2 cache that is shared with 2 cores. In addition to DDR4 with
about 90 GB/s, MCDRAM of maximally 16 GB accessible
with 400 GB/s is available with one of three modes: cache,
flat, and hybrid. Each core supports 4-way hyperthreading. The
SIMD instruction AVX-512 is available. 32 vector registers of
512-bit length are assigned to 8 double or 16 single precision
numbers.
Our target machine is the Oakforest-PACS system hosted
by Joint Center for Advances High Performance Computing
(JCAHPC, University of Tokyo and University of Tsukuba)
[5]. The system is composed of 8208 nodes of Intel Xeon Phi
7250 (68 cores, 1.4 GHz) connected by full-bisection fat tree
network of the Intel Omni-Path interconnect. It has 25 PFlops
of peak performance in total, and started public service in
April 2017.
IV. IMPLEMENTATION
A. Simulation code
As the base code, we choose the Bridge++ code set [6],
[7], which is described in C++ with the object-oriented design.
This code set allows us to replace fermion matrices and solver
algorithms independently. Porting of Bridge++ to accelerators
are performed in [8]. In the original Bridge++, hereafter called
the Bridge++ core library, the data is in double precision and in
fixed data layout. Following the strategy employed in [8], we
extend the vectors and matrices so as to enable any data layout
and changing the data type. From the core library, the linear
equation solver is offloaded to the newly developed code. This
enables us to tune the hot spot on the specific architecture with
keeping the remaining part of Bridge++ available. Following
the implementation of Bridge++, we parallelized the code with
MPI and employ OpenMP for multi-threading.
B. Related works
Since the lattice QCD simulation, in particular the fermion
matrix multiplication, is considered a typical problem of high
performance computing, it is examined in a textbook of KNL
[2]. Its chapter 26 is devoted to performance tuning of the
Wilson-Dslash operator, corresponding to the Wilson matrix
above, using the QPhiX library [9]. On a single KNL with
68 cores at 1.4 GHz, the maximal performance of the Wilson
matrix multiplication achieved 272 and 587 GFlops for double
and single precision, respectively.
As for the first generation of Xeon Phi, Knights Corner,
there are several works [10], [11], [12], [13], [14], [15], [16].
Joo et al. [10] is the direct base of the QPhiX library on
KNL. Ref. [14] developed a library named ‘Grid’ for the SIMD
architecture, and has largely affected our work. These works
would be extended to the KNL as well.
C. Implementation
To fully exploit the SIMD architecture of KNL, rearrange-
ment of data is inevitably important. For double and single
precision data types, 512-bit register corresponds to 8 and
16 floating point numbers, respectively, so we rearrange the
Fig. 2. The site index ordering. For float type complex variables, we use 3
dimensional analogues of this figure.
date in these units. We implement the code in C++ template
classes and instantiate them for double and float data types
individually. Since the vector data in lattice QCD are complex,
there are several possibilities for the ordering of real and imag-
inary parts. Considering the number of SIMD registers and the
number of the degree of freedom on each site, we decide to
place the real and imaginary parts as consecutive data on the
memory. The color and spinor components are distributed to
separate registers. For float type, data on eight sites are packed
into a single register and processed simultaneously.
There is flexibility how to fold lattice sites into the data
array. We have tested several reasonable types of site ordering
for the Wilson fermion matrix and found similar performance.
Throughout this paper, we adopt one of them in which
the implementation has been made most progress. This site
ordering is introduced in [14]. Fig. 2 displays how the index
of site is folded into SIMD-vector data. This indexing does not
require shuffling inside the vector variables during the stencil
calculation except for the boundary sites in a local process,
which makes the implementation easy. It also allows us a
flexible choice of local lattice volume. In our implementation,
local lattice sizes in y-, z-, and t-directions must be even
to pack 8 complex numbers into a single SIMD-vector. A
possible disadvantage is a load imbalance between bulk and
boundary sites. In most cases, however, this is hidden in the
larger imbalance due to packing and unpacking of the data for
MPI communication.
D. Tuning procedure
1) Data alignment: For a better performance with SIMD-
vector data, the data must be aligned by 64 bytes (i.e. 512
bits) in AVX512 architecture. To allocate the data, we use
std::vector in the standard C++ template library with
providing an aligned allocator. This allocation is done when
only the master thread is working. The use of std::vector
ensures that the data are allocated on contiguous area of the
memory. For the KNL architecture, this may cause imbalance
of affinity between cores and memory and decrease of perfor-
mance in particular with large number of cores per process. We
nonetheless adopt this setup because of the least modification
of previously developed codes.
3
2) Using Intrinsics: The arithmetics for the SIMD-vector
variables are implemented with intrinsics. For example, the
following code is an implementation of complex multiplication
a ∗ b using AVX-512 intrinsics.
/ / a , b a r e v e c t o r v a r i a b l e s
m512 a r = mm512 moveldup ps ( a ) ;
m512 a i = mm512 movehdup ps ( a ) ;
m512 b t = mm512 permute ps ( b , 0 xB1 ) ;
a i = mm512 mul ps ( a i , b t ) ;
m512 c= mm512 fmaddsub ps ( a r , b , a i ) ;
One can incorporate such codes with intrinsics by making use
of inline functions, C++ templates, or preprocessor macros.
We instead make use of simd directory in the ‘Grid’ library
[14], where a wrapper to the vector variable (__m512) and
complex arithmetics are defined by using the intrinsics. Using
a template class defined there, operation a ∗ b can be written
such as vComplexF c=a*b.
3) Prefetching: We compare the manual prefetch and the
automated prefetch by compiler. The most outer loop of the
matrix is in the site index. At each site, we accumulate 8 stencil
contributions, from +x, −x,...,−t directions in order. The
prefetch to L1 and L2 cache is inserted 1 and 3 steps before
the computation, respectively. That is, before accumulating
a (+x)-contribution, data needed for (−x)-contribution is
prefetched to the L1 cache and (−y)-contribution is to the
L2 cache. We use _mm_prefetch with _MM_HINT_T0 and
_MM_HINT_T1 to generate the prefetch order. The following
is a pseudo code to show the prefetch insertions:
f o r ( s =0 ; s<num of s i t e s ; s ++){
#pragma n o p r e f e t c h
/ / +x
p r e f e t c h t o L1 (−x ) ;
p r e f e t c h t o L2 (−y ) ;
accumla t e f rom (+x ) ;
/ / −x
p r e f e t c h t o L1 (+y ) ;
p r e f e t c h t o L2 (+ z ) ;
accumla t e f rom (−x ) ;
. . .
}
It is not straightforward to insert prefetch commands appro-
priately. One needs to tune the variables and the place to
insert referring to a profiler, e.g. Intel Vtune amplifier. The
performance may sensitive to the problem size, choice of
parameters such as the number of threads, and so on.
4) Thread task assignment: Since the lattice extends over
the machine nodes, the matrix and the reduction of vectors
require communication among nodes. The function of matrix
processes the following steps in order: (1) Packing of the
boundary data for communication, (2-a) Doing communica-
tion, (2-b) Operations of the bulk part, and (3) Unpacking the
boundary data. (2-a) and (2-b) can be overlapped, and how
efficient is this is important for the total performance.
We restrict ourselves in the case that only the master
thread performs the communication, i.e. corresponding to
MPI_THREAD_FUNNELED. For the implementation of the
steps (2-a) and (2-b) above, there are two possibilities: (i)
arithmetic operational tasks are equally assigned to all the
available threads, and (ii) the master thread concentrates the
communication and other threads bear the arithmetic tasks. In
this work, we adopt the case (ii).
V. PERFORMANCE OF WILSON FERMION MATRIX
A. Machine environment
The performance is measured on the Oakforest-PACS sys-
tem. We use the Intel compiler of version 17.0.4 with options
-O3 -ipo -no-prec-div -xMIC-AVX512. On execu-
tion, we use job classes with the cache mode of MCDRAM.
According to the tuning-guide provided by JCAHPC, we
adopt the following setup. To avoid OS jitter, the 0th and
1st cores on each KNL card are not assigned for execution.
KMP AFFINITY=compact is set if more than 1 thread is
assigned to a core (unset for 1 thread/core).
B. Wilson fermion matrix
1) Task assignment to threads: We start with the Wil-
son fermion matrix. We first compare the performance for
combinations of a number of cores per MPI process and a
number of threads per core. For the former, (1) one core per
process, (2) two cores (one tile) per process, and (3) 64 cores
(whole KNL card) per process. For the latter, making use of
the hyperthreading, the following three cases are examined:
(a) one thread per core, (b) two threads per core, and (c)
four threads per core. The actual number of threads per MPI
process is multiple of numbers of cores and threads per core.
We compare one, two, and four threads per core cases. Fig. 3
shows the results with 1-node and 16-node, generated with the
strong scaling on 323×64 lattice. For the case (3), copy of the
boundaries is enforced so as to compare with the other cases
on the same footing. For the case (1) and (2), dependence
on the number of threads per core is not strong. Including
the cases of other numbers of nodes, there is a tendency of
achieving the best performance with 2 threads per MPI process
for case (1) and (2), and 1 thread per core for case (3).
2) Prefetching: Effect of manual prefetch against the auto-
mated prefetch by compiler is displayed in Fig. 4. In the single
node case, where no inter-node communication is needed,
the manual prefetch improves the performance by more than
20%. Contrary to a statement in [2], manual prefetching is
effective to our case. Increasing the number of nodes, however,
the effect is gradually washed and becomes a few percent
at 16 nodes as shown in Fig. 4. Since our target lattice
sizes assumes more than O(10) KNL nodes, the advantage
of manual prefetch is not manifest compared to involved
tuning effort. For this reason, we do not apply it to the linear
algebraic functions and the domain-wall fermion matrix, while
the following measurement of the Wilson matrix is done with
the tuned code.
Here we summarize the output of Intel Vtune Amplifier
for the Wilson matrix multiplication with a single process of
64 threads. Applying the manual prefetch, L2 cache hit rate
increases from 76.7% to 99.1%. With the manual prefetch,
the UTLB overhead and the page walk are 0.0% and 0.2%
4
 0
 50
 100
 150
 200
 250
 300
 350
 400
cores
/proc
1 2 64 1 2 64
 1 node 16 nodes
performance of Wilson mult: 323x64 latt. [GFlops/node]
1 thread/core
2 thread/core
4 thread/core
Fig. 3. Comparison of numbers of threads per core.
of clockticks, respectively. The metric SIMD Compute-to-
L2 Access Ratio reaches as large as 2,023. For the whole
BiCGStab solver examined below, however, the L2 hit rate
decreases to 84.4%, as explained by the larger byte-per-flop
rate of the solver algorithm.
3) Comparison to other codes: Now we compare the per-
formance of the Wilson matrix multiplication to other codes
under the condition of a single process on a single KNL.
As quoted already, QPhiX library achieves 587 GFlops for
single precision [2] on a 323 × 96 lattice. The GRID library
[14] provides a benchmark of the Wilson matrix that we
run on the same environment as this work. On 323 × 64
lattice, based on v0.7.0, it gives the best performance with one
thread/core and amounts to 348.6 GFlops that is comparable
to our result. While our result is not as fast as QPhiX, it shows
that large fraction of performance can be achieved with widely
applicable techniques. These values are reasonable considering
the memory bandwidth of MCDRAM and the byte-per-flop in
Table I, while far below the peak performance.
For reference, we also measure the performance of the
original Bridge++, that is implemented in not SIMD-oriented
manner and only in the double precision. The best performance
is obtained with 4 threads/core and results in 60.0 GFlops,
which roughly corresponds to 120 GFlops in single precision.
This indicates the impact of the SIMD-oriented tuning.
4) Scaling property of matrix multiplication: Now we ob-
serve the scaling properties of the Wilson matrix multiplica-
tion. In the following, we adopt 1, 2, and 2 threads/core for 64,
2, and 1 cores/process, respectively. The top panel of Fig. 5
shows the weak scaling plot for the 163 × 32 lattice in each
node. In the measurements, we do not enforce the copy of
the boundary data unless it is really needed. For reference,
if it is enforced on a single node with 64 cores/process, the
performance becomes 100.4 and 176.0 GFlops on 163 × 32
and 323×64 lattices, respectively. For the 64 cores/process on
multiple nodes, the performance is about the half of the other
two cases. This may be explained by that all the 64 cores in a
node share the whole memory of the node so that imbalance of
accessibility to the memory among the cores exists. The cases
of 1 or 2 cores/process achieve more than 170 GFlops/node
 0
 50
 100
 150
 200
 250
 300
 350
 400
with prefetch
cores
/proc
1 2 64 1 2 64
 1 node 16 nodes
performance of Wilson mult: 323x64 latt. [GFlops/node]
1 thread/core
2 thread/core
4 thread/core
Fig. 4. Effect of manual prefetching. The performance of the Wilson matrix
multiplication is measured on a 323 × 64 lattice.
up to 256 nodes.
The bottom panel of Fig. 5 shows the strong scaling on a
323×64 lattice. In this case, as the number of nodes increases,
the local lattice volume inside each node decreases so that the
communication overhead becomes more and more significant.
Again the 64 cores/process case exhibits less performance
than other two cases. For the strong scaling, the performance
depends on how the lattice is divided into sublattices. We plot
several cases at each number of nodes. For the case of 1 and
2 cores/process, the performance at 16 nodes is about 2/3 of
that of single node.
C. Performance of BiCGStab solver
For the Wilson matrix, the BiCGStab solver works effi-
ciently. We compose the BiCGStab algorithm with BLAS-like
library. For the linear algebraic functions, we apply neither
manual prefetch nor additional compiler option for prefetch.
While the Wilson matrix part is improved by manual prefetch,
this effect is small because the performance is determined by
the linear algebraic functions.
The top panel of Fig. 6 shows the weak scaling for the
BiCGStab solver on a 163×32 lattice in each node. Because of
larger byte-per-flop values of the linear algebraic functions, the
performance reduces to about 1/4 of the matrix multiplication
at 256 nodes. The difference of 64 cores/process and other two
cases also decrease because of the linear algebraic functions.
The worse scaling as the number of node is caused by the
reduction operations. The bottom panel of Fig. 6 shows the
strong scaling plot of the solver on 323×64 lattice. While the
difference of 64 cores/process and other two cases shrinks,
smaller numbers of cores/process achieve better performance
for large number of nodes. In total, these results indicate that
small number of cores per MPI process has advantage, if the
memory size and the local lattice volume allow.
VI. PERFORMANCE OF DOMAIN-WALL FERMION MATRIX
A. Performance of matrix multiplication
For the domain-wall fermion matrix, the condition of per-
formance measurement is the same as the Wilson matrix
5
050
100
150
200
250
300
350
400
 1  10  100
G
Fl
op
s/
no
de
number of nodes
weak scaling of wilson mult: 163x32 latt.
1 core/proc.
2 cores/proc.
64 cores/proc.
0
50
100
150
200
250
300
350
400
 0  2  4  6  8  10  12  14  16
G
Fl
op
s/
no
de
number of nodes
strong scaling of wilson mult: 323x64 latt.
1 core/proc.
2 cores/proc.
64 cores/proc.
Fig. 5. Scaling plots for the Wilson matrix multiplication. Top: weak scaling
with a 163 ×32 lattice in each node. Bottom: strong scaling with a 323 ×64
lattice.
except for no manual prefetch is applied. The domain-wall
multiplication achieves better performance than the Wilson
matrix, as expected from smaller byte-per-flop value due to
the reuse of link variable Uµ(x), as well as the larger number
of arithmetic operations per lattice site.
The top panel of Fig. 7 displays a weak scaling plot of the
domain-wall matrix multiplication for a 163×32 lattice in each
node. On a single node with 64 cores/process, if the boundary
data copy is enforced, the performance becomes 155.2 GFlops
on a 163×32 lattice and 211.8 GFlops on a 323×64 lattice for
the weak and strong scaling, respectively. While the 1 and 2
cores/process cases exhibit good scaling behavior, multi-node
result of 64 cores/process is quite bad. What is different from
the Wilson matrix is larger size of boundary data transferred
at the communication. How such reduction of performance
occurs and how can be avoided is now under investigation.
The strong scaling on a 323 × 64 lattice in the bottom panel
of Fig. 7 shows similar tendency as the Wilson matrix, except
for the reduction of the 64 cores/process data at 16 nodes.
For the 64 cores/process data, we observe large fluctuations
up to almost factor 10 in the elapsed time. We have not
understood why such fluctuations occur and are investigating
the cause. For this reason, we do not include the data for 64
cores/process in the measurement of the CG solver below.
0
20
40
60
80
100
120
140
 1  10  100
G
Fl
op
s/
no
de
number of nodes
weak scaling of wilson BiCGStab: 163x32 latt.
1 core/proc.
2 cores/proc.
64 cores/proc.
0
20
40
60
80
100
120
140
 0  2  4  6  8  10  12  14  16
G
Fl
op
s/
no
de
number of nodes
strong scaling of wilson BiCGStab: 323x64 latt.
1 core/proc.
2 cores/proc.
64 cores/proc.
Fig. 6. Scaling property of the BiCGStab solver with Wilson matrix. Top:
weak scaling plot for a 163 × 32 lattice in each node. Bottom: strong scaling
on a 323 × 64 lattice.
B. Performance of CG solver
For the domain-wall fermion, we apply the CG algorithm
to the matrix D†D. Fig. 7 displays the weak (top) and
strong (bottom) scaling plots for the CG solver. Better scaling
behaviors than those of the Wilson matrix are explained by
larger weight of the matrix multiplication in the algorithm
and larger size of vectors. Even for the strong scaling plot, the
performance of 1 and 2 cores/process cases almost unchanged
as increasing the number of nodes from 1 to 16.
VII. CONCLUSION
In this paper, we apply rather simple prescription to make
use of the SIMD architecture of the Intel Xeon Phi KNL
processor to a typical problem in lattice QCD simulation.
Aiming at widely applying to existing codes, we examine
the rearrangement of data layout and AVX-512 intrinsics to
arithmetic operations, and examined the effect of manual
prefetching. The former two are inevitable to achieve accept-
able performance, compared to the original Bridge++ code.
The effect of manual prefetching is more restrictive. It amounts
to dedicated efforts only on single node or small number
of nodes. Comparing the choices of numbers of cores per
MPI process, small numbers of cores per MPI process have
advantages as increasing number of nodes.
6
050
100
150
200
250
300
350
400
 1  10  100
G
Fl
op
s/
no
de
number of nodes
weak scaling of domainwall mult: 163x32 latt.
1 core/proc.
2 cores/proc.
64 cores/proc.
0
50
100
150
200
250
300
350
400
 0  2  4  6  8  10  12  14  16
number of nodes
strong scaling of domainwall mult: 323x64 latt.
1 core/proc.
2 cores/proc.
64 cores/proc.
Fig. 7. The weak and strong scaling plots of the domain-wall matrix
multiplication. For the weak scaling, 163×32 lattice per node. For the strong
scaling, 323 × 64 lattice.
Our results indicate two efficient ways of using KNL. On
single KNL, multi-thread application without MPI paralleliza-
tion may works efficiently. The manual prefetch is worth to
try. For a multi-node case, adopting small numbers of cores
per MPI process, like a massively parallel machine, one can
optimize the number of nodes against a given problem size. As
for the code of application, it is essential to employ design that
enables flexible rearrangement of data layout and incorporation
of intrinsics.
ACKNOWLEDGMENT
The authors would like to thank Peter Boyle, Guido Cossu,
Ken-Ichi Ishikawa, Daniel Richtmann, Tilo Wettig, and the
members of Bridge++ project for valuable discussion. Numer-
ical simulations were performed on Oakforest-PACS system
hosted by JCAHPC, with support of Interdisciplinary Com-
putational Science Program in CCS, University of Tsukuba.
This work is supported by JSPS KAKENHI (Grant Numbers
JP25400284, JP16H03988), and by Priority Issue 9 to be
tackled by Using Post K Computer, and Joint Institute for
Computational Fundamental Science (JICFuS).
REFERENCES
[1] For modern textbooks, e.g., T. DeGrand, and C. DeTar, “Lattice Methods
for Quantum Chromodynamics” (World Scientific Pub., 2006); C. Gat-
0
50
100
150
200
250
300
350
400
 1  10  100
G
Fl
op
s/
no
de
number of nodes
weak scaling of domainwall CG: 163x32 latt.
1 core/proc.
2 cores/proc.
0
50
100
150
200
250
300
350
400
 0  2  4  6  8  10  12  14  16
number of nodes
strong scaling of domainwall CG: 323x64 latt.
1 core/proc.
2 cores/proc.
Fig. 8. The weak and strong scaling plot for the CG solver with the domain-
wall fermion matrix. For the weak scaling, 163 ×32 lattice per node. For the
strong scaling, 323 × 64 lattice.
tringer and C. B. Lang, ”Quantum Chromodynamics on the Lattice”
(Springer, 2010)
[2] J. Jeffers, J. Reinders, A. Sodani, “Intel Xeon Phi Processor High
Performance Programming Knights Landing Edition” (Elsevier, 2016).
[3] Y. Iwasaki, T.Hoshino T.Shirakawa Y.Oyanagi T.Kawai “QCDPAX: A
parallel computer for lattice QCD simulation”, Comp. Phys. Commun.
49, 449 (1988).
[4] P.A. Boyle et al., “QCDOC: A 10 Teraflops Computer for Tightly-
Coupled Calculations”, SC ’04: Proceedings of the 2004 ACM/IEEE
Conference on Supercomputing, DOI: 10.1109/SC.2004.46.
[5] Joint Center for Advanced High Performance Computing (JCAHPC),
https://ofp-www.jcahpc.jp/.
[6] Bridge++ project, http://bridge.kek.jp/Lattice-code/.
[7] S. Ueda et al., “Bridge++: an object-oriented C++ code for lattice
simulations” PoS LATTICE2013, 412 (2014).
[8] S. Motoki et al., “Development of Lattice QCD Simulation Code Set on
Accelerators” Procedia Computer Science 29, 1701 (2014). H. Matsufuru
et al., “OpenCL vs OpenACC: Lessons from Development of Lattice
QCD Simulation Code” Procedia Computer Science 51, 1313 (2015).
[9] QPhiX library, https://github.com/JeffersonLab/qphix.
[10] B. Joo et al., “Lattice QCD on Intel Xeon Phi Coprocessors” Supercom-
puting Vol. 7905 of ser. Lecture Notes in Computer Science pp 40-54
(2013).
[11] R. Li and S. Gottlieb, “Staggered Dslash Performance on Intel Xeon
Phi Architecture,” PoS LATTICE 2014, 034 (2015).
[12] S. Heybrock et al., “Lattice QCD with Domain Decomposition on Intel
Xeon Phi Co-Processors,” doi:10.1109/SC.2014.11.
[13] P. Arts et al., “QPACE 2 and Domain Decomposition on the Intel Xeon
Phi,” PoS LATTICE 2014, 021 (2015).
[14] P. A. Boyle, G. Cossu, A. Yamaguchi and A. Portelli, “Grid: A next
generation data parallel C++ QCD library,” PoS LATTICE 2015, 023
(2016).
7
[15] H. Kobayashi, Y. Nakamura, S. Takeda and Y. Kuramashi, “Optimiza-
tion of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi
Coprocessor,” PoS LATTICE 2015, 029 (2016).
[16] T. Boku et al., “A performance evaluation of CCS QCD Benchmark on
the COMA (Intel(R) Xeon PhiTM , KNC) system,” PoS LATTICE 2016,
261 (2016).
8
