Three Dirac operators on two architectures with one piece of code and no
  hassle by Durr, Stephan
Three Dirac operators on two architectures with one
piece of code and no hassle
Stephan Dürr∗
University of Wuppertal, D-42119 Wuppertal, Germany
IAS/JSC Forschungszentrum Jülich, D-52425 Jülich, Germany
E-mail: durr(AT)itp.unibe.ch
A simple minded approach to implement three discretizations of the Dirac operator (staggered,
Wilson, Brillouin) on two architectures (KNL and core i7) is presented. The idea is to use a
high-level compiler along with OpenMP parallelization and SIMD pragmas, but to stay away
from cache-line optimization and/or assembly-tuning. The implementation is for Nv right-hand-
sides, and this extra index is used to fill the SIMD pipeline. On one KNL node single precision
performance figures for Nc = 3, Nv = 12 read 475 Gflop/s, 345 Gflop/s, and 790 Gflop/s for the
three discretization schemes, respectively.
The 36th Annual International Symposium on Lattice Field Theory - LATTICE2018
22-28 July, 2018
Michigan State University, East Lansing, Michigan, USA.
∗Speaker.
c© Copyright owned by the author(s) under the terms of the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0). https://pos.sissa.it/
ar
X
iv
:1
80
8.
05
50
6v
2 
 [h
ep
-la
t] 
 20
 N
ov
 20
18
Three Dirac operators on two architectures Stephan Dürr
1. Introduction
Recent years brought plenty of machines with peak performances in the multi-petaflop/s range,
but it gets increasingly difficult to achieve good strong-scaling behavior with actual scientific codes.
Lattice QCD is still in a fortunate position to harness these capabilities [1, 2, 3], as it does not
require any run-time dependent data structures. On the other hand, an un-optimized non-parallel
code tends to have O(105) lines. This means that significant human resources must be spent to
parallelize a lattice code and to obtain good performance figures on a given architecture.
Ideally one would have a piece of code, written in a high-level language, which parallelizes and
reaches decent (read: non-optimal but non-disastrous) performance upon compilation on a given
new architecture. In these proceedings I report on an attempt to do this on the single-node level,
based on OpenMP (OMP) threads and again OpenMP pragmas for SIMD-pipelining. I concentrate
on the part which takes most time in actual computations – the matrix-times-vector operation for a
given Dirac operator, considering the Susskind (“staggered”), Wilson and Brillouin varieties.
2. Vector layout options
The routines are written in Fortran 2008, using the stride notation, as this allows for com-
pact source files (like in matlab). The gauge field U is defined as a 7-dimensional array through
complex(kind=sp),dimension(Nc,Nc,4,Nx,Ny,Nz,Nt)::U, with parameters like
Nc=3 and sp,dp (for single and double precision, respectively) specified at compile time. Hence
U(:,:,3,x,y,z,t) defines N2c complex numbers, arranged contiguously in memory.
With Wilson-type vectors arranged in blocks of Nv right-hand sides, the layout options include
vec(Nc,4,Nv,...), vec(4,Nc,Nv,...), vec(Nc,Nv,4,..), vec(Nv,Nc,4,..),
vec(4,Nv,Nc,...), vec(Nv,4,Nc,...), where the dots stand for Nx*Ny*Nz*Nt. This
restriction of having the space-time index as the slowest (right-most) index precludes sophisticated
SIMD strategies (see Refs. [1, 2]), but it may facilitate the use of PGAS concepts (see below). With
Susskind-type vectors arranged in blocks of Nv right-hand sides, the layout options under the same
restriction are suv(Nc,Nv,Nx*Ny*Nz*Nt), suv(Nv,Nc,Nx*Ny*Nz*Nt).
Our task is to optimize the performance under the self-imposed set of restrictions. An impor-
tant ingredient in the code is that all contributions to the “out” vector are collected in the thread-
private variable site, which for each space-time index is written once. This avoids write collisions
among threads in a natural way. We have the same 6 layout options as for vec to define site as
an array of dimension 3 in the Wilson case (or the same 2 options in the staggered case).
3. Staggered kernel details and performance
The Susskind (“staggered”) Dirac operator is defined through
DS(x,y) =∑
µ
ηµ(x)
1
2
[Vµ(x)δx+µˆ,y−V †µ (x− µˆ)δx−µˆ,y] (3.1)
with η1(x) = 1, η2(x) = (−1)x1 , η3(x) = (−1)x1+x2 , η4(x) = (−1)x1+x2+x3 . HereVµ(x) represents a
smeared version of the (original) gauge link Uµ(x), i.e. a gauge-covariant parallel transporter from
1
Three Dirac operators on two architectures Stephan Dürr
 10
 100
 1  10  100
G
flo
p/
s 
 [s
p]
# threads
    1 thread  :     6.6 Gflop/s
  68 threads: 400.0 Gflop/s
136 threads: 469.7 Gflop/s
272 threads: 451.7 Gflop/s
239 threads: 476.2 Gflop/s
stag_sp @ KNL_68cores
 10
 100
 1  10  100
G
flo
p/
s 
 [d
p]
# threads
    1 thread  :     4.9 Gflop/s
  68 threads: 295.3 Gflop/s
136 threads: 335.1 Gflop/s
272 threads: 309.0 Gflop/s
153 threads: 342.7 Gflop/s
stag_dp @ KNL_68cores
 100
 1  10
G
flo
p/
s 
 [s
p]
# threads
  1 thread  :   27.3 Gflop/s
  6 threads: 127.8 Gflop/s
12 threads: 129.8 Gflop/s
24 threads: 124.4 Gflop/s
stag_sp @ ci7_6cores
 100
 1  10
G
flo
p/
s 
 [d
p]
# threads
  1 thread  :   17.8 Gflop/s
  6 threads:   87.9 Gflop/s
12 threads:   92.6 Gflop/s
24 threads:   90.7 Gflop/s
stag_dp @ ci7_6cores
Figure 1: Log-log scaling plot of the single processor performance in Gflop/s versus the number of OMP
threads for the staggered Dirac operator in sp (left) and dp (right). The top panels feature a KNL processor
with 68 cores, the bottom panels a Broadwell chip with 6 cores. The lattice size is 343×68.
x+ µˆ to x, to reduce taste-symmetry breaking. From a HPC viewpoint, a clear advantage of this
operator with precomputed Vµ is that its stencil is restricted to sites which are at most one hop
away. Still, it is not trivial to reach an acceptable performance on a many-core architecture [4, 5].
In our framework we have 2 options for the suv layout, 2 options for site, and 2 reasonable
loop nestings. It is thus possible to implement all these options, and to compare the timings. For
one choice (the one performing best on the KNL architecture) thread scaling results are shown in
Fig. 1. Performance on the Broadwell architecture seems far less sensitive to these choices.
4. Wilson kernel details and performance
The Wilson Dirac operator is defined through
DW(x,y) =∑
µ
γµ∇stdµ (x,y)−
a
2
4std(x,y)+m0δx,y− cSW2 ∑µ<ν
σµνFµνδx,y , (4.1)
where ∇stdµ is a 2-point discretization of the covariant derivative
a∇stdµ (x,y) =
1
2
[Vµ(x)δx+µˆ,y−V−µ(x)δx−µˆ,y] (4.2)
2
Three Dirac operators on two architectures Stephan Dürr
 10
 100
 1  10  100
G
flo
p/
s 
 [s
p]
# threads
    1 thread  :     4.9 Gflop/s
  68 threads: 297.6 Gflop/s
136 threads: 328.1 Gflop/s
272 threads: 324.7 Gflop/s
154 threads: 345.0 Gflop/s
wils_sp @ KNL_68cores
 10
 100
 1  10  100
G
flo
p/
s 
 [d
p]
# threads
    1 thread  :     3.2 Gflop/s
  68 threads: 183.5 Gflop/s
136 threads: 209.2 Gflop/s
272 threads: 199.2 Gflop/s
138 threads: 219.8 Gflop/s
wils_dp @ KNL_68cores
 10
 100
 1  10
G
flo
p/
s 
 [s
p]
# threads
  1 thread  :    17.3 Gflop/s
  6 threads:    82.9 Gflop/s
12 threads:    95.2 Gflop/s
24 threads:    91.9 Gflop/s
wils_sp @ ci7_6cores
 10
 100
 1  10
G
flo
p/
s 
 [d
p]
# threads
  1 thread  :      8.1 Gflop/s
  6 threads:    38.4 Gflop/s
12 threads:    46.6 Gflop/s
24 threads:    44.7 Gflop/s
wils_dp @ ci7_6cores
Figure 2: Log-log scaling plot of the single processor performance in Gflop/s versus the number of OMP
threads for the Wilson Dirac operator at cSW = 0 in sp (left) and dp (right). The top panels feature a KNL
processor with 68 cores, the bottom panels a Broadwell chip with 6 cores. The lattice size is 343×68.
and4std is a 9-point discretization of the covariant Laplacian (sum over 4 pos. and 4 neg. indices)
a24std(x,y) =−8δx,y+1∑µVµ(x)δx+µˆ,y . (4.3)
Also this operator is HPC friendly, since its stencil contains at most 1-hop terms.
In our framework we have 6 options for the vec layout, times 6 options for site, times a few
reasonable loop nestings. For the standard Laplacian 4std in the Wilson operator the latter set of
options is 4, resulting in a total of 144 routines. Evidently, this brings some limitations to the “best
of breed” ansatz, since it may be a little awkward to test each routine with each possible thread
count (e.g. 1 through 272 on the KNL). In practice, it seems reasonable to restrict the selection
process to just a few thread counts (e.g. 68, 136, and 272 on the KNL).
Full thread scaling results for one such choice are displayed in Fig. 2. Similar to Fig. 1, we see
an almost linear increase in performance up to 68 threads on the KNL. This is followed by another
linear gain (albeit with a smaller slope) up to 136 threads. Beyond that point results wiggle a bit out
to 272 threads. The maximum appears in a number of threads (154 in sp, 138 in dp) which seems
hard to predict. Again, the code seems to perform (without any change) reasonably well on the
Broadwell architecture, too. Having more than 2 threads per core does not enhance performance,
but the good news is that it’s not really detrimental either. Next steps of performance tuning would
naturally include eo-decomposition and gauge compression (from Nc to Nc−1 columns).
3
Three Dirac operators on two architectures Stephan Dürr
#hop #terms #paths formula
1 8 1!=1 Wµ(x) =Vµ(x) (smeared link, µ∈{±1,±2,±3,±4})
2 24 2!=2 Wµ+ν(x) = 12 [Vµ(x)Vν(x+µˆ)+perm]
3 32 3!=6 Wµ+ν+ρ(x) = 16 [Vµ(x)Vν(x+µˆ)Vρ(x+µˆ+νˆ)+perms]
4 16 4!=24 Wµ+ν+ρ+σ (x) = 124 [Vµ(x)Vν(x+µˆ)Vρ(x+µˆ+νˆ)Vσ (x+µˆ+νˆ+σˆ)+perms]
Table 1: Overview of the set of off-axis linksWdir(x), with lengths ranging from 1 to 4 hops. Given a site x,
81 directions are possible, but one is trivial, and the remaining 80 can be reduced to 40 based on W−dir(x) =
W †dir(x−dir). In the codeW is precomputed and stored in the array W(Nc,Nc,40,Nx,Ny,Nz,Nt). Note
that for 36 of the 40 directions the entry is not special unitary, and no gauge compression is possible.
5. Brillouin kernel details and performance
The Brillouin Dirac operator is defined through [6]
DB(x,y) =∑
µ
γµ∇isoµ (x,y)−
a
2
4bri(x,y)+m0δx,y− cSW2 ∑µ<ν
σµνFµνδx,y , (5.1)
where the isotropic derivative ∇isoµ is a 54-point discretization of the covariant derivative
a∇isoµ (x,y) = ρ1 [Wµ(x)δx+µˆ,y−W−µ(x)δx−µˆ,y]
+ ρ2∑6=(ν ;µ)[Wµ+ν(x)δx+µˆ+νˆ ,y− (µ →−µ)]
+ ρ3∑6=(ν ,ρ;µ)[Wµ+ν+ρ(x)δx+µˆ+νˆ+ρˆ,y− (µ →−µ)]
+ ρ4∑6=(ν ,ρ,σ ;µ)[Wµ+ν+ρ(x)δx+µˆ+νˆ+ρˆ+σˆ ,y− (µ →−µ)] (5.2)
and the Brillouin Laplacian4bri is a 81-point discretization of the covariant Laplacian
a24bri(x,y) = λ0 δx,y + λ1∑µWµ(x)δx+µˆ,y
+ λ2∑6=(µ,ν)Wµ+ν(x)δx+µˆ+νˆ ,y
+ λ3∑6=(µ,ν ,ρ)Wµ+ν+ρ(x)δx+µˆ+νˆ+ρˆ,y
+ λ4∑6=(µ,ν ,ρ,σ)Wµ+ν+ρ+σ (x)δx+µˆ+νˆ+ρˆ+σˆ ,y (5.3)
with (ρ1,ρ2,ρ3,ρ4)≡ (64,16,4,1)/432 and (λ0,λ1,λ2,λ3,λ4)≡ (−240,8,4,2,1)/64 respectively.
In (5.2) the last sum extends over (pos. and neg.) indices (ν ,ρ,σ) which are mutually unequal and
different from µ . In (5.3) the last sum extends over indices (µ,ν ,ρ,σ) which are pairwise unequal.
In these formulasWdir(x) denotes a link in direction “dir” which may be on-axis (dir=µ) or off-axis
with length
√
2 (dir=µν) or
√
3 (dir=µνρ) or
√
4 (dir=µνρσ ). More details are given in Tab. 1.
The Brillouin operator (5.1) brings new perspectives on PDFs [6, 7] and – when used in con-
junction with the overlap procedure [8, 9] – on heavy-quark physics [10, 11]. From a HPC view-
point the Brillouin operator is interesting, since its computational intensity is by a factor 2.5 higher
than for the Wilson operator. The choice of treating Nv right-hand-sides simultaneously enhances
the computational intensity of either operator, while keeping this ratio almost invariant [11]. In a
very distant future, when cycles are totally irrelevant, one might opt for not precomputing W ; this
would trigger a massive enhancement of the computational intensity of the operator (5.1).
4
Three Dirac operators on two architectures Stephan Dürr
 10
 100
 1  10  100
G
flo
p/
s 
 [s
p]
# threads
    1 thread  :   10.0 Gflop/s
  68 threads: 635.7 Gflop/s
136 threads: 767.4 Gflop/s
272 threads: 704.8 Gflop/s
138 threads: 791.6 Gflop/s
bril_sp @ KNL_68cores
 10
 100
 1  10  100
G
flo
p/
s 
 [d
p]
# threads
    1 thread  :     6.1 Gflop/s
  68 threads: 366.9 Gflop/s
136 threads: 455.2 Gflop/s
272 threads: 358.5 Gflop/s
139 threads: 467.4 Gflop/s
bril_dp @ KNL_68cores
 10
 100
 1  10
G
flo
p/
s 
 [s
p]
# threads
  1 thread  :   27.4 Gflop/s
  6 threads: 150.8 Gflop/s
12 threads: 222.8 Gflop/s
24 threads: 222.1 Gflop/s
bril_sp @ ci7_6cores
 10
 100
 1  10
G
flo
p/
s 
 [d
p]
# threads
  1 thread  :   17.2 Gflop/s
  6 threads:   94.8 Gflop/s
12 threads: 116.3 Gflop/s
24 threads: 115.8 Gflop/s
bril_dp @ ci7_6cores
Figure 3: Log-log scaling plot of the single processor performance in Gflop/s versus the number of OMP
threads for the Brillouin Dirac operator at cSW = 0 in sp (left) and dp (right). The top panels feature a KNL
processor with 68 cores, the bottom panels a Broadwell chip with 6 cores. The lattice size is 343×68.
Thread scaling results are shown in Fig. 3. As in previous cases, we see nearly-perfect scaling
behavior up to 68 threads on the KNL. After this, there is another performance increase up to 136
threads. Beyond that point, performance figures wiggle a bit out to 272 threads. And again, the
(unchanged, just recompiled) code performs reasonably well on the Broadwell architecture, too.
Unlike in the Wilson case, the author is unaware of simple recipes for further improvement.
6. Summary and outlook
In summary, thread scaling results for the Susskind, Wilson and Brillouin varieties of the
Dirac operator in lattice QCD were presented. They are based on straightforward implementations
in Fortran 2008, with OpenMP pragmas for shared-memory parallelization and SIMD pipelining.
No cache-access optimization and no hand-assembly tuning have been applied, and still rea-
sonable performance figures can be obtained. Key to the improvement over last year’s version [12]
is an ansatz where performance critical routines are written for a variety of layouts of the in/out-
vectors, of accumulation variables, and possibly loop nestings. For a given architecture and com-
piler combination, all of these routines are compiled “out of the box”, and a few test calls will
quickly reveal which option features best on a particular machine. With this “best of breed” ansatz,
the winning combination is subsequently used for actual computations.
5
Three Dirac operators on two architectures Stephan Dürr
The plots presented in this contribution illustrate that this strategy proves successful on the
single-node level. What would be important for actual calculations, however, is a working concept
(along these lines) on the multi-node level. The author’s hope is that vendors will finally provide
efficient PGAS (Coarray Fortran and/or Unified Parallel C) support for CPUs. In fact, this is the ra-
tionale for not sacrificing any of the space-time indices for the SIMD vectorization. In the event the
GPU world offers this feature more promptly, this will be a clear case for switching to OpenACC.
Acknowledgements: Program development and test runs were performed on the DEEP-ER
system at IAS/JSC in Jülich. The author likes to thank Eric Gregory for useful discussion.
References
[1] P. A. Boyle, PoS LATTICE 2016, 013 (2017) doi:10.22323/1.256.0013 [arXiv:1702.00208 [hep-lat]].
[2] A. Rago, EPJ Web Conf. 175, 01021 (2018) doi:10.1051/epjconf/201817501021 [arXiv:1711.01182].
[3] M. Lin, “Machines and Algorithms for Lattice QCD,” talk at Lattice 2018 (these proceedings).
[4] C. DeTar, D. Doerfler, S. Gottlieb, A. Jha, D. Kalamkar, R. Li and D. Toussaint, PoS LATTICE 2016,
270 (2016) doi:10.22323/1.256.0270 [arXiv:1611.00728 [hep-lat]].
[5] C. DeTar, S. Gottlieb, R. Li and D. Toussaint, EPJ Web Conf. 175, 02009 (2018).
doi:10.1051/epjconf/201817502009 [arXiv:1712.00143 [hep-lat]].
[6] S. Durr and G. Koutsou, Phys. Rev. D 83, 114512 (2011) [arXiv:1012.3615 [hep-lat]].
[7] S. Durr, G. Koutsou and T. Lippert, Phys. Rev. D 86, 114514 (2012) [arXiv:1208.6270 [hep-lat]].
[8] H. Neuberger, Phys. Lett. B 417, 141 (1998) [hep-lat/9707022].
[9] H. Neuberger, Phys. Lett. B 427, 353 (1998) [hep-lat/9801031].
[10] Y. G. Cho, S. Hashimoto, A. Juttner, T. Kaneko, M. Marinkovic, J. I. Noaki and J. T. Tsang, JHEP
1505, 072 (2015) [arXiv:1504.01630 [hep-lat]].
[11] S. Durr and G. Koutsou, arXiv:1701.00726 [hep-lat].
[12] S. Durr, EPJ Web Conf. 175, 02001 (2018) doi:10.1051/epjconf/201817502001 [arXiv:1709.01828].
6
