Lattice QCD based on OpenCL by Bach, Matthias et al.
Lattice QCD based on OpenCL
Matthias Bachb, Volker Lindenstruthb, Owe Philipsena, Christopher Pinkea
aInstitut fu¨r Theoretische Physik, Goethe-Universita¨t, Max-von-Laue-Str. 1, 60438 Frankfurt am Main
bFrankfurt Institute for Advanced Studies / Institut fu¨r Informatik - Johann Wolfgang Goethe-Universita¨t,
Ruth-Moufang-Str. 1, 60438 Frankfurt am Main
Abstract
We present an OpenCL-based Lattice QCD application using a heatbath algorithm for the pure gauge
case and Wilson fermions in the twisted mass formulation. The implementation is platform independent
and can be used on AMD or NVIDIA GPUs, as well as on classical CPUs. On the AMD Radeon HD
5870 our double precision 6D implementation performs at 60GFLOPS over a wide range of lattice sizes.
The hybrid Monte-Carlo presented reaches a speedup of four over the reference code running on a
server CPU.
Keywords: Lattice QCD, Heatbath, HMC, Graphic Cards, GPGPU, OpenCL, HPC
Contents
1 Introduction 2
2 Lattice QCD and Monte Carlo simulations 2
2.1 Choice of Lattice action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Ensemble generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 OpenCL and Graphic Cards 6
3.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 GPU Architecture and OpenCL Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Implementation strategy 7
4.1 Memory requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Global memory storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Common code for CPUs and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Performance results 11
5.1 Heatbath performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2 6D performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.3 HMC performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Conclusions 17
Email addresses: bach@compeng.uni-frankfurt.de (Matthias Bach), voli@compeng.de (Volker Lindenstruth),
philipsen@th.physik.uni-frankfurt.de (Owe Philipsen), pinke@th.physik.uni-frankfurt.de (Christopher Pinke)
Preprint submitted to Elsevier September 27, 2012
ar
X
iv
:1
20
9.
59
42
v1
  [
he
p-
lat
]  
26
 Se
p 2
01
2
1. Introduction
Lattice Quantum Chromodynamics (LQCD) is the only a priory approach to describing the strong
force. State-of-the-art lattice simulations require high-performance computing and constitute actually
one of the most compute intensive problems overall. With processor clock speeds no longer improving
and processors instead increasing their core count, Graphics Processing Units (GPUs) with their
high peak performance and bandwidth have become an interesting platform for high-performance
computing. In June 2011 three of the top five systems in the Top500 list of supercomputers [2] were
GPU accelerated clusters. An example of heterogeneous architecture for general purpose computing
is the LOEWE-CSC [3], which solely consists of AMD hardware and provides two 12-core AMD
Magny-Cours CPUs and one AMD Radeon HD 5870 GPU in the majority of its nodes. Originally, it
was ranked 22nd in the Top500 list of supercomputers [2] and ranked eighth in the Green500 list of
energy-efficient supercomputers (with 718 MFLOPS/Watt) [4].
Originating in the high-end computer gaming market, GPUs nowadays offer highest computing
capabilities at a very attractive price-per-flop ratio. Current high-end gaming GPUs by AMD and
NVIDIA are priced at about 500 Euros. However, while the pioneer work in the field [5], using
application programming interfaces (APIs) designed for graphics rendering, was in principle platform
agnostic, nearly all later developments in the field were based on NVIDIA CUDA [6], therefore being
limited to hardware by this single GPU vendor.
We present the first application of OpenCL [7] to Wilson fermions, enabling the code to be run on
AMD and NVIDIA GPUs and also on classical CPUs. The work, of which early prototypes have been
shown in [1], has been extended to a full hybrid Monte-Carlo application that shows major performance
gains over a purely CPU based reference code.
We begin by stating the physical problem we want to solve and then explain the tools used.
Afterwards we will describe the important parts of our implementation and end with an analysis of
the performance of our application.
2. Lattice QCD and Monte Carlo simulations
The strong interactions between elementary constituents of matter are described by Quantum Chro-
modynamics (QCD). In this section, we will give an introductory overview of QCD and its evaluation
by means of Monte-Carlo methods. For more detailed information, we refer to [8, 9].
QCD is a SU(NC) gauge theory consisting of gauge and fermion fields. An analytical access by
means of perturbation theory may be applied only in regions where the coupling constant, g, is small.
This is ensured for high momentum-transfer (“asymptotic freedom”). To explore the non-perturbative
regime, euclidean spacetime is discretized on a hypercube with lattice spacing a. Accordingly, the
QCD action SQCD is replaced by a lattice version afflicted with discretization errors,
SLQCD = SQCD + aS1 + a2S2 + . . . , (1)
and continuum physics can be obtained in the limit a→ 0.
In statistical physics, the central object is the partition function Z of the system, which allows for
the measurement of an observable A of interest. On the lattice, the expectation value of A is then:
〈A〉 = Z−1
∫
DUDψDψ¯A exp {−SLQCD[U ]}
= Z−1
∫
DUA detD[U ] exp {−Sgauge[U ]} , (2)
where the exponential in the first line is the Boltzmann factor if one identifies SLQCD = βH. The
grassmann-valued fermion fields ψ can be integrated out exactly, yielding the determinant of the
2
Uµ(x)
ψ(w)
a
Uµ(y)
Uν(y + ~µ)
U†µ(y + ~ν)
U†ν (y)
Uµ(z) Uµ(z + ~µ)
Uν(z + 2~µ)
U†µ(z + ~µ+ ~ν)U
†
µ(z + ~ν)
U†ν (z)
Figure 1: Sketch of lattice discretization at lattice spacing a. w, x, y, z denote lattice sites, ψ a fermion field and U a
gauge link, respectively. Also shown are the plaquette Pµν(y) and the rectangle product of links Rµν(z).
fermion matrix D, and one is left with an integral over all possible gauge configurations U , which are
the relevant degrees of freedom. It is convenient to rewrite the fermion determinant as an integral over
a bosonic field φ (pseudo fermions),
detD[U ] ∼
∫
Dφ exp{−φ†D−1[U ]φ} , (3)
giving an effective action Seff[U, φ] = Sgauge[U ] + φ
†D−1[U ]φ. Since it is unfeasible to solve this high-
dimensional integral, importance sampling methods are used to generate an ensemble of N gauge
configurations {Um} using the Boltzmann-weight p[U, φ] = exp {−Seff[U, φ]} as probability measure.
Then, 〈A〉 can be approximated by
〈A〉 ≈ 1
N
∑
m
A[Um] . (4)
2.1. Choice of Lattice action
There is some freedom in the specific form of the lattice action, since it has to agree with QCD in
the continuum limit only. Different discretizations will suffer differently from discretization artifacts
and often also do not hold all the continuum properties. Especially the discretization of the fermion
matrix is non-trivial, since the naive one gives rise to non-physical fermion copies (“doublers”). It is
also possible to add improvement terms irrelevant in the continuum to the lattice action, which help
to reduce discretization errors.
At this point, it is adequate to insert some words related to the dimensions of the quantities involved
and to fix our notation. We will denote the spatial and temporal extent of the system with Nσ and
Nτ , respectively, so that the total volume of the system is Vtot = N
3
σ × Nτ . The number of colors,
Dirac indices and spacetime dimensions are denoted as NC, NDirac and ND, respectively and they
take on the values 3, 4 and 4. On each lattice site x, the component φ(x) of φ is a complex valued,
(NDirac × NC)-dimensional vector, whereas the gauge field U = Uµ(x) is a complex valued NC × NC
matrix linking x and its neighbour x+ ~µ in spacetime-direction ~µ (therefore called links, see fig. 1).
In our numerical studies [10–12] we employ the tree-level Symanzik improved Wilson action in the
gauge sector,
Stlsym =
β
NC
∑
x
(
c0
∑
µ,ν>µ
{1− Re Tr(Pµν(x))}+ c1
∑
µ,ν
{1− Re Tr(Rµν(x))}
)
. (5)
Here, Pµν(x) and Rµν(x) denote path-ordered plaquette and rectangle products of link variables (see
fig. 1) and the parameters are β = 6/g2, c0 = 1− 8c1 and c1 = 1/12. The unimproved gauge action is
regained setting c1 to zero.
3
In the fermionic sector we use the so-called twisted mass Wilson fermions [13] to simulate two
flavours of fermions, whose matrix on the lattice reads
D±tm = (1± 2iaκµγ5) δxyδαβδab −
κ
2
∑
µ
(1− γ±µ)αβ U±µ(x)abδn+~µ,y
= M±diag + 6D , (6)
with shorthand notation γ−µ = −γµ and U−µ(x) = Uµ(x − ~µ)†. The γµs are matrices in Dirac space
satisfying {γµ, γν} = 2gµν and a, b, α, β are color and Dirac indices, respectively. The sign in the
diagonal mass matrix M±diag corresponds to up and down flavour. Two mass parameters appear, the
twisted mass µ and the usual m (via the hopping parameter κ = (2(am+4))−1). Pure Wilson fermions
can be reobtained at vanishing twisted mass, DWilson ≡ Dtm(µ = 0). Twisted mass fermions have
the advantage that if κ is tuned such that the renormalized quark-mass vanishes for pure Wilson
fermions, the O(a) discretization errors in (1) vanish too. For µ 6= 0 the mass is then determined
solely by µ and the system is said to be tuned to Maximal Twist. In addition, for two flavours one
has detDtm = det
(
D2Wilson + 4κ
2µ2
)
. Thus, the twisted mass acts as an infrared regulator, which is
important in avoiding so called exceptional configurations, for which the fermion matrix has very small
eigenvalues rendering its inversion ill-defined (see below). The most used discretization type besides
Wilson fermions are staggered fermions. Regarding numerics, these differ a lot from the Wilson type
since the “staggering” essentially means that on each site there is only one spinor component instead
of four.
We now want to dwell on the role the fermion matrix plays in LQCD simulations a bit more closely.
For simplicity, we will drop the indices on D from now on. To evaluate the fermion determinant in
the partition function one has to know the inverse of D (see (3)). In addition, the entries of D−1 are
related to the fermion propagator, thus its inversion is also crucial for measuring fermionic observables.
To give an example, the two-point correlation function at sites n and m for a generic meson d¯(n)Γu(n)
is given by:
CΓ ≡ 〈u¯(m)Γd(m)d¯(n)Γu(n)〉 = −Tr
[
Γ(D−1)(n,m)Γγ5((D−1)(n,m))†γ5
]
. (7)
For more involved observables, more occurrences of D−1 will appear.
D is a (NC × NDirac × Vtot)-dimensional sparse square matrix, so one uses iterative Krylov space
based methods to calculate D−1 indirectly out of equations like
Dφ = ψ ⇒ φ = D−1ψ . (8)
During the inversion, the matrix-vector product Dφ has to be carried out many times. The most
common solvers used are CG and BiCGStab. The convergence of the solver depends inversely on
how light the simulated quarks are, and also scales with Vtot. This renders the inversion the most
involved part of the simulation. In addition, there is a lot of machinery around to simplify (8), such
as mass-preconditioning and, most prominently, even-odd-preconditioning.
In a simulation the dimensions of the lattice strongly depend on the physical problem under inves-
tigation. In order to rule out finite size effects the spatial volume is typically large. For simulations
of thermal systems, temperature is set by T = 1/aNτ and thus lattices have a rather small tempo-
ral extent here. Accordingly, in vacuum simulations, the opposite is the case. On the other hand,
thermal systems require simulations over ranges of parameter values and higher statistics compared
to vacuum simulations. Typical lattices have currently Nσ ≈ 32, Nτ ≈ 12 in thermal studies and
Nσ ≈ 64, Nτ ≈ 128 for vacuum simulations. Therefore a thermal lattice is typically simulated using
less parallelism.
4
2.2. Ensemble generation
The standard simulation algorithm to generate QCD gauge configurations is the Hybrid Monte-
Carlo (HMC) algorithm [14], where the effective action is embedded in a fictitious classical system
governed by the Hamiltonian H = 12P
2+Seff[U, φ] by introducing Gaussian momenta P conjugate to U .
Starting from a given configuration (U,P ), the system is evolved over a time τ to a new configuration
(U ′, P ′), according to the hamiltonian equations of motions with a force F,
P˙ = −∂Seff/∂U ≡ F ,
U˙ = P . (9)
This evolution is carried out by numerical integration, the most common integration schemes being the
leapfrog- and the second order minimal (2MN) scheme. φ and P are chosen with a Gaussian distribution
initially and φ is held constant throughout the evolution of the system. Since the numerical integration
is not exact, a metropolis step is carried out in the end, thus ensuring detailed balance. This means
that the new configuration is only accepted with a probability min (1, exp(H[P ′, U ′]−H[P,U ])). Here,
P = Pµ(x) is a real-valued, (NC × NC − 1) dimensional vector, as is the force F = Fµ(x). The latter
has three contributions with our choice of lattice action:
F = F gaugeplaquette + F
gauge
rectangles + F
fermion , (10)
where the gauge part of the action has been divided into plaquette and rectangle parts. These are
proportional to sums of link products, so called staples. For example, (F gaugeplaquette)µ(x) ∼
∑
µ 6=ν P˜µν(x),
where we denote the staple P˜µν(x) as the plaquette product of links without Uµ(x) (see fig. 1). Note
that these sums are no longer elements of SU(NC).
By means of the identity ∂UD
−1 = D−1(∂UD)D−1, the fermion force can be evaluated. Note that
this means that one needs to invert D for the input of the force calculation. In addition, this part of
the force receives contributions from the 6D part of the fermion matrix only since our diagonal matrix
does not depend on U .
Techniques exist to refine the integration of the equations of motions. τ is usually divided into
smaller substeps and the inverter can be preconditioned by introducing additional pseudo fermions
via det(D) = det(A)det(D)det(A) , where A denotes a fermion matrix with a different mass than D (mass
preconditioning). Furthermore, it is beneficial to integrate cheaper force contributions more often
(multiple timescales). For details we refer to [15].
L(0) = U
for i < m = # subgroups:
W = L(i) * Staple
w = project(W,i)
v = update(w)
V = extend(v,i)
L(i) = L(i-1) * V
U = L(m)
Figure 2: Heatbath algorithm
In many cases, one is interested in QCD-like systems without
fermions, what is then called pure gauge theory. In principal, this
gives detM = 1 in the equations above and the HMC algorithm is
still applicable. Nevertheless, the so-called heatbath algorithm [16–
18] provides an alternative and more direct way. It is based on an
exact algorithm for SU(2) that updates a specific link according to
its neighbours. This can be extended to the general SU(NC) case by
systematically reducing the SU(NC) link to a number of SU(2) sub-
groups, which for the NC = 3 case is usually also 3. Figure 2 shows
a sketch of the algorithm, where W, U, V, L denote SU(NC) links and
w, v SU(2) links, respectively. project and extend denote reduction
and extension to and from the ith subgroup. The update routine
corresponds to the aforementioned exact SU(2) update or to an over-
relaxation update [19], which serves to cover the whole configuration space more quickly. Only in the
first update random numbers are necessary.
5
Chip Peak SP Peak DP Peak BW
[GFLOPS] [GFLOPS] [GB/s]
AMD Radeon HD 5870 Cypress 2720 544 154
AMD FirePro V7800 Cypress 2016 403 128
AMD Radeon HD 6970 Cayman 2703 683 176
AMD Radeon HD 7970 Tahiti 3789 947 264
AMD FirePro W8000 Tahiti 3230 810 176
NVIDIA Tesla C1060 Tesla 933 78 102
NVIDIA GeForce GTX 280 Tesla 933 78 142
NVIDIA GeForce GTX 480 Fermi 1345 132 177
NVIDIA GeForce GTX 580 Fermi 1581 198 192
NVIDIA Tesla M2090 Fermi 1331 665 177
NVIDIA GeForce GTX 680 Kepler 3090 258 192
AMD Opteron 6172 Magny-Cours 202 101 42.7
AMD Opteron 6278 Interlagos 307 154 51.2
Intel Xeon E5-2690 Sandy Bridge EP 371 186 51.2
Table 1: Theoretical peak performance of current GPUs and CPUs. SP and DP denote single and double precision,
respectively. BW denotes bandwidth
3. OpenCL and Graphic Cards
Table 1 shows an overview of available GPUs and CPUs. One sees immediately that GPUs surpass
CPUs in peak performance as well as in memory bandwidth. However, one also notes the drop in
performance when going from single to double precision on the GPU. To take advantage of this perfor-
mance an appropriate programming model and a certain understanding of the hardware architecture
is required. One of these programming models is provided by OpenCL.
3.1. OpenCL
GPU applications generally consist of a controlling program (“host”), running on the CPU, that
executes suitable smaller programs (“kernels”) on the GPU. All the memory handling is performed
by the host program. The most prominent library of such kind is NVIDIA’s CUDA [6]. Virtually all
existing LQCD applications are based on CUDA, at the disadvantage that these are destined to run
on NVIDIA hardware exclusively [20–24]. A hardware independent approach to parallel computing
is given by the Open Computing Language (OpenCL [7]), which is an open standard to perform
calculations on heterogeneous computing platforms. Thus, in OpenCL one is not determined to use
CPUs or GPUs, but it is in principal possible to combine these and distribute the computations among
all available computing “devices” (see fig. 3). For this OpenCL defines a programming language that
is based on C99. Implementations of OpenCL can be found both from AMD (AMD Accelerated
Parallel Processing (APP) [25], formerly ATI Stream SDK) and NVIDIA (as part of CUDA), as well
as other vendors. As LQCD applications are predominantly written in CUDA, OpenCL performance
results are scarcely found. There is but one reported for a HMC using staggered fermions [22], where
a significantly lower performance of OpenCL compared to CUDA is reported.
3.2. GPU Architecture and OpenCL Terms
Like modern CPUs, GPUs are multi-core Single Instruction Multiple Data (SIMD) processors.
However, where CPUs are tuned towards processing each thread as fast as possible, the GPU archi-
tecture is tuned towards high throughput over thousands of threads.
6
Figure 3: OpenCL Concept
While the SIMD model on CPUs is based on vector registers,
where a single processor executes an instruction in parallel for
multiple elements in the vector register, the GPUs implement a
variant known as Single Instruction Multiple Threads (SIMT).
On the GPU registers are seen as scalar by each thread, however
the execution units, called “processing elements” in OpenCL ter-
minology, share a common instruction decoder. Therefore a core,
called “compute unit” in OpenCL, always executes a group of
threads in lock-step. The group of lock-stepped threads is basi-
cally equivalent to the SIMD thread on a CPU. However, the memory access is more flexible due to
the scalar nature of the thread execution. On CPUs the group executed on a compute unit in lock-step
might contain only one processing element.
GPUs provide a larger set of registers than CPUs. The AMD Radeon HD 5870 provides 16384
registers, each 16bytes in size. The NVIDIA GTX 480 provide 32768 registers of 4bytes each. These
registers are dynamically mapped to threads. Therefore, a compute unit can run either a smaller
number of threads using even more than a hundred registers each, or more than a thousand threads,
each using only a dozen registers each. Like hyper-threading on a CPU, running more threads will
allow to hide memory latencies, increasing overall throughput. The scheduling of the thread groups
is performed by a hardware scheduler with minimal overhead. Registers stay allocated to each thread
from its creation until it finished execution.
The memory architecture of GPUs is more complex than that of CPUs, which appears uniform
to the user. It is split into multiple logical regions. Global memory is the normal main memory of
the GPU that can be read and written to by all threads running on the GPU. Private memory is a
part of global memory that is partitioned among all threads running on the GPU. When addressing
into private memory each thread accesses its own partition. This memory is also used as a swapping
place for registers if the register file cannot hold a threads full working set. Registers swapped into the
private memory are also known as scratch registers. As private memory is part of the global memory
it shares the same performance characteristics. This means, large latencies can be incurred when
accessing data from local memory. Therefore, usage of scratch registers usually comes with a large
performance penalty. Another part of global memory is constant memory. That memory can only
be written to from the host. GPUs usually are able to cache accesses to this memory and broadcast
values from this memory to all threads very efficiently. In addition modern GPUs also provide a local
memory. Just like registers, local memory is on-die and can be accessed with similar performance.
Local memory is shared between threads running on the same compute unit and can be used as a
user programmed explicit cache. Allocations of memory on the GPU are referred to as “Buffers” by
OpenCL.
Stemming from their graphics tradition, GPUs originally only had dedicated read-only caches for
constant memory and textures. Textures are images stored in memory in a special format. On the
AMD Radeon HD 5870 and 6970, when keeping to some restrictions, the AMD OpenCL compiler is
capable to automatically utilize the texture cache to access buffers which are only read by a kernel.
More modern GPUs like the NVIDIA GTX 480 and the AMD Radeon HD 7970 provide multi-level
read-write caches. While CPU caches target to minimize latencies in memory access for a single thread,
GPU caches are shared by many threads. One of their main functions is to coalesce access by multiple
threads to close-by addresses into single memory transactions.
4. Implementation strategy
Since OpenCL needs a quite different approach to LQCD than existing software, we decided to start
from scratch instead of modifying an existing application. Consequently, all parts of the simulation code
7
are carried out in OpenCL. We set up the host program in C++, which nicely allows for independent pro-
gram parts using C++ classes and also naturally provides extension capabilities. The main object is the
class gaugefield, where the initialization process of OpenCL is incorporated. It manages the physical
gauge field when several devices are used and holds an application-specific number of opencl device-
C++-Hostprogram
Gaugefield
Get Platform Init Context
Task 0 Task 1 . . .
Init Queue
OpenCL device 0 OpenCL device 1 . . .
Compile/Input-
Parameters
Collect Sourcefiles/Compileoptions Init Buffers/Kernels
Further Execution. . .
Figure 4: Schematic flow of program initialization
objects. This class in turn contains all
compute device-related parts, such as ker-
nels, and eventually executes the kernels
on a specific device. Child classes of
gaugefield and opencl module contain
problem-related functionality and provide
algorithmic logic. For hybrid applications,
different “tasks” can be defined to carry out
a physics problem. These may then again
contain a number of device objects them-
selves. The OpenCL environment has to be
initialized before the actual calculations can
be carried out (Fig. 4). To generate the ker-
nels, the code files are collected, compiled
and linked into an OpenCL “program“ using the OpenCL compiler. To ease debugging we build each
kernel as a stand-alone program. The binary files, which the compiler produces during the kernel
compilation, can be reused at a later point and also provide information about the kernel, e.g. register
usage statistics, important for benchmarking and optimization. Thus, the kernels are created only at
runtime of the application, which allows to pass runtime parameters (e.g. NT, NS, . . . ) to them as
compile-time parameters. We can avoid many kernel arguments like this, and parameters are “hard
coded” into the kernel code. On the device all data types are implemented as structures, with all
the required operations defined for them. Although this might in some cases require more registers
than simply operating on arrays of scalars stored in main memory, we opted for this implementation
strategy for its better code readability.
It is not trivial to estimate how much the register overhead of the structure based implementation
strategy is, as the exact register requirements highly depend on the optimizer. The possibly higher
register requirements of the structure based approach can easily be seen when looking at the addition
of two structures of four floats. When operating on arrays of scalars there only has to be space for
three floats in registers. As addition is element-wise, only one float from each operand has to be
loaded at the same time. Additionally there has to be room for one element of the result. Using
actual structures four times this space is required, as the whole structure has to be stored for each
operand and the result. In the case of spinors register requirements would increase from three floats
to 72. This is, however, a worst case situation as for most operations, e.g. multiplication of a SU3
matrix with a spinor, more than one element of each structure is required at the same time anyway.
In addition, given the high latency of GPU memory it does not make sense to completely serialize
handling of each element in a structure. Therefore the register usage should be higher even when
performing all operations using scalar types, as the optimizer will use different registers for different
elements to enable the exploitation of instruction-level parallelism.
As up till now they have not been relevant for the overall runtime, all LAPACK operations have
been implemented in a straightforward manner. The only optimization was implicitly given by the
data type storage format which is described below. ILDG compatible I/O has been implemented as
well as the Ranlux [26] PRNG (Pseudo Random Number Generator), as it is the standard choice
for LQCD simulations. We use the original implementation on the host while on the device we use
8
RANLUXCL1, an open-source OpenCL implementation of Ranlux. For testing purposes we have also
implemented the NR3 [27] generator, but it has not been used for the measurements in this paper.
Initialization of the random number generator follows the usual Ranlux rules which are applied across
the host and the device, where each OpenCL thread runs on its own Ranlux PRNG state.
Since on different GPU drivers we have observed multiple miscompilations of our code during
development, we added regression tests for most of our OpenCL functionality. This allows us to
quickly check new drivers for incorrect output. A special challenge is that the likelihood for compiler
errors scales with code complexity. Thus, a function might work perfectly in a simple test case but
will produce errors when integrated into a larger kernel. Therefore it is important to not only test
each building block for regressions, but also repeatedly check whether they still work as expected when
being used in larger kernels.
4.1. Memory requirements
LQCD calculations are always memory bound. To see that consider the theoretical peak perfor-
mances collected in table 1 and the characteristics of some kernels displayed in table 2. Especially for
the fermion related kernels, looking at the respective ratios of bandwidth to flops (numerical density)
shows how the memory bandwidth dominates performance. One the other hand, the heatbath related
kernels may be expected to perform well on GPUs, since here the bandwidth is somewhat less impor-
tant, compared to the flops the kernels perform. However, the numerical density is still low compared
to the peak flops per bandwidth ratio of the GPUs.
Table 3 shows that the gauge field is the biggest object to store. In addition, one needs more than
one field of each species during the simulations. However, there are possibilities to reduce these sizes,
most prominent the aforementioned even-odd preconditioning for the spinorfields. In general, each
gauge link may be displayed with N2C − 1 real numbers, one for each generator of the group. Since
this representation induces additional computational overhead, other methods are more feasible. For
NC = 3, one column of the matrix can be reconstructed from the other two exactly, since it has to
obey ~c = ~a × ~b , so one can discard 6 floats of the full matrix (REC12). Additional restrictions on
the left 12 numbers allow to save 2 or 4 more floats (REC10 and REC8, respectively). However, the
last procedure may run into arithmetic problems during the reconstruction [20]. As the reconstruction
adds additional complexity to the code we have not used it so far except for the 6D kernel. Most of our
kernels are already at the edge of the register limit and additional code complexity tends to increase
the register pressure, even if with perfect register reuse no additional registers would be required.
4.2. Global memory storage formats
Our first implementation of the 6D kernel used an array of structures (AoS) storage format for the
gauge and spinor fields. Observing this kernel reaching only about 22GB/s of memory throughput on
the AMD Radeon HD 5870 we started studying characteristics of the memory controller.
First we checked the effects of utilizing the texture cache, which on AMD hardware can be done
with minimal code modification. While this provides a significant speedup, the achieved 46GB/s are
still far from that GPUs theoretical bandwidth limit of 155GB/s.
Another speedup can be reached by specifying a proper alignment for the larger data types. The key
is to use largest possible alignment that is still a divisor of the data type size. If no alignment is specified
the compiler will use the alignment of the smallest contained data type, resulting in superfluous memory
fetches. While the number of fetches in some cases could be reduced even further by using an alignment
that is larger than the data type, it turns out that the additional bandwidth required overcompensates
the benefit. Using proper alignment of all types the code reaches 68GB/s, which is still far from the
performance limit.
1https://bitbucket.org/ivarun/ranluxcl
9
float double float4 su3 spinor
0
50
100
150
200
G
B
y
te
s/
s
AMD V7800
AMD 5870
AMD 6970
AMD 7970
Figure 5: Copy performance using different data types to copy a buffer of 100MiB. Note that the AMD Radeon HD
5870 was run using the Catalyst 11.7 driver, while all other measurements where performed using Catalyst 12.4.
Figure 5 shows how using different data types for global memory access effects the speed of copying
a fixed size buffer on the GPU. The float and double types are scalar data types of 4 bytes and 8 bytes
size. The float4 type is a built-in data type of OpenCL that acts like a struct of four floats and
therefore has a size of 16 bytes. The types su3 and spinor are complex structs of size 144 bytes and
192 bytes, respectively. The figure shows that the best way to access memory is using the double or
float4 data type. Actually all other types with the same size and alignment will work, too. Therefore
we now use a double precision floating point complex type as the base storage type for all our data in
GPU global memory. From size and alignment this type is equivalent to the float4 type. All other
types are stored in a SoA fashion based on this type. This allows to utilize memory bandwidths of
up to 120GB/s on the AMD Radeon HD 5870. Care has to be taken, though, as SoA increases the
register pressure. This can lead to register spilling and therefore cause a performance penalty.
Another important effect is given by the mapping of the gauge field indices to memory addresses. As
we are using even-odd preconditioning, neighboring threads will always access either links originating
in even or those originating in odd sites only. If links originating in even sites are stored interleaving
with those originating in odd sites, which is the naive format, and a SoA pattern is used, a memory
access will span data of both types of links. As neighboring threads will only require one type of link
at that moment, half the memory bandwidth will be wasted. Therefore we split our gauge field such
that links originating in even sites are stored separate by those originating in odd sites.
4.3. Common code for CPUs and GPUs
As OpenCL can be used both for CPU and GPU programming, we use a single source code for
the CPU and the GPU implementation. To cater for the different architectures, we introduces some
abstractions to achieve best performance in both worlds.
An important difference between the CPU and the GPU is the optimal looping strategy. On CPU
loops perform best when each CPU works on its own consecutive block of memory. Thereby, it can
best utilize its time-local cache to reduce the number of actual requests performed on the memory.
On the GPU however, the best memory throughput is achieved if consecutive cores read consecutive
elements from memory. Therefore, a loop should always move though the data in a strided way. We
use a macro PARALLEL_FOR to implement these different looping strategies transparently.
10
kernel RW-Size Flops
6D 2880 1632
F gaugeplaquette 2800 2717
F gaugerectangles 13168 14813
Ffermion 4736 1748
heatbath 2880 3912
overrelax 2880 3846
saxpy 608 64
Table 2: Read-Write (RW) sizes in Bytes and flop sizes
for some HMC related kernels (for double precision).
All numbers are per site (and direction). “saxpy” cor-
responds to the algebraic operation ~z = α~x+ ~y.
general size [Vtot] Bytes [Vtot]
φ NDirac ×NC × C 192
φeo (NDirac ×NC × C)/2 96
U N2C ×ND × C 576
UREC12 2NC ×ND × C 384
UREC10 (2NC − 1)×ND × C 320
UREC8 (2NC − 2)×ND × C 256
P, F (N2C − 1)×ND 256
Table 3: Overview over memory requirements of
LQCD quantities in double precision per site (and di-
rection). C denotes the size of one complex number (2
real numbers).
Another important difference between CPU and GPU is that the GPU prefers a SoA pattern, while
the CPU tends to prefer an AoS pattern. Therefore we encapsulated all memory accesses into separate
functions which transparently perform the SoA conversion if required. When moving data onto or
from a device the same automatism occurs, ensuring best memory access patterns on all devices.
While OpenCL provides vector data types, which the AMD platform uses for vectorization on the
CPU, we did not use those in our code. Besides complication of the source code, they would increase
the amount of registers required on the GPU which are already a sparse resource.
5. Performance results
In this section we test our implementations of the heatbath and HMC algorithm and in addition
report on the performance of the 6D, the crucial time consuming part for fermionic observables and
the HMC itself. We tested our LQCD implementations on different architectures with different lattice
sizes and compared the results to existing applications and literature data. Since we are interested in
thermal systems, we benchmarked our programs on lattice sizes ranging from Nσ= 16, 24, 32, 48 and
Nτ= 4, 8, 12, 16. The 6D and heatbath kernels have also been benchmarked for some additional values
of Nτ to explore a wider range of used memory sizes.
As double precision results are desired for physical measurements we concentrated on such precision
calculations throughout. However, single precision calculations do promise twice the performance for a
given device’s memory bandwidth and have provided good results in other applications running on the
same GPUs[28]. Therefore our results are only a lower bound for the achievable performance. For the
heatbath algorithm, where no summations over the whole lattice are required, single precision results
are given.
The GPUs used are an AMD Radeon HD 5870, as it is used in the LOEWE-CSC, and an AMD
FirePro V7800, which is the professional grade version of the AMD Radeon HD 5870. It is clocked
somewhat lower and equipped with more memory. In addition we also used the newer AMD Radeon
HD 6970 and the AMD Radeon HD 7970, which is the latest GPU available by AMD. For comparison,
the 6D benchmarks have also been executed on the NVIDIA GeForce GTX 480, which is of comparable
age to the AMD Radeon HD 5870, and the NVIDIA GeForce GTX 680, which is the latest GPU
available by NVIDIA. To show the flexibility of OpenCL, we also simulated on Intel Xeon E5520,
AMD Opteron 6172 and AMD Opteron 6278 CPUs. All of the benchmarks, except those on the AMD
Radeon HD 7970, used Catalyst 12.4 which is the most current available AMD GPU driver available
at the time of writing. On the AMD Radeon HD 7970 the Catalyst 12.2 was used. The NVIDIA GPUs
were using version 295.41 of NVIDIA’s GPU driver.
11
1
6
^
3
x
1
6
1
6
^
3
x
3
2
1
6
^
3
x
4
8
3
2
^
3
x
8
2
4
^
3
x
2
4
2
4
^
3
x
3
2
3
2
^
3
x
1
6
2
4
^
3
x
4
8
3
2
^
3
x
2
4
4
8
^
3
x
8
3
2
^
3
x
3
2
4
8
^
3
x
1
2
3
2
^
3
x
4
8
4
8
^
3
x
1
6
Lattice Size
10
20
30
40
50
60
70
G
fl
o
p
s
AMD FirePro V7800
AMD Radeon HD 5870
AMD Radeon HD 6970
2 Intel Xeon E5520
2 AMD Opteron 6278
1
6
^
3
x
1
6
1
6
^
3
x
3
2
1
6
^
3
x
4
8
3
2
^
3
x
8
2
4
^
3
x
2
4
2
4
^
3
x
3
2
3
2
^
3
x
1
6
2
4
^
3
x
4
8
3
2
^
3
x
2
4
4
8
^
3
x
8
3
2
^
3
x
3
2
4
8
^
3
x
1
2
3
2
^
3
x
4
8
4
8
^
3
x
1
6
Lattice Size
20
40
60
80
100
120
140
160
180
G
fl
o
p
s
AMD FirePro V7800
AMD Radeon HD 5870
AMD Radeon HD 6970
2 Intel Xeon E5520
2 AMD Opteron 6278
Figure 6: Performance of the single precision heatbath and overrelax kernels.
5.1. Heatbath performance
First, we present results of our implementation of the heatbath algorithm, that is we concentrate
on the heatbath and overrelax kernels. Figure 6 shows the performance in GFLOPS achieved in single
precision. While we don’t manage to saturate device bandwidth, in both kernels the Cypress and
Cayman based GPUs provide at least 30GFLOPS for the heatbath kernel for any lattice size. The
overrelax kernel performs a lot better, here we can achieve at least 100GFLOPS on the GPU, peaking
up at about 160GFLOPS on the AMD Radeon HD 6970. This behavior can most likely be explained
by how the compiler manages with the random number generator, since this is the essential difference
between both kernels. One can also see the different performance of different card generations. CPU
performances are also given, but they are only a factor 1/4 of those on the GPUs. However, the
overrelax kernel again outperforms the heatbath kernel. The comparison is not entirely fair, as the
CPU code did not receive any optimizations besides the proper looping strategy and is not using a
SoA access pattern. Vectorization might be able to close the gap, although one should keep in mind
that this is a comparison of two server CPUs to one consumer GPU, where the latter is a much less
costly solution.
Not shown is the double precision case, where on the contrary, the performance is of the order
of 10GFLOPS and thus poor for both kernels on all GPUs. Again, the overrelax kernel performs
better than the heatbath kernel, however, this time the difference is small. The massive drop in
performance can be understood when looking at the working set sizes of the kernels. Both kernels
require storing of a relatively large amount of data throughout kernel execution, rendering register
reuse difficult. Therefore, in double precision the working set for each thread gets too large to fit
into the available registers, leading to massive spilling to scratch registers. This reduces performance
12
by both, the additional latency when accessing the spilled data, as well as the increased memory
bandwidth required to access the spilled data. Therefore, effective algorithmic density is less than
what is given in Table 2. However, this is also specific to heatbath and overrelax kernels, which can
be seen in the performances reported below.
Despite performance, the limiting factor for physical studies here is clearly the GPU main memory.
It is limited to 1GB on the AMD Radeon HD 5870 and to 2GB on the other two. This sets an upper
limit for possible lattice sizes. Additionally, on AMD GPUs prior to the HD 7000 generation there is
no official support to use more than half of the GPUs memory from an OpenCL application. The limit
can however be circumvented by setting environment variables as specified by the AMD Knowledge
Base[29]. Also note that an OpenCL runtime can do a multitude of things when it runs out of physical
memory on the GPU. While current implementations all seem to give an error when running out of
memory, from the API specification it is perfectly legal to swap buffers to host memory, which will
come with a major performance penalty.
Comparing the heatbath kernel performance with [24], where performances of a CUDA based
heatbath implementation on NVIDIA GeForce GTX 295 and NVIDIA GeForce GTX 580 is given,
we observe only half of the reported single precision performances for the heatbath kernel. The
overrelax kernel performs at around 90% of the peak value of [24] for the AMD Radeon HD 5870 and a
slightly better performance for the AMD Radeon HD 6970 is found. Note that we do not use memory
bandwidth reducing storage techniques for the links as they are used in [24] (REC12). Other than the
code in [24], our implementation can work with any number of lattice points that fit into the GPU
memory and is not limited by the number of threads used. This shows that all the tested GPUs can
in principle be used to efficiently run the algorithm.
5.2. 6D performance
For simulations including fermions the inversion of the fermion matrix is essential and thus the
performance of evaluating Dφ is crucial. Since for twisted mass fermions the diagonal part of D is not
dependent on the gauge field, the 6D performance has to be considered most carefully. We measured
it on various lattice sizes in double precision, the results can be seen in figure 7.
The slowest of our GPUs, the AMD FirePro V7800 levels at nearly 60GFLOPS and is above
50GFLOPS even for small lattices like 163 × 4. The gamer GPUs AMD Radeon HD 5870 and AMD
Radeon HD 6970 scale nicely with their higher peak memory bandwidth, achieving about 60GFLOPS
and 80GFLOPS, respectively. In addition, one can observe better performances with the AMD Radeon
HD 5870 for lattices sizes that have a spatial extent that is a multiple of 16, where the kernels performs
at about 70GFLOPS. The performance of the AMD Radeon HD 7970 is not better than that of the
AMD Radeon HD 6970, however this GPU was measured using a preview driver. As that driver does
not provide any useful resource usage statistic we did not optimize our kernel for that GPU. As we
cannot ensure proper register usage we simply reused the memory stride optimization found to be
optimal on the AMD Radeon HD 5870.
Also shown is the bandwidth utilization, where apart from the unoptimized AMD Radeon HD 7970
at least 70% of the theoretical peak bandwidth are reached on all AMD GPUs. On the AMD Radeon
HD 6970, when running at 140GB/s, 80% of the peak bandwidth are achieved, peaking up to 86% for
lattices like 163 × 24. For the AMD Radeon HD 5870, the utilization peaks at 120GB/s (about 78%
of the theoretical peak bandwidth), which is actually even higher than the about 105GB/s achieved
on this GPU in a simple float4 copying kernel as shown in figure 5. This can be explained by the
loop-unrolled characteristic of the SoA memory access and the higher read-to-write ratio of the 6D
kernel, as only reads can benefit from the cache of the AMD Radeon HD 5870.
Additionally, we give 6D performances obtained on the NVIDIA GPUs GeForce GTX 480 and
GeForce GTX 680. Both cards constantly reach about 25GFLOPS and 40GB/s, respectively. The
latter corresponds to about 20% of the theoretical peak. The NVIDIA GPUs show register spilling,
13
1
6
^
3
x
4
1
6
^
3
x
1
6
1
6
^
3
x
2
8
2
4
^
3
x
1
2
2
4
^
3
x
1
6
2
4
^
3
x
2
0
2
4
^
3
x
2
4
3
2
^
3
x
1
2
4
8
^
3
x
4
3
2
^
3
x
1
6
3
2
^
3
x
2
0
3
2
^
3
x
2
4
3
2
^
3
x
2
8
3
2
^
3
x
3
2
4
8
^
3
x
1
2
Lattice Size
10
20
30
40
50
60
70
80
90
G
fl
o
p
s
AMD HD 5870
AMD FirePro V7800
AMD HD 6970
AMD HD 7970
NVIDIA GTX 480
NVIDIA GTX 680
1
6
^
3
x
4
1
6
^
3
x
1
6
1
6
^
3
x
2
8
2
4
^
3
x
1
2
2
4
^
3
x
1
6
2
4
^
3
x
2
0
2
4
^
3
x
2
4
3
2
^
3
x
1
2
4
8
^
3
x
4
3
2
^
3
x
1
6
3
2
^
3
x
2
0
3
2
^
3
x
2
4
3
2
^
3
x
2
8
3
2
^
3
x
3
2
4
8
^
3
x
1
2
Lattice Size
20
40
60
80
100
120
140
160
B
a
n
d
w
id
th
 G
B
/s
AMD HD 5870
AMD FirePro V7800
AMD HD 6970
AMD HD 7970
NVIDIA GTX 480
NVIDIA GTX 680
Figure 7: Performance and memory bandwidth utilization of the 6D kernel on several GPUs (double precision).
which is probably the performance limiting factor. While the NVIDIA GPUs have a smaller register
file than their AMD counterparts, that is only part of the problem. In CUDA it is possible to have the
compiler increase the number of available registers at the cost of fewer threads being able to run on the
GPU concurrently. In addition, it is possible to reconfigure the ratio of L1 cache to shared memory. A
larger L1 cache would probably avoid some of the slowdown due to the spilled registers. Sadly neither
of these features is exposed when using OpenCL on the NVIDIA hardware. With the current code,
the AMD GPUs achieve 3-4 times more performance with the same kernel, but the comparison is not
entirely fair, as the AMD GPUs have been the primary development platform.
There are a couple of reported performances in the literature, which are all based on CUDA
implementations. Using the AMD Radeon HD 5870 as our reference value, our 6D kernel is nearly
twice as fast as the one shown in [20], even though that kernel uses REC12 to reduce bandwidth
requirements. One should however keep in mind that the NVIDIA GeForce GTX 280 is one generation
older than the AMD Radeon HD 5870. Our 6D is also 40% faster than that shown in [23]. That
performance was measured on a NVIDIA GeForce GTX 480 which, as mentioned above, is as old as
the AMD Radeon HD 5870 used by us.
Furthermore, figure 8 shows the effect of REC12 on the 6D performance exemplarily for the AMD
Radeon HD 5870 and the AMD FirePro V7800. This technique reduces the RW load of the kernel
by 13% (see table 3), and thus should result in a visible speedup. An increase in performance up to
9% can indeed be observed and proves again that the performance is bandwidth limited. Note that
we did not include the computational overhead induced by the reconstruction technique in the FLOP
count. Compared to [20], where actually a decrease in double precision performance is reported when
decreasing the RW load, these results support the good performance picture of our 6D implementation.
14
1
6
^
3
x
8
1
6
^
3
x
1
6
1
6
^
3
x
2
4
3
2
^
3
x
4
2
4
^
3
x
1
2
2
4
^
3
x
1
6
2
4
^
3
x
2
0
2
4
^
3
x
2
4
3
2
^
3
x
1
2
4
8
^
3
x
4
3
2
^
3
x
1
6
3
2
^
3
x
2
0
3
2
^
3
x
2
4
Lattice Size
10
20
30
40
50
60
70
80
G
fl
o
p
s
AMD HD 5870 REC12
AMD HD 5870 w/o REC12
AMD FirePro V7800 REC12
AMD FirePro V7800 w/o REC12
Figure 8: Performance of the 6D kernel using REC12 (double precision).
setup A B C
aµ 0.0025 0.0035 0.1
mpi 260 310 520
Table 4: Setups used for benchmarking. β = 3.9 and κ = κc = 0.160856 were used throughout to simulate at maximal
twist. The value of mpi is only approximate.
5.3. HMC performance
In order to test our HMC program under realistic conditions, we used setups corresponding to one
heavy and two lighter pion masses, see table 4. The parameters were chosen according to [30] in order
to simulate at maximal twist. In each setup, we started from a prior generated gauge configuration
and performed 10 HMC steps with τ = 1 (Setup C) and τ = 0.1 (Setup A and B), respectively.
We used separate time scales for the gauge and fermion parts and on each the 2MN integrator with
10 integration steps. To gain statistics, each run has been carried out O(10) times. The OpenCL
application was run on the AMD FirePro V7800 and the AMD Radeon HD 6970. For comparison, we
performed the same measurements using the CPU-based application tmlqcd [31], running on different
numbers of cores on the LOEWE-CSC (AMD Opteron 6172 CPUs). Statistical errors have been
neglected since they were found to be below the percentage limit.
The results for setup C are shown in figure 9, where runs for fixed Nτ = 8 and fixed Nσ = 24 have
been performed. The reference tmlqcd was executed on one and two whole nodes, which means 16 and
32 and 24 and 48 CPU cores, respectively, for the different lattice sizes, since the lattice extent has to
be divisible by the number of cores here.
One can see nice scaling in all reference runs, the additional MPI overhead is only visible for small
lattices. The GPU runtimes are always comparable to or better than the reference values. Especially
the AMD Radeon HD 6970 can achieve a better performance than 32 and 48 CPU cores throughout,
respectively. Considering the Nσ = 24 runs, we measure a speedup of the AMD FirePro V7800 of
about 1.7 compared to the 24 core reference runs and a slow-down of 0.9 compared to the 48 core runs.
The AMD Radeon HD 6970 is faster in both comparisons, about 2.5 and 1.3, respectively. Note that
this is a comparison to two and four full CPUs, where each CPU is more expensive then the GPUs
used. We did not perform reference runs on a single CPU core, but we observed a scaling of about 0.75
when running tmlqcd on more than one core, which can be explained by the additional MPI overhead.
This scaling yields a speedup of about 30 for the AMD FirePro V7800 and a speedup of about 44 for
the AMD Radeon HD 6970, compared to one AMD Opteron 6172 core. However, this comparison is
a bit academic, since one would practically not use only one core of a multi core CPU.
A similar picture emerges also for the lighter pion masses (see figure 10), where the runtime scales
15
0 2,000 4,000 6,000
16
24
32
Seconds
N
σ
tmlqcd on 16 Cores
tmlqcd on 32 Cores
AMD FirePro V7800
AMD Radeon HD 6970
0 1,000 2,000 3,000 4,000
4
8
12
16
Seconds
N
τ
tmlqcd on 24 Cores
tmlqcd on 48 Cores
AMD FirePro V7800
AMD Radeon HD 6970
Figure 9: HMC runtimes in seconds for setup C (see table 4) for fixed Nτ = 8 and Nσ = 24, respectively.
approximately the same on all systems. For these setups, the inversion of the fermion matrix takes
longer. Therefore the force calculation, which only achieves about half the bandwidth utilization of the
6D kernel and takes 40% of the total execution time in setup C, becomes less important for the overall
performance. The limited performance of the force calculation is also caused by register spilling, even
though it is not as problematic as in the similar heatbath kernels.
Not displayed are additional runs of the OpenCL code which we did on AMD Opteron 6172 CPUs.
Since no specific optimizations for the CPUs have been carried out, these needed significantly more
runtime. However, switching between both types of devices is smooth, showing that OpenCL code is
absolutely versatile.
The comparison of these results to the extensive tests of the CUDA based HMC for the staggered
fermion discretization reported in [22] is somewhat difficult because of the fundamental differences
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000
A
B
C
Seconds
tmlqcd on 24 Cores
tmlqcd on 48 Cores
AMD Radeon HD 5870
AMD Radeon HD 6970
Figure 10: HMC runtimes in seconds for setups A, B, and C (see table 4) for a lattice with Nτ = 8 and Nσ = 24.
16
between the two discretizations. Additionally, the authors of [22] mainly use single precision within
the HMC, whereas we perform double precision calculations throughout. Although we did not test
this, the results obtained for the 6D on NVIDIA cards strongly suggest that our code would currently
perform significantly slower on these cards compared to AMD GPUs. This is the opposite picture to
the significantly slower performance of the staggered HMC reported in [22]. Just like our NVIDIA
results their AMD results are probably affected by the choice of the primary development platform,
resulting in less optimization work in the other. In addition the staggered fermionic data types should
be smaller, reducing the working set and therefore avoiding register spilling. This should especially
benefit the NVIDIA GPUs used, as their register file is smaller than that of the AMD GPU.
6. Conclusions
We have presented the first implementation of LQCD using OpenCL and shown it is working on
CPUs as well as the two major GPU platforms, NVIDIA and AMD.
In single precision our heatbath implementation is able to achieve a competitive 160GFLOPS on
the AMD Radeon HD 6970, while the double precision variant is still work in progress. Fine tuning
the applied memory optimizations, which are currently tuned for double precision codes, and the use
of bandwidth-reducing techniques like REC12 should provide further speedups here.
For the Wilson 6D kernel we were able to show excellent performance, utilizing more than 70% of
the available memory bandwidth for all lattice sizes on multiple AMD GPUs outperforming published
performances of CUDA based codes. We also see a positive effect of REC12 on the performance.
Extended to a full HMC for twisted-mass Wilson fermions we showed a speedup factor of four of our
code running on an AMD Radeon HD 5870 compared to a reference code running on an AMD Opteron
6172.
Further speedups to our HMC will be reached by further optimizing the inverter performance. One
way to reach this will be a mixed precision solver, which up till now we avoided due to the small
memory of the AMD Radeon HD 5870. As proposed in [1] we will also start to use all compute
resources on hybrid systems by performing parts of the calculations on the CPU. This will require the
SIMD capabilities of the CPUs, too. One approach for this could be the redefinition of our base types
for each device. In addition, we are currently investigating possible performance gains by using REC12
in the force calculation. More complex reduction techniques may also be of advantage here. Finally, to
reduce processing time and circumvent memory limitations we will expand our code to run on multiple
GPUs.
Acknowledgments
O. P. and C. P. are supported by the German BMBF grant FAIR theory: the QCD phase diagram at
vanishing and finite baryon density, 06MS9150. M. B., O. P, and C. P. are supported by the Helmholtz
International Center for FAIR within the LOEWE program of the State of Hesse. M.B. and C.P. are
supported by the GSI Helmholtzzentrum fu¨r Schwerionenforschung. C.P. acknowledges travel support
by the Helmholtz Graduate School HIRe for FAIR.
The authors want to thank Lars Zeidlewicz for his participation in the early stages of this project.
Some of the calculations have been performed on LOEWE-CSC. The authors thank the LOEWE-CSC
team for all the support.
References
[1] O. Philipsen, C. Pinke, C. Scha¨fer, L. Zeidlewicz, M. Bach, LatticeQCD using OpenCL, PoS
LATTICE2011 (2011) 044. arXiv:1112.5280.
17
[2] Top500 List – June 2011.
URL http://www.top500.org/list/2011/06/100
[3] M. Bach, M. Kretz, V. Lindenstruth, D. Rohr, Optimized HPL for AMD GPU and multi-core
CPU usage, Comput. Sci. 26 (3-4) (2011) 153–164. doi:10.1007/s00450-011-0161-5.
[4] The Green500 List – November 2010.
URL http://www.green500.org/lists/2010/11/top/list.php
[5] G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, D. Nogradi, K. K. Szabo, Lattice QCD as a video
game, Computer Physics Communications 177. arXiv:0611022, doi:10.1016/j.cpc.2007.06.
005.
[6] NVIDIA, NVIDIA CUDA C Programming Guide.
URL http://developer.nvidia.com/nvidia-gpu-computing-documentation
[7] Khronos Working Group, The OpenCL Specification.
URL http://www.khronos.org/registry/cl/
[8] T. DeGrand, C. DeTar, Lattice Methods for Quantum Chromodynamics, World Scientific, 2006.
[9] C. Gattringer, C. Lang, Quantum Chromodynamics on the Lattice: An Introductory Presentation,
Springer, 2010.
[10] O. Philipsen, L. Zeidlewicz, Cutoff effects of Wilson fermions on the QCD equation of state to
O(g2), Phys. Rev. D81 (2010) 077501. arXiv:0812.1177, doi:10.1103/PhysRevD.81.077501.
[11] E.-M. Ilgenfritz, K. Jansen, M. P. Lombardo, M. Mu¨ller-Preuker, M. Petschlies, O. Philipsen,
L. Zeidlewicz, Phase structure of thermal lattice QCD with Nf = 2 twisted mass Wilson fermions,
Phys. Rev. D80 (2009) 094502. arXiv:0905.3112, doi:10.1103/PhysRevD.80.094502.
[12] F. Burger, E.-M. Ilgenfritz, M. Kirchner, M. P. Lombardo, M. Mu¨ller-Preussker, O. Philipsen,
C. Urbach, L. Zeidlewicz, The thermal QCD transition with two flavours of twisted mass fermions
(2011). arXiv:1102.4530.
[13] A. Shindler, Twisted mass lattice QCD, Phys. Rept. 461 (2008) 37–110. arXiv:0707.4093,
doi:10.1016/j.physrep.2008.03.001.
[14] S. Duane, A. D. Kennedy, B. J. Pendleton, D. Roweth, Hybrid Monte Carlo, Phys. Lett. B195
(1987) 216–222. doi:10.1016/0370-2693(87)91197-X.
[15] C. Urbach, K. Jansen, A. Shindler, U. Wenger, HMC algorithm with multiple time scale inte-
gration and mass preconditioning, Comput.Phys.Commun. 174 (2006) 87–98. arXiv:hep-lat/
0506011, doi:10.1016/j.cpc.2005.08.006.
[16] M. Creutz, Monte Carlo Study of Quantized SU(2) Gauge Theory, Phys. Rev. D21 (1980) 2308–
2315. doi:10.1103/PhysRevD.21.2308.
[17] N. Cabibbo, E. Marinari, A new method for updating SU(N) matrices in computer simulations
of gauge theories, Physics Letters B 119 (4-6) (1982) 387 – 390. doi:10.1016/0370-2693(82)
90696-7.
[18] A. Kennedy, B. Pendleton, Improved heatbath method for monte carlo calculations in lattice gauge
theories, Physics Letters B 156 (5-6) (1985) 393 – 399. doi:10.1016/0370-2693(85)91632-6.
18
[19] R. Petronzio, E. Vicari, An overrelaxed Monte Carlo algorithm for SU (3) lattice gauge theories,
Physics Letters B 248 (1990) 159–162. doi:10.1016/0370-2693(90)90032-2.
[20] M. A. Clark, R. Babich, K. Barros, R. C. Brower, C. Rebbi, Solving Lattice QCD systems of
equations using mixed precision solvers on GPUs, Comput. Phys. Commun. 181 (2010) 1517–
1528. arXiv:0911.3191, doi:10.1016/j.cpc.2010.05.002.
[21] R. Babich, M. A. Clark, B. Joo, Parallelizing the QUDA Library for Multi-GPU Calculations in
Lattice Quantum Chromodynamics (2010). arXiv:1011.0024.
[22] C. Bonati, G. Cossu, M. D’Elia, P. Incardona, QCD simulations with staggered fermions on GPUs
(2011). arXiv:1106.5673.
[23] A. Alexandru, C. Pelissier, B. Gamari, F. Lee, Multi-mass solvers for lattice QCD on GPUs,
J.Comput.Phys. 231 (2012) 1866–1878. arXiv:1103.5103, doi:10.1016/j.jcp.2011.11.003.
[24] N. Cardoso, P. Bicudo, Generating SU(Nc) pure gauge lattice QCD configurations on GPUs with
CUDA and OpenMP (2011). arXiv:1112.4533.
[25] AMD, AMD Accelerated Parallel Processing OpenCL Programing Guide.
URL http://developer.amd.com/sdks/AMDAPPSDK/documentation
[26] M. Lu¨scher, A portable high-quality random number generator for lattice field theory simula-
tions, Computer Physics Communications 79 (1) (1994) 100 – 110. doi:10.1016/0010-4655(94)
90232-1.
[27] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes, Cambridge
University Press, 2007.
[28] J. Gerhard, V. Lindenstruth, M. Bleicher, Relativistic Hydrodynamics on Graphic Cards (2012).
arXiv:1206.0919.
[29] Preview feature: AMD APP SDK v2.3 support for accessing additional physical memory on the
GPU from OpenCL applications.
URL http://developer.amd.com/support/KnowledgeBase/Lists/KnowledgeBase/DispForm.
aspx?ID=123
[30] R. Baron, P. Boucaud, P. Dimopoulos, F. Farchioni, R. Frezzotti, V. Gimenez, G. Herdoiza,
K. Jansen, V. Lubicz, C. Michael, G. Muenster, D. Palao, G. Rossi, L. Scorzato, A. Shindler,
S. Simula, T. Sudmann, C. Urbach, U. Wenger, Light Meson Physics from Maximally Twisted
Mass Lattice QCD, JHEP 1008 (2010) 097. arXiv:0911.5061, doi:10.1007/JHEP08(2010)097.
[31] K. Jansen, C. Urbach, tmLQCD: a program suite to simulate Wilson Twisted mass Lattice QCD,
Comput. Phys. Commun. 180 (2009) 2717–2738. arXiv:0905.3331, doi:10.1016/j.cpc.2009.
05.016.
19
