We present an OpenCL-based Lattice QCD application using a heatbath algorithm for the pure gauge case and Wilson fermions in the twisted mass formulation. The implementation is platform independent and can be used on AMD or NVIDIA GPUs, as well as on classical CPUs. On the AMD Radeon HD 5870 our double precision D implementation performs at 60GFLOPS over a wide range of lattice sizes. The hybrid Monte-Carlo presented reaches a speedup of four over the reference code running on a server CPU.
Introduction
Lattice Quantum Chromodynamics (LQCD) is the only a priory approach to describing the strong force. State-of-the-art lattice simulations require high-performance computing and constitute actually one of the most compute intensive problems overall. With processor clock speeds no longer improving and processors instead increasing their core count, Graphics Processing Units (GPUs) with their high peak performance and bandwidth have become an interesting platform for high-performance computing. In June 2011 three of the top five systems in the Top500 list of supercomputers [2] were GPU accelerated clusters. An example of heterogeneous architecture for general purpose computing is the LOEWE-CSC [3] , which solely consists of AMD hardware and provides two 12-core AMD Magny-Cours CPUs and one AMD Radeon HD 5870 GPU in the majority of its nodes. Originally, it was ranked 22nd in the Top500 list of supercomputers [2] and ranked eighth in the Green500 list of energy-efficient supercomputers (with 718 MFLOPS/Watt) [4] .
Originating in the high-end computer gaming market, GPUs nowadays offer highest computing capabilities at a very attractive price-per-flop ratio. Current high-end gaming GPUs by AMD and NVIDIA are priced at about 500 Euros. However, while the pioneer work in the field [5] , using application programming interfaces (APIs) designed for graphics rendering, was in principle platform agnostic, nearly all later developments in the field were based on NVIDIA CUDA [6] , therefore being limited to hardware by this single GPU vendor.
We present the first application of OpenCL [7] to Wilson fermions, enabling the code to be run on AMD and NVIDIA GPUs and also on classical CPUs. The work, of which early prototypes have been shown in [1] , has been extended to a full hybrid Monte-Carlo application that shows major performance gains over a purely CPU based reference code.
We begin by stating the physical problem we want to solve and then explain the tools used. Afterwards we will describe the important parts of our implementation and end with an analysis of the performance of our application.
Lattice QCD and Monte Carlo simulations
The strong interactions between elementary constituents of matter are described by Quantum Chromodynamics (QCD). In this section, we will give an introductory overview of QCD and its evaluation by means of Monte-Carlo methods. For more detailed information, we refer to [8, 9] .
QCD is a SU (N C ) gauge theory consisting of gauge and fermion fields. An analytical access by means of perturbation theory may be applied only in regions where the coupling constant, g, is small. This is ensured for high momentum-transfer ("asymptotic freedom"). To explore the non-perturbative regime, euclidean spacetime is discretized on a hypercube with lattice spacing a. Accordingly, the QCD action S QCD is replaced by a lattice version afflicted with discretization errors,
and continuum physics can be obtained in the limit a → 0. In statistical physics, the central object is the partition function Z of the system, which allows for the measurement of an observable A of interest. On the lattice, the expectation value of A is then:
where the exponential in the first line is the Boltzmann factor if one identifies S LQCD = βH. The grassmann-valued fermion fields ψ can be integrated out exactly, yielding the determinant of the
ψ ψ ψ(w) a a a U µ (y) Figure 1 : Sketch of lattice discretization at lattice spacing a. w, x, y, z denote lattice sites, ψ a fermion field and U a gauge link, respectively. Also shown are the plaquette Pµν (y) and the rectangle product of links Rµν (z).
fermion matrix D, and one is left with an integral over all possible gauge configurations U , which are the relevant degrees of freedom. It is convenient to rewrite the fermion determinant as an integral over a bosonic field φ (pseudo fermions),
giving an effective action
Since it is unfeasible to solve this highdimensional integral, importance sampling methods are used to generate an ensemble of N gauge configurations {U m } using the Boltzmann-weight p[U, φ] = exp {−S eff [U, φ]} as probability measure. Then, A can be approximated by
Choice of Lattice action
There is some freedom in the specific form of the lattice action, since it has to agree with QCD in the continuum limit only. Different discretizations will suffer differently from discretization artifacts and often also do not hold all the continuum properties. Especially the discretization of the fermion matrix is non-trivial, since the naive one gives rise to non-physical fermion copies ("doublers"). It is also possible to add improvement terms irrelevant in the continuum to the lattice action, which help to reduce discretization errors.
At this point, it is adequate to insert some words related to the dimensions of the quantities involved and to fix our notation. We will denote the spatial and temporal extent of the system with N σ and N τ , respectively, so that the total volume of the system is V tot = N 3 σ × N τ . The number of colors, Dirac indices and spacetime dimensions are denoted as N C , N Dirac and N D , respectively and they take on the values 3, 4 and 4. On each lattice site x, the component φ(x) of φ is a complex valued, (N Dirac × N C )-dimensional vector, whereas the gauge field U = U µ (x) is a complex valued N C × N C matrix linking x and its neighbour x + µ in spacetime-direction µ (therefore called links, see fig. 1 ).
In our numerical studies [10] [11] [12] we employ the tree-level Symanzik improved Wilson action in the gauge sector,
Here, P µν (x) and R µν (x) denote path-ordered plaquette and rectangle products of link variables (see fig. 1 ) and the parameters are β = 6/g 2 , c 0 = 1 − 8c 1 and c 1 = 1/12. The unimproved gauge action is regained setting c 1 to zero.
In the fermionic sector we use the so-called twisted mass Wilson fermions [13] to simulate two flavours of fermions, whose matrix on the lattice reads
with shorthand notation γ −µ = −γ µ and U −µ (x) = U µ (x − µ) † . The γ µ s are matrices in Dirac space satisfying {γ µ , γ ν } = 2g µν and a, b, α, β are color and Dirac indices, respectively. The sign in the diagonal mass matrix M ± diag corresponds to up and down flavour. Two mass parameters appear, the twisted mass µ and the usual m (via the hopping parameter κ = (2(am+4)) −1 ). Pure Wilson fermions can be reobtained at vanishing twisted mass, D Wilson ≡ D tm (µ = 0). Twisted mass fermions have the advantage that if κ is tuned such that the renormalized quark-mass vanishes for pure Wilson fermions, the O(a) discretization errors in (1) vanish too. For µ = 0 the mass is then determined solely by µ and the system is said to be tuned to Maximal Twist. In addition, for two flavours one has det
Thus, the twisted mass acts as an infrared regulator, which is important in avoiding so called exceptional configurations, for which the fermion matrix has very small eigenvalues rendering its inversion ill-defined (see below). The most used discretization type besides Wilson fermions are staggered fermions. Regarding numerics, these differ a lot from the Wilson type since the "staggering" essentially means that on each site there is only one spinor component instead of four.
We now want to dwell on the role the fermion matrix plays in LQCD simulations a bit more closely. For simplicity, we will drop the indices on D from now on. To evaluate the fermion determinant in the partition function one has to know the inverse of D (see (3) ). In addition, the entries of D −1 are related to the fermion propagator, thus its inversion is also crucial for measuring fermionic observables. To give an example, the two-point correlation function at sites n and m for a generic mesond(n)Γu(n) is given by:
For more involved observables, more occurrences of D −1 will appear. D is a (N C × N Dirac × V tot )-dimensional sparse square matrix, so one uses iterative Krylov space based methods to calculate D −1 indirectly out of equations like
During the inversion, the matrix-vector product Dφ has to be carried out many times. The most common solvers used are CG and BiCGStab. The convergence of the solver depends inversely on how light the simulated quarks are, and also scales with V tot . This renders the inversion the most involved part of the simulation. In addition, there is a lot of machinery around to simplify (8) , such as mass-preconditioning and, most prominently, even-odd-preconditioning. In a simulation the dimensions of the lattice strongly depend on the physical problem under investigation. In order to rule out finite size effects the spatial volume is typically large. For simulations of thermal systems, temperature is set by T = 1/aN τ and thus lattices have a rather small temporal extent here. Accordingly, in vacuum simulations, the opposite is the case. On the other hand, thermal systems require simulations over ranges of parameter values and higher statistics compared to vacuum simulations. Typical lattices have currently N σ ≈ 32, N τ ≈ 12 in thermal studies and N σ ≈ 64, N τ ≈ 128 for vacuum simulations. Therefore a thermal lattice is typically simulated using less parallelism.
Ensemble generation
The standard simulation algorithm to generate QCD gauge configurations is the Hybrid MonteCarlo (HMC) algorithm [14] , where the effective action is embedded in a fictitious classical system governed by the Hamiltonian H = 1 2 P 2 +S eff [U, φ] by introducing Gaussian momenta P conjugate to U . Starting from a given configuration (U, P ), the system is evolved over a time τ to a new configuration (U , P ), according to the hamiltonian equations of motions with a force F,
This evolution is carried out by numerical integration, the most common integration schemes being the leapfrog-and the second order minimal (2MN) scheme. φ and P are chosen with a Gaussian distribution initially and φ is held constant throughout the evolution of the system. Since the numerical integration is not exact, a metropolis step is carried out in the end, thus ensuring detailed balance. This means that the new configuration is only accepted with a probability min (1, exp(
The latter has three contributions with our choice of lattice action:
where the gauge part of the action has been divided into plaquette and rectangle parts. These are proportional to sums of link products, so called staples. For example, (F gauge plaquette ) µ (x) ∼ µ =νP µν (x), where we denote the stapleP µν (x) as the plaquette product of links without U µ (x) (see fig. 1 ). Note that these sums are no longer elements of SU (N C ).
By means of the identity
, the fermion force can be evaluated. Note that this means that one needs to invert D for the input of the force calculation. In addition, this part of the force receives contributions from the D part of the fermion matrix only since our diagonal matrix does not depend on U .
Techniques exist to refine the integration of the equations of motions. τ is usually divided into smaller substeps and the inverter can be preconditioned by introducing additional pseudo fermions via det(D) = det(A)
, where A denotes a fermion matrix with a different mass than D (mass preconditioning). Furthermore, it is beneficial to integrate cheaper force contributions more often (multiple timescales). For details we refer to [15] . In many cases, one is interested in QCD-like systems without fermions, what is then called pure gauge theory. In principal, this gives det M = 1 in the equations above and the HMC algorithm is still applicable. Nevertheless, the so-called heatbath algorithm [16] [17] [18] provides an alternative and more direct way. It is based on an exact algorithm for SU (2) that updates a specific link according to its neighbours. This can be extended to the general SU (N C ) case by systematically reducing the SU (N C ) link to a number of SU (2) subgroups, which for the N C = 3 case is usually also 3. Figure 2 shows a sketch of the algorithm, where W, U, V, L denote SU (N C ) links and w, v SU (2) links, respectively. project and extend denote reduction and extension to and from the ith subgroup. The update routine corresponds to the aforementioned exact SU (2) update or to an overrelaxation update [19] , which serves to cover the whole configuration space more quickly. Only in the first update random numbers are necessary. Table 1 shows an overview of available GPUs and CPUs. One sees immediately that GPUs surpass CPUs in peak performance as well as in memory bandwidth. However, one also notes the drop in performance when going from single to double precision on the GPU. To take advantage of this performance an appropriate programming model and a certain understanding of the hardware architecture is required. One of these programming models is provided by OpenCL.
OpenCL and Graphic Cards

OpenCL
GPU applications generally consist of a controlling program ("host"), running on the CPU, that executes suitable smaller programs ("kernels") on the GPU. All the memory handling is performed by the host program. The most prominent library of such kind is NVIDIA's CUDA [6] . Virtually all existing LQCD applications are based on CUDA, at the disadvantage that these are destined to run on NVIDIA hardware exclusively [20] [21] [22] [23] [24] . A hardware independent approach to parallel computing is given by the Open Computing Language (OpenCL [7] ), which is an open standard to perform calculations on heterogeneous computing platforms. Thus, in OpenCL one is not determined to use CPUs or GPUs, but it is in principal possible to combine these and distribute the computations among all available computing "devices" (see fig. 3 ). For this OpenCL defines a programming language that is based on C99. Implementations of OpenCL can be found both from AMD (AMD Accelerated Parallel Processing (APP) [25] , formerly ATI Stream SDK) and NVIDIA (as part of CUDA), as well as other vendors. As LQCD applications are predominantly written in CUDA, OpenCL performance results are scarcely found. There is but one reported for a HMC using staggered fermions [22] , where a significantly lower performance of OpenCL compared to CUDA is reported.
GPU Architecture and OpenCL Terms
Like modern CPUs, GPUs are multi-core Single Instruction Multiple Data (SIMD) processors. However, where CPUs are tuned towards processing each thread as fast as possible, the GPU architecture is tuned towards high throughput over thousands of threads. While the SIMD model on CPUs is based on vector registers, where a single processor executes an instruction in parallel for multiple elements in the vector register, the GPUs implement a variant known as Single Instruction Multiple Threads (SIMT). On the GPU registers are seen as scalar by each thread, however the execution units, called "processing elements" in OpenCL terminology, share a common instruction decoder. Therefore a core, called "compute unit" in OpenCL, always executes a group of threads in lock-step. The group of lock-stepped threads is basically equivalent to the SIMD thread on a CPU. However, the memory access is more flexible due to the scalar nature of the thread execution. On CPUs the group executed on a compute unit in lock-step might contain only one processing element.
GPUs provide a larger set of registers than CPUs. The AMD Radeon HD 5870 provides 16384 registers, each 16bytes in size. The NVIDIA GTX 480 provide 32768 registers of 4bytes each. These registers are dynamically mapped to threads. Therefore, a compute unit can run either a smaller number of threads using even more than a hundred registers each, or more than a thousand threads, each using only a dozen registers each. Like hyper-threading on a CPU, running more threads will allow to hide memory latencies, increasing overall throughput. The scheduling of the thread groups is performed by a hardware scheduler with minimal overhead. Registers stay allocated to each thread from its creation until it finished execution.
The memory architecture of GPUs is more complex than that of CPUs, which appears uniform to the user. It is split into multiple logical regions. Global memory is the normal main memory of the GPU that can be read and written to by all threads running on the GPU. Private memory is a part of global memory that is partitioned among all threads running on the GPU. When addressing into private memory each thread accesses its own partition. This memory is also used as a swapping place for registers if the register file cannot hold a threads full working set. Registers swapped into the private memory are also known as scratch registers. As private memory is part of the global memory it shares the same performance characteristics. This means, large latencies can be incurred when accessing data from local memory. Therefore, usage of scratch registers usually comes with a large performance penalty. Another part of global memory is constant memory. That memory can only be written to from the host. GPUs usually are able to cache accesses to this memory and broadcast values from this memory to all threads very efficiently. In addition modern GPUs also provide a local memory. Just like registers, local memory is on-die and can be accessed with similar performance. Local memory is shared between threads running on the same compute unit and can be used as a user programmed explicit cache. Allocations of memory on the GPU are referred to as "Buffers" by OpenCL.
Stemming from their graphics tradition, GPUs originally only had dedicated read-only caches for constant memory and textures. Textures are images stored in memory in a special format. On the AMD Radeon HD 5870 and 6970, when keeping to some restrictions, the AMD OpenCL compiler is capable to automatically utilize the texture cache to access buffers which are only read by a kernel. More modern GPUs like the NVIDIA GTX 480 and the AMD Radeon HD 7970 provide multi-level read-write caches. While CPU caches target to minimize latencies in memory access for a single thread, GPU caches are shared by many threads. One of their main functions is to coalesce access by multiple threads to close-by addresses into single memory transactions.
Implementation strategy
Since OpenCL needs a quite different approach to LQCD than existing software, we decided to start from scratch instead of modifying an existing application. Consequently, all parts of the simulation code are carried out in OpenCL. We set up the host program in C++, which nicely allows for independent program parts using C++ classes and also naturally provides extension capabilities. The main object is the class gaugefield, where the initialization process of OpenCL is incorporated. It manages the physical gauge field when several devices are used and holds an application-specific number of opencl device- objects. This class in turn contains all compute device-related parts, such as kernels, and eventually executes the kernels on a specific device. Child classes of gaugefield and opencl module contain problem-related functionality and provide algorithmic logic. For hybrid applications, different "tasks" can be defined to carry out a physics problem. These may then again contain a number of device objects themselves. The OpenCL environment has to be initialized before the actual calculations can be carried out (Fig. 4) . To generate the kernels, the code files are collected, compiled and linked into an OpenCL "program" using the OpenCL compiler. To ease debugging we build each kernel as a stand-alone program. The binary files, which the compiler produces during the kernel compilation, can be reused at a later point and also provide information about the kernel, e.g. register usage statistics, important for benchmarking and optimization. Thus, the kernels are created only at runtime of the application, which allows to pass runtime parameters (e.g. NT, NS, . . . ) to them as compile-time parameters. We can avoid many kernel arguments like this, and parameters are "hard coded" into the kernel code. On the device all data types are implemented as structures, with all the required operations defined for them. Although this might in some cases require more registers than simply operating on arrays of scalars stored in main memory, we opted for this implementation strategy for its better code readability.
C++-Hostprogram
Gaugefield
It is not trivial to estimate how much the register overhead of the structure based implementation strategy is, as the exact register requirements highly depend on the optimizer. The possibly higher register requirements of the structure based approach can easily be seen when looking at the addition of two structures of four floats. When operating on arrays of scalars there only has to be space for three floats in registers. As addition is element-wise, only one float from each operand has to be loaded at the same time. Additionally there has to be room for one element of the result. Using actual structures four times this space is required, as the whole structure has to be stored for each operand and the result. In the case of spinors register requirements would increase from three floats to 72. This is, however, a worst case situation as for most operations, e.g. multiplication of a SU3 matrix with a spinor, more than one element of each structure is required at the same time anyway. In addition, given the high latency of GPU memory it does not make sense to completely serialize handling of each element in a structure. Therefore the register usage should be higher even when performing all operations using scalar types, as the optimizer will use different registers for different elements to enable the exploitation of instruction-level parallelism.
As up till now they have not been relevant for the overall runtime, all LAPACK operations have been implemented in a straightforward manner. The only optimization was implicitly given by the data type storage format which is described below. ILDG compatible I/O has been implemented as well as the Ranlux [26] PRNG (Pseudo Random Number Generator), as it is the standard choice for LQCD simulations. We use the original implementation on the host while on the device we use
RANLUXCL
1 , an open-source OpenCL implementation of Ranlux. For testing purposes we have also implemented the NR3 [27] generator, but it has not been used for the measurements in this paper. Initialization of the random number generator follows the usual Ranlux rules which are applied across the host and the device, where each OpenCL thread runs on its own Ranlux PRNG state.
Since on different GPU drivers we have observed multiple miscompilations of our code during development, we added regression tests for most of our OpenCL functionality. This allows us to quickly check new drivers for incorrect output. A special challenge is that the likelihood for compiler errors scales with code complexity. Thus, a function might work perfectly in a simple test case but will produce errors when integrated into a larger kernel. Therefore it is important to not only test each building block for regressions, but also repeatedly check whether they still work as expected when being used in larger kernels.
Memory requirements
LQCD calculations are always memory bound. To see that consider the theoretical peak performances collected in table 1 and the characteristics of some kernels displayed in table 2. Especially for the fermion related kernels, looking at the respective ratios of bandwidth to flops (numerical density) shows how the memory bandwidth dominates performance. One the other hand, the heatbath related kernels may be expected to perform well on GPUs, since here the bandwidth is somewhat less important, compared to the flops the kernels perform. However, the numerical density is still low compared to the peak flops per bandwidth ratio of the GPUs. Table 3 shows that the gauge field is the biggest object to store. In addition, one needs more than one field of each species during the simulations. However, there are possibilities to reduce these sizes, most prominent the aforementioned even-odd preconditioning for the spinorfields. In general, each gauge link may be displayed with N 2 C − 1 real numbers, one for each generator of the group. Since this representation induces additional computational overhead, other methods are more feasible. For N C = 3, one column of the matrix can be reconstructed from the other two exactly, since it has to obey c = a × b , so one can discard 6 floats of the full matrix (REC12). Additional restrictions on the left 12 numbers allow to save 2 or 4 more floats (REC10 and REC8, respectively). However, the last procedure may run into arithmetic problems during the reconstruction [20] . As the reconstruction adds additional complexity to the code we have not used it so far except for the D kernel. Most of our kernels are already at the edge of the register limit and additional code complexity tends to increase the register pressure, even if with perfect register reuse no additional registers would be required.
Global memory storage formats
Our first implementation of the D kernel used an array of structures (AoS) storage format for the gauge and spinor fields. Observing this kernel reaching only about 22GB/s of memory throughput on the AMD Radeon HD 5870 we started studying characteristics of the memory controller.
First we checked the effects of utilizing the texture cache, which on AMD hardware can be done with minimal code modification. While this provides a significant speedup, the achieved 46GB/s are still far from that GPUs theoretical bandwidth limit of 155GB/s.
Another speedup can be reached by specifying a proper alignment for the larger data types. The key is to use largest possible alignment that is still a divisor of the data type size. If no alignment is specified the compiler will use the alignment of the smallest contained data type, resulting in superfluous memory fetches. While the number of fetches in some cases could be reduced even further by using an alignment that is larger than the data type, it turns out that the additional bandwidth required overcompensates the benefit. Using proper alignment of all types the code reaches 68GB/s, which is still far from the performance limit. : Copy performance using different data types to copy a buffer of 100MiB. Note that the AMD Radeon HD 5870 was run using the Catalyst 11.7 driver, while all other measurements where performed using Catalyst 12.4. Figure 5 shows how using different data types for global memory access effects the speed of copying a fixed size buffer on the GPU. The float and double types are scalar data types of 4 bytes and 8 bytes size. The float4 type is a built-in data type of OpenCL that acts like a struct of four floats and therefore has a size of 16 bytes. The types su3 and spinor are complex structs of size 144 bytes and 192 bytes, respectively. The figure shows that the best way to access memory is using the double or float4 data type. Actually all other types with the same size and alignment will work, too. Therefore we now use a double precision floating point complex type as the base storage type for all our data in GPU global memory. From size and alignment this type is equivalent to the float4 type. All other types are stored in a SoA fashion based on this type. This allows to utilize memory bandwidths of up to 120GB/s on the AMD Radeon HD 5870. Care has to be taken, though, as SoA increases the register pressure. This can lead to register spilling and therefore cause a performance penalty.
Another important effect is given by the mapping of the gauge field indices to memory addresses. As we are using even-odd preconditioning, neighboring threads will always access either links originating in even or those originating in odd sites only. If links originating in even sites are stored interleaving with those originating in odd sites, which is the naive format, and a SoA pattern is used, a memory access will span data of both types of links. As neighboring threads will only require one type of link at that moment, half the memory bandwidth will be wasted. Therefore we split our gauge field such that links originating in even sites are stored separate by those originating in odd sites.
Common code for CPUs and GPUs
As OpenCL can be used both for CPU and GPU programming, we use a single source code for the CPU and the GPU implementation. To cater for the different architectures, we introduces some abstractions to achieve best performance in both worlds.
An important difference between the CPU and the GPU is the optimal looping strategy. On CPU loops perform best when each CPU works on its own consecutive block of memory. Thereby, it can best utilize its time-local cache to reduce the number of actual requests performed on the memory. On the GPU however, the best memory throughput is achieved if consecutive cores read consecutive elements from memory. Therefore, a loop should always move though the data in a strided way. We use a macro PARALLEL_FOR to implement these different looping strategies transparently. Table 2 : Read-Write (RW) sizes in Bytes and flop sizes for some HMC related kernels (for double precision). All numbers are per site (and direction). "saxpy" corresponds to the algebraic operation z = α x + y. Another important difference between CPU and GPU is that the GPU prefers a SoA pattern, while the CPU tends to prefer an AoS pattern. Therefore we encapsulated all memory accesses into separate functions which transparently perform the SoA conversion if required. When moving data onto or from a device the same automatism occurs, ensuring best memory access patterns on all devices.
While OpenCL provides vector data types, which the AMD platform uses for vectorization on the CPU, we did not use those in our code. Besides complication of the source code, they would increase the amount of registers required on the GPU which are already a sparse resource.
Performance results
In this section we test our implementations of the heatbath and HMC algorithm and in addition report on the performance of the D, the crucial time consuming part for fermionic observables and the HMC itself. We tested our LQCD implementations on different architectures with different lattice sizes and compared the results to existing applications and literature data. Since we are interested in thermal systems, we benchmarked our programs on lattice sizes ranging from N σ = 16, 24, 32, 48 and N τ = 4, 8, 12, 16. The D and heatbath kernels have also been benchmarked for some additional values of N τ to explore a wider range of used memory sizes.
As double precision results are desired for physical measurements we concentrated on such precision calculations throughout. However, single precision calculations do promise twice the performance for a given device's memory bandwidth and have provided good results in other applications running on the same GPUs [28] . Therefore our results are only a lower bound for the achievable performance. For the heatbath algorithm, where no summations over the whole lattice are required, single precision results are given.
The GPUs used are an AMD Radeon HD 5870, as it is used in the LOEWE-CSC, and an AMD FirePro V7800, which is the professional grade version of the AMD Radeon HD 5870. It is clocked somewhat lower and equipped with more memory. In addition we also used the newer AMD Radeon HD 6970 and the AMD Radeon HD 7970, which is the latest GPU available by AMD. For comparison, the D benchmarks have also been executed on the NVIDIA GeForce GTX 480, which is of comparable age to the AMD Radeon HD 5870, and the NVIDIA GeForce GTX 680, which is the latest GPU available by NVIDIA. To show the flexibility of OpenCL, we also simulated on Intel Xeon E5520, AMD Opteron 6172 and AMD Opteron 6278 CPUs. All of the benchmarks, except those on the AMD Radeon HD 7970, used Catalyst 12.4 which is the most current available AMD GPU driver available at the time of writing. On the AMD Radeon HD 7970 the Catalyst 12.2 was used. The NVIDIA GPUs were using version 295.41 of NVIDIA's GPU driver. 
Heatbath performance
First, we present results of our implementation of the heatbath algorithm, that is we concentrate on the heatbath and overrelax kernels. Figure 6 shows the performance in GFLOPS achieved in single precision. While we don't manage to saturate device bandwidth, in both kernels the Cypress and Cayman based GPUs provide at least 30GFLOPS for the heatbath kernel for any lattice size. The overrelax kernel performs a lot better, here we can achieve at least 100GFLOPS on the GPU, peaking up at about 160GFLOPS on the AMD Radeon HD 6970. This behavior can most likely be explained by how the compiler manages with the random number generator, since this is the essential difference between both kernels. One can also see the different performance of different card generations. CPU performances are also given, but they are only a factor 1/4 of those on the GPUs. However, the overrelax kernel again outperforms the heatbath kernel. The comparison is not entirely fair, as the CPU code did not receive any optimizations besides the proper looping strategy and is not using a SoA access pattern. Vectorization might be able to close the gap, although one should keep in mind that this is a comparison of two server CPUs to one consumer GPU, where the latter is a much less costly solution.
Not shown is the double precision case, where on the contrary, the performance is of the order of 10GFLOPS and thus poor for both kernels on all GPUs. Again, the overrelax kernel performs better than the heatbath kernel, however, this time the difference is small. The massive drop in performance can be understood when looking at the working set sizes of the kernels. Both kernels require storing of a relatively large amount of data throughout kernel execution, rendering register reuse difficult. Therefore, in double precision the working set for each thread gets too large to fit into the available registers, leading to massive spilling to scratch registers. This reduces performance by both, the additional latency when accessing the spilled data, as well as the increased memory bandwidth required to access the spilled data. Therefore, effective algorithmic density is less than what is given in Table 2 . However, this is also specific to heatbath and overrelax kernels, which can be seen in the performances reported below.
Despite performance, the limiting factor for physical studies here is clearly the GPU main memory. It is limited to 1GB on the AMD Radeon HD 5870 and to 2GB on the other two. This sets an upper limit for possible lattice sizes. Additionally, on AMD GPUs prior to the HD 7000 generation there is no official support to use more than half of the GPUs memory from an OpenCL application. The limit can however be circumvented by setting environment variables as specified by the AMD Knowledge Base [29] . Also note that an OpenCL runtime can do a multitude of things when it runs out of physical memory on the GPU. While current implementations all seem to give an error when running out of memory, from the API specification it is perfectly legal to swap buffers to host memory, which will come with a major performance penalty.
Comparing the heatbath kernel performance with [24] , where performances of a CUDA based heatbath implementation on NVIDIA GeForce GTX 295 and NVIDIA GeForce GTX 580 is given, we observe only half of the reported single precision performances for the heatbath kernel. The overrelax kernel performs at around 90% of the peak value of [24] for the AMD Radeon HD 5870 and a slightly better performance for the AMD Radeon HD 6970 is found. Note that we do not use memory bandwidth reducing storage techniques for the links as they are used in [24] (REC12). Other than the code in [24] , our implementation can work with any number of lattice points that fit into the GPU memory and is not limited by the number of threads used. This shows that all the tested GPUs can in principle be used to efficiently run the algorithm.
D performance
For simulations including fermions the inversion of the fermion matrix is essential and thus the performance of evaluating Dφ is crucial. Since for twisted mass fermions the diagonal part of D is not dependent on the gauge field, the D performance has to be considered most carefully. We measured it on various lattice sizes in double precision, the results can be seen in figure 7 .
The slowest of our GPUs, the AMD FirePro V7800 levels at nearly 60GFLOPS and is above 50GFLOPS even for small lattices like 16 3 × 4. The gamer GPUs AMD Radeon HD 5870 and AMD Radeon HD 6970 scale nicely with their higher peak memory bandwidth, achieving about 60GFLOPS and 80GFLOPS, respectively. In addition, one can observe better performances with the AMD Radeon HD 5870 for lattices sizes that have a spatial extent that is a multiple of 16, where the kernels performs at about 70GFLOPS. The performance of the AMD Radeon HD 7970 is not better than that of the AMD Radeon HD 6970, however this GPU was measured using a preview driver. As that driver does not provide any useful resource usage statistic we did not optimize our kernel for that GPU. As we cannot ensure proper register usage we simply reused the memory stride optimization found to be optimal on the AMD Radeon HD 5870.
Also shown is the bandwidth utilization, where apart from the unoptimized AMD Radeon HD 7970 at least 70% of the theoretical peak bandwidth are reached on all AMD GPUs. On the AMD Radeon HD 6970, when running at 140GB/s, 80% of the peak bandwidth are achieved, peaking up to 86% for lattices like 16 3 × 24. For the AMD Radeon HD 5870, the utilization peaks at 120GB/s (about 78% of the theoretical peak bandwidth), which is actually even higher than the about 105GB/s achieved on this GPU in a simple float4 copying kernel as shown in figure 5 . This can be explained by the loop-unrolled characteristic of the SoA memory access and the higher read-to-write ratio of the D kernel, as only reads can benefit from the cache of the AMD Radeon HD 5870.
Additionally, we give D performances obtained on the NVIDIA GPUs GeForce GTX 480 and GeForce GTX 680. Both cards constantly reach about 25GFLOPS and 40GB/s, respectively. The latter corresponds to about 20% of the theoretical peak. The NVIDIA GPUs show register spilling, which is probably the performance limiting factor. While the NVIDIA GPUs have a smaller register file than their AMD counterparts, that is only part of the problem. In CUDA it is possible to have the compiler increase the number of available registers at the cost of fewer threads being able to run on the GPU concurrently. In addition, it is possible to reconfigure the ratio of L1 cache to shared memory. A larger L1 cache would probably avoid some of the slowdown due to the spilled registers. Sadly neither of these features is exposed when using OpenCL on the NVIDIA hardware. With the current code, the AMD GPUs achieve 3-4 times more performance with the same kernel, but the comparison is not entirely fair, as the AMD GPUs have been the primary development platform. There are a couple of reported performances in the literature, which are all based on CUDA implementations. Using the AMD Radeon HD 5870 as our reference value, our D kernel is nearly twice as fast as the one shown in [20] , even though that kernel uses REC12 to reduce bandwidth requirements. One should however keep in mind that the NVIDIA GeForce GTX 280 is one generation older than the AMD Radeon HD 5870. Our D is also 40% faster than that shown in [23] . That performance was measured on a NVIDIA GeForce GTX 480 which, as mentioned above, is as old as the AMD Radeon HD 5870 used by us.
Furthermore, figure 8 shows the effect of REC12 on the D performance exemplarily for the AMD Radeon HD 5870 and the AMD FirePro V7800. This technique reduces the RW load of the kernel by 13% (see table 3), and thus should result in a visible speedup. An increase in performance up to 9% can indeed be observed and proves again that the performance is bandwidth limited. Note that we did not include the computational overhead induced by the reconstruction technique in the FLOP count. Compared to [20] , where actually a decrease in double precision performance is reported when decreasing the RW load, these results support the good performance picture of our D implementation. Table 4 : Setups used for benchmarking. β = 3.9 and κ = κc = 0.160856 were used throughout to simulate at maximal twist. The value of mπ is only approximate.
HMC performance
In order to test our HMC program under realistic conditions, we used setups corresponding to one heavy and two lighter pion masses, see table 4. The parameters were chosen according to [30] in order to simulate at maximal twist. In each setup, we started from a prior generated gauge configuration and performed 10 HMC steps with τ = 1 (Setup C) and τ = 0.1 (Setup A and B), respectively. We used separate time scales for the gauge and fermion parts and on each the 2MN integrator with 10 integration steps. To gain statistics, each run has been carried out O(10) times. The OpenCL application was run on the AMD FirePro V7800 and the AMD Radeon HD 6970. For comparison, we performed the same measurements using the CPU-based application tmlqcd [31] , running on different numbers of cores on the LOEWE-CSC (AMD Opteron 6172 CPUs). Statistical errors have been neglected since they were found to be below the percentage limit.
The results for setup C are shown in figure 9 , where runs for fixed N τ = 8 and fixed N σ = 24 have been performed. The reference tmlqcd was executed on one and two whole nodes, which means 16 and 32 and 24 and 48 CPU cores, respectively, for the different lattice sizes, since the lattice extent has to be divisible by the number of cores here.
One can see nice scaling in all reference runs, the additional MPI overhead is only visible for small lattices. The GPU runtimes are always comparable to or better than the reference values. Especially the AMD Radeon HD 6970 can achieve a better performance than 32 and 48 CPU cores throughout, respectively. Considering the N σ = 24 runs, we measure a speedup of the AMD FirePro V7800 of about 1.7 compared to the 24 core reference runs and a slow-down of 0.9 compared to the 48 core runs. The AMD Radeon HD 6970 is faster in both comparisons, about 2.5 and 1.3, respectively. Note that this is a comparison to two and four full CPUs, where each CPU is more expensive then the GPUs used. We did not perform reference runs on a single CPU core, but we observed a scaling of about 0.75 when running tmlqcd on more than one core, which can be explained by the additional MPI overhead. This scaling yields a speedup of about 30 for the AMD FirePro V7800 and a speedup of about 44 for the AMD Radeon HD 6970, compared to one AMD Opteron 6172 core. However, this comparison is a bit academic, since one would practically not use only one core of a multi core CPU.
A similar picture emerges also for the lighter pion masses (see figure 10) , where the runtime scales approximately the same on all systems. For these setups, the inversion of the fermion matrix takes longer. Therefore the force calculation, which only achieves about half the bandwidth utilization of the D kernel and takes 40% of the total execution time in setup C, becomes less important for the overall performance. The limited performance of the force calculation is also caused by register spilling, even though it is not as problematic as in the similar heatbath kernels. Not displayed are additional runs of the OpenCL code which we did on AMD Opteron 6172 CPUs. Since no specific optimizations for the CPUs have been carried out, these needed significantly more runtime. However, switching between both types of devices is smooth, showing that OpenCL code is absolutely versatile.
The comparison of these results to the extensive tests of the CUDA based HMC for the staggered fermion discretization reported in [22] is somewhat difficult because of the fundamental differences 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 between the two discretizations. Additionally, the authors of [22] mainly use single precision within the HMC, whereas we perform double precision calculations throughout. Although we did not test this, the results obtained for the D on NVIDIA cards strongly suggest that our code would currently perform significantly slower on these cards compared to AMD GPUs. This is the opposite picture to the significantly slower performance of the staggered HMC reported in [22] . Just like our NVIDIA results their AMD results are probably affected by the choice of the primary development platform, resulting in less optimization work in the other. In addition the staggered fermionic data types should be smaller, reducing the working set and therefore avoiding register spilling. This should especially benefit the NVIDIA GPUs used, as their register file is smaller than that of the AMD GPU.
Conclusions
We have presented the first implementation of LQCD using OpenCL and shown it is working on CPUs as well as the two major GPU platforms, NVIDIA and AMD.
In single precision our heatbath implementation is able to achieve a competitive 160GFLOPS on the AMD Radeon HD 6970, while the double precision variant is still work in progress. Fine tuning the applied memory optimizations, which are currently tuned for double precision codes, and the use of bandwidth-reducing techniques like REC12 should provide further speedups here.
For the Wilson D kernel we were able to show excellent performance, utilizing more than 70% of the available memory bandwidth for all lattice sizes on multiple AMD GPUs outperforming published performances of CUDA based codes. We also see a positive effect of REC12 on the performance. Extended to a full HMC for twisted-mass Wilson fermions we showed a speedup factor of four of our code running on an AMD Radeon HD 5870 compared to a reference code running on an AMD Opteron 6172.
Further speedups to our HMC will be reached by further optimizing the inverter performance. One way to reach this will be a mixed precision solver, which up till now we avoided due to the small memory of the AMD Radeon HD 5870. As proposed in [1] we will also start to use all compute resources on hybrid systems by performing parts of the calculations on the CPU. This will require the SIMD capabilities of the CPUs, too. One approach for this could be the redefinition of our base types for each device. In addition, we are currently investigating possible performance gains by using REC12 in the force calculation. More complex reduction techniques may also be of advantage here. Finally, to reduce processing time and circumvent memory limitations we will expand our code to run on multiple GPUs.
