Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Performance Computing (HPC). NVIDIA established CUDA as a parallel computing architecture controlling and making use of the compute power of their GPUs. CUDA provides sufficient support for C++ language elements to enable the Expression Template (ET) technique in the device memory domain.
Introduction
GPUs are getting increasingly important in scientific HPC. Massively multicore architectures supported by high bandwidth memory buses make them an attractive target architecture for either floating point operation rich or memory access intensive applications.
NIVIDA established CUDA [1] as their parallel computing architecture. It enables HPC scientists to dramatically increase the computing performance by taking advantage of the compute power of GPUs. While former releases of CUDA mostly supported a C-like Application Programming Interface (API) with little support for C++ features, the upcoming release 4.0 supports C++ features in a much broader way attracting an even larger set of applications and libraries to be made subject to multicore acceleration.
Active libraries typically implemented utilising metaprogramming methods have proven to provide domainspecific abstractions to the user and on the other hand to provide compilers with sufficient freedom to optimise the code to a satisfactory level. They combine the benefits of built-in language abstractions (convenient API, translation to efficient code) with those of library-level abstractions (information hiding, adaptability) [2, 3] .
Quantum Chromodynamics (QCD) is the theory of the strong force between gluons and quarks. The formulation of the theory on a discrete space-time lattice is called lattice QCD and has opened the path to numerical calculations. Lattice QCD has received much attention as a "grand challenge" problem in scientific HPC. Current lattice calculations demand computational work of sustained peta-flops and compute resources have been and continue to be the main limiting factor to large scale lattice QCD calculations.
Although heavily dependent on the simulation parameters typically the largest portion of the compute time in lattice QCD calculations is spent solving a system of linear equations when inverting the so called fermion matrix. Typically most of the work invested in optimisation of lattice QCD applications is spent on this part leaving the remaining parts of the calculation fairly unoptimised.
GPUs form an attractive platform upon which to deploy large scale lattice QCD calculations [4] . To date a lot of effort has focussed on optimising the solver part of lattice QCD applications [5, 6, 7, 8] . Outstanding implementations of several inverters which achieve a very high sustained performance using a mixed precision approach combined with reliable updates became avail-able [9, 10] . Seamless integration for these solvers are provided for a number of lattice QCD software suites including Chroma which builds on top of QDP++ [11] .
However, as these solvers optimised for the CUDA architecture provide for a significant speed-up of the inversion of the fermion matrix, the remaining (unoptimised) parts of the calculation start to dominate the overall execution time.
Currently GPU enabled software packages typically have some short comings: A particular part of the calculation is either executed on the GPU or the CPU leaving the respective other system mostly idle. This is not ideal taking into account the roughly equal acquisition cost and power consumption of both systems.
On the other hand heterogeneous computing architectures offer the possibility to enable a cooperative computing environment where the general purpose and specialised processors are working together in an interleaved fashion. The idea is to split the calculation into several small tasks and to deploy multiple types of processing elements within a single workflow each assigned to the task its best suited for.
In order to install this "fine-grained" structure in Chroma acceleration is directly implemented in the underlying library QDP++ assigning the code parts to execute on the processor element it is best suited for, i.e. IO-intensive operations on the CPU, floating-point rich operations on the GPU. In this way not only a finite set of selected functions are executed on the accelerator but acceleration is applied in a more general fashion.
Benchmark measurements confirmed that solely accelerating the floating-point rich operations yet yields a significant speedup factor compared to unaccelerated execution.
The data layout, the order of access, the precision of the primitives are left unchanged just as prescribed by the QDP++ standard template order. Section 2 introduces briefly the CUDA architecture. The ET technique is briefly outlined in section 3. Section 4 introduces QDP++. Section 5 introduces the design elements introduced to QDP++ for accelerating the evaluation. Section 6 details on the new QDP++ API elements. An application example is detailed in section 7. Section 8 details on benchmarking results.
The CUDA Architecture
The CUDA architecture is built around a set of multithreaded Streaming Multiprocessors (SM) which provide the main compute power. Threads are organised in a hierarchical manner: A kernel grid is a collection of thread blocks. A thread block is a collection of threads and represents an indivisible unit allocatable by a multiprocessor. Whether a given thread block can be allocated by a SM depends on the resources (number of registers and shared memory) its threads collectively require and the resources available on the SM. The threads of a thread block execute concurrently on one SM, and multiple thread blocks can execute concurrently on one SM (one block active at a time).
The Streaming Multiprocessors implement the Single-Instruction, Multiple-Thread (SIMT) architecture. Threads resident on one SM are bundled into groups of 32 parallel threads, so called warps. The threads of exactly one warp are executed in SIMT fashion. Individual threads composing a warp start together at the same program address, but they feature their own instruction register and are free to branch independently. However, in this case execution is serialised.
The SM is able to hide latency to device memory by switching execution to a different warp whose instruction is ready to execute. It is therefore beneficial to organise the thread number per SM in such a way that a sufficient number of warps is resident to (ideally) completely hide the latency to device memory.
Expression Templates
C++ function and class templates together with function and operator overloading offer the possibility to represent expressions as C++ types. This technique is commonly referred to as Expression Templates (ETs) and was first introduced by Todd Veldhuizen [12] and David Vandevoorde.
ETs provide a means to eliminate the need for creating temporaries when implementing a C++ vector class library which features both a convenient API such as for domain specific languages desired and a high performance of the translated code. However, the latter feature relies on the optimisation abilities of the compiler and is not ensured in all ET applications, e.g. for Basic Linear Algebra Subprograms (BLAS) Level 2 and 3. Here different, but also ET based approaches may achieve a better performance [13] .
The Portable Expression Template Engine (PETE) pioneered the use of the ET technique for parallel physics computations [3] . It is an extensible implementation of the ET technique and achieves an exceptional level of abstraction without sacrificing performance and provides the core functionality (on the vector level) of QDP++.
Implementation of a vector library utilising the ET technique typically involves defining a template of the evaluate function. Upon assignment of an expression the compiler generates an instantiation of the evaluate function matching the expression. The evaluate function then typically implements a loop iterating over all vector components -no temporary vector objects are required.
C++ compilers offer the possibility to access the function template's arguments in a fully instantiated C++ type in form of a C string -so called pretty printing, which provides a means to access at runtime the expressions.
QDP++
QCD Data Parallel (QDP++) is a C++ vector class library suited for quantum field theory. It forms the basis for the widely used lattice QCD software suite Chroma and as such provides the lattice wide data types and expressions used in Chroma [11] . Chroma implements linear algebra operations which may include nearest neighbour communications utilising the QDP++ API.
Although not designed originally for multicore acceleration, this work demonstrates that design elements can be added to QDP++ in such a way that evaluations of arbitrary expressions are accelerated and executed on a GPU. This approach was previously established when deploying lattice QCD applications to the QPACE supercomputer [14, 15] .
Accelerating QDP++ evaluation on a GPU
Accelerating QDP++ expressions on a GPU relies on leveraging the ET technique to the device memory domain. A crucial component for ETs to work in a particular memory domain is (besides the compiler's ability to handle C++ templates) the ability to take memory addresses of functions and allowing for dynamically dereferencing function addresses (function pointers).
The new release 4.0 of CUDA on devices with compute capability no less than 2.0 meets the requirements for the ET technique to work on the device memory domain.
Unfortunately CUDA provides only a C-interface to kernel functions making it impossible to directly pass C++ constructed expressions as arguments to compute kernels.
This work circumvents the aforementioned lack of a C++ API to kernel arguments by first constructing in device memory domain an object of an equivalent C++ expression type as used during host code translation and second deploying the missing runtime configurable Plain Old Data (POD) parts by copying those from the host side expression object.
In order to construct the object in device memory domain a JIT compilation of CUDA kernel modules is triggered upon expression evaluation. Launching this kernel constructs the required object in device memory. Still the runtime configurable parts are missing.
To build the bridge between the two expression objects residing in different memory domains the POD part of the expression object on host side is copied into a C API compatible form which is allowed to be passed as CUDA kernel arguments.
Dynamic Code
The shared library is loaded via the dynamic linking loader and the kernel is executed on the device.
After evaluation the shared library is kept loaded until the application exits. This ensures that each kernel function is only generated once and subsequent calls branch to the already loaded shared library.
Flattening the Expression Tree
PETE [3] provides means to traverse the expression tree and execute custom operations on the tree nodes and leaves. This method is used on host side to collect the runtime configurable data (POD portion) of each operator and to store them into a linear storage container that can be passed (as a pointer) to the CUDA kernel as an argument, i.e. the expression tree is flattened.
The inverse operation restores the expression tree on device side 1 . Certain runtime configurable operators require special treatment. E.g. the shift operation requires readonly access to a site table index initialised at runtime.
A device storage container was added in such a way that the first call to a particular runtime configurable operation triggers copying of the required data to device memory. The storage container keeps track of already transferred memory regions and subsequent calls to the same operation do not trigger the device copy again but make usage of the already resident data in device memory. The site tables remain in device memory until they are explicitly freed by the user. This speeds up repeated calls to the same shift function.
Mixed Memory Domain Approach
Since device memory is (still) a scarce resource a mixed memory domain approach was favoured: Memory allocation for lattice wide objects utilise the host memory domain. Upon user request the object's data is pushed to the device memory domain.
A new feature coming with version 4.0 of CUDA provides the possibility to page-lock a memory range that was already allocated (4kB aligned) in the host memory domain and to add it to the tracking mechanism to automatically accelerate calls to device copy functions. This mechanism eliminates the previously required staging of data regions prior to the transfer to device memory and reduces pressure on host memory.
Thread Geometry
The evaluation function template in ET based vector libraries typically triggers execution of a loop iterating over all lattice sites. With CUDA, parallelisation of a loop is typically carried out unrolling the loop and starting one thread per loop iteration.
Since here the applied CUDA kernels not only consist of processing the lattice sites but also require prior reconstruction of the expression tree it was not clear that the typical approach leads to the best performance. A software configuration parameter N site was introduced which specifies the number of sites assigned to one thread.
CUDA kernel functions are launched with specifying the grid and block geometries. Thus a software configuration parameter N threads is introduced that specifies the number of threads per block.
Given the total number of lattice sites the grid geometry is a function of N threads and N site .
For each expression E i a separate CUDA kernel is generated. Thus the sustained performance P is a function of the number of threads per block, the number of lattice sites processed per thread, and the expression E i :
CUDA enabled software packages might be equipped with an auto-tuning mechanism that determines the optimal grid and block geometries for the particular set of installed devices. Auto-tuning is run prior to production and the geometry parameters that yield the best performance are stored for later inclusion.
New QDP++ API Elements
The QDP++ API was extended by the following elements:
Listing 1: Modified Chroma implementation for Jacobi smearing. Prior to any calculation the lattice wide objects are pushed to the device (first darker grey shaded region). After calculation the result object is copied back to host memory and device memory is freed (second shaded region). QDP++ expressions (line number): E 0 (16, 20) , E 1 (25), E 2 (27),E 3 (30),E 4 (33). • OLattice::pushToLattice() Allocates a memory region of the object's size in device memory and copies the object's data to the device.
• OLattice::popFromLattice() Copies the data from the device to host memory and frees device memory.
• OLattice::freeDeviceMem() Frees device memory.
• theDeviceStorage::freeAll() Frees device memory previously allocated for runtime configurable operators.
Application Example: Jacobi Smearing
Frequently used Chroma routines which are typically not subject to special optimisation and which can consume a significant amount of (wallclock) time when executed on a few CPU cores only are quark smearing routines. These are operations acting on lattice wide objects and are typically iterative prescriptions including nearest neighbour communications. One of these routines implements Jacobi smearing [16] and serves here for the benchmarking analysis.
The Jacobi smearing procedure is obtained by solving the Klein-Gordon equation
where
as a power series in κ S stopping at some finite power N smear . Listing. 1 shows the Chroma implementation of Jacobi smearing using the QDP++ API . Five QDP++ expressions are involved: E µ , with 0 ≤ µ ≤ 4 where the most compute intensive ones are E 1 and E 2 .
The code lines shaded darker grey were introduced to enable execution on the GPU. All lattice objects are pushed to the device. After calculation the device memory is freed and the result objects copied back to its original location in host memory.
Benchmark Results
Benchmarking analyses were carried out using a NVIDIA GeForce GTX 480. This device has 1. 
Figure 1: Performance dependence of expression E 2 on N threads for a L x /a = 32 lattice (single precision).
memory, compute capability 2.0, 15 Streaming Multiprocessors and a total of 480 CUDA cores. Double precision on this device is restricted as it utilises the GF100 chip designed for the consumer market. The NVIDIA CUDA 4.0 toolkit (Release Candidate 2) was used with the NVIDIA Linux kernel driver version 270.40. The Chroma Jacobi smearing routine was applied to lattice objects where lattice sizes ranged from N = 8 3 × 16 to N = 32 3 × 64 in single precision and from N = 8 3 × 16 to N = 24 3 × 48 in double precision. The first benchmark analysis studied the performance P(N threads , N site , E) for varying N threads keeping E = E 2 and N site = 1, 2, 4, 8 fixed for a 32 3 × 64 lattice in single precision. Fig. 1 shows the result. The best performance was achieved when setting the thread block size equal to the warp size N threads = 32.
The second benchmark analysis focused on the performance dependence on N site keeping E = E 2 and N threads = 32, 64, 128, 256 fixed for a 32 3 × 64 lattice in single precision. Fig. 2 shows the result. Only a moder- ate dependence was seen on this configuration parameter. Although when configuring for 32 threads a slightly better performance is achieved when using N site = 4 or 8 instead of N site = 1.
These benchmark analyses were repeated for the expressions E = E 0 , E 1 , E 3 , E 4 . The same characteristics were observed: Setting the parameter N threads = 32 resulted in all cases clearly to the best performance with only a moderate dependence on N site .
A final benchmark analysis was carried out for the whole smearing routine (including all five expressions) for different lattice sizes (single and double precision) in comparison to executing the same routine on the host CPU, an Intel Xeon CPU (E5507, 4 cores, 2.27GHz). Tab. 1 shows the benchmark results in numbers (sustained GFLOPS for single and double precision). Fig. 3 shows the result graphically. For the 32 2 × 64 lattice in single precision a speedup factor (compared to the CPU) of more than 14 was measured.
Conclusion, Outlook and Discussion
As a first step, QDP++ expression evaluation was accelerated on the GPU by leveraging the ET technique on the device memory domain. Solely with acceleration a significant speedup factor for the evaluation could be achieved.
Providing an auto-tuning mechanism for the approach described here is not straight forward since the individual expressions E i are not known prior to production. Establishing an auto-tuning mechanism forms part of the to-do list. However, as the benchmark measurements showed there seems to be a preferred thread geometry that leads to the best performance.
Optimisation would be a next major step. By changing the data layout in such a way that coalescing device memory access takes place an even higher speedup factor is expected.
A next step in another direction would be parallelisation to multiple GPUs per host and targeting for the parallel architecture of QDP++ to extend this approach to multiple hosts.
One might also want to move the shared linking loader to an external daemon program or the system loader. In this way the JIT compilation takes place only once -even across several Chroma runs.
Utilising QDP++ profiling would eliminate the dependence on an external code generator.
Software Availability
QDP++ and Chroma are available as open source software [11] . QDP++ configurable for GPU evaluation is available [17] .
The GPU portion of QDP++ requires the upcoming release 4.0 of NIVIDA CUDA [1] and devices with compute capability lo less than 2.0 are required.
