Accelerating QDP++ using GPUs by Winter, Frank
Accelerating QDP++ using GPUs
Frank Wintera
aSchool of Physics and Astronomy, University of Edinburgh, Edinburgh EH9 3JZ, UK
Abstract
Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Per-
formance Computing (HPC). NVIDIA established CUDA as a parallel computing architecture controlling and making
use of the compute power of their GPUs. CUDA provides sufficient support for C++ language elements to enable the
Expression Template (ET) technique in the device memory domain.
QDP++ is a C++ vector class library suited for quantum field theory which provides vector data types and expres-
sions and forms the basis of the lattice QCD software suite Chroma.
In this work accelerating QDP++ expression evaluation to a GPU was successfully implemented leveraging the ET
technique and using Just-In-Time (JIT) compilation. The Portable Expression Template Engine (PETE) and the C API
for CUDA kernel arguments were used to build the bridge between host and device memory domains. This provides
the possibility to accelerate Chroma routines to a GPU which are typically not subject to special optimisation. As
an application example a smearing routine was accelerated to execute on a GPU. A significant speed-up compared to
normal CPU execution could be measured.
Keywords: Lattice QCD, GPU, Expression Templates, Just-In-Time Compilation
1. Introduction
GPUs are getting increasingly important in scientific
HPC. Massively multicore architectures supported by
high bandwidth memory buses make them an attractive
target architecture for either floating point operation rich
or memory access intensive applications.
NIVIDA established CUDA [1] as their parallel com-
puting architecture. It enables HPC scientists to dramat-
ically increase the computing performance by taking ad-
vantage of the compute power of GPUs. While former
releases of CUDA mostly supported a C-like Applica-
tion Programming Interface (API) with little support for
C++ features, the upcoming release 4.0 supports C++
features in a much broader way attracting an even larger
set of applications and libraries to be made subject to
multicore acceleration.
Active libraries typically implemented utilising meta-
programming methods have proven to provide domain-
specific abstractions to the user and on the other hand to
provide compilers with sufficient freedom to optimise
the code to a satisfactory level. They combine the ben-
efits of built-in language abstractions (convenient API,
translation to efficient code) with those of library-level
abstractions (information hiding, adaptability) [2, 3].
Quantum Chromodynamics (QCD) is the theory of
the strong force between gluons and quarks. The for-
mulation of the theory on a discrete space-time lattice is
called lattice QCD and has opened the path to numerical
calculations. Lattice QCD has received much attention
as a “grand challenge” problem in scientific HPC. Cur-
rent lattice calculations demand computational work of
sustained peta-flops and compute resources have been
and continue to be the main limiting factor to large scale
lattice QCD calculations.
Although heavily dependent on the simulation pa-
rameters typically the largest portion of the compute
time in lattice QCD calculations is spent solving a sys-
tem of linear equations when inverting the so called
fermion matrix. Typically most of the work invested
in optimisation of lattice QCD applications is spent on
this part leaving the remaining parts of the calculation
fairly unoptimised.
GPUs form an attractive platform upon which to de-
ploy large scale lattice QCD calculations [4]. To date a
lot of effort has focussed on optimising the solver part
of lattice QCD applications [5, 6, 7, 8]. Outstanding im-
plementations of several inverters which achieve a very
high sustained performance using a mixed precision ap-
proach combined with reliable updates became avail-
October 22, 2018
ar
X
iv
:1
10
5.
22
79
v1
  [
he
p-
lat
]  
11
 M
ay
 20
11
able [9, 10]. Seamless integration for these solvers are
provided for a number of lattice QCD software suites
including Chroma which builds on top of QDP++ [11].
However, as these solvers optimised for the CUDA
architecture provide for a significant speed-up of the
inversion of the fermion matrix, the remaining (unop-
timised) parts of the calculation start to dominate the
overall execution time.
Currently GPU enabled software packages typically
have some short comings: A particular part of the cal-
culation is either executed on the GPU or the CPU leav-
ing the respective other system mostly idle. This is not
ideal taking into account the roughly equal acquisition
cost and power consumption of both systems.
On the other hand heterogeneous computing archi-
tectures offer the possibility to enable a cooperative
computing environment where the general purpose and
specialised processors are working together in an in-
terleaved fashion. The idea is to split the calculation
into several small tasks and to deploy multiple types of
processing elements within a single workflow each as-
signed to the task its best suited for.
In order to install this “fine-grained” structure in
Chroma acceleration is directly implemented in the un-
derlying library QDP++ assigning the code parts to ex-
ecute on the processor element it is best suited for, i.e.
IO-intensive operations on the CPU, floating-point rich
operations on the GPU. In this way not only a finite set
of selected functions are executed on the accelerator but
acceleration is applied in a more general fashion.
Benchmark measurements confirmed that solely ac-
celerating the floating-point rich operations yet yields
a significant speedup factor compared to unaccelerated
execution.
The data layout, the order of access, the precision of
the primitives are left unchanged just as prescribed by
the QDP++ standard template order.
Section 2 introduces briefly the CUDA architecture.
The ET technique is briefly outlined in section 3. Sec-
tion 4 introduces QDP++. Section 5 introduces the de-
sign elements introduced to QDP++ for accelerating the
evaluation. Section 6 details on the new QDP++ API el-
ements. An application example is detailed in section 7.
Section 8 details on benchmarking results.
2. The CUDA Architecture
The CUDA architecture is built around a set of multi-
threaded Streaming Multiprocessors (SM) which pro-
vide the main compute power. Threads are organised in
a hierarchical manner: A kernel grid is a collection of
thread blocks. A thread block is a collection of threads
and represents an indivisible unit allocatable by a mul-
tiprocessor. Whether a given thread block can be al-
located by a SM depends on the resources (number of
registers and shared memory) its threads collectively re-
quire and the resources available on the SM. The threads
of a thread block execute concurrently on one SM, and
multiple thread blocks can execute concurrently on one
SM (one block active at a time).
The Streaming Multiprocessors implement the
Single-Instruction, Multiple-Thread (SIMT) architec-
ture. Threads resident on one SM are bundled into
groups of 32 parallel threads, so called warps. The
threads of exactly one warp are executed in SIMT fash-
ion. Individual threads composing a warp start together
at the same program address, but they feature their own
instruction register and are free to branch independently.
However, in this case execution is serialised.
The SM is able to hide latency to device memory by
switching execution to a different warp whose instruc-
tion is ready to execute. It is therefore beneficial to or-
ganise the thread number per SM in such a way that a
sufficient number of warps is resident to (ideally) com-
pletely hide the latency to device memory.
3. Expression Templates
C++ function and class templates together with func-
tion and operator overloading offer the possibility to
represent expressions as C++ types. This technique is
commonly referred to as Expression Templates (ETs)
and was first introduced by Todd Veldhuizen [12] and
David Vandevoorde.
ETs provide a means to eliminate the need for creat-
ing temporaries when implementing a C++ vector class
library which features both a convenient API such as
for domain specific languages desired and a high perfor-
mance of the translated code. However, the latter feature
relies on the optimisation abilities of the compiler and
is not ensured in all ET applications, e.g. for Basic Lin-
ear Algebra Subprograms (BLAS) Level 2 and 3. Here
different, but also ET based approaches may achieve a
better performance [13].
The Portable Expression Template Engine (PETE) pi-
oneered the use of the ET technique for parallel physics
computations [3]. It is an extensible implementation
of the ET technique and achieves an exceptional level
of abstraction without sacrificing performance and pro-
vides the core functionality (on the vector level) of
QDP++.
Implementation of a vector library utilising the ET
technique typically involves defining a template of the
2
evaluate function. Upon assignment of an expression
the compiler generates an instantiation of the evaluate
function matching the expression. The evaluate function
then typically implements a loop iterating over all vector
components – no temporary vector objects are required.
C++ compilers offer the possibility to access the
function template’s arguments in a fully instantiated
C++ type in form of a C string – so called pretty print-
ing, which provides a means to access at runtime the
expressions.
4. QDP++
QCD Data Parallel (QDP++) is a C++ vector class li-
brary suited for quantum field theory. It forms the basis
for the widely used lattice QCD software suite Chroma
and as such provides the lattice wide data types and ex-
pressions used in Chroma [11]. Chroma implements lin-
ear algebra operations which may include nearest neigh-
bour communications utilising the QDP++ API.
Although not designed originally for multicore accel-
eration, this work demonstrates that design elements can
be added to QDP++ in such a way that evaluations of
arbitrary expressions are accelerated and executed on a
GPU. This approach was previously established when
deploying lattice QCD applications to the QPACE su-
percomputer [14, 15].
5. Accelerating QDP++ evaluation on a GPU
Accelerating QDP++ expressions on a GPU relies on
leveraging the ET technique to the device memory do-
main. A crucial component for ETs to work in a par-
ticular memory domain is (besides the compiler’s abil-
ity to handle C++ templates) the ability to take mem-
ory addresses of functions and allowing for dynamically
dereferencing function addresses (function pointers).
The new release 4.0 of CUDA on devices with com-
pute capability no less than 2.0 meets the requirements
for the ET technique to work on the device memory do-
main.
Unfortunately CUDA provides only a C-interface to
kernel functions making it impossible to directly pass
C++ constructed expressions as arguments to compute
kernels.
This work circumvents the aforementioned lack of a
C++ API to kernel arguments by first constructing in
device memory domain an object of an equivalent C++
expression type as used during host code translation
and second deploying the missing runtime configurable
Plain Old Data (POD) parts by copying those from the
host side expression object.
In order to construct the object in device memory do-
main a JIT compilation of CUDA kernel modules is
triggered upon expression evaluation. Launching this
kernel constructs the required object in device memory.
Still the runtime configurable parts are missing.
To build the bridge between the two expression ob-
jects residing in different memory domains the POD
part of the expression object on host side is copied into
a C API compatible form which is allowed to be passed
as CUDA kernel arguments.
5.1. Dynamic Code Translation, Just-In-Time Compila-
tion
Entering the evaluation function triggers pretty print-
ing of its arguments and executing an external code
generator which generates C++ device code leverag-
ing the ET technique. The NVIDIA Compiler NVCC
is invoked which builds a shared library containing the
CUDA kernel.
The shared library is loaded via the dynamic linking
loader and the kernel is executed on the device.
After evaluation the shared library is kept loaded until
the application exits. This ensures that each kernel func-
tion is only generated once and subsequent calls branch
to the already loaded shared library.
5.2. Flattening the Expression Tree
PETE [3] provides means to traverse the expression
tree and execute custom operations on the tree nodes
and leaves. This method is used on host side to collect
the runtime configurable data (POD portion) of each op-
erator and to store them into a linear storage container
that can be passed (as a pointer) to the CUDA kernel as
an argument, i.e. the expression tree is flattened.
The inverse operation restores the expression tree on
device side1.
Certain runtime configurable operators require spe-
cial treatment. E.g. the shift operation requires read-
only access to a site table index initialised at runtime.
A device storage container was added in such a way
that the first call to a particular runtime configurable op-
eration triggers copying of the required data to device
memory. The storage container keeps track of already
transferred memory regions and subsequent calls to the
same operation do not trigger the device copy again but
make usage of the already resident data in device mem-
ory. The site tables remain in device memory until they
are explicitly freed by the user. This speeds up repeated
calls to the same shift function.
1For unknown reasons the NVIDIA Frontend++ traverses the ex-
pression tree in a mirrored sense compared to the GNU C++ Com-
piler.
3
5.3. Mixed Memory Domain Approach
Since device memory is (still) a scarce resource a
mixed memory domain approach was favoured: Mem-
ory allocation for lattice wide objects utilise the host
memory domain. Upon user request the object’s data
is pushed to the device memory domain.
A new feature coming with version 4.0 of CUDA pro-
vides the possibility to page-lock a memory range that
was already allocated (4kB aligned) in the host memory
domain and to add it to the tracking mechanism to auto-
matically accelerate calls to device copy functions. This
mechanism eliminates the previously required staging
of data regions prior to the transfer to device memory
and reduces pressure on host memory.
5.4. Thread Geometry
The evaluation function template in ET based vector
libraries typically triggers execution of a loop iterating
over all lattice sites. With CUDA, parallelisation of a
loop is typically carried out unrolling the loop and start-
ing one thread per loop iteration.
Since here the applied CUDA kernels not only consist
of processing the lattice sites but also require prior re-
construction of the expression tree it was not clear that
the typical approach leads to the best performance. A
software configuration parameter Nsite was introduced
which specifies the number of sites assigned to one
thread.
CUDA kernel functions are launched with specifying
the grid and block geometries. Thus a software configu-
ration parameter Nthreads is introduced that specifies the
number of threads per block.
Given the total number of lattice sites the grid geom-
etry is a function of Nthreads and Nsite.
For each expression Ei a separate CUDA kernel is
generated. Thus the sustained performance P is a func-
tion of the number of threads per block, the number of
lattice sites processed per thread, and the expression Ei:
P(Nthreads,Nsite, Ei).
CUDA enabled software packages might be equipped
with an auto-tuning mechanism that determines the op-
timal grid and block geometries for the particular set of
installed devices. Auto-tuning is run prior to production
and the geometry parameters that yield the best perfor-
mance are stored for later inclusion.
6. New QDP++ API Elements
The QDP++ API was extended by the following ele-
ments:
Listing 1: Modified Chroma implementation for Jacobi smearing.
Prior to any calculation the lattice wide objects are pushed to the de-
vice (first darker grey shaded region). After calculation the result ob-
ject is copied back to host memory and device memory is freed (sec-
ond shaded region). QDP++ expressions (line number): E0(16, 20),
E1(25), E2(27),E3(30),E4(33).
1 template <typename T>
2 void jacobiSmear(const multi1d <
LatticeColorMatrix >& u, T& chi ,
3 const Real& kappa , int iter , int
no_smear_dir , const Real& _norm)
4 {
5 T psi;
6 Real norm;
7 T s_0 ,h_smear;
8
9 psi.pushToDevice();
10 for(int mu = 0; mu < Nd; ++mu )
11 u[mu].pushToDevice();
12 chi.pushToDevice();
13 h smear.pushToDevice();
14 s 0.pushToDevice();
15
16 s_0 = chi;
17
18 for(int n = 0; n < iter; ++n)
19 {
20 psi = chi;
21 bool first = true;
22 for(int mu = 0; mu < Nd; ++mu )
23 {
24 if (first)
25 h_smear = u[mu]*shift(psi , FORWARD
, mu) + shift(adj(u[mu])*psi ,
BACKWARD , mu);
26 else
27 h_smear += u[mu]*shift(psi , FORWARD
, mu) + shift(adj(u[mu])*psi ,
BACKWARD , mu);
28 first = false;
29 }
30 chi = s_0 + kappa * h_smear;
31 }
32
33 chi /= _norm;
34
35 chi.popFromDevice();
36 for(int mu = 0; mu < Nd; ++mu )
37 u[mu].freeDeviceMem();
38 psi.freeDeviceMem();
39 h smear.freeDeviceMem();
40 s 0.freeDeviceMem();
41
42 }
4
• OLattice::pushToLattice()
Allocates a memory region of the object’s size in
device memory and copies the object’s data to the
device.
• OLattice::popFromLattice()
Copies the data from the device to host memory
and frees device memory.
• OLattice::freeDeviceMem()
Frees device memory.
• theDeviceStorage::freeAll()
Frees device memory previously allocated for run-
time configurable operators.
7. Application Example: Jacobi Smearing
Frequently used Chroma routines which are typically
not subject to special optimisation and which can con-
sume a significant amount of (wallclock) time when ex-
ecuted on a few CPU cores only are quark smearing rou-
tines. These are operations acting on lattice wide ob-
jects and are typically iterative prescriptions including
nearest neighbour communications. One of these rou-
tines implements Jacobi smearing [16] and serves here
for the benchmarking analysis.
The Jacobi smearing procedure is obtained by solving
the Klein-Gordon equation
K(x, x′)S (x′, 0) = δx,0 (1)
where
Kx,x′ = δx,x′ − κS
∑
µ
Uµ(x)δx′,x+µ +U†µ(x−µ)δx′,x−µ (2)
as a power series in κS stopping at some finite power
Nsmear.
Listing. 1 shows the Chroma implementation of Ja-
cobi smearing using the QDP++ API . Five QDP++ ex-
pressions are involved: Eµ, with 0 ≤ µ ≤ 4 where the
most compute intensive ones are E1 and E2.
The code lines shaded darker grey were introduced
to enable execution on the GPU. All lattice objects are
pushed to the device. After calculation the device mem-
ory is freed and the result objects copied back to its orig-
inal location in host memory.
8. Benchmark Results
Benchmarking analyses were carried out using a
NVIDIA GeForce GTX 480. This device has 1.5 GB of
SP DP(∗)
lattice size CPU GPU CPU GPU
83 × 16 1.55 1.64 1.45 1.49
123 × 24 1.52 6.53 1.40 4.41
163 × 32 1.50 11.86 1.40 6.26
203 × 40 1.52 16.56 1.41 7.51
243 × 48 1.52 19.09 1.38 7.91
323 × 64 1.51 21.31
Table 1: Benchmark results for the Jacobi smearing routine executed
on the CPU and the GPU. Numbers in sustained GFLOPS for the
whole smearing routine. (∗)Double precision throughput on this de-
vice is restricted.
50 100 150 200 250
Nthreads
18
19
20
21
22
23
24
GF
LO
PS
Performance of expression E2
Nsite =1
Nsite =2
Nsite =4
Nsite =8
Figure 1: Performance dependence of expression E2 on Nthreads for a
Lx/a = 32 lattice (single precision).
memory, compute capability 2.0, 15 Streaming Multi-
processors and a total of 480 CUDA cores. Double pre-
cision on this device is restricted as it utilises the GF100
chip designed for the consumer market. The NVIDIA
CUDA 4.0 toolkit (Release Candidate 2) was used with
the NVIDIA Linux kernel driver version 270.40.
The Chroma Jacobi smearing routine was applied to
lattice objects where lattice sizes ranged from N = 83 ×
16 to N = 323 × 64 in single precision and from N =
83 × 16 to N = 243 × 48 in double precision.
The first benchmark analysis studied the performance
P(Nthreads,Nsite, E) for varying Nthreads keeping E = E2
and Nsite = 1, 2, 4, 8 fixed for a 323 × 64 lattice in single
precision. Fig. 1 shows the result. The best performance
was achieved when setting the thread block size equal to
the warp size Nthreads = 32.
The second benchmark analysis focused on the per-
formance dependence on Nsite keeping E = E2 and
Nthreads = 32, 64, 128, 256 fixed for a 323 × 64 lattice in
single precision. Fig. 2 shows the result. Only a moder-
5
2 4 6 8 10 12 14 16
Nsite
14
16
18
20
22
24
GF
LO
PS
Performance of expression E2
Nthreads =32
Nthreads =64
Nthreads =128
Nthreads =256
Figure 2: Performance dependence of expression E2 on Nsite for a
Lx/a = 32 lattice (single precision).
5 10 15 20 25 30 35
Lx /a
0
5
10
15
20
25
GF
LO
PS
Jacobi smearing
CPU
GTX480
CPU DP
GTX480 DP
Figure 3: Benchmark result for Jacobi smearing on a NVIDIA
GeForce GTX 480 in comparison to the Intel Xeon CPU. The number
of lattice sites is given by N = 2(Lx/a)4.
ate dependence was seen on this configuration parame-
ter. Although when configuring for 32 threads a slightly
better performance is achieved when using Nsite = 4 or
8 instead of Nsite = 1.
These benchmark analyses were repeated for the ex-
pressions E = E0, E1, E3, E4. The same characteristics
were observed: Setting the parameter Nthreads = 32 re-
sulted in all cases clearly to the best performance with
only a moderate dependence on Nsite.
A final benchmark analysis was carried out for the
whole smearing routine (including all five expressions)
for different lattice sizes (single and double precision)
in comparison to executing the same routine on the host
CPU, an Intel Xeon CPU (E5507, 4 cores, 2.27GHz).
Tab. 1 shows the benchmark results in numbers (sus-
tained GFLOPS for single and double precision). Fig. 3
shows the result graphically. For the 322 × 64 lattice in
single precision a speedup factor (compared to the CPU)
of more than 14 was measured.
9. Conclusion, Outlook and Discussion
As a first step, QDP++ expression evaluation was ac-
celerated on the GPU by leveraging the ET technique
on the device memory domain. Solely with acceleration
a significant speedup factor for the evaluation could be
achieved.
Providing an auto-tuning mechanism for the ap-
proach described here is not straight forward since the
individual expressions Ei are not known prior to pro-
duction. Establishing an auto-tuning mechanism forms
part of the to-do list. However, as the benchmark mea-
surements showed there seems to be a preferred thread
geometry that leads to the best performance.
Optimisation would be a next major step. By chang-
ing the data layout in such a way that coalescing device
memory access takes place an even higher speedup fac-
tor is expected.
A next step in another direction would be paralleli-
sation to multiple GPUs per host and targeting for the
parallel architecture of QDP++ to extend this approach
to multiple hosts.
One might also want to move the shared linking
loader to an external daemon program or the system
loader. In this way the JIT compilation takes place only
once – even across several Chroma runs.
Utilising QDP++ profiling would eliminate the de-
pendence on an external code generator.
Software Availability
QDP++ and Chroma are available as open source
software [11]. QDP++ configurable for GPU evalua-
tion is available [17].
The GPU portion of QDP++ requires the upcom-
ing release 4.0 of NIVIDA CUDA [1] and devices with
compute capability lo less than 2.0 are required.
Acknowledgements
FW is supported through a Marie Curie Early Stage
Researcher fellowship as part of STRONGnet (EU grant
238353).
[1] NVIDIA, Cuda zone, accessed 2011/5/5.
URL www.nvidia.com/object/cuda_home.html
6
[2] T. L. Veldhuizen, D. Gannon, Active Libraries: Rethinking the
roles of compilers and libraries , in: In Proceedings of the
SIAM Workshop OO98, SIAM Press, 1998. arXiv:math.NA/
9810022.
[3] S. Haney, J. Crotinger, S. Karmesin, S. Smith, Easy expression
templates using PETE, the portable expression template engine
, Technical Report LA-UR-99 (1999) 777.
[4] M. A. Clark, QCD on GPUs: Cost Effective Supercomputing ,
PoS LAT2009 (2009) 003. arXiv:0912.2268.
[5] A. Alexandru, C. Pelissier, B. Gamari, F. Lee, Multi-mass
solvers for lattice QCD on GPUs arXiv:1103.5103.
[6] N. Cardoso, M. Cardoso, P. Bicudo, Finite temperature lattice
QCD with GPUs arXiv:1104.5432.
[7] B. Walk, H. Wittig, E. Dranischnikow, E. Schomer, Implemen-
tation of the Neuberger-Dirac operator on GPUs , PoS LAT-
TICE2010 (2010) 044. arXiv:1010.5636.
[8] T.-W. Chiu, T.-H. Hsieh, Y.-Y. Mao, K. Ogawa, GPU-Based
Conjugate Gradient Solver for Lattice QCD with Domain-Wall
Fermions , PoS LATTICE2010 (2010) 030. arXiv:1101.
0423.
[9] M. A. Clark, R. Babich, K. Barros, R. C. Brower, C. Rebbi,
Solving Lattice QCD systems of equations using mixed pre-
cision solvers on GPUs , Comput. Phys. Commun. 181
(2010) 1517–1528. arXiv:0911.3191, doi:10.1016/j.
cpc.2010.05.002.
[10] R. Babich, M. A. Clark, B. Joo, Parallelizing the QUDA Li-
brary for Multi-GPU Calculations in Lattice Quantum Chromo-
dynamics arXiv:1011.0024.
[11] R. G. Edwards, B. Joo, The Chroma software system for lat-
tice QCD , Nucl.Phys.Proc.Suppl. 140 (2005) 832. arXiv:
hep-lat/0409003, doi:10.1016/j.nuclphysbps.2004.
11.254.
[12] T. Veldhuizen, Expression Templates , C++ Report 7, 1995.
[13] K. Iglberger, G. Hager, J. Treibig, U. Ruede, Expression Tem-
plates Revisited: A Performance Analysis of the Current ET
Methodology , ArXiv e-printsarXiv:1104.1729.
[14] F. Winter, Investigation of Hadron Matter using Lattice QCD
and Implementation of Lattice QCD Applications on Heteroge-
neous Multicore Acceleration Processors , Ph.D. thesis, Regens-
burg University (2011).
[15] Y. Nakamura, A. Nobile, D. Pleiter, H. Simma, T. Streuer,
T. Wettig, F. Winter, Lattice QCD Applications on QPACE
arXiv:1103.1363.
[16] C. R. Allton, C. T. Sachrajda, R. M. Baxter, S. P. Booth, K. C.
Bowler, S. Collins, D. S. Henty, R. D. Kenway, B. J. Pendle-
ton, D. G. Richards, J. N. Simone, A. D. Simpson, B. E. Wilkes,
C. Michael, Gauge-invariant smearing and matrix correlators us-
ing wilson fermions at β = 6.2, Phys. Rev. D 47 (11) (1993)
5128–5137. doi:10.1103/PhysRevD.47.5128.
[17] F. Winter, GPU enabled QDP++.
URL github.com/fwinter
7
