We present the first GPU-based conjugate gradient (CG) solver for lattice QCD with domain-wall fermions (DWF). It is well-known that CG is the most time-consuming part in the Hybrid Monte Carlo simulation of unquenched lattice QCD, which becomes even more computational demanding for lattice QCD with exact chiral symmetry. We have designed a CG solver for the general 5-dimensional DWF operator on NVIDIA R CUDA TM architecture with mixed-precision, using the defect correction as well as the reliable updates algorithms. We optimize our computation by even-odd preconditioning in the 4D space-time lattice, plus several innovative techniques for CUDA kernels. For NVIDIA GeForce R GTX 285/480, our CG solver attains 180/233 Gflops (sustained).
Introduction
Simulation of unquenched lattice QCD with exact chiral symmetry is a grand challenge among all sciences. Even for a modest 16 3 × 32 lattice with lattice spacing a ∼ 0.1 fm, it often requires a supercomputer with peak computing power more than 50 Teraflops (e.g., 10 racks of IBM BlueGene/L). Therefore, only 2-3 lattice QCD groups around the world could afford to perform the simulation of unquenched lattice QCD with the domain-wall fermion [1] , or the overlap-Dirac fermion [2] . However, this scenario has been undergoing a dramatic change during the last 12 months. With the emergence of low-cost and computationally powerful Graphic Processing Unit (GPU), now plugging a graphic card with NVIDIA GTX 285 (240 cores, one Teraflops peak) into a PC immediately turns the system into a powerful device, attaining a sustained 180 Gflops for the conjugate gradient (CG) with mixed precision.
Since 2009, the Taiwan Lattice QCD Collaboration (TWQCD) has been using a GPU cluster (currently constituting of 250 NVIDIA GPUs with 40 Teraflops (sustained)) to simulate unquenched lattice QCD with optimal domain-wall quarks [3, 4] . We have met the challenge of preserving the chiral symmetry to a high precision and sampling all topological sectors ergodically. For our recent physical results on the 2-flavors QCD, we refer the readers to [5] and [6] .
In this paper, we present our design of the first CG solver for the general 5-dimensional DWF operator on NVIDIA CUDA architecture with mixed-precision, using the defect correction as well as the reliable updates algorithms. Our CG solver is optimized with even-odd preconditioning on the 4-dimensional space-time lattice, plus several innovative tuning techniques for the CUDA kernels. For NVIDIA GeForce GTX 285/480, our CG solver achieves 180/233 Gflops (sustained).
Conjugate Gradient Solver for Domain-Wall Fermions

Optimal Domain-Wall Fermions
Mathematically, for a given N s (the number of sites in the fifth dimension), the maximal chiral symmetry can be attained by the optimal domain-wall fermion (ODWF) [3] with the operator
Here D w is the standard Wilson Dirac Operator plus a negative parameter −m 0 (0 < m 0 < 2),
where U µ (x) denotes the link varaible, d is the dimension of the space-time (d = 4 for QCD),
and
The weights {ω s } along the fifth dimension are fixed according to the formula derived in [3] such that the maximal chiral symmetry is attained. In general, for other DWF with non-maximal chiral symmetry, the weights {σ s } and {ω s } have different values, e.g., for the conventional (Shamir) DWF, σ s = 0, ω s = 1, ∀s, and for the Borici DWF [7] , σ s = ω s = 1, ∀s.
Even-Odd Preconditioning
Since D w commutes with (ω) ss ′ ≡ ω s δ ss ′ and (σ ) ss ′ = σ s δ ss ′ , Eq. (2.1) becomes
Separating the even and the odd sites on the 4D space-time lattice, Eq. (2.5) can be written as
where
Now we further rewrite it in a more symmetric form by defining
Obviously, the most time-consuming task in the HMC is to solve the linear system CC † |x = |b by the conjugate gradient (CG), namely, in the computation of the fermion force in the molecular dynamics. In this work, we implement the entire conjugate gradient inside the NVIDIA GPU, which can be used for the HMC as well as for computing the valence quark propagators.
Algorithm
Conjugate Gradient (CG) method [8] is a widely-used numerical algorithm for iteratively solving a linear system Ax = b to a certain precision ε, where A is a positive-definite Hermitian matrix. With the CG algorithm (see Algorithm 1), the problem is turned into a task dominated by the matrix-vector multiplication. In this work, we utilize CUDA to implement the 5D domain-wall fermion operator (2.10) matrix-vector multiplications of the CG on NVIDIA GPUs. For the GPU, the single-precision operations are several times faster than the double-precision ones, thus it is advantageous to use the mixed-precision CG. In the so-called defect correction algorithm, one solves x in the low-precision, and updates the solutionx and the residuer in the high-precision. (see Algorithm 2, where the hatted symbols represent variables in the high-precision). In this fashion, most of the floating-point operations are in the low-precision, thus it is advantageous for the GPU. Theoretically, the defect correction algorithm is mathematically sound, and it always works in practice. However, the seemingly drawback is that one has to build up the Krylov space every time it restarts the CG in the low precision. On the other hand, if one does not reset the low-precision p 
x := initial guesŝ r :=b −Âx while |r k | >ε|b| do r :=r p :=r x := 0 Use Algorithm 1 to solve x = A −1 r in the low-precision to a precision ε x :=x + x r :=b −Âx end while vector inside the while loop of Algorithm 2 (i.e., skipping the step (p :=r) except at the first time), the "warm-up" time in re-building the Krylov space could be reduced. This so-called reliable updates algorithm [9, 10] would run faster than the defect correction. Although the reliable updates in this fashion may not converge for all cases due to the non-orthogonality of p and r, in practice it seems to work for most cases. We have implemented both algorithms in our CG solver, and it automatically switches to the defect correction if the reliable updates does not converge in the first place.
CUDA Kernels and Optimization
The CUDA architecture developed by NVIDIA enables us to do parallel computations on NVIDIA's GPUs. (For more detailed programming models and strategies, see "CUDA Programming Guide for CUDA Toolkit" [11] .)
In CUDA, a thread is the smallest unit to execute a kernel, and a certain number of threads form a block. Each thread will be assigned a 3-dimensional index, and each block a 2-dimensional index. Inside a block, a warp of threads will execute the kernel concurrently. Each thread has its own register memory, and threads in the same block share a shared memory. The space of register and shared memory is very limited, but they have the fastest access speed. The global memory on the device can be accessed by any thread, but it has the slowest bandwidth.
To implement the mixed-precision CG for ODWF with CUDA, we perform all matrix-vector multiplication, vector reduction (inner product), and vector addition/subtraction on the GPU (device), while the CPU (host) is used to do the flow control, memory assignment, and device control.
The CUDA kernels in our CG solver can be divided into five different catalogs. We will discuss each catalog and their optimization in the following subsections.
Vector Index Conversion
These kernels are used to change the indexing schemes between the device (GPU) and the host (CPU). To store the Dirac spinors in an one-dimensional array, we need to map the multidimensional indices to the 1D array index. One needs 4 × 3 × 2 = 24 real numbers to store one Dirac spinor at each site of the 5D lattice. On the CPU, this color-spinor index c which runs from 0 to 23 is the inner-most (fastest-running) index, which is followed by the fifth-dimension index s, and then x, y, z,t indices, where t is the outer-most (slowest-running) index. If i host denotes the one-dimensional array index of the CPU scheme, then we have
However, for computation on the GPU, we assign each thread a five-dimensional site index. This implies that adjacent threads have consecutive s indices. Thus we want to arrange the data such that optimal coalescence is attained when loading the vector from device global memory to the register and/or the shared memory of the threads. Since the GPU provides vector data types such as float4 and double2 which allow us to move 4 single precision numbers (or 2 double precision numbers) at one time, a simple way to map to the one-dimensional array index on GPU (for singleprecision) is 2) and similarly for the double-precision. So every time when we transfer data between the host and the device, we convert the index accordingly.
Matrix-Vector Multiplication for
is similar to the usual Wilson-Dirac operator D w without the mass term,
From this expression we see that the multiplication of D OE w (D EO w ) with a vector involves the link variables U µ (x) and γ-matrices. We have used the following tricks for optimization.
Firstly, since the γ-matrices in Eq. (3.3) are in the combination (1 ± γ µ ), the left-handed and the right-handed Dirac components are related to each other. Also, since the link variables do not have Dirac indices, we can just multiply U µ (x) to the left-handed components, and then restore the right-handed components.
Secondly, U µ (x) has no fifth-dimension dependence, so threads having the same x but different s can share the same U µ (x). So we put the link variables in the shared memory.
Thirdly, because GPU computation is memory bandwidth bound, one must try to reuse the data. For example, the hopping term (δ x−aμ,x ′ ) in D w , all neighboring sites of x are involved in the calculation. If we assign each x to one thread, then there must be overlapping data loading for neighboring sites. To reduce this overlapping data transfer, we distribute each (x, y, z) to one thread, with a loop in the t-direction. Then the neighboring data in the t-direction can be reused, and the efficiency is enhanced.
Besides above tuning techniques, we also expand small loops, and to use the texture memory for caching data, Here texture is used for loading the vectors and link variables. We use Python [12] to expand small loops, and to generate the set of kernels for D w multiplication.
Matrix-Vector Multiplication for M 5
The matrix M 5 is given in Eq. (2.8). One can see that M 5 is block diagonal in the chiral basis and it does not depend on the space-time nor the color indices. In fact, it can be divided into two constant matrices in the fifth dimension, i.e., the left-handed and the right-handed ones. So the multiplication of M 5 with a vector can be regarded as
Here we use the shared memory to store the source vector (v). Since M 5 only takes 2N 2 s real numbers, we can put M 5 into the register of each thread (with different s and x, y, z). Again, a loop in t is inserted to reuse the M 5 data in the register. Also, we use Python to generate these kernels.
Vector Reduction
To calculate the norm of a vector, we use the well-known parallel reduction with the shared memory. However, due to the limitation on the number of threads per block, it is inevitable to use global memory when the size of a vector becomes very large. Our solution is to perform the block reduction in prior kernels, i.e., to sum up vector elements (already stored in registers/shared memory) within each block. Then these partial sums can be added with a parallel reduction.
Vector Addition and Subtraction
We can combine the simple addition/subtraction with other kernels in which one vector has been loaded. For example, to multiply C ≡ 1 − M 5 D OE w M 5 D EO w to a vector, we can combine the last subtraction with the last M 5 multiplication. 
Summary
We have implemented an efficient GPU-based CG solver for generalized domain-wall fermions in lattice QCD. Our CUDA kernels are tuned with several innovative techniques. On NVIDIA GeForce GTX 285/480, our CG solver achieves 180/233 Gflops (sustained). This efficient CG solver constitutes the most crucial part in TWQCD's HMC code for simulation of unquenched lattice QCD with the optimal domain-wall fermion.
