NVIDIA's new architecture, Kepler improves GPU's performance significantly with the new streaming multiprocessor SMX. Along with the performance, NVIDIA has also introduced many new technologies such as direct parallelism, hyper-Q and GPU Direct with RDMA. Apart from other usual GPUs, NVIDIA also released another Kepler 'GeForce' GPU named GTX Titan.
Introduction
On 2013, NVIDIA launched a new Kepler GPU, GTX Titan, named after the fastest supercomputer, a GPU cluster of NVIDIA Tesla K20X at Oak Ridge National Laboratory [1] . GeForce GPUs are designed for gaming. However, GTX Titan is good for parallel computing with CUDA, too. From the standpoint of computing, GTX Titan is as great as Tesla K20X. Nevertheless, the price of the former is about 3 times cheaper than of the latter.
Meanwhile, in 2012, Intel announced the Xeon Phi system with Intel many integrated core architecture (MIC) [2] . A Xeon Phi coprocessor integrates many CPU cores on a PCI express card like GPU, so that it could, in principle, provide similar theoretical performance with GTX Titan. The merit is that most of usual C codes which runs on CPUs can run on Xeon Phi system without much modification, because it is a CPU-based platform. However, it turns out that its performance is so low that it is very hard to obtain the high performance from Xeon Phi. Table 2 : changed properties related with thread and block scheduling There are several optimization schemes possible for GTX Titan. Table 2 shows those changes in GPUs regarding thread and block scheduling. A SMX (Kepler) has 6 times more cores than SM (Fermi). To deal with these cores, the SMX has twice number of blocks run simultaneously and twice of warp schedulers than SM. The number of registers per thread is also increased to 255, so that a thread can store more variables to registers and reuse them quickly. Therefore, we might obtain significantly better performance by simply adjusting thread and block numbers.
GTX Titan & Kepler Architecture
The change of the memory bandwidth is also very important. Unfortunately in general, the main bottle neck in GPUs are the limitation in data transfer speed between GPU registers and memories. The performance of a CUDA program is usually determined by the product of CGMA (compute to global memory access) ratio and the amount of data transfer per time [3] . Here, CGMA ratio is the number of floating point operations per single data transfer.
Kepler architecture has new features to improve the memory usage as follows.
• 8 bytes shared memory bank mode is added. Fermi GPUs provide only 4 bytes (32 bits) mode. But Kepler GPUs provide 8 bytes (64 bits) mode, too. When this mode is turned on, one gets about twice the effective bandwidth for double precision floating point numbers.
• It is possible to adjust the ratio of shared memory and L1 cache to get the better performance with the total memory size fixed. Fermi GPUs only support 16 Kbytes (shared memory) + 48 Kbytes (L1 cache) and 48 + 16 modes. Kepler GPUs can allocate 32 K to shared mem and 32 K to L1.
• 48 KB Read-only data cache is added. The texture memory can be used as an additional read-only cache memory for Kepler GPUs. 2 is the schematic diagram of GPU Direct v2, by which a GPU can access the memory of another GPU on the same PCIe bus or transfer data to the another without using the CPU memory.
A usual MPI + CUDA program assigns one GPU per one MPI process node. For simplicity, consider a cluster node with 2 GPUs (GPU0 and GPU1) and we run a MPI job of 2 processes. The first MPI process (MPI0) is assigned with CPU0 and GPU0, and the second one with CPU1 and GPU1. Then we want to transfer some data stored in GPU0's memory to GPU1's memory. Without GPU Direct, we should follow a cumbersome procedure.
1. First copy the data from GPU0's memory to CPU0's memory by using CUDA. 2. Send the data in CPU0's memory to CPU1's memory through the memory of the infiniband network adapter by using MPI.
3. Copy the data to GPU1's memory by using CUDA. (Fig. 4) . With GPU Direct, the data in GPU0 are transferred to GPU1 at once by a CUDA function: cudaMemcpyPeer().
// GPU0 to GPU1 cudaMemcpyPeer( d_a1, 1, d_a0, 0, size ); 
Xeon Phi

(TFLOPS) .
Here, FMA means fused multiply add instruction, which computes a× x+ y in one cycle. Similarly, that of Xeon Phi 5110P is
In the case of double precision operations, Xeon Phi 5110P has about the same performance (≈1 TFlops) as GTX Titan (≈1.3 TFlops). However, in the case of Xeon Phi coprocessor, one should note that the factor 16 boost comes from vectorization. The vectorization means a parallelized calculation using SIMD instructions. As already mentioned, Xeon Phi coprocessors support 512 bit SIMD operations, so that 16 single precision calculations can be computed simultaneously. On the other hand, this means that the actual performance of Xeon Phi system is highly dependent on the vectorization of the code.
However, it is not always possible to convert the code into a vectorized one as specified by Xeon Phi systems. The first difficulty is that one must program the code in the level of the assembly language to control the array structure of the SIMD registers [4] . Unfortunately, the C level compiler cannot do this job automatically to our satisfaction. The second difficulty is that our QCD code is not, in general, designed to fit it into the structure format of specific SIMD registers required by the vectorization. This does not allow most of our code to be adapted to the vectorization scheme. Hence, in practice, this gain of 16 in vectorization is useless to us. Therefore, in the end of day we find out that the real performance of Xeon Phi systems is inferior to that of GTX Titan by a factor of about 10. This result is quite discouraging. 
Conclusion
GTX Titan provides 1.15 GFLOPS per USD of double precision performance, which is much better than 0.35 of Tesla K20X and 0.38 of Xeon Phi 5110P. Besides the theoretical performance, there are many other improvements on Kepler GPUs, such as direct parallelism, Hyper-Q and GPU Direct RDMA. By applying GPU Direct v2 to two GPUs on the same PCIe bus, we achieved about 3 times gain in data transfer. We also investigated the Xeon Phi system. However, the performance of Xeon Phi is so low (by a factor of 10) that we do not recommend using Xeon Phi systems for the lattice QCD simulation yet.
Acknowledgement
