Abstract-Although General Purpose computation on GPU (GPGPU) is widely used for high-performance computing, standard programming frameworks such as CUDA and OpenCL are still difficult to use. They require low-level specifications and hand-optimization is a large burden. Therefore we are developing an easier framework named MESI-CUDA. Based on a virtual shared memory model, MESI-CUDA hides lowlevel memory management and data transfer from the user. The compiler generates low-level code and also optimizes memory accesses applying conventional hand-optimizing techniques. However, creating GPU threads is same as CUDA; thread mapping, i.e. thread indexing and the size of thread blocks run on each streaming multiprocessors (SM), are specified by the user. The mapping largely affects the execution performance and may obstruct automatic optimization of MESI-CUDA compiler. Therefore, the user must find optimal specification considering physical parameters. In this paper, we propose a new thread mapping scheme. We introduce new thread creation syntax specifying hardware-independent logical mapping, which is converted into optimized physical mapping at compile time. Making static analysis of array index expressions, we obtain groups of threads accessing the same or neighboring array elements. Mapping such threads into the same thread block and assigning consecutive thread indices, the physical mapping is determined to maximize the effect of memory access optimization. As the result of evaluation, our scheme could find optimal mapping strategy for three benchmark programs.
I. INTRODUCTION General Purpose computation on Graphics Processing
Unit (GPGPU) [1] is widely used for high-performance computing. However, current de facto programming frameworks such as CUDA [2] and OpenCL [3] provide APIs for low-level specifications such as memory allocation and data transfer. Although they enable the user to hand-optimize the program, it requires deep knowledge of GPU architecture. Thus, GPGPU programming is complicated and achieving high-performance is very difficult. Moreover, such optimization may not be portable to the different GPU models.
Therefore, we are developing a new programming framework named MESI-CUDA [4] , [5] , [6] . MESI-CUDA is a CUDA variation which aims to make GPGPU programming easier and more portable, but still achieving highperformance. Adopting simpler programming model, MESI-CUDA hides low-level GPU features. The low-level code is automatically generated and optimized by the compiler.
Current MESI-CUDA hides complicated GPU memory architecture under a virtual-shared memory model, eliminating low-level code for memory management/data transfer. The compiler makes memory access optimizations such as using shared memories as explicit caches [6] . On the other hand, creating/mapping GPU threads are same as CUDA; the user defines kernel functions for GPU and invokes them as threads, explicitly mapping to physical resources. We consider explicit multi-threading is necessary for practical highperformance computing, but physical mapping should be hidden because its hand-optimization is very difficult and not portable between GPU models. Furthermore, thread mapping also affects the memory usage thus inappropriate mapping also obstructs our existing memory access optimizations.
In this paper, we propose a new creation scheme of GPU threads. We introduce new syntax of thread creation specifying logical mapping. Such logical mapping is converted to CUDA's physical mapping at compile time. In our scheme, the user can create threads without considering low-level and device-dependent parameters such as the number of GPU cores. Because physical mapping is determined by the compiler, many automatic optimizations are possible. Based on static analysis of array index expressions, the compiler maps threads to improve the usage of physical resources and the effect of memory access optimization. This paper is organized as follows: Section II gives an introduction of GPU, CUDA, MESI-CUDA and optimization techniques. Section III, IV details our logical mapping and how to convert it to optimized physical mapping. Section V shows the evaluation results and in Section VI we discuss the related works. In Section VII, we state the conclusion.
II. BACKGROUND A. GPU Architecture
GPU is a collection of streaming multiprocessors (SM), which have certain number of CUDA cores. Although CUDA cores are simpler than typical CPU cores, a GPU has hundreds or thousands of CUDA cores. Thus the potential performance of a GPU is much higher than a CPU. Fig. 1 shows a typical architecture of a GPU card installed on a PC. Similarly as the CPU cores share the main memory (called host memory in CUDA programming), all CUDA cores share a large off-chip device memory. Furthermore, each SM has a small on-chip memory called shared memory, which is shared by all CUDA cores in the SM.
NVIDIA GPU architecture has been evolved in each generations Tesla, Fermi, Kepler, and the latest Maxwell, introducing new features and changing many hardware specifications such as the numbers of SMs and CUDA cores. Different GPU models often have different specifications even if they belong to the same generation.
B. CUDA
CUDA (Compute Unified Device Architecture) [2] , [7] , [8] is a GPGPU programming framework using extended C/C++ or Fortran. Fig. 2 shows a matrix multiplication program using CUDA. The additional code required for parallel programming in CUDA is shown in bold font.
In CUDA, CPU and GPU are called host and device, respectively. Functions, declared with the __device__ or __global__ qualifier, are called kernel functions and can be executed on the device (Fig. 2 . 5-13 ). The other functions (called host functions in this paper) are executed on the host ( . 14-32). To start computation on the GPU, any host function can invoke a __global__ kernel function specifying the number of threads ( . 26). Then, the created GPU threads execute the kernel function. In this paper, we simply call such GPU threads as threads.
CUDA uses grids and blocks for controlling thread mapping to data and physical resources. A block is a group of threads executed on the same SM, and a grid is a group of blocks of the same size. On the invocation of a kernel function, execution configuration must be specified using an expression of the form <<< , >>>. and are the size of the grid and blocks, respectively. They should be values of integer or a built-in 3D vector type dim3. For example, Fig. 2 program creates a grid of N/BX × N blocks and each block consists of BX threads (Fig. 2 . 18, 26) . BX is used because the block size is limited (1024 at maximum for Kepler), and also may be tuned for the performance.
Built-in variables shown in Table I the index expressions of arrays, each thread performs the same computation on the different array element. In the kernel function transpose() of Fig. 2 program, row and col are computed using block/thread indices so each thread computes different element of the array c ( . 7-12). The host/device memories are only accessible from CPU/GPU cores, respectively. For the data sharing between CPU and GPU, memory allocations on both memories and data transfers are required. In CUDA programming, the user must explicitly describe such low-level behaviors calling API functions: memory allocation/deallocation calling cudaMalloc()/cudaFree() ( . 19-21, 29-31), and data transfer calling cudaMemcpy() ( . 24-25, 27).
C. Hand-Optimization of CUDA program
Various hand-optimizations are possible in CUDA programming, considering architecture-level features of the target GPU model. Because thousands of CUDA cores perform computation using data on a single device memory, hiding the memory access latency is the most important strategy.
1) Memory Access Optimization:
A thread block is partitioned into groups of 32 threads called warps and a SM executes all threads of a warp in SIMD style. When the current warp is stalled on memory accesses, the execution is switched to other active warps. The device memory accesses in a warp are coalesced if the requested data are in the same L2 cache line of 128 bytes. Therefore, the access transactions are largely reduced if the threads in a warp simultaneously access neighboring memory addresses [8] .
Another optimization is to allocate frequently-used data in the shared memories on each SM. Because their access latency is much smaller, they can be used as the explicit caches of the device memory. A local variable of a kernel function can be allocated on the shared memory using __shared__ qualifier. However, explicit copy from/to the device memory is needed in the function. Such variable is shared among all threads in the block, thus explicit synchronizations calling __syncthreads() may be needed. Furthermore, the variables caching on the shared memories must be carefully determined because each available capacity is only 48KB.
2) Thread Mapping Optimization: The grid/block sizes are not limited by the numbers of SMs and CUDA cores; scheduling threads to the physical resources are managed by the hardware. However, physical mapping is not completely hidden. The execution configuration and the usage of buildin thread/block indices largely affect the performance.
The block size, determined by the execution configuration, should be an integral multiple of the warp size 32. It is better to have many warps in a block to hide the memory access latency. The number of blocks also should be large enough to keep all SMs busy. If multiple blocks are assigned to the same SM, some of the blocks may be executed concurrently to increase active warps on the SM. The block size also affects the possibility of explicit caching because the cached data size often depends on the number of threads in a block.
Mapping threads to data is determined by the user's usage of thread/block indices. On the other hand, the threads in the same block are mapped to the same SM and the threads of consecutive threadIdx.x are mapped to the same warp. Thus, the indices usage determines if the coalesced access is possible. It also determines the access locality within a block and may influence explicit caching.
To make the optimization difficult, the user must consider the trade-off between utilizing the shared memories and assigning multiple warps/blocks to a SM. Large block size increases active warps but may obstruct explicit caching because the required data size will also increase. Using shared memories suppresses concurrent execution of blocks because the total resources required by all concurrent blocks on a SM cannot exceed the physical resources. Fig. 3 shows the thread-data mapping of Fig. 2 program. Shaded and thick framed regions are the accessed elements in a block and in a thread of the block, respectively. In each thread, one element of the array c is updated BX times but it can be optimized to use a temporal scalar variable and write back once. For a, all threads in a block access the same element for each loop iteration, and the accessed elements are consecutive in the memory. Therefore, the access will be coalesced and explicit caching is also available. On the other hand, each thread in a block accesses exclusive range of b. Caching is difficult because × elements are needed. Although the accesses in a block will be coalesced, blocks have a thread accessing the same column of b and their accesses cannot be coalesced.
A hand-optimization technique using a square partitioning [7] is shown in Fig. 4 . All threads in the block accesses B rows/columns of a, b, respectively. On the loop iteration , all threads access the element of column/row of a, b, respectively. Thus the code is optimized to repeat caching B×B elements of a, b (dark regions in Fig. 4 ) and executing the iteration B times. The optimized code is shown in Fig. 5 .
D. MESI-CUDA
CUDA programming API directly reflects the complex GPU architecture. Although such low-level API enables hand-tuning considering hardware specifications, it is difficult and may not be efficient on other GPU models. Therefore we are developing an easier GPGPU programming framework MESI-CUDA [4] , [5] , [6] .
MESI-CUDA adopts a virtual shared memory model that all CPU/CUDA cores share a single global memory (Fig. 6) (B, B) ; On the other hand, basic parallelization scheme is same as CUDA: writing host/kernel functions for CPU/CUDA cores and invoking the latter from the former. We do not hide this explicit parallelization because the characteristics of CUDA cores are quite different from the CPU cores. For example, CUDA cores can run fine-grained threads with small overhead, but branch divergent code is inefficient. In high-performance computing using GPU, it would be impractical to write code ignoring such differences. Fig. 2 program can be simplified using MESI-CUDA as shown in Fig. 7 . The additional code required for parallel programming in MESI-CUDA is shown in bold font. The arrays for 2D matrices can be defined as VS variables and can be accessed from both host/kernel functions (Fig. 7 . 3). The user can concentrate on parallel algorithm without low-level API functions. However, MESI-CUDA is uppercompatible to CUDA and hand-optimizing using CUDA API is possible if the compiler's optimization is not sufficient.
The MESI-CUDA compiler is implemented as a translator to CUDA code. The low-level code such as memory management and data transfer is generated by the compiler. To achieve high-performance without user's handoptimization, the compiler performs optimization based on static analysis. For this purpose, we have developed automatic optimization schemes such as overlapping thread execution/data transfer [4] and explicit cache using shared [6] . However, thread mapping cannot be optimized because current MESI-CUDA adopts physical mapping like CUDA. Moreover, user's mapping may obstruct existing optimizations by the compiler.
For the Fig. 7 program, our compiler caches the array a. However, applying the optimization shown in Fig. 5 is impossible because thread-data mapping is not modified.
III. LOGICAL GRID OF THREADS
We propose a new logical thread mapping scheme for MESI-CUDA. It hides low-level GPU features from the user, device-dependent optimization can be left to the compiler, and more flexible automatic optimization is possible.
To keep the lower compatibility with CUDA, We introduce new execution configuration specified using an expression of the form <[< 0 , . . .,
The expression specifies the size of single and arbitrary dimensional thread grid, whose size is 0 × . . . × −1 . In this paper, we call the configuration of new form and the specified grid as logical execution configuration and logical grid, respectively. We also introduce new built-in variable shown in Table II .
The logical grid has no limitation on its dimension and size. They can be determined to naturally map the application's data structure. Thus the code will be independent from the physical resources. For example, the execution 
IV. CONVERSION TO OPTIMIZED PHYSICAL MAPPING A. Basic Strategy
We make static analysis of array index expressions to obtain groups of threads accessing the same or neighboring elements. Here we define the following terminology:
• target variable Suppose that a -dimensional logical grid of size = 0 × . . . × −1 is specified for invoking a kernel function, and an access to a -dimensional array occurs within −nested loops. On analyzing each index expressions 0 , . . . , − 
The normal form must be a polynomial expression of first-degree terms of target variables.
, are the coefficients of each term and must be loop-invariant. Kernel functions must satisfy the following assumptions: 1) All loop iteration numbers are fixed and known at compile time. 2) All array index expressions can be transformed to the normal form. Currently, we cannot handle unfixed loop iterations and non-linear index expressions. However, many kernel functions can be analyzed because irregular access patterns are inefficient and tend to be avoided in GPU code. Omitting the expressions which does not satisfy the assumptions may cause non-optimal mapping, but code generation is possible.
We statically analyze index expressions of each array variable occurrence, and obtain the relationship between threads and array indices. We also make range analysis of the index expressions [6] to obtain the accessed range in a thread for each array. Using the result, we make groups of threads maximizing the chance to apply memory access optimizations. The basic strategies are as follows:
1) Threads , access the same array element if each corresponding index expression 0 , . . . , −1 has the same value. Assigning and in the same warp makes their access for the occurrence coalesced. 2) If the values are the same for 0 , . . . , −2 and close to for −1 , assigning and in the same warp can expect (partially) coalesced access because they access neighboring addresses. Testing the strategies for every thread pair consumes large computation time. Additionally, generating efficient code is difficult if logical/physical threads have nonlinear relationship. Therefore, we partition the -dimensional logical grid to -dimensional tiles and assign each tile to a CUDA thread block. To select preferred dimensions from logical dimensions, dimensions are given partitioning-priority value, which are denoted as 0 , . . . , −1 . We only examine threads which are adjacent on one of the -dimensions and add priority values when they satisfy the strategy conditions.
Estimating advantage/disadvantage of explicit caching is very difficult because it reduces the memory access cost but suppresses concurrent blocks. Therefore, currently we use only coalescing detection for selecting dimensions and explicit caching is considered on determining the block size.
B. Static Analysis 1) Building Normal Forms of Index Expressions:
For each index expression in the kernel function, we transform it to the normal form. For non-target variables in the expression, we find the assignment whose value is valid at the occurrence of index expression. Then we replace the variable with the rhs expression of the assignment. For example, the normal forms of index expressions for the occurrence of a in Fig. 8 . 10 are lThreadIdx.y and k.
2) Detecting Access to Same Elements: To detect multiple accesses to the same element, index expressions are examined for each array variable occurrence. If all 0 , . . ., −1 do not have the term of lThread.Idx[ ], all thread with the same logical indices except dimension access the same element and can be coalesced. For each such occurrence, we add × 32 to , where is the product of loop iteration numbers 0 , . . . , −1 .
For example, all threads with the same lThreadIdx.y value access the same element on the occurrence of a in Fig. 8 . 10 because lThreadIdx.x does not appear in the index expressions. Thus N×32 is added to 0 . Similarly, the same value is added to 1 examining the b occurrence.
3) Detecting Access to Neighboring Elements: Even if all index terms exist, some priority value should be given if the threads adjacent on logical grid accesses the neighboring element on the memory. For this purpose, the rightmost index expression −1 is examined. if only one index term lThread.Idx[ ] appears and the coefficient is a constant value, × ⌈128/{ × sizeof(type of element)}⌉ is added to . This expression evaluates the efficiency of coalescing, i.e. how many cache lines are needed for the accesses in a warp.
C. Thread Mapping
The execution configuration is specified for each thread invocation and the possibility of coalescing/explicit caching is not influenced by the configuration of other invocation. Thus each thread mapping can be determined independently. However, using the shared memories for explicit caching may reduce the concurrent blocks of other invocation. . The threads which have the same logical indices except lThread.Idx[ 0 ] have the highest chance to apply coalescing. Therefore, the threads aligned along the 0 -direction should be grouped into a CUDA block. We call such grouping 0 -fusion.
2) Determining Size of Thread Blocks:
We select the block size from 64, 128, 256, 512 and 1024. Larger size can provide more active warps. However, explicit caching may require larger size, which can exceed the size of shared memories or suppress concurrent execution of blocks. Because the candidate dimensions to assign within CUDA blocks are determined, possible explicit caching is also identified. So we compute the required shared memory size for each block size and select the maximum size for which satisfies ≤ 16KB.
3) Mapping logical dimensions to CUDA dimensions: Using the determined block size , we fix the innerblock mapping. If 0 is greater than , the threads aligned along the 0 -direction must be partitioned to multiple blocks. CUDA grid/block sizes are configured as 0 / × / 0 and × 1 × 1, respectively. The logical index lThread.Idx[ 0 ] is converted to the expression blockDim.x * blockIdx.x + threadIdx.x. Other dimensions are all flattened to a single-dimension and the logical indices are converted to the block indices using blockIdx.y. We call this mapping 0 -fusion mapping.
If 0 , 0 × 1 , or so on is smaller than , the dimensions are mapped to threadIdx.x, threadIdx.y, and threadIdx.z. The grid/block sizes and the remaining logical indices are determined same as above.
To make optimization shown in Fig. 5 , block mapping using 2-dimensional tiles is needed. For this purpose, the 
V. EVALUATION
We evaluated our scheme using three benchmark programs of Fig. 7 matrix multiplication (N=8192), Jacobi kernel (1024 2 matrix size, 10000 iterations), and FDTD kernel (10240 2 matrix size, 1000 iterations). Because all programs use 2D logical grid, possible mappings are x, y, and xy-fusion. For each mapping, we evaluated the execution time with/without explicit caching. Fig. 9 shows the speedup compared to y-fusion without explicit caching.
While our conventional compiler used x-fusion mapping for Fig. 7 code, the mapping optimization detects the chance of coalescing in both x,y-directions in the Fig. 8 code. Thus, xy-fusion mapping is selected. Without explicit caching, xy-fusion mapping achieves 1.2-2 times speedup compared with x-fusion. The xy-fusion mapping enables the optimization shown in Fig. 5 , which achieves 3-4 times speedup compared with x-fusion mapping with explicit caching.
As for the Jacobi kernel, each thread reads adjacent 4 elements to update one element. The FDTD kernel uses multiple arrays but has similar access pattern. For each read in these programs, the chance of coalescing is detected along x-direction thus x-fusion mapping is selected.
In Jacobi/FDTD kernels, x-fusion mapping achieves 8.5-14 times speedup compared with y-fusion. However, applying explicit caching declines the performance for Jacobi. While reading 4 elements without cache is fully coalesced, caching ( + 2) 2 regions of the array causes non-coalesced accesses. Thus our scheme should be improved to consider the cost of memory accesses for caching.
VI. RELATED WORKS CUDA 6 [7] and Kepler GPUs introduced managed memory, which works similar to MESI-CUDA's VS variables. The large difference is that CUDA 6 implements managed memory in hardware/driver-level, while VS variables are implemented in compiler-level as a source code translation. Our advantage is that compile-time optimization is possible using static analysis, such as memory access optimizations and also thread mapping optimization proposed in this paper. The main purpose of managed memory is easier GPGPU programming and using low-level API is encouraged for the high-performance. In contrast, the goal of MESI-CUDA is to hide optimization under the compiler.
OpenACC [9] and OpenMP-to-CUDA translation [10] are another GPGPU approach without low-level specifications. A sequential program with some parallelizing directives is compiled into a parallel program executable on GPU. The compiler automatically generates threaded code and maps threads to data and physical resources. However, current performance of OpenACC is worse compared with handoptimized CUDA code [11] . Although pragmas for mapping control, such as gang and vector, are provided, handoptimizing is difficult and needs low-level and hardwaredependent knowledge like CUDA programming. Our scheme may be applicable to optimize OpenACC thread mapping.
Hoshino, et al. [11] report that the layout of data structure largely affects the performance of GPU computation. Our scheme selects thread mapping to maximize the chance of optimizing memory accesses in the kernel function. However, the performance may not be sufficient if conflicting mappings are optimal for different arrays. Thus introducing data layout optimization may improve our scheme.
VII. CONCLUSION
We are developing a GPGPU programming framework MESI-CUDA for easier high-performance computing than CUDA. Providing virtual shared variables, MESI-CUDA hides low-level memory management and data transfer. However, GPU thread creation and mapping are same as CUDA and device-dependent hand-optimization is needed for the performance. In this paper, we proposed a new thread creation/mapping scheme. We introduced logical mapping specification independent from the low-level architecture. The compiler converts the mapping to CUDA's physical mapping, optimizing memory accesses and controlling the concurrency of blocks and warps.
As the result of evaluation, our scheme selected the optimal mapping strategy for three small benchmarks. However, the current static analysis is still rough estimation. More evaluation and improving the scheme are our future works.
ACKNOWLEDGMENT
This work was supported by JSPS KAKENHI Grant Number 24500060.
