Abstract Recent development in graphic processing units (GPUs) has opened a new challenge in harnessing their computing power as a new general purpose computing paradigm. However, porting applications to CUDA remains a challenge to average programmers, which have to package code in separate functions, explicitly manage data transfers between the host and device memories, and manually optimize GPU memory utilization. In this paper, we propose a restructuring tool (RT-CUDA) that takes a C-like program and some user directives as compiler hints to produce an optimized CUDA code. The tool strategy is based on efficient management of the memory system to minimize data motion by managing the transfer between host and device, maximizing bandwidth for device memory accesses, and enhancing data locality and re-use of cached data using shared-memory and registers. Enhanced resource utilization is implemented by re-writing code as parametric kernels and use of efficient auto-tuning. The tool enables calling numerical libraries (CuBLAS, CuSPARSE, etc.) to help implement applications in science simulation like iterative linear algebra solvers. For the above applications, the tool implement an inter-block global synchronization which allow the execution overall among a few iterations which is helpful to balance load and to avoid polling. Evaluation of RT-CUDA has been performed using a variety of basic linear algebra operators (Madd, MM, MV, VV, etc.) as well as the programming of iterative solvers for systems of linear equations like Jacobi and Conju- gate Gradient algorithms. Significant speedup has been achieved over other compilers like PGI OpenACC and GPGPU compilers for the above applications. Evaluation shows that generated kernels efficiently call math libraries and enable implementing complete iterative solvers. The tool help scientists developing parallel simulators like reservoir simulators, molecular dynamics, etc. without exposing to complexity of GPU and CUDA programming. We have partnership with a group of researchers at the Saudi Aramco, a national company in Saudi Arabia. RT-CUDA is currently explored as a potential development tool for applications involving linear algebra solvers by the above group. In addition, RT-CUDA is being used by Senior and Graduate students at King Fahd University of Petroleum and Minerals in their projects as part of RT-CUDA continuous enhancement.
gate Gradient algorithms. Significant speedup has been achieved over other compilers
Introduction
Recent development in graphic processing units (GPUs) has opened a new challenge in harnessing their computing power as a new general purpose computing paradigm. Strong implications are expected on computational science and engineering, especially in the area of discrete numerical simulation.
Modern GPUs use multiple streaming multiprocessors (SMs) with potentially hundreds of cores, fast context switching, and high memory bandwidth to tolerate ever-increasing latencies to main memory by overlapping long-latency loads in stalled threads with useful computation in other threads. The compute unified device architecture (CUDA) is a simple C-like interface proposed for programming NVIDIA GPUs. However, porting applications to CUDA remains a challenge to average programmers. CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory.
There is a growing research for reducing the complexity of producing optimized CUDA kernels [4, 26, 34] . Even though the programming model of CUDA offer a more programmer friendly interface, programming GPUs is still considered error-prone and complex, in comparison to programming CPUs with standing parallel programming models, such as OpenMP [7, 33] . Most recently, quite a few directive-based GPU programming models have been proposed from both the research community (hiCUDA [14] , OpenMPC [23] , etc.) and industry (PGI Accelerator [35] , HMPP [38] , R-Stream [25] , OpenACC, OpenMP for Accelerators [1] , etc.). On the surface, these models appear to offer different levels of abstraction, and expected programming effort for code restructuring and optimization.
CUDA-lite [40] proposed a set of annotations to maximize the efficiency of the transformation. It takes as input a naïve CUDA code that treats the memory as a single entity instead of a hierarchical one. CUDA-lite performs the following transformations: inserts shared memory variables, loop tiling, memory coalesced loads and/or stores by replacing global memory access with corresponding shared memory access. hiCUDA [14] is a directive based language for programming NVIDIA GPUs that provides abstractions which are closely matched with the CUDA programming model. hiCUDA's goal is not to automate optimizations rather it makes it easier for the programmer to program CUDA. It still depends on explicit optimizations by the programmer, such as utilizing shared memory or constant memory. OpenMPC [23] proposed a set of directives to be considered along with OpenMP directives [1] . It optimizes data movement between CPU and GPU by applying the inter-procedural dataflow analysis to accomplish the following: (1) overriding transfers to GPU global memory when the global memory already have up to date values of relevant variables, and (2) overriding transfers to CPU from GPU memory that is written back results from GPU global memory to host memory if the relevant value wasn't used in the CPU part before it is written. It also performs parallel loop swap and loop collapsing to enhance the inter-thread locality. Furthermore, it also uses auto-tuning to obtain the final optimized CUDA code. PGI accelerator programming model [35] is a directive based model targeting general hardware accelerators, even though it currently supports only CUDA GPUs based on OpenACC standards. The PGI model allows very high-level abstraction similar to OpenMP. The user needs to insert directives into the host program to guide the compiler for particular set of kernel optimizations and code transformations. So, the user is required to be exposed to machine dependent GPU details such as kernel dimensionality, thread-block organization etc. OpenACC is the first standardization effort toward a directive-based, general accelerator programming model portable across device types and compiler vendors. HMPP [38] is another directive-based GPU programming model targeting both CUDA and OpenCL. It also provides very high-level abstraction on GPU programming similar to PGI accelerator. HMPP model is based on the concept of codelets, functions that can be remotely executed on hardware accelerators like GPUs. Because the codelet is a base unit containing computations to be offloaded to GPUs, porting existing applications using HMPP often requires manual modification of code structures, such as outlining of a code region or re-factoring existing functions. For optimized data management, HMPP uses the concept of a group of codelets. By grouping a set of codelets, data in the GPU memory can be shared among different codelets, without any additional data transfers between CPU and GPU. Combining the codelet grouping with HMPP advanced load/delegated store directives allows the same data-transfer optimizations as the data region in the PGI model. One unique feature in the HMPP model is that the same codelet can be optimized differently depending on its call sites. R-Stream [25] is a high-level, architecture-independent programming model that is based on the polyhedral model [2] . It targets various architectures, such as STI Cell, SMP with OpenMP directives, Tilera, and CUDA GPUs. R-Stream performs affine scheduling to transform the code into both fine-grained and coarse-grained parallelism. It performs following code optimizations: global memory coalescing, loop interchange, strip-mining, loop fusion, shared memory promotion and tiling. CUDACHiLL [22] is an another compiler framework that performs transformations using a transformation recipe that contains a set of commands for each required code transformation. Although it depends on command based transformations, applying these transformations still need some optimization heuristics [5] to be applied manually that constrain the optimization strategies considered and limit the possible values for the optimization parameters. Heuristics implemented in CUDA-CHiLL are dependences and Parallelization, Global Memory Coalescing, Shared Memory and Bank Conflicts, and Maximize Reuse in Registers. It also suggests an auto-tuning framework to choose among resulting codes from various possibilities of parameters.
To understand the level of abstraction that each programming model/compiler provides, Table 1 summarizes the features and optimization approaches acquired in these compilers. Here, explicit means that the feature is enabled by some user provided hints to the compiler, Implicit indicates that the compiler automatically handles the feature without user intervention, Indirect shows that the users can manually control the compiler to use the feature, and Imp-dep means that the feature is implementation dependent. The table shows that R-Stream provides the highest level of abstraction in comparison to other models/compilers found in literature as most of the features are handled implicitly in R-Stream. It also shows that hiCUDA and CUDA-CHiLL provide the lowest level of abstraction among other tools as the programmer has to control most of the features explicitly. However, lower level of abstraction may be beneficial in some cases that allow enough control over various optimizations and features specific to the underlying GPU architecture to achieve optimal performance. On the other hand, high level of abstraction sometimes limits the application coverage of the tool. RT-CUDA provides the same level of abstraction as R-Stream but also provides user-defined configurations to control various optimizations and features of the underlying GPU architecture to explore the effects of different kernel optimizations.
Background

GPU Architecture
Graphics processing units (GPUs) are organized into an array of highly threaded streaming multiprocessors (SMs). Each SM has a number of streaming processors (SPs) that share control logic and instruction cache. GPUs have their own memory hierarchy with unified memory request paths for loads and stores. This memory hierarchy has been developed from Level-0 (no hierarchy) to Level-3 (L1, L2, Global) with the advancement in the nVidia GPU generations (Tesla, Fermi, Kepler). Unlike CPU cache memory levels, GPU cache memories are read-only memories and doesn't have cache coherency protocols implemented. Table 2 shows the features of the GPU device (K20c based on nVidia Kepler Architecture) that we used in our experiments.
Like nVidia Fermi architecture, each SM in nVidia Kepler has 64 KB of on-chip memory that can be configured to be used as shared memory and L1 cache. Unlike nVidia Fermi, this memory can be configured as 32/32 KB partitions for shared memory and L1 cache respectively in addition to 16/48 and 48/16 KB configurations [30] . L1 cache in nVidia Kepler can store only local memory accesses such as register spills and stack data, global memory loads are cached in L2 [39] . In addition to L2 cache, global loads can also be cached in compiler directed 48 KB read-only data cache. The read-only path can be managed automatically by the compiler or explicitly by the programmer. The threads are grouped in blocks which are then scheduled to SMs dynamically on the availability of each SM. These threads follow the single-program multiple-data (SPMD) program execution model. Within a block, threads are grouped in 32-threads instruction called warps, where each warp is being executed in the single-instruction multiple-data (SIMD) manner. Each SP is responsible to execute one thread at a time. In nVidia Kepler, each SM contains four warp schedulers each with dual instruction dispatch units. So, each SM can select four warps and two independent instructions per warp to be dispatched in each cycle.
CUDA Kernel Optimizations
CUDA kernel can be optimized in many possible ways with good understanding of the needs of the application. NVIDIA suggests a cyclic process namely APOD (Asses, Parallelize, Optimize, Deploy) [31] to help application developers to rapidly identify the portions of their code that would most readily benefit from GPU acceleration, rapidly realize that benefit, and begin leveraging the resulting speedups in production as early as possible. Using APOD, a programmer can apply and test the optimization strategies incrementally as they are learned. Optimizations can be applied at various levels, from overlapping data transfers with computation all the way down to finetuning floating-point operation sequences.
We have explored several optimizations that can be categorized into three classes based on their application:
1. Manual This set of optimizations can only be applied by the programmer himself through manual code analysis and modifications. For example: -Vectorization can be applied by using vector data types provided in CUDA as structs and modify the code logic to use the correct element of the vector in other expressions.
-Texture memory can be used in kernels by invoking texture intrinsic provided in CUDA such as tex1D(), tex2D, and tex3D() for 1D, 2D, and 3D arrays respectively. Also, it needs to be bound explicitly to CUDA array allocated in device memory. -To perform coalesced global memory, the programmer has to modify the algorithmic logic to change the access patterns of arrays. Although, it can also be achieved as a result of some other optimizations such as prefetching using shared memory which can be performed automatically by some tool provided hints from the user regarding shared memory size and array to be loaded. -To perform faster data exchange among threads within a warp, the programmer has to use nVidia Kepler's Shuffle Instructions explicitly in the code and change the code logic based on the data exchange requirements. 
RT-CUDA
In this section, we are presenting RT-CUDA in detail including its design specifications, restructuring transformations, and infrastructure.
Design Specifications
To target the complex nature of the GPU architecture, programs often have to go through profound transformations. GPUs featured a massive number of active threads -Data motion between the GPU accelerator and the host CPU (I/O).
-The thread organization as the block-level and thread-level parallelism and the mapping of threads to results (CP). -A deep memory hierarchy ranging from a large register file, small shared memory, read-only memories, and a slow global memory (LO). -A zero-overhead thread switching augmented with parallel memories and wide buses (PMB). -User activation of some optimizations (CO/G): optimizations which are implemented using annotation-based language extensions, which enables the use of high-level languages while hiding to some extent the GPU details.
These optimizations are shown in top 5 rows of the Table 4 , which shows the estimated optimization level (low, medium, and high) and percentage addressing each optimization (24/24 = 100 %, 18/24 = 75 %, etc.). The overall usage rate of the optimization specifications is 47.4 % (91/192). In the following, we summarize the BA optimizations that were implemented in most of the existing compilers:
1. The Parallel Memory Bandwidth (PMB) aims at mapping threads within a warp to access data in distinct storages in the device memory to enable coalesced global memory access. For shared memory accesses, map threads within a warp to access data in distinct memory banks to avoid serialization. Although the BA optimizations are essential they are far from being sufficient to optimize simple domain specific applications. The target applications of our proposed code restructuring is the area of Iterative Linear Algebra Solvers (ILAS) for which the algorithms have the following features:
1. A spectrum of parallelism ranging from the abundance of data parallelism at the operator level (matrix multiplication (MM), Matrix vector (MV), inner product, vector scaling, etc.), tree like operations like sum of product, down to scalar execution. 2. A producer-consumer data layout in which data computed on in one iteration is mostly consumed in the same iteration with the possibility of forwarding to the next iteration for some other data. 3. Exact algorithm behavior requires enforcing some synchronization across all the thread blocks. However, more efficient execution can be made if iterations are allowed to overlap. This is especially useful when there is potential of load unbalancing, such as the case of sparse matrix vector operations. 4. The existence of highly optimized math libraries for basic dense and sparse linear algebra operations, which were developed by researchers and industry. Research based libraries are constantly enhanced to: cope with newer hardware features, improve data locality using operator code merging, and use low level programming in some critical code fragments.
To efficiently compile ILAS algorithms, we aims at a restructuring tool that embodies the BA optimizations and being able to address all the above domain specific (DS) optimizations. For this, we target integrating the above BA optimizations with the following additional compiler techniques: and machine occupancy which is subject to a complex set of constraints, (2) determine the optimal values of kernel parameters using constraints analysis and/or auto-tuning.
Inter-Block Synchronization (SYC) since GPUs have no global synchronization,
(1) use of customized inter-block synchronization when absolutely required for program correctness, (2) use of kernel entry/exit, lock-based, and lock free, and relaxed synchronization, (3) adapt synchronization to algorithm depending on expected degree of thread load unbalancing.
Invocation of Optimized external Libraries (ELB) some external libraries have been
optimized at lower level programming and may deliver substantial performance advantages over manually optimized regular code. Library details (parameters and error handing) are hidden or abstracted to the user. Efficient invocation of external libraries require full understanding of its parameters and related implementation logic to select proper parameter values.
With these enhancements in the compiler design, RT-CUDA improves the overall usage rating of the optimization specifications from 47.4 to 52.32 % (113/216) as shown in Table 4 . Figure 1 shows how all the above BA and DS optimizations can be integrated in the restructuring tool, see Sects. 3.2 and 3.3 for code transformation details.
Massively parallel GPUs have thousands of arithmetic units, capability to manage tens of thousands of hardware threads, and can achieve peak performance of several TFLOPS. Many highly computational workload having abundant data parallelism can be compiled into a large number of parallel threads to achieve high performance in terms of throughput on GPUs. Energy efficiency is becoming one key parameter to computing performance due to power scarcity in existing and future massively parallel systems [29, 46] .
Reading or writing a value from GM require a factor of a hundredth fold more power than a floating point operation [11, 12, 19] . Moving a FP value across a chip costs a factor of five to ten than computing with it. As a result, keeping power consumption low will require minimizing data movement and enhancing temporal and spatial localities. One objective might be to minimize power consumption, which would likely require using the most appropriate type of core for the workload at hand. Another might be to minimize execution time. A third might be to balance both power consumption and execution time.
Our approach is based on enhancing program data manipulation by the compiler to improve spatial and temporal locality of data accesses in code regions to improve power efficiency at runtime. Proposed restructuring technique aims at minimizing kernel execution time combined with the use of minimal data movement and enhanced temporal and spatial localities to improve power efficiency. RT-CUDA uses a set of six energy-aware rules (EORs), which are listed below together with their justifications:
1. Minimizing data transfer overhead between CPU and GPU The CPU (host) and GPU (device) have separate memories. All data read/written on the device must be copied to/from the device (over the PCIe bus). This is very expensive, so it is important to try to minimize copies wherever possible, and keep data resident on device. This may involve porting more routines to device, even if they are not computationally expensive.
Use of re-ordered synchronization to minimize the number of kernel invocations
Kernel exit-re-entry (KEE) transfers control from the device back to CPU by interacting through the PCIe. KEE is sometimes a simple solution for inter-block synchronization, which is needed for a reduction or a collective write (CWs) to GM. In GPUs, data cached by a kernel cannot be used in a subsequent kernel launch because the TBs are dynamically assigned to SMs and data locality is lost at kernel exit. This leads to additional overhead for saving the data back into GM and reloading it again after kernel re-entry. A thread reference to some data requires a load instruction that reads GM and store into ShM, loads data into a register prior to use, and a store instruction to write back the register data into ShM. L1 cache is accessed while executing the load instruction. However, once a data is stored into ShM, L1 cache is no more accessed for the data. Instead of the KEE, we use an in-kernel approach using a custom inter-block synchronization, called re-ordered synchronization (ROS), to avoid the overhead associated with multiple KEEs and the loss of data locality. ROS enables staying in kernel without increasing the synchronization overheads as compared to KEE. Also, we may combine reductions and/or CWs to further reduce the number of needed ROS, if the algorithm allow.
Load data from GM into ShM if there is sufficient data reuse, otherwise, data is loaded directly into Register File (RF)
Data load or store on GM or texture memory causes a 400-cycle long latency (LS) operation, while ALU operations, branch resolution, or accesses to ShM are short-latency (SL) events taking from 8 to 20 cycles. GPU put the burden on application programmers to explicitly manage the usage of shared and global memory. To use ShM we first must load data from GM to ShM, which is an LS. Next, every data invocation requires moving the data from ShM to the specific SP lane register, i.e. close to the execution unit. If there is no enough reuse of data that consumes more time and energy than simply loading data directly from GM into RF. Here, compiler stores short-lived working data in low energy structures such as RF, which is close to the execution units and recycle registers if data is not referenced any more. Data must be stored in RF just before its reference to avoid power consumption due to long latency register context switching. Effectively leveraging ShM requires increasing kernel and thread grain size by combining operations, whenever possible. Short kernels and threads must have realistic memory usage using direct data load from GM into RF. 4. Maximizing memory bandwidth using coalescing The memory bandwidth for the graphics memory on the GPU is high compared to the CPU, but there are many data-hungry cores so memory bandwidth is still a performance bottleneck. The maximum bandwidth is achieved when data is loaded for multiple threads in a single transaction, e.g. memory coalescing. This will happen when data access patterns meet certain conditions: 16 consecutive threads (i.e. a half-warp) must access data from within the same memory segment. This condition is met when consecutive threads read consecutive memory addresses within a warp. If the condition is not met, memory accesses are serialized, significantly degrading performance.
Most frequently read-only data is kept in CM Constant Memory (CM) is a
small read-only cache that caches a portion of GM. Maintaining some repeatedly accessed read-only (RO) data in CM helps to conserve memory bandwidth due to SRAM nature and its 20 cycles typical access time. Due to the small size of CM only RO data that has the highest read frequency should be saved in CM (and the rest in GM) before kernel entry. 6. Auto-tuning to maximize occupancy and improve latency hiding use kernels that consists of arrays of thread blocks so to keep all cores in all SM busy. When programming for the GPU architecture, the programmer maps each loop iteration to a thread. Obviously, there must be at least as many total threads as cores, otherwise cores will be left idle. For best performance, the number of threads must be much greater than the number of cores. Accesses to global memory have several hundred cycles of latency when a thread stalls waiting for data, another thread can switch in this time to hide latency. NVIDIA GPUs feature very fast thread switching, and support many concurrent threads. The nVidia Fermi architecture supports up to 1536 concurrent threads on an SM, e.g. if there are 256 threads per block, it can run up to 6 blocks concurrently per SM (and the remaining blocks will queue). The resources must be shared between threads because high use of on-chip memory and registers will limit the number of concurrent threads. Typically, it is best to use many thousands of threads in total. The optimal number of threads per block should be found using auto-tuning which consists of running a parametric kernel for a few combination of architectural parameters (grid size, block size, thread grain size, and some compiler flags) and select the one that sustain the best performance.
The application of the above EORs in optimizing a few numerical algorithms like the SpMV, BiCG-Stab, and QMR is presented in the evaluation Sect. 4.8. The energy consumption is compared to standard algorithm implementation using library calls with multiple-kernel invocations.
RTA-CUDA Description
RTA-CUDA (Restructuring Tool Algorithm for CUDA) has been developed to be implemented in a source-to-source transformation tool to convert a standard C function (input) containing computational loops into an optimized CUDA kernel (output) based on some user-defined variables. The algorithm consists of the following steps as shown in Fig. 2: 
C-Loop Optimizations (Loop Collapsing)
Merge the nested loops if they are independent and calculate array indices based on the new loop variable. For example, if i and j are two independent loops such that i = [0 to N] and j = [0 to N]. The new loop index (idx) will be equal to [0 to N * N] and i, j will be the quotient and remainder of the division of idx with N respectively such that 'i' represents row of the matrix and 'j' represents column of the matrix. So, instead of two nested loops, we will have now one main loop. 
Array Transformation (nD → 1D)
In GPUs, device memory can be allocated only as linear memory so CUDA arrays are restricted to be allocated as 1D arrays while standard C language supports multidimensional arrays. In this step, all the multi-dimensional array accesses in the expressions are converted to linear array representations. For example, C[i][j] will be represented as C[i * N + j] where N is the width of the array.
Loop Partitioning
Partition the main loop among all cuda threads. This obtains task distribution among all cuda threads based on its block id and thread id. Figure 3 shows the task distribution among 4 cuda blocks with 4 threads per block. Each cuda thread identifies its working element of the resultant array using the block id and thread id within the cuda thread block. At this stage, each thread is mapped to one element of the resultant array. So, the loop is replaced by the statements to calculate the loop index to generate a Naïve CUDA Kernel.
CUDA Kernel Optimizations
In this step, each of the generated Naïve CUDA Kernel is transformed into a Parameterized CUDA Kernel after applying a set of optimizations as shown in Fig. 4. 
Block Merging
At this stage, each thread block is mapped to one block of resultant matrix/vector. Each thread within the block calculates one element of the resultant. To increase the thread granularity, each thread block can be mapped to multiple resultant blocks vertically. The number of blocks to be merged is defined as an input parameter and the optimal value of block merging can be obtained after running the parameter tuning algorithm as explained in Sect. 3.2.5. Figure 5 shows the task distribution among 2 cuda blocks with 4 threads per block after merging 2 resultant blocks that is each thread calculates two resultant elements at the same offset in consecutive resultant blocks. This is done by the following steps:
1. Convert the resultant variable into an array stored in local memory to compute the multiple elements simultaneously in a pipelined fashion. 2. Replace the first index of resultant matrix with the increment of loop index m where m defines the number of elements to be calculated by each thread. 3. Replace the resultant variable into array with loop index m. 4. Update row index calculations with multiple of number of blocks to be merged.
Prefetching Using Shared Memory
To effectively use the shared memory and do coalesced access in global memory. The matrices in the loop that are going to be access by each thread in row-major can be tiled and loaded into shared memory with coalesced access. This is done by the following steps:
1. Declare shared variable for the tile of the given matrix to load the tiled rows of the number of blocks merged in the previous step. 2. Load the tile of the given matrix into shared memory by accessing the tiled rows in a coalesced manner such that each thread of the block access the consecutive elements in the same row. 3. Add barrier to synchronize all threads (__syncthreads()) after loading the tile. 4. Replace array access in the loop with the shared tile. 5. Add barrier to synchronize all threads (__syncthreads()) after calculating the tile. 6. Load the tile next consecutive tile of the given matrix into shared memory to be used in the computation of next iteration. 7. Add barrier to synchronize all threads (__syncthreads()) after calculating the tile. 8. Modify the main loop to traverse all tiles of the given matrix. 9. Calculate the remaining tile loaded in the last iteration. 10. unroll the loops to load and calculate the tile.
Block Skewing
To increase the thread access locality, each thread block can be mapped to multiple resultant blocks horizontally. The number of blocks to be skewed is defined as an input parameter and the optimal value of block skewing can be obtained after running the parameter tuning algorithm as explained in Sect. 3.2.5. Figure 6 shows the task distribution among 2 cuda blocks with 4 threads per block 6 Task distribution among all threads with block skewing applied after skewing 2 resultant blocks that is each thread calculates two resultant elements at the consecutive offset in the resultant blocks. This is done by the following steps:
1. Convert the resultant variable into a matrix stored in local memory to hold the results of merged and skewed elements. 2. Replace the second index of resultant matrix with the increment of loop variable n where n defines the number of elements skewed in a resultant block. 3. Add second dimension of the resultant variable with loop index n. 4. Update column index calculations with multiple of number of blocks to be skewed. These steps generate a parameterized CUDA kernel with three parameters that are BLOCKSIZE (number of threads per block), MERGE_LEVEL (number of blocks to be merged), and SKEW_LEVEL (number of blocks to be skewed). The optimal values of these parameters can be obtained using the parameter tuning algorithm as explained in the following Sect. 3.2.5.
Remove Redundant Array Access in Loop Body
Parameters Tuning Algorithm
Algorithm 1 determines the optimal parameters (BLOCKSIZE, MERGE_LEVEL, and SKEW_LEVEL) for the generated parametric CUDA kernel. The size of cuda grid will be determined by dividing the total number of elements in the resultant array with the product of all three parameters.
The algorithm evaluates the generated parametric kernel with various possible combinations of BLOCKSIZE, MERGE_LEVEL and SKEW_LEVEL. The pruning of the list of possible parameters is used at three levels to reduce the repeated compilation and execution of the kernel. The three levels of pruning are as follows: for ml=minML to maxML Step *2 do 7:
for sl=minSL to maxSL Step *2 do 8: KB = INT(N/bs/ml/sl) 9:
if KB = 0 then 10: 
RT-CUDA Design
RT-CUDA is a source-to-source transformation tool that is capable to convert a standard C-Program (input) into an Optimized CUDA Program (output). The overall transformation is driven by some user-defined directives and API function calls provided in the tool with automatic kernel optimizations. The user is also able to include/exclude any of the optimizations available in the transformations. The tool follows the steps as shown in Fig. 7 to generate an Optimized CUDA Program.
1. Configuration File A programmer defined text file containing configuration parameters to RT-CUDA for code transformations. This includes basic structure of the target kernel such as kernel name, semantics, used array dimensions and datatypes, selected optimizations, range of kernel parameters for auto-tuning, and target nVidia GPU architecture such as Fermi, Kepler, Maxwell, Tegra. 2. Pre-Processing In this step, the source program is partitioned into a DAG (Dynamic Acyclic Graph) of loops identified as candidate CUDA kernels to be executed on GPU while separating the set of scalar segments that will be executed on host. Data dependence among the loops is enforced in the generated code. By examining the DAG, loops that are data independent can be merged together to reduce the exit/entry of kernels, copying data between Global Memory (GM) and Shared Memory (ShM), and to reduce loop overhead. 3. RTA-CUDA This step will take each of the functions generated in the preprocessing step based on the loops identified as candidate CUDA kernels and apply RTA-CUDA algorithm as explained in Sect. 3.2 to generate the optimized CUDA kernels. invocation of the most optimized external math library functions such CUBLAS and cuSparse for some of the dense and sparse matrix operations respectively. It provides interfacing APIs, error handling interpretation, and user transparent programming. These can be included by calling a related API function defined in the tool. Table 5 shows the list of available functions. It reduces the programming effort in terms of lines of code, error handling, and hides complex parameters selection by the programmers while giving similar performance in terms of execution time, see Sect. 4.3 for code comparison and performance evaluations.
RT-CUDA also supports inter-block synchronization in three ways:
1. CPU Synchronization This is the simplest approach recommended by Nvidia [15] for inter-block synchronization by exiting and re-entering the kernel that is considered as an implicit synchronization. This is done by defining separate CUDA kernels for each of the dependent loops and calling them in sequence from host.
Lock-Free Synchronization This synchronization primitive is based on the work
in [45] . It uses two arrays, named Ain and Aout, of length N for synchronization. When all threads of a block finish their work, the first thread of each block increments its location in the Ain array. Then, the first N threads of the first block in parallel check whether all blocks have written to their corresponding location in the Ain array. If so, these N threads write in parallel to the Aout array to inform other threads that the threads of this block have reached the synchronization point. Meanwhile, the first thread of each block continuously checks its location in the Aout array until the value is set to k where k is the iteration number. 3. Relaxed Synchronization We have developed a new synchronization primitive that can be useful in implementing iterative solvers with block dependencies among Figure 8 shows the flowchart of the relaxed synchronization. This approach overlaps the computation of two consecutive iterations. After completion of iteration 'k', each block start the computation of the iteration 'k + 1' using the completed blocks of dependent array by the previous iteration. Each block updates its designated element in presence vector 'P' in global memory with the iteration number at the end of each iteration. So for the next iteration, it will call the relaxed synchronization primitive that will check the presence vector for the completed blocks of the previous iteration and return a work vector 'W' and the number of completed blocks 'bnum' that can be used to start the computation of the next iteration using the completed blocks of the previous iteration. The presence vector will be first loaded into shared memory from global memory with coalesced access to reduce global memory loads.
Performance Evaluation
We have run our experiments on Tesla K20c GPU (see Table 2 for specifications) with various applications including Demosaic, Histogram, Matrix Addition (Madd), Matrix Multiplication (MM), Matrix Vector Multiplication (MV), and Vector Vector Multiplication (VV). We have compared the implementations using RTA-CUDA with CUBLAS, GPGPU compiler, and OpenACC (PGI compiler) implementations. We have also evaluated the different inter-block synchronization primitives provided in the tool to be used in Jacobi Iterative Solver. Furthermore, the effects of calling external library functions for basic linear algebra operations and sparse matrices have also been evaluated. The correctness of the results of each converted application using RT-CUDA are guaranteed by comparing with the results of serial version of each application on CPU using the subset of the problem sizes. The result showed that our conversions produce correct resultant values. We have also trace the resultant matrix indices with the mapping of block and thread ids which are also found to be corrected.
Evaluation of Basic Linear Algebra Operations
This section shows the evaluation of the tool using a set of operators in LAPACK benchmark suite for basic linear algebra operations including Madd, MM, MV, and VV applications to compare with CUBLAS, GPGPU compiler, and OpenACC (PGI compiler). All four kernels are generated by the tool using RTA-CUDA algorithm as explained in Sect. 3.2 and applied different set of transformations/optimizations according to the specific code structure of each application. Table 6 shows the applied transformations/optimizations for each application, the table also shows the optimal parameters (<BLOCKSIZE,MERGE_LEVEL,SKEW_LEVEL>) for each application obtained through parameter tuning algorithm (Algorithm 1). The comparisons for different applications and tools have been shown with appropriate space size (N) for each application to show the execution times in a particular range. Figure 9 shows the execution time in seconds of different applications using CUBLAS and RTA-CUDA. The comparisons have been performed using following CUBLAS functions:
-cublasSgeam for Madd -cublasSgemm for MM and VV -cublasSgemv for MV The results show that RTA-CUDA obtained better performance for Madd and VV. But, for complex applications such as MM and MV, CUBLAS still has significant performance advantage over RTA-CUDA. This is because cublasSgemm and cublasSgemv functions have been developed with complex kernel optimizations at very low level of coding by hand while at this stage, we are focusing on high level CUDA kernel optimizations. However, with the proposed high level kernel optimizations, RTA-CUDA outperforms CUBLAS with 45 % improvement in case of Madd and 2 % improvement in case of VV. Figure 10 shows the execution time in seconds of different applications using GPGPU compiler and RTA-CUDA. The results show that RTA-CUDA outperforms GPGPU compiler with 17 % improvement in case of Demosaic, 30 % improvement in case of MM, 99 % improvement in case of MV and 50 % improvement in case of VV. Also, MV implementation in GPGPU compiler gives value errors in case of large space size while RTA-CUDA generates correct values with any space size. So, the 
Evaluation of Inter-Block Synchronization Primitives
This section shows the evaluation of inter-block synchronization primitives provided in the tool. The execution time of three variants of block Jacobi iterative solver have Figure 12 shows the execution time in seconds of SJ, AJ, and RJ implementations with different array dimensions using 128 (maximum concurrent threads blocks possible for these implementations) number of blocks and single-precision floating point operations. Here, SJ implementation shows a little overhead of synchronization among each iteration and reduces with the increase in the array dimension that for N = 16,384 with 128 thread blocks the synchronization overhead is just about 1.5 % over AJ implementation. RJ implementation further obtained little improvement over SJ implementation that is about 1 % reduction in execution time than SJ implementation except the case of N = 4096 where RJ improvement is about 6 %. This shows that all thread blocks complete its execution with little difference in time as the tasks among each thread block is distributed equally. Relaxed synchronization approach is expected to give more performance improvement if the tasks are not evenly distributed among thread blocks on GPU architectures. To analyze the behaviour of RJ in case of unbalanced thread blocks execution, we have performed an experiment of a naïve block-row partitioning in sparse matrix-vector operation used in an iterative solver (such as Sparse Jacobi with MV), which causes some load unbalancing over the iteration space. Figure 13 shows the execution time in seconds of SJ and RJ with unbalanced thread blocks. Here, RJ obtained about 8 % performance improvement in terms of execution time over SJ implementation. Figure 14 shows the execution time in seconds of SJ, AJ, and RJ implementations with different array dimensions using 64 number of blocks (optimal number of thread blocks for these implementations) and double-precision floating point operations. Here, SJ implementation shows a high overhead of synchronization among each iteration that is for N = 16,384 the synchronization overhead is about 12 % over AJ implementation. RJ implementation is shown performance improvement of about 4 % over SJ implementation. RJ also shows increasing trend in terms of performance improvement over SJ with the increase in the array dimension. 
Effects of Calling External CUBLAS Functions
As shown in Sect. 4.1, in most of the cases CUBLAS library functions obtained the highest performance in comparison to RT-CUDA and other tools. This is because CUBLAS functions have been developed with complex kernel optimizations at very low level of coding by hand. To get the benefit of the existing optimized CUDA libraries, we have provided a feature of calling external library functions as an optimization within the tool.
This section shows the performance evaluation of RT-CUDA using CUBLAS functions as an optimization instead of using RTA-CUDA algorithm. We applied the tool on Matrix Multiplication (MM), Matrix Vector Multiplication (MV), Matrix Transpose (MT), and Vector Vector Multiplication (VV). The execution time of each application has been compared with GPGPU compiler and PGI OpenACC compiler implementations.
RT-CUDA enables transparent invocation of the most optimized external math libraries like cuBLAS and cuSparse libraries. It provides interfacing APIs, error handling interpretation, and user transparent programming. An example of MV implementation in RT-CUDA (code available at RT-CUDA MV Example 1 ) reduces the programming efforts of about 93 % in terms of lines of code and hides complex parameters selection by the programmers while giving similar performance in terms of execution time. Figure 15 shows normalized execution time of MM and MV operations using CUBLAS library and RT-CUDA API. Table 7 shows the execution time in milliseconds of RT-CUDA, GPGPU compiler and PGI OpenACC implementations of different applications with different array dimensions. The results show that RT-CUDA significantly outperforms both GPGPU and PGI OpenACC implementations. It obtained performance improvements in terms of execution time of up to 80, 100, 4, and 56 % for MM, MV, MT, and VV respectively over GPGPU compiler with N =4096. It obtained performance improvements of up to 100, 33, 78, and 70 % for MM, MV, MT, and VV respectively over PGI OpenACC implementations with N = 4096.
Table 7
Comparing RT-CUDA with GPGPU compiler and PGI OpenACC compiler using different applications 
Effects of Sparse Matrix Operations Using CUDA Sparse Library Routines
We have evaluated various sparse matrix formats available in cuSparse library for Sparse-Sparse Matrix Multiplication (spMM), Sparse-Dense Matrix Multiplication (spdMM), and Sparse-Dense Matrix Vector Multiplication (spdMV) (provided in RT-CUDA API). The objective of these evaluations is to access the performance of various matrix operators and storage schemes available in cuSparse library to select the best storages as standard in RT-CUDA. The evaluation results show that the sparse matrix multiplication (both spMM and spdMM) is only profitable in terms of memory allocations but not for execution time for computations as the dense matrix multiplication in CUBLAS is highly optimized and provide the best performance independent on the sparsity (percentage of number of zeros in the matrix) of the matrix. Whereas, in case of matrix-vector multiplication (spdMV), the sparse implementations obtain better performance both in terms of memory allocations and execution time in comparison to dense implementation. Table 8 shows the memory allocations in MB for the sparse matrix in Dense, CSR, BSR (with 256 × 256 block dimensions), and HYB formats. Here, CSR and HYB formats show minimum memory requirements for matrix storage that is about 50-80 % less than the dense and BSR formats. Table 9 shows the execution time in seconds for the computations of spMM, spdMM, and spdMV in Dense, CSR, BSR (with 256 × 256 block dimensions), and HYB formats. For matrix multiplication, sparse operations show significant overhead of computations due to irregular memory access patterns of the randomly initialized sparse matrices. For matrix-vector multiplication, sparse operations obtained the speedup of about 4 and 2.5 over dense operations for N = 8192 in CSR and HYB formats respectively. Furthermore, spdMV in CSR format is more efficient in terms of performance for N ≤13,312 and spdMV in HYB format obtained more speedup for N ≥ 14,336 as shown in Fig. 16 .
Generation of API Functions for Efficient Calling of cuSparse Library Routines
Since CSR and HYB sparse matrix formats have shown optimized storages (Sect. 4.4), we decided to use them as implicit storage schemes in RT-CUDA. We have implemented API functions to call cuSparse library routines with CSR matrix format for spMM and spdMM operations, and with HYB matrix format for spdMV operation. Table 10 show the execution time in seconds for all three operations spMM, spdMM, and spdMV implemented in RT-CUDA with different matrix sparsity for N = 10,240. Furthermore, RT-CUDA provides ability to load standard sparse matrices available in a matrix market file format (an ASCII-based file format designed to facilitate the exchange of matrix data) [37] into a sparse matrix structure to be used in RT-CUDA API functions for sparse matrix operations. We have evaluated the sparse operations of RT-CUDA using various standard sparse matrices in the domain of computational fluid dynamics available in the repository of University of Florida [8] and extracted from the real applications. Table 11 shows the properties of these matrices. All of the selected matrices has about 99 % sparsity and are bend diagonal in nature. Figure 17 shows the obtained speedup of RTspDMM, RTspdDMM, and RTspdDMV API functions of RT-CUDA (see Table 5 for details) over Dense equivalent operation. For RTspDMM, the sparse operation obtained more speedup if the non-zero elements are closed to diagonal as in the case of matrices cavity17 and cdde1 while in the case of bcsstm13, cavity10, and coater1 the speedup is relatively less due to scattered non-zero elements. For RTspdMM, the obtained speedup is seem to be dependent on the sparsity of the matrix. The matrices having more sparsity show more speedup than the matrices having less sparsity. For RTspdMV, the sparse operation obtained more speedup if the non-zero elements are closed to diagonal but with large dimension as in the case of cavity17 but the speedup drops significantly for small dimension as in the case of cdde1.
We have also compared the sparse matrix multiplication implementation in RT-CUDA (RTspDMM) with an efficient sparse matrix multiplication recently proposed for x86-based many core processors [27] using a newly defined sparse storage format ESB (ELLPack Sparse Block format) with careful load balancing. Figures 18 and 19 show the execution time in milliseconds of matrix multiplication implementation in RT-CUDA API and an efficient sparse GEMM implementation with ESB format using standard sparse matrices as shown in Table 11 and using hetpa diagonal sparse matrix with different dimensions respectively.
Ease and Efficient Implementation of Large Scale Solvers
RT-CUDA hides architectural details of the underlying GPU device that helps traditional C programmers to develop parallel programs in a fast and efficient manner. RT-CUDA supports efficient development of sparse linear solvers such as conjugate gradient to be used in reservoir simulation softwares. It also includes API functions to allocate and initialize sparse matrices with random sparsity as well as reading matrix from matrix market file to be used as input for the solver. The implementation of such a solver can be optimized using the combination of user-defined functions and invoking highly optimized library functions including cuBLAS and cuSparse library functions. RT-CUDA CG Implementation 2 shows the inputs and outputs for the conjugate gradient algorithm implementation in RT-CUDA. The implementation uses the combination of user-defined functions for merged operations as an implementation optimization and RT-CUDA API functions for highly optimized sparse matrix-vector multiplication and dot product. Figure 20 shows the execution time in milliseconds for RT-CUDA Conjugate Gradient (CG) implementation in comparison to CG implementation in MAGMA library for hepta diagonal sparse matrix with different dimensions. The results show the handoptimized MAGMA CG implementation provides 50 % higher performance in terms of execution time than RT-CUDA implementation but increase in the array dimension to larger arrays reduces the performance gap between RT-CUDA and MAGMA implementations and provides same performance for large matrices to dimension N = 400,384. This iterative CG algorithm is written in C and converted into optimized CUDA code with customized merged operations by using RT-CUDA which shows comparable performance to optimized MAGMA CG.
Evaluation of the Tool Using More Elaborated Algorithms
Hybrid Matrix Multiplication Using Strassen and Winograd Recursive Algorithms
Using RT-CUDA, We implemented the hybrid matrix multiplication recursive algorithms based on Strassen (S-MM) and its Winograd (W-MM) variant. The above Figure 21 shows the speedup achieved by S-MM and W-MM over native CUBLAS Sgemm versus the matrix size using one recursion level. Note that W-MM shows the highest performance for the studied range of matrix size. CUBLAS is faster than our implementations only for matrices where N ≤ 3072 (for S-MM) and N ≤ 2048 (for W-MM). For matrices with N >2048, W-MM becomes up to twice as fast as CUBLAS. S-MM becomes also faster (speedup of 1-1.9) than CUBLAS for N >3072. Figure 22 shows the performance in FLOPS for Strassen/Winograd implementation with CUBLAS versus the matrix size. Sorting the above implementations in descending order of best achieved performance gives W-MM, S-MM, and CUBLAS with best score 2.35 TFLOPS, 1.95 TFLOPS, and 1.35 TFLOPS respectively. The above performance scores represent 67, 55, and 38 % of peak performance for nVidia Kepler K20c GPU (3.52 TFLOPS). These results confirm the effectiveness of proposed kernel optimizations because significant fraction of peak GPU performance is achieved. Figure 23 shows the speedup achieved by S-MM, W-MM, and CUBLAS Sgemm over NVIDIA SDK library MM(Matrix Multiplication) implementation. Notice that S-MM, W-MM, and CUBLAS are 80×, 60×, and 41× faster than NVIDIA SDK However, as we increase the level of recursion the execution time is also increased due to the overhead of additional operations in both S-MM and W-MM implementations. The reason is that the reduction in one matrix multiplication using Strassen and Winograd is not large enough to offset the overhead of the matrix additions and the stack overhead due to recursive invocations when two recursion levels or more are executed. Also note that the high-performance MM CUBLAS Sgemm is highly optimized at a very low level of programming.
Matrix Transpose
Matrix Transpose is an important linear algebra function that has various applications in many numerical computing applications. Several factors hinder achieving optimal performance due to its inherent memory access patterns which cause bank conflicts in shared memory and prohibit the use of coalesced global memory access. RT-CUDA allowed to implement a few address permutation functions (APF) [20] that resolve the above limitations. Evaluation (Fig. 24) shows that RT-CUDA implementation of APF produces comparable performance to NVIDIA best implementation when using 32 × 32 tiles. The advantage of RT-CUDA implementation is the elimination of bank conflicts while allocating exactly the tile size memory space as opposed to best reported result which allocates for T × T tile an N × (N + 1) storage.
Advanced Encryption Standard (AES-128)
Further to elaborate the generalization of RT-CUDA, we have converted the sequential implementation of the AES-128 ECB Encryption using RT-CUDA and evaluated its performance on two of the recent GPUs (nVidia Fermi GPU: nVidia Quadro FX 7000 and nVidia Kepler GPU: Tesla K20c). We obtained a speedup of up to 87× against an advanced CPU (Intel Xeon X5690) based implementation. The AES algorithm basically comprises of four phases: (1) Key Expansion, (2) Initial Round, (3) Intermediate Rounds, and (4) Final Round. Readers are requested to [6] and [Rijndael-ammended from NIST] for details. In ECB mode, the forward cipher function (encryption) is done on each of the chunk or block of input plaintext separately. Since each block is processed independently of the others, this mode is exploitable for parallel implementation.
The AES 128-bit key is expanded in the host device (CPU) and stored in a 32-bit array of 44 elements, to be used as 128-bit round keys for 11 rounds. The host machine also divides plaintext data input into 16 bytes blocks; each contains 4 32-bit columns, which are all stored linearly in one dimensional array. The implementation loads expanded keys in constant memory while the t-tables (four 1 KB loopkup tables) in shared memory. The optimal number of blocks (Fermi = 64, Kepler = 1024) and threads (1024) are identified by RTA-CUDA. Figure 25 shows the speedup over CPU (sequential) implementation with both nVidia Fermi and nVidia Kepler GPUs using different input sizes. The RT-CUDA generated parallel implementations obtained speedup of up to 84 (on nVidia Fermi) and 87 (on nVidia Kepler) varies with different input sizes. However, for input data of low sizes (less than 1MB), the CPU and GPU performance is very similar. Figure 26 shows the normalized energy consumption for the SpMV, BiCG-Stab, and QMR algorithms. For each of the SpMV, BiCG-Stab and QMR algorithms, the plot shows the ratio of SK/MC, where SK and MC denotes the average energy consumption for the single-kernel implementation and multiple-kernel invocation using the CuSparse library. A reference sparse matrix is used with 10 % non-zeros in each row. The Nvidia Management Library (NVML) and the utility (nvidia-smi) have been used to measure power and energy using a time function. MC uses a more optimized library operator coding than SK, but the implied data-CUDA implements SK using ROS global synchronization and in-kernel iteration to preserve the vector data locality across the iterations. Hence, SK is less optimized than CuSparse at the operator level, but its single-kernel structure significantly enhances the memory access bandwidth, data locality, and data reuse.
Energy Aware Code Optimization
Normally, a global synchronization (GS) is needed for each inner product (reduction) and after each collective write (CW) to GM. The application of EORs (Sect. 3.1) seems to be effective for combining GSs to minimize overhead in addition to enhancing algorithm data locality. For example, the BiCG-Stab algorithm has five inner-products and two CW. Hence, there are seven kernel exit-entry in each iteration when BiCG is implemented using CuSparse. However, the application of EORs for optimizing BiCG-Stab energy consumption allowed implementing the main loop using only four GSs and preserving data locality due to in-kernel coding. Proposed EOR memory management rules (Sect. 3.1) produces the best results when the algorithm has more in-range vector operations which makes the best use of cached vector data. Each thread block updates many range vectors which are kept in ShM, thus saving the energy of repeatedly storing them back on GM before kernel exit and reloading at next kernel entry. Hence, SK implementation produces a saving of 58, 41, and 38 % of energy consumption for QMR, BiCG-Stab, and SpMV respectively when the matrix size is large enough (N > 5 × 105) to achieve enough energy saving on GM accesses and to amortize the in-kernel synchronization overheads against a more optimized operator coding.
Conclusion
We have proposed a restructuring algorithm with best possible kernel optimizations and energy-aware rules. A design of a restructuring tool (RT-CUDA) has also been presented based on the proposed restructuring algorithm that is capable to convert a standard C-program (input) into an optimized CUDA program (output). Proposed restructuring technique aims at minimizing kernel execution time combined with the use of minimal data movement and enhanced temporal and spatial localities to improve power efficiency. The code transformations are driven by some user-defined configuration parameters and API function calls provided in the tool with automatic kernel optimizations. Evaluations have been performed using various numerical algebra kernels including matrix addition (Madd), matrix-matrix multiplication (MM), matrix-vector multiplication (MV), vector-vector multiplication (VV) and also two specialized kernels demosaic (Demo) and histogram (Hist). The generated optimized code gives significant improvement in performance in most of the cases. For Madd, we obtained 45 % improvement over CUBLAS, and 42 % over openACC. For VV, we obtained 2 % improvement over CUBLAS, 50 % over GPGPU compiler, and same performance as OpenACC. For Demo, we obtained 17 % improvement over GPGPU compiler, and 37 % over OpenACC. For MM, we obtained 30 % improvement over GPGPU compiler, and 99 % over openACC. For MV, we obtained 99 % improvement over GPPGU compiler. For Hist, we obtained 97 % improvement over OpenACC. RT-CUDA also provides the functionality of using BLAS routines of the highly optimized library CUBLAS. By using this feature, it obtained performance improvements in terms of execution time of up to 80, 100, 4, and 56 % for MM, MV, MT, and VV respectively over GPGPU compiler with N = 4096. And it obtained performance improvements of up to 100, 33, 78, and 70 % for MM, MV, MT, and VV respectively over PGI OpenACC implementations with N = 4096. The tool also supports sparse operations for Sparse-Sparse Matrix Multiplication (spMM), Sparse-Dense Matrix Multiplication (spdMM), and Sparse-Dense Matrix Vector Multiplication (spdMV) using the best possible sparse matrix formats available in cuSparse library. RT-CUDA enables transparent invocation of the most optimized external math libraries like cuBLAS and cuSparse. In this manner, RT-CUDA bridges the gap between optimized linear algebra libraries and high-level GPU programming. The code produced by RT-CUDA, in terms of readability, is much better than the code produced by other sophisticated compilation infrastructures for GPUs and it can be easily modified to apply complex optimizations specific to the underlying device architecture that bridges the performance gap between compiler and hand-optimized code. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators such as reservoir simulators, molecular dynamics, etc. RT-CUDA also has the capability to successfully generate GPU kernels for the Jetson TK1 (nVidia Tegra 4) which has 192 CUDA cores. Generated CUDA code is valid for all family on NVIDIA GPU including the Jetson TK1 mobile device. In future, we will evaluate RT-CUDA on other GPU architectures such as nVidia Maxwell and nVidia Tegra (Mobile GPUs) with some additional application classifications. Futhermore, we are intending to explore code transformations and optimizations for other vendors' architectures such as AMD GPUs that uses programming frameworks other than CUDA.
to improve global memory bandwidth utilization, the programmer should use vector data types in the case when consecutive data loads are not aligned to 128 bytes. Texture Fetching utilized the texture memory that is a read-only portion of memory in device memory (DRAM) that has been cached (off-chip cache) on access [32, 44] . It has been accessible by all threads and host. It is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together achieve best performance. Texture references that are bound to CUDA arrays can be written to via surface-write operations by binding a surface to the same underlying CUDA array storage. Reading from a texture while writing to its underlying global memory array in the same kernel launch should be avoided because the texture caches are read-only and are not invalidated when the associated global memory is modified. So, texture fetches from addresses that have been written via global stores in the same kernel call returned undefined data. The data in texture can consist of 1, 2, or 4 elements of any of the following types: (1) Signed or unsigned 8, 16, or 32-bit integers, (2) 16-bit floating point values, and (3) 32-bit floating point values. Arrays declared in texture memory can be used in kernels by invoking texture intrinsic provided in CUDA such as tex1D(), tex2D(), and tex3D() for 1D, 2D, and 3D CUDA arrays respectively. Before invoking a kernel that uses texture memory, the texture must be bound to a CUDA array or device memory by calling cudaBindTexture(), cudaBindTexture2D(), or cudaBindTexturetoArray(). Coalesced Global Memory Access refers to combining multiple memory accesses into a single transaction [13] . Global memory is the slowest memory on the GPU. Simultaneous global memory accesses by each thread of a halfwarp (16 threads) during the execution of a single read and write instruction are coalesced into a single access. This is achieved based on the following conditions: (1) the size of the memory element accessed by each thread is either 4, 8, or 16 bytes, (2) the elements to be accessed form a contiguous block of memory, (3) the N th element is accessed by the N th thread in the half-warp, does not affect if any thread in between not accessing the global memory that is divergent warp, and (4) the address of the first element is aligned to 16 times the element's size. nVidia Kepler's Shuffle Instructions perform data exchange between threads within a warp [30] . It is more faster than the use of shared memory. This feature allows the threads of a warp to exchange data with each other directly without going through shared (or global) memory. So, this can present an attractive way for applications to rapidly interchange data among threads. There are four variants of shuffle instructions in CUDA that are __shfl(), __shfl_up(), __shfl_down(), and __shfl_xor(). Shuffle instructions can be used to free up shared memory to be used for other data or to increase warp occupancy and to perform warpsynchronous optimizations (removing __syncthreads()). All the __shfl() intrinsics take an optional width parameter which permits sub-division of the warp into segments. For example, to exchange data between 4 groups of 8 lanes in a SIMD manner. If width is less than 32 then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. A thread may only exchange data with others in its own subsection. Width must have a value which is a power of 2 so that the warp can be subdivided equally; results are undefined if width is not a power of 2, or is a number greater than warpSize.
User Driven Optimizations
Loop Collapsing is a technique to transform some nested loops into a single-nested loop to reduce loop overhead and improve runtime performance [18] specifically for irregular applications such as sparse matrix vector multiplication (spMV). Such applications pose challenges in achieving high performance on GPU programs because stream architectures are optimized for regular program patterns. It improves the performance of the application in three ways: (1) the amount of parallel work (the number of iterations, to be executed by GPU threads) is increased (2) inter-thread locality is increased (3) control flow divergence is eliminated, such that adjacent threads can be executed concurrently in an SIMD manner [24] . Thread and Thread-Block Merging enhance the data sharing among thread blocks to reduce the number of global memory accesses [47, 48] . Thread-Block Merge determines the workload for each thread block while Thread Merge decides the workload for each thread. If data sharing among neighbouring blocks is due to a global to shared memory (G2S) access, Thread-Block Merge should be preferred to better utilization of the shared memory. When data sharing is from a global to register (G2R) access, Thread Merge from neighbouring blocks should be preferred due to the reuse of registers. If there are many G2R accesses, which lead to data sharing among different thread blocks, the register file is not large enough to hold all of the reused data. In this case, Thread-Block Merge should be used and shared memory variables should be introduced to hold the shared data. In addition, if a block does not have enough threads, Thread-Block Merge instead of Thread Merge should also be used to increase the number of threads in a block even if there is no data sharing. Thread Merge achieves the effects of loop unrolling. It combines several threads' workload into one thread (combining N neighbouring blocks along column direction into one). By doing this, they can share not only shared memory but also the registers in the register file. Furthermore, some control flow statements and address computation can be reused, thereby further reducing the overall instruction count. The limitation is that an increased workload typically requires a higher number of registers, which may reduce the number of active threads that can fit in the hardware. Parallel Loop Swap is used to improve the performance of regular data accesses in nested loops [3, 24] . It uses to transform non-continuous memory accesses within the loop nest to a continuous memory access which is a candidate for the coalesced global memory access optimization. Strip Mining splits a loop into two nested loops [16, [41] [42] [43] . The outer loop has stride equal to the strip size and the inner loop has strides of the original loop within a strip. This technique is also used in loop tiling. In loop tiling or loop blocking, loops are also interchanged after performing strip mining to improve the locality of memory references that is why loop tiling is also called strip-mine-and-interchange. Bank Conflict Free Shared Memory Access improves performance by reordering the data into shared memory such that the memory addresses requested by the consecutive threads in a half-warp should be mapped to different memory banks of shared memory [10] . Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. In GPUs, the warp size is 32 threads and the number of banks is 16. So, a shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. However, no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads. Using Read-Only Data Cache, introduced in nVidia Kepler in addition to L1 cache, can benefit the performance of bandwidth-limited kernels [30, 39] . This is the same cache used by the texture pipeline via a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures. This feature is automatically enabled and managed by the compiler, access to any variable or data structure that is known to be constant through programmer use of the C99-standard "const __restrict__" keyword tagged by the compiler to be loaded through constant data cache.
Compiler Optimizations
Common Sub-Expression Elimination is a compiler optimization technique that searches for instances of identical expressions, evaluates to the same value, and replace them with a single variable holding the computed value [9, 49] . It enhances the application performance by reducing the number of floating point operations. In CUDA, common sub-expression elimination can be used to avoid redundant calculations for the initial address of an array. Loop Invariant Code Motion (also called hoisting or scalar promotion) is a compiler optimization that has been performed automatically [17, 36] . Loop invariant code is a set of statements or expressions within the body of a loop that can be moved outside of the body without affecting the semantics of the program. It makes loops faster by reducing the amount of code that executes in each iteration of the loop. The CUDA C compiler automatically applies this optimization technique to the PTX code. Loop Unrolling is a compiler optimization technique that is applied for the known trip counts at the compile time either by using the constants or templating the kernel [28] . NVIDIA compiler also provides a directive '#pragma unroll' to explicitly activate the loop unrolling on a particular loop.
