Abstract Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
Introduction
With the high computation power, GPGPUs have been a popular platform for applications in a wide variety of domains such as linear algebra, computational finance, and machine learning. In the meantime, researchers have made efforts [1] [2] [3] to support or improve big data processing using GPGPUs. In order to achieve the high performance for these applications, developers have to write parallel programs based on program languages including CUDA and OpenCL. As a result, writing parallel programs is the key to utilize the full potential of GPGPUs.
Parallel programs consist of series of code sections with different thread-level parallelism (TLP). Depending on application characteristics and the parallelization strategy, a parallel thread itself may contain both serial code and parallel loops. Such parallel loops inside a thread are referred to as nested thread-level parallelism. In other words, if a thread contains a parallel loop, which can be further parallelized, we consider it as a candidate for nested thread-level parallelism.
To exploit such nested parallelism in GPGPU (general purpose computation on graphics processing units) applications, the latest NVIDIA Kepler architecture in-troduces the support for dynamic parallelism, which enables a GPU thread to invoke another kernel during execution. Although dynamic parallelism reduces the overhead of invoking a GPU kernel from a CPU, two key limitations remain. First, the communication between a parent thread and its child threads has to be through global memory variables. Second, launching a kernel from a GPU thread involves the device runtime 1 ○ and has non-trivial performance overhead [4] . In this paper, we first study a set of benchmarks to show that they contain parallel loops with relatively small loop counts. As a result, the benefits from parallelizing such loops using dynamic parallelism fail to overweigh its overhead. Then, we propose our solution, referred to as CUDA-NP, to exploit nested parallelism within GPGPU applications. Similar to dynamic parallelism, CUDA-NP faces two fundamental challenges: 1) how to have different numbers of threads running in different code sections, and 2) how to enable lowlatency data communication between a parent/master thread and its child/slave threads. To address these challenges, CUDA-NP first re-maps threads in a thread block (TB) into a one-dimensional (1D) organization. Then, for each thread, referred to as a master thread, CUDA-NP adds a set of slave threads along a different dimension. The purpose of the slave threads is to help their master thread on its parallel loops. To do so, CUDA-NP introduces control flow to disable slave threads during sequential code sections. In CUDA-NP, the low cost data communication between a master thread and its slave threads is achieved through registers or shared memory. In a way, CUDA-NP can be viewed as lightweight dynamic parallelism.
As CUDA-NP essentially exploits nested parallelism within a single parallel thread to further improve TLP, the applicability of CUDA-NP is for parallel applications that contain parallel loops with relatively small loop counts. In addition, for a given target device, if the available TLP is not sufficient to hide computation or memory access latencies, further improvement on TLP is likely to improve the performance. In other words, CUDA-NP is beneficial for applications containing parallel loops with relatively small loop counts and their TLP is not enough for the target device.
Our proposed CUDA-NP is implemented as a source-to-source compiler framework, which takes CUDA kernels with OpenMP-like directives as the input and outputs optimized CUDA kernels to exploit nested parallelism. In this way, a GPGPU application developer only needs to add pragmas to identify parallel loops within a kernel to take advantage of CUDA-NP.
Our experimental results on NVIDIA GTX 680 GPUs show our proposed CUDA-NP achieves remarkable performance gains, up to 6.69 times and 2.01 times on average. Our optimized code also consistently outperforms the highly optimized library CUBLAS v5.0 on the benchmarks matrix-vector multiplication and transpose-matrix-vector multiplication for different input sizes.
In summary, our work makes the following contributions: 1) we study a set of GPGPU applications and highlight the characteristics of their nested parallelism; 2) we propose simple pragmas and a set of optimization techniques to support nested parallelism; 3) we implement our CUDA-NP using a source-to-source compiler to relieve the programming complexity from application developers; and 4) we show that our proposed solution is highly effective and significantly improves the performance.
The remainder of the paper is organized as follows. Section 2 presents a brief background on GPGPU architecture with a focus on NVIDIA dynamic parallelism. We also analyze a set of GPGPU applications to show the characteristics of their parallel loops. Section 3 presents our compiler framework to exploit nested parallelism. The experimental methodology is addressed in Section 4 and the results are presented in Section 5. Related work is discussed in Section 6. Section 7 concludes the paper.
Background

GPGPU Architecture and Programming Model
In order to achieve high computational throughput and memory bandwidth, GPGPU exploits many-core architectures and organizes the cores in a two-level hierarchy. First, a GPU contains multiple streaming multiprocessors (SMs) in the NVIDIA GPU architecture. An SM is also called a next generation SM (SMX) in NVIDIA's latest Kepler architecture and is similar to a compute unit (CU) in AMD GPU architecture. Each SMX/CU in turn consists of multiple streaming processors (SPs) or thread processors (TPs). An SMX/CU can support thousands of threads running concurrently, following the single-program multipledata (SPMD) programming model. In the CUDA programming model, the threads are also managed in a two-level hierarchy: thread grids and thread blocks (TBs). A GPGPU program, also called a kernel, is launched as a grid of TBs. A TB in turn contains multiple threads, which can have up-to-threedimension thread identifiers (ids). In our compiler, we always map a multi-dimensional thread id into a 1D one using the approach presented in Subsection 3.7. Therefore, in our subsequent discussions, we assume the input kernel has only 1D threads in a TB.
GPGPU employs the single-instruction multipledata (SIMD) model to amortize the cost of instruction decode and fetch. A small group of threads, referred to as a warp, share the same instruction pointer. The latest NVIDIA Kepler architecture introduces a set of shuffle (shfl ) instructions to enable the data exchange through registers for threads in the same warp. One shfl instruction used in this paper is shfl (var, laneID, laneSize). For this instruction, a warp (32 threads) is partitioned into small groups with the group size as laneSize. Then, laneID is used to specify the relative thread id in a group, and var is the variable to be read. For example, the instruction shfl (var, 0, 4) means that a warp contains eight groups with a group size of 4 and all threads in the same group will read var from the first thread of the group. As a result, threads with id 0, 1, 2, and 3, belonging to the first group, read var from thread 0; threads with id 4, 5, 6, and 7, read var from thread 4; and so on. Compared to shared memory, which can be shared among all threads in the same TB, the shfl instructions have higher performance with the following two limitations. First, it can only be supported for the threads in the same warp. Second, threads can read a register from another thread in the same warp, but cannot write to a register of another thread.
The support for dynamic parallelism is introduced to NVIDIA GPUs with compute capability 3.5. With dynamic parallelism, a GPU thread can launch a kernel during execution. Dynamic parallelism provides an easy way to develop GPU kernels for a program that contains nested parallelism without involving the host CPU. However, in order to achieve the high performance, the kernel launched by a GPU thread must have a very high number of threads to offset the overhead of launching a kernel. To illustrate the overhead of dynamic parallelism, we use the memory-copy micro-benchmark in our experiment on an NVIDIA Tesla K20c GPU. To copy 64-million floats, the baseline micro-benchmark without dynamic parallelism achieves the bandwidth of 142 GB/s. Then, we observe that once we enable the compiler flag for dynamic parallelism, the original kernel without using dynamic parallelism can only achieve 63 GB/s. Such overhead is referred to as dynamic-parallelism-enabled kernel overhead 2 ○ . Next, we modify the benchmark to make use of dynamic parallelism. In the dynamic parallelism version, we have a parent kernel and a child kernel. The parent kernel is launched once, but each thread of the parent kernel will launch a child kernel. Therefore, the child kernel can be launched many times. In the child kernel, each thread simply copies a float from the input to the output. If the number of threads of the parent kernel is m, and the number of threads of every child kernel launch is n, then m × n is the overall floats to be copied from the input to the output. We fix the value of m × n to 64 million and show the bandwidths for different m in Fig.1 . Although the overall workload remains the same, the performance degrades rapidly when the number of child kernel launches increases. In other words, each kernel launch needs to have a high number of threads to achieve good performance. From Fig.1 , we can see that when each child kernel launch has 16k threads, the overall memory copy bandwidth only reaches 34 GB/s. This highlights the kernel launching overhead for dynamic-parallelism. Another limitation of dynamic parallelism is that the communication between the parent thread and its child threads has to be through global memory 64m  32m  16m  8m  4m  2m  1m  512k  256k  128k  64k  32k  16k  8k  4k  2k  1k  512  256 Size of n Bandwidth (GB/s) Fig.1 . Throughput of the memory-copy micro-benchmark using dynamic parallelism. n: the number of threads per kernel launch for the child kernel.
Nested Parallelism in GPGPU Programs
GPGPU applications are typically highly parallelized due to the required TLP to hide high memory access latencies. Still, there exist parallel loops First, developers tend to view that there is already sufficient TLP. In the TMV example, the output vector c may have a high number of elements and further parallelization of the loop in Fig.2 may be considered to be not necessary. However, based on the characteristics of the applications, each thread may require too much resource to limit the number of threads that can run concurrently on each SMX, even though the overall number of threads of the application is indeed high enough.
Second, if a loop contains loop-carried dependencies, application developers may choose not to parallelize it. As shown in Fig.2 , to parallelize the loop, the reduction primitive needs to be supported.
Third, if an application developer chooses to further parallelize some parallel loops in a kernel, he/she needs to understand the GPGPU architecture very well, in order to achieve good performance. For example, if a parallel loop is distributed among threads in the same warp, we should avoid workload imbalance due to the nature of SIMD execution. Another key factor is how to balance the resource usage among shared memory, register file, and local memory. Therefore, in this paper, we argue for a compiler approach to exploit nested parallelism to relieve the application developers from the associated complexity.
Overall, the loop in Fig.2 showcases a common example in many benchmarks that contain parallel loops. An outstanding feature of a good candidate for leveraging nested parallelism is limited TLP due to the nature of the application or the heavy resource usage of each GPU thread.
Characteristics of Nested Parallelism in GPGPU Programs
In order to understand available nested parallelism in GPGPU programs, we studied the benchmarks in NVIDIA SDK 3 ○ , Rodinia [5] , GPGPUSim [6] and Mars [2] . The detailed experimental methodology is discussed in Section 5. Since 1k threads per SMX can provide sufficient TLP for NVIDIA Kepler architecture, further exploring the nested parallelism does not help the performance if a benchmark can support 1k threads per SMX. We list in Table 1 the benchmarks that contain nested parallelism but cannot achieve 1k threads per SMX due to resource constraints. We also observe that the input sizes can affect the resource usages and TLPs for some benchmarks. Therefore we also list the resource usages and input sizes in the table. In the table, the first column (name) contains the name of the benchmarks; the second column (input) shows the input to each benchmark; the third column (PL) shows how many parallel loops exist in the kernel; the forth column (LC) shows the largest loop counts among these parallel loops; the fifth column (R/S) indicates if the loops contain the scan(S)/reduction(R) operations or not (X). We also show the resource usage including register file (REG), shared memory (SM) and local memory (LM) per thread for the baseline code (BL) and optimized versions (OPT) generated from CUDA-NP.
From Table 1 , we can see that the benchmarks LE, LIB, and CFD have intensive local memory usage to limit their performance, even though they do have relatively high numbers of concurrent threads running on each SMX; the benchmarks LU, MV, SS, BK, LR, and KM have intensive shared memory usage, which limits the number of concurrent threads on each SMX; the benchmark MC has intensive usage of both shared memory and local memory; the remaining benchmarks, TMV and NN, do not have intensive resource usage. From these benchmarks, we can see that the loop counts of parallel loops are relatively small. In this section, we present our CUDA-NP compiler framework to leverage nested parallelism. The input to our compiler is a CUDA kernel with OpenMP-like directives to denote parallel loops. The output of our compiler framework is the optimized kernel code. Fig.3 shows an example. The kernel with directives denoting the parallel loop is shown in Fig.3 (a) and the optimized kernel using our proposed compiler is shown in Fig.3 (b). The first transformation on the input kernel is to increase the TB size. Given the hardware limit of GTX 680 GPUs, the maximal number of threads in a TB is 1 024. Therefore, the TB size is increased up to 1 024 if the input kernel has a smaller TB size. As our compiler already re-maps the threads in a TB into a 1D organization, increasing the TB size is achieved by adding a new dimension. The newly introduced threads will be used to carry out the parallel loops in the input kernel. To differentiate the original threads from the newly added ones, we refer to the original threads as "master" threads and the newly added ones as "slave" threads and we use "master id " and "slave id " as their thread ids along different dimensions. If the master threads are aligned in the X/Y dimension, slave threads will be added along the Y /X dimension, as discussed in Subsection 3.4. For the kernel shown in Fig.3 , "master id" is actually "threadIdx.x" and "slave id" is "threadIdx.y". The "lud perimeter" kernel in Fig.3 (a) has only 32 threads in a TB and each TB needs 3KB shared memory. As a result, 16 thread blocks can run concurrently on each SMX on an NVIDIA GTX 680 GPU, for which the shared memory is configured to 48 KB per SMX. The line 13 in Fig.3(a) is the pragma indicating the subsequent loop can be parallelized. Then in the code optimized using CUDA-NP, as shown in Fig.3(b) , each TB has 32×8 threads, where the size of the X dimension is 32 and the size of the Y dimension is 8. However, we only allow 32 threads in a TB to be active for the sequential code such as line 10 in Fig.3 (a) in the kernel. For the parallel loops, such as those between line 14 and line 16 in Fig.3(a) , all 256 threads per TB are active, and each thread processes one or more iterations of the loop. In other words, for each master thread, seven slave threads are added to share its parallel workload. Note that the parameter "slave size" is 8 as eight threads (1 master + 7 slaves) will all work equally on the parallel loops.
Next, we use one example to illustrate why our optimized kernel improves performance, as shown in Fig.4 . In Fig.4(a) , we assume each TB of the input kernel has eight threads running through its whole lifetime. Then in the optimized kernel, each TB has 8×4 threads as shown in Fig.4 (b). However we have only eight threads running for sequential codes, such as line 10 in Fig.3(a) , and 8×4 threads for loop sections, such as line 14 to line 16 in Fig.3(a) . These active threads that execute the sequential sections are the master threads, as they are very similar to original threads in the input kernel. But in the parallel loop sections, the workload of each thread in the input kernel will be distributed to the corresponding master thread and its slave as shown in Fig.4 (b). The overall performance is improved due to higher TLP for parallel loops. In the example in Fig.3 , the master threads are aligned along the X dimension, i.e., the master id is "threadIdx.x" in the corresponding TB of the input kernel. When we generate the slave threads for a master thread, they share the same master id (i.e., "threadIdx.x") but have different slave ids (i.e., "threadIdx.y"). Therefore, we actually use threads in different warps to work collaboratively for the workload of a master thread in the input kernel, since the warps are formed using consecutive thread ids. For example, the threads with 2D ids (1, 0) to (1, 7) in a TB of our optimized kernel perform the workload of the thread with id 1 in the TB of the original kernel. The thread with id (1, 0) in the optimized kernel will be the master thread corresponding to thread 1 in the original kernel and threads with ids (1, 1) to (1, 7) are its slave threads. These threads will be in different warps as the TB dimension is 32×8. Therefore, we refer to this way of distributing parallel loop iterations as inter-warp NP. Besides inter-warp NP, we can map the "master id" to "threadIdx.y" and map "slave id" to "threadIdx.x". For example, we use threads with thread ids (0, 1) to (7, 1) in our optimized kernel to perform the workload of thread 1 in the original kernel. In this way, we use threads within a warp to distribute the parallel loops. We refer to this way of thread id mapping or workload distribution as intra-warp NP.
As illustrated in Fig.3(b) , several key code transformations are performed by our proposed CUDA-NP compiler framework. In the sequential code sections, it introduces the control flow in line 10 to only allow the master threads to compute the variable array offset. Since this variable is to be used in the parallel loop by slave threads, it needs to be broadcasted to the slave threads. A function called "read from master " is introduced for this purpose. In the parallel loop, it updates the loop iterator using slave ids and the number of slave threads (i.e., slave size) so that multiple slave threads can process multiple loop iterations in parallel. As the loop bound checking remains in the transformed code, this transformation is valid if the loop bound is determined at the runtime. Also, from this example, we can see that a key challenge of our code transformations is to handle the variables which are live cross a parallel section and a sequential section.
In Subsection 3.1, we discuss how our compiler handles scalar live-in variables. Subsection 3.2 addresses the scalar live-out variables. For live array-variables, if they reside in global memory or shared memory, they are already accessible by the master threads and their salve threads. Therefore, in Subsection 3.3, we discuss our compiler transformation to deal with live array-variables, which are located in local memory. In Subsection 3.4, we summarize the trade-offs between the two workload distribution schemes, intra-warp NP and inter-warp NP. In Subsection 3.5 we discuss how to make the choice of optimization parameters, and in Subsection 3.6 we present the overall compiler algorithm for code transformations. Subsection 3.7 lists our proposed pragmas for NP. Subsection 3.8 discusses the preprocessing step of our compiler.
Scalar Inputs/Live-Ins to Parallel Sections
For a scalar variable defined in a sequential section and used in a subsequent parallel section, i.e., a "livein" variable to the parallel section, we need to broadcast it from a master thread to its slave threads. The exception is that the variable is in the global memory which is already visible to all the threads. For example, as shown with line 15 in Fig.3(a) , the global memory array m can be directly accessed by slave threads. Shared memory has similar behavior and can be used directly by slave threads without additional code transformations.
The variables in the register file or local memory have to be broadcasted to slave threads as they are private to a master thread. We implement it using the function read from master. If we use intra-warp NP for Kepler GPUs, since the master and its slave threads are in the same warp, we can use the shfl instruction, shfl (var, 0, slave size), to implement the read from master function as shown in Fig.5(b) . As explained in Subsection 2.1, for such a shfl instruction, threads of a warp first are partitioned into small groups with the group size of slave size. Therefore, a group actually contains all the slave threads of a master thread for intra-warp NP. Then all threads within a group will read the value of var from the master thread, whose id (i.e., threadIdx.x % slave size) is 0, in the small group. For inter-warp NP or intra-warp NP on GPUs that do not support shfl instructions, read from master is implemented using shared memory as shown in Fig.5(a) . In this case, master threads first write values to shared memory, and then slave threads read them from shared memory. Instead of communicating through registers or shared memory, another way is to let all slave threads compute the live-in variables redundantly. Such redundant computation is also called uniform vector operations as they have the same input and output values for different threads [7] . If the overhead is only simple ALU computations like line 10 in Fig.3(a) , in general, redundant computation can deliver better performance due to eliminating the shared memory usage and control flow. In our compiler, if an instruction's inputs are constant values or the output of a uniform vector instruction, this instruction will be executed by all slave threads redundantly. Otherwise, we let the master thread execute it and broadcast the result to slave threads.
Scalar Outputs/Live-Outs of Parallel Sections
Similarly to the live-ins of a parallel section, if a scalar output of a parallel section is in global memory or shared memory, we can just leave as it is, as it is already visible to the master threads. If a variable is in the register file or local memory, we have different scenarios to handle. One common scenario is a reduction or scan variable, for which we can generate the parallel implementations to retrieve the results from slave threads. We implement the reduction using shared memory for inter-warp NP or using the shfl instruction for intrawarp NP. For the scan implementation, we also use a similar approach to NVIDIA CUDA SDK 4 ○ .
There are scenarios that an output of a parallel section is neither a reduction variable nor a scan variable. One such example is the code "if (i==3) x=a [i] ;" inside a parallel loop, where i is the loop iterator and the variable x is used in later sequential sections. The problem with this code is that in our CUDA-NP scheme, each slave thread has a local variable x and will execute the code. But it is supposed that only one slave thread will write the value to x, and the x of this slave thread needs to be transferred to the master thread. In such a case, we can make the initial value of x be 0 so that a reduction operation on x can be used to retrieve the value from the slave threads.
Live Array-Variables Residing in Local Memory
Since the register file has a limited size and cannot be accessed as an indexed array, array variables residing in the local memory are used in some CUDA programs.
As shown in Fig.6 , the array Grad has to be spilled into local memory due to the register file size limitation. Such local memory accesses incur high pressure on the L1 cache and lead to poor performance. We apply our CUDA-NP on the parallel loops marked with our CUDA-NP pragmas in Fig.6 , and Fig.7 shows the code after our optimization. From  Fig.7 , we can see all parallel loops are distributed to multiple slave threads. For the loop starting from line 6 in Fig.6 , each slave thread only needs to compute NPOINTS/slave size iterations as shown from line 6 in Fig.7 . As shown from line 7 in Fig.7 , each iteration of the loop in a slave thread is mapped to an iteration of the loop of the baseline kernel before our optimization. In this way, all iterations of the loop in the baseline are distributed to slave threads. The reduction or scan operations are also appended after the loops if the pragmas specify the reduction or scan clauses.
As we discussed in Subsection 3.1, a local array is private to a thread, and not visible to other threads. However, in order for slave threads to process a parallel loop, this array has to be shared among those threads. Therefore, we need to replace a local array with a shared memory array or a global memory so as to make it visible to all threads. One exception is that a local array is accessed based on the loop iterator. For example, the parallel loops in Fig.6 always access the array Grad using the loop iterators. In this case, since each slave thread only needs to access part of the local array without interleaving, we can partition the local array into small ones and distribute each small array to one slave thread. Therefore, for a live local array, we can replace it with a global memory array, a shared memory array, or partition it into small local arrays as shown in Fig.7 . Since these approaches only affect the accesses to local arrays, we differentiate them using two MACROs: DEF Grad and Grad (i), in Figs.7(a), 7(b) , and 7(c) so that the code in Fig.7(d) remains the same. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29 
1) Replace a Local Array with a Global Memory Array.
We first define a new global array and partition it such that each partition corresponds to the local array of a master thread. As shown in Fig.7(a) , the MACRO DEF Grad partitions a new global memory array Grad g based on the id of a master thread so that all slave threads of the master thread access the same partition.
2) Replace a Local Memory with a Shared Memory Array. In this case, we first declare a shared memory array. The size of its first dimension is master size, i.e., the number of master threads in a TB, and the size of its second dimension is the size of the local array. Then slave threads can access the shared memory based on its master thread id and the index of original local array. Since many benchmarks already use shared memory intensively, the potential issue of this approach is the increased usage of shared memory.
3) Partition a Local Array into Smaller Local Arrays. In Fig.7(c) , each slave thread only requires a smaller local array whose size is NPOINTS/slave size. This approach that requires a slave thread must only read and write its own local array after the partition.
Our framework employs the following policy to decide which option is to be used to replace a local array. First, if the local array meets all conditions to be partitioned into smaller ones, we choose option 3. Otherwise, the size of the local array is checked, and the shared memory is used to replace the local array, if the size of the local array is less than 384 bytes. The reason for this choice is that assuming the local array size is 384 bytes and we can launch eight slave threads for each master thread, 48 KB shared memory can support 128 master threads and 896 slave threads after our optimizations, which provides enough TLP on each SMX. If the shared memory is already used in the baseline, we also need to subtract such shared memory usage from 384 bytes to ensure that shared memory will not be the resource bottleneck for TLP. Since the size of shared memory for different generations of GPUs can be different, we also allow the developers to provide compute capability information to decide the size of shared memory. The last choice is to replace the local array with one in global memory even with the high access latency. The reason is that local memory is limited to be accessed within a single thread while multiple slave threads need to share it after enabling nest parallelism.
Inter-Warp NP vs Intra-Warp NP
The choice between inter-warp NP and intra-warp NP may have significant performance impact. Here, we summarize their trade-offs. First, since threads in the same warp can use registers to exchange data, shfl instructions can be used for communication and also the scan and reduction operations for intra-warp NP.
As a result, the intra-warp NP may have less shared memory usage. Second, if the slave threads of a master thread have different workloads, the intra-warp NP will be worse than inter-warp NP due to control divergence. Third, intra-warp NP may have negative impact on memory coalescing as it changes the memory access pattern of the original kernel. In general, the master threads in the original kernel have adjacent thread ids and tend to access the global memory in a coalesced way. If we map these master thread ids into "threadIdx.y" as the intra-warp NP approach, these coalesced global memory accesses are broken. Forth, a similar issue may also happen for constant memory accesses when we use intra-warp NP. Considering line 11 in Fig.6 , if "Grad" is a constant array, threads in a warp will access the same address of "Grad" in the baseline. However, after intra-warp NP, slave threads of a master thread will access different addresses of the constant array. Such accesses cannot leverage the hardware broadcast logic and may hurt performance. Finally, to use the shfl instructions, the number of slave threads for a master thread has to be (a 2's power − 1 ), i.e., 1, 3, 7, 15. Otherwise, these slave threads might be in different warps.
Choice of Optimization Parameters
From previous discussion, we can see that when applying our CUDA-NP on an application, two important optimization decisions may affect the performance. The first one is the choice between intra-warp NP and interwarp NP, and the second one is the number of slave threads. One possible solution is to use an auto-tuning approach to explore different options and choose the best one based on the test runs. Since the search space is not larger than 10 if we limit the number of slave threads to be 2, 4, 8, 16 or 32 with intra-warp NP and inter-warp NP, the cost of auto-tuning is reasonable. However, one weakness of the auto-tuning approach is that it requires the runtime support and cannot guarantee whether the decision is optimal for different inputs as it can only cover specific inputs. Therefore, we develop a static analysis approach to select the best optimization strategy for applications.
For the choice between intra-warp NP and interwarp NP, our static analysis approach checks the memory access pattern of the input program based on previous techniques [8] . If the coalesced memory access requirement is satisfied for the input program, we should apply inter-warp NP as intra-warp NP will break the coalesced memory accesses. If the input program does not follow the coalesced memory access pattern while intra-warp NP can improve memory performance, intra-warp NP is used for the program. For other cases, inter-warp is preferred as it has more advantages as discussed in Subsection 3.4.
For the number of slave threads, we only consider the choice among 2, 4, 8, 16, and 32 as the reasons mentioned before. While having more slave threads can introduce more TLP, it may not have positive impact on the performance due to the following reasons. 1) The cost of the reduction and scan operations is also increased with more slave threads. 2) Since the workload of each master thread is distributed to slave threads, increasing the number of slave threads reduces the workload of each thread and may result in less instruction level parallelism (ILP) for each thread. 3) Having more slave threads improves performance only if the parallel loops are the bottleneck due to Amdahl's law. In order to simplify the problem, we choose to find a nearoptimal number of slave threads instead of an optimal number of slave threads. Our purpose is to use static analysis and empirical experience to decide the number of slave threads to achieve relative good performance for most of applications. First, among 2, 4, 8, 16, and 32, our compiler can determine the possible ones for an application based on resource usage and thread block configurations. For example, if an application has 128 threads per thread block, the maximum number of slave threads is 8 due to the limit of 1 024 threads per thread block. Other resources such as registers and shared memory also limit the number of slave threads. Second, our compiler uses the middle one, 8, as the target for the number of slave threads as the benefit of larger number is reduced while introducing more overhead. Overall we will use 8 if the application can support 8 as the number of slave threads. Otherwise we will use four slave threads if the application can support four slave threads or two slave threads if the application can only support two slave threads.
Although our static approach cannot achieve the optimal performance, it does not require additional autotuning and does not rely on specific inputs. We will evaluate our static approach in the experimental result section.
Compiler Algorithm
Here, we summarize our CUDA-NP compiler algorithm, as shown in Fig.8 . CUDA-NP takes a kernel as the input. It parses the kernel into a series of code sections. Each code section is either sequential or parallel. A parallel section is identified by the "np" pragma. First, we map the thread id of the input kernel to the master and the slave thread ids in the transformed kernel for either the inter-warp NP or the intra-warp NP approach. Then, if a code section is sequential, we generate the control flow to only allow the master threads to execute it. Redundant computations can be used in sequential sections depending on the characteristics of an instruction as discussed in Subsection 3.1. For parallel sections, all slave threads along with their master threads are active. For each parallel section, we also generate the code for its scalar input (Subsection 3.1) and the code for its scalar output (Subsection 3.2). The live local arrays have to be replaced with global/shared memory arrays, or partitioned into smaller local arrays, as discussed in Subsection 3.3.
NP_transformation(Kernel kernel) css = generateCodeSections(kernel) inter-warp or intra-warp thread map for kernel (Subsection 3.4) for cs in css:
if cs is sequential: cs is master thread model if cs is a parallel loop: map each slave thread id to iterations of cs for each input in of cs: insert broadcast function for in before cs (Subsection 3.1) for each output out of cs:
insert reduction or scan for out after cs (Subsection 3.2) for each live local memory array lm (Subection 3.3) map lm to global memory, shared memory or the register file 
Pragmas
In order to reduce the programming complexity to leverage nested parallelism, we adapt the OpenMP pragmas for our CUDA-NP framework.
Most of CUDA-NP grammars are designed to be very similar to OpenMP pragmas on purpose. A developer can add "#pragma np for" to denote a parallel loop, and can also specify different clauses of the pragma. A copyin clause defines the data which should be broadcasted from a master thread to its slave threads. If a copy-in clause is not available from users' pragmas, our compiler can automatically find the live-in variables defined before a parallel loop and make them broadcasted from a master thread to its slave threads. A reduction/scan clause defines the reduction or scan operations. Developers have the flexibility to specify the preferred number of slave threads (number threads), whether the inter-warp NP or intra-warp NP is preferred (NP type), and the targeted version of NVIDIA CUDA compute capability (sm version). Our current support for compute capability versions is mainly for the purpose of using shfl instructions. If the target version is less than 3, the shfl instruction cannot be used to guarantee correctness. If a developer does not provide such information, our compiler generates multiple versions to explore different numbers of slave threads, and different thread distribution approaches.
Preprocessors
The purpose of the preprocessors to our compiler is to generate the input source code suitable for our code optimizations.
1) Convert a TB with Multi-Dimensional Threads into a TB With 1D Threads. We use the mapping relationship shown in Fig.9 to map multi-dimensional thread ids to 1D ones and vice versa. This transformation has limited performance impact since it does not change thread organizations within warps. In other words, the threads in a warp remain in a warp after the transformation. Therefore, it does not affect memory coalescing or divergence. 2) Combine Unrolled Statements into a Loop. We find that sometimes the developers may manually unroll some loops. Since our compiler targets at parallel loops, for statements after unrolling, they can be combined into a parallel loop to take advantage of CUDA-NP. Fig.10(a) shows such an example, as the input of each statement cannot be mapped to an iterator of a loop directly. In our pre-processor, we put the nonlinear indexes in constant buffers, and then access these indexes using loop iterator. In this way, we can convert such sequential code into a parallel loop.
3) Pad Arrays. As shown in Fig.6 , the size of the local memory array Grad is 150, which is not multiple of 4, 8, 16, or 32. However, if we apply the inter-warp NP scheme, the number of slave threads of a master thread has to be (a 2's power number −1) and the loop count needs to be a multiple of slave size. In this case, we can pad the size of Grad to 160 and also increase the upper bound of the loop to 160 so that the loop counter is the multiple of 32. Then an additional control flow "if (i < 150)" is added in the loop body to skip the padding data, where i is the loop iterator. Such padding may introduce workload imbalance among slave threads due to some idle iterations. 
4)
Automatic CUDA-NP Pragma Insertion. In order to simplify the development, it is possible for our preprocessor to automatically insert the CUDA-NP pragma for some applications if their data dependencies can be accurately resolved at compilation time.
Experimental Methodology
To evaluate our proposed compiler, we perform our experiments on NVIDIA GTX 680 GPUs with CUDA SDK 5.0. We let the CUDA runtime determine the shared memory usage automatically based on the resource requirement of each benchmark. Most benchmarks used in the experiments are from NVIDIA SDK, GPGPUSim, and Rodinia benchmark suite. Among these benchmarks, MarchingCubes (MC) is from NVIDIA SDK, and Libor (LIB) is from GPGPUSim. Lud (LU), Leukocyte (LE), Streamcluster (SS), Computational Fluid Dynamics (CFD), BucketSort (BK), and Nearest Neighbor (NN) are from Rodinia. LE is the array order version [9] , and BK is in the Hybrid Sort package. Since NN only uses one thread in each TB and has very poor performance, we first modify the TB configuration so that each TB has 32 threads, which is 2.89 times faster than the original version. Then we use this modified version as the baseline in our experiments. We use the optimized matrixvector multiplication (MV) based on [10] . The TMV code is shown in Fig.1 . We also include two MapReduce GPGPU applications, KMean (KM) and LinearRegression (LR) from Mars [2] . In Table 1 , as a comparison, we show the resource usage of our optimized benchmarks. For these benchmarks, we manually add the NP pragma to identify parallel loops.
We implement our proposed CUDA-NP in a sourceto-source compiler using Cetus [11] . Our compiler has an auto-tuning mechanism to select from multiple choices, such as intra-warp NP or inter-warp NP, and different numbers of slave threads to be used to distribute parallel loop iterations.
Experimental Results
In Fig.11 , we report the speedups of the optimized kernel generated by our compiler over the baseline. As shown in the figure, our proposed CUDA-NP can achieve from 1.36 to 6.69 times speedups. On average using the geometric mean (GM), among the twelve benchmarks, our proposed CUDA-NP can achieve 2.01 times speedup using auto-tuning approach and 1.77 times speedup using our static approach. The autotuning approach will try both inter-warp NP and intrawarp NP with 2, 4, 8, 16, and 32 slave threads to find the best configuration for each application. The static approach will use memory accesses to determine one of intra-warp NP or inter-warp NP, and try to use eight slave threads if possible as discussed in Subsection 3.5. From the figure, the static approach cannot achieve the optimal performance, but it is only 12% less than the auto-tuning approach and does not need any runtime support.
In order to better understand the impact of our CUDA-NP on these benchmarks, we show results for different slave sizes coupled with either inter-warp NP or intra-warp NP in Fig.12 . Among these benchmarks, LU and NN are the only cases that intra-warp NP achieves better performance than inter-warp NP. The main reason for LU is that the loops of LU are in the control flow "master id < 16". The intra-warp NP approach allocates slave threads in the same warp for a master thread. Assuming each master thread has three slave threads, each warp will contain only eight master threads (eight master threads + 24 slaves). These eight master threads and their slaves will execute the same path. Therefore, control divergence disappears after intra-warp NP. Our static approach cannot capture such special case so it fails to find the intra-warp NP is the better one. Furthermore, when three slave threads are allocated to each master thread (i.e, slave size = 4), it achieves the best performance for both intra-warp NP and inter-warp NP, as it enables 2k threads per SMX based on the resource usage. For NN, the intra-warp NP version can access the global memory in a more coalesced manner while the impact of inter-warp NP is For other benchmarks, inter-warp NP always outperforms intra-warp NP. MC, LIB and LE have imbalanced workload among slave threads for intra-warp NP. For example, since the loop count of MC is 12, and if we allocate seven slave threads for each master thread, then some slave threads have to take two iterations and the others take one iteration. LIB and LE have similar behaviour to MC.
The difference between inter-warp NP and intrawarp NP for CFD is minor, because the memory accesses of the baseline are not all coalesced and the loop iterations can be evenly distributed to slave threads.
For the two MapReduce applications, LR and KM, the baselines have 256 threads per thread block. Therefore, we can only use four or two slave threads per master thread. The number of slave threads has small impact on the inter-warp NP but larger impact on intrawarp NP due to memory coalescing issues.
From Fig.12 , we can also see that in some cases, the performance degrades when the number of slave threads increases. This observation is consistent with the previous work [12] in that higher TLP is not always helpful.
As we discussed previously, if the loop count of a parallel loop counter is not power of 2, we need to pad it to the multiple of power of 2 for intra-warp NP. However, for inter-warp NP, such limitation can be removed. Therefore in Fig.13 , for the benchmark LE, we compare the results without padding to those with padding. The loop count of the baseline is 150. We choose to compare these results that have similar number of slave threads. If we compare 3 threads (no padding) to 2 threads (padding), 5 threads (no padding) to 4 threads (padding), 8 threads (padding) to 10 threads (no padding), 16 threads (padding) to 15 threads (no padding), we can see that the no-padding version (NP) always outperforms the padding version (P). The best optimized version can achieve 2.25 times speedup over the baseline.
Since the benchmarks TMV and MV are available in NVIDIA CUBLAS library, we also compare our optimized version to CUBLAS v5.0 5 ○ in Fig.14 and Fig.15 on GTX 680 GPUs. We also include the SMM version for MV [10] in Fig.15 as a reference. In Fig.14 , the height of input matrix is always 2k, and we vary the width of the input matrix. From this figure we can see that our baseline has similar performance to CUBLAS, and our CUDA-NP solution achieves significantly better performance. For matrices with smaller sizes, since the number of overall threads is determined by the width of input matrix, our approach enables more threads to occupy the SMXs and achieve even higher speedups. For example, if the input width is 1k, our CUDA-NP version delivers 4.9 times speedup over CUBLAS. We compare our result of MV to CUBLAS and SMM in Fig.15 . The shared memory configuration is set to 48 KB per SMX. We fix the width of input matrix to 2k and vary the height from 1k to 64k. The height determines the overall number of threads for the baseline. From Fig.15 , we can see that our solution always outperforms both SMM and CUBLAS.
As discussed in Subsection 3.3, for a local memory array, we may replace it with an array in global memory/shared memory or partition it so that the small partitions can be allocated as registers. Among all benchmarks, we are able to apply such optimizations to LE and LIB. We show the performance results when using global memory, shared memory, or register file to replace the local memory arrays for these two benchmarks in Fig.16 . As show in the figure, using global memory does not help the performance, as the global memory is off-chip memory and the local memory can be cached through L1 cache. The performance of using shared memory depends on the usage of shared memory. Since the size of the local memory array of LE is about twice of the size of local memory array of LIB, we can see heavy shared memory usage for LE actually hurts the performance while we can observe the performance gains for LIB. Since the register file is much larger than shared memory, it works the best for both benchmarks. The benefit of using the shfl instruction is shown in Fig.17 . In the figure, we use the best inter-warp NP version as the baseline to be normalized to. Then we show the speedup of the version of using shfl instruction for reduction or scan operations over the version of using shared memory. From this figure, we can see that the shfl instruction is very useful for MC and LU. These two benchmarks have intensive shared memory usage, and using shared memory for reduction or scan operations can make it even more intensive. For other benchmarks, the impact of using shfl instruction is minor as the scan or reduction primitive only takes a small amount of execution time. We also see some small slowdowns, and consider the reason may be due to memory behaviour changes.
Related Work and Discussion
Many compiler frameworks [8, [13] [14] [15] [16] have been developed to utilize high-throughput GPGPU architecture. For example, OpenMPC was proposed to transfer legacy OpenMP programs to GPGPU programs, and [8] translates the naïve GPU programs into optimized ones. Most of these studies focus on optimizing global memory accesses and enabling massive thread-level parallelism. To the best of our knowledge, nested parallelism has been overlooked and our solution is the first compiler approach to exploit it.
It is shown in [17] that using dynamic parallelism can improve the performance of the divide-and-conquer algorithms as the child kernels can be launched by the GPU instead of the CPU. However, the authors also observed slowdowns due to dynamic parallelism for regular applications such as K-mean.
Dynamic parallelism is more suitable for massive nested loops instead of benchmarks listed in the paper. Furthermore, dynamic parallelism may require additional development effort due to the communication between a child kernel and a parent kernel. For example, shared memory is used in the nested loops for benchmarks MC, LU, MV, SS, and BK. In order to utilize dynamic parallelism, developers have to write the code to copy the data from shared memory to global memory so that the child kernel can access the data, and then copy the data from global memory back to shared memory. Such expensive memory copy introduces both performance overhead and code complexity. In comparison, our CUDA-NP solution eliminates such redundant communications. For the benchmarks, NN, TMV, LE, LIB, and CFD, we implemented dynamic parallelism versions, which show 28.92, 7.61, 13.45, 125.67 and 52.29 times slower than their original versions, respectively, due to the communication cost and the dynamic-parallelism-enabled kernel overhead. Using NN as an example, which has a parallel loop with a large loop count, if we choose to let each thread of parent kernel launch a child kernel to perform the parallel loop, the dynamic parallelism version is about 28.92 times slower, compared to the version without dynamic parallelism. Then, we manually optimize it by only using the first thread in a TB to start a child kernel to reduce the number of kernel launches. This version is still 3.25 times slower. Overall, for benchmarks used in this paper, dynamic parallelism cannot help the performance for benchmarks even with manual optimizations, as the available NP is too limited to offset the high overhead of dynamic parallelism.
Some recent work focuses on identifying the best performing version in a large search space, using either analytical models [18] or auto-tuning [19] . Since CUDA-NP only generates a small number of versions, the optimal version can be found by testing these versions exhaustively. In other words, a simple auto-tuning can be used to find the optimal configuration. Furthermore, our experiments also reveal some key factors to find the optimal version for CUDA-NP. First, memory coalescing and intra-warp divergence can be used to determine the priority between intra-warp NP and inter-warp NP. Second, using three or seven slave threads achieves close-to-optimal performance for all benchmarks in our study.
While many architectural improvements [20] [21] [22] and application optimizations [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] have been proposed for GPGPU programs, only few observe the potential of nested TLP in GPU programs. In [22] , a hardware approach is proposed to create threads dynamically in the runtime to reduce the overhead of complex control flow, similar to NVIDIA dynamic parallelism. In [24] , a warp is used to handle one job instead of one thread so as to reduce control divergence. However, in this paper, we show that inter-warp NP, i.e., distributing one master thread's workload into different warps, may be suitable for many benchmarks. Furthermore, compared to [24] , our approach only requires an application developer to write pragmas instead of developing new programs. In addition, our compiler can also handle the reduction and scan variables and leverage the shfl instructions.
Nested parallelism is specified as an option in the OpenMP and is supported in some implementations such as [36] . Compared to previous OpenMP studies [37] [38] [39] [40] [41] [42] [43] [44] on nested parallelism for CPUs, our solution is the first language extension to support nested parallelism for GPGPU platforms, and we show different ways to handle the scan and reduction operations, the inter-warp and intra-warp NP schemes, as well as careful resource managements on GPGPUs.
Conclusions
In this paper, we proposed a novel compiler solution to leverage nested parallelism for GPGPU application. We observed that many benchmarks have relatively light nested parallelism, i.e., parallel loops with small loop counts, which cannot be effectively exploited using the recently introduced dynamic parallelism. Therefore, we proposed to partition the code sections into sequential sections and parallel sections, and then enabled different numbers of threads for sequential sections and parallel sections. In order to simplify application development, we implemented our approach as a compiler framework to support directive-based nested parallelism. In our proposed CUDA-NP compiler framework, an application developer only needs to add OpenMP-like pragmas for parallel loops, and our compiler will generate the optimized code. Our compiler can handle the reduction and scan variables, choose either the intra-warp NP or inter-warp NP approach to distribute the parallel loop iterations, and automatically handle different types of live variables across code sections. Our performance results show significant performance gains and demonstrate the effectiveness of our proposed solution.
