Contemporary GPUs allow concurrent execution of small computational kernels in order to prevent idling of GPU resources. Despite the potential concurrency between independent kernels, the order in which kernels are issued to the GPU will significantly influence the application performance. A technique for deriving suitable kernel launch orders is therefore presented, with the aim of reducing the total execution time. Experimental results indicate that the proposed method yields solutions that are well above the 90 percentile mark in the design space of all possible permutations of the kernel launch sequences.
Introduction: Graphics processing units (GPU) have experienced widespread adoption in the scientific computing community as application accelerators. Programmers encapsulate parts of their application as compute kernels for execution on the GPU co-processor, by using language extensions such as NVIDIA's CUDA [9] . Frequently, these compute kernels cannot completely utilize the GPU resources. Vendors have therefore introduced features of concurrent execution of kernels, thereby enabling increased resource utilization and an overall reduction in the GPU execution time. For NVIDIA GPUs, concurrency is achieved by queueing independent kernels into separate CUDA streams. When a limited number of streams are deployed, it is a well-known fact that the practically achieved parallelism is affected by the order in which kernels are enqueued into their respective streams, due to false dependencies arising from hardware and software limitations [11] . To avoid these false dependencies, users can dedicate one stream for every kernel, as long as the kernels are independent. However, researchers have overlooked the fact that even in this case, the order in which the streams are initiated can significantly influence the concurrency and thus the total execution time. For instance, a recent study [7] reported that the effect of kernel launch order on the total execution time is insignificant; however, their conclusion was erroneous because it was based on identical kernels differing only in the number of thread blocks within each experiment. As we shall see shortly, ordering does not matter for that case. Only very recently, Pai et al [10] identified this issue of "non-commutative concurrency" for GPUs; nevertheless, their solution follows a different approach through source to source transformation of kernels into elastic versions, whereas we propose the reordering of kernel launch orders without any kernel modification. Li et al [5, 6, 2] also proposed several power/energy/performance-aware scheduding techniques for concurrent GPU kernel executions. The work was primarily to support efficient GPU sharing [1, 3, 4] by improving the overall GPU resource utilization through efficient kernel scheduling algorithms.
Fundamental Concept of Reordering: GPU cores, or streaming processors (SP), are organized into groups known as streaming multiprocessors (SM). Each SM executes one or more thread blocks. When there are several kernels ready for execution, all thread blocks from the earliest issued kernel are first allocated to the SMs, followed by thread blocks from the next issued kernel [10] . If the total number of thread blocks does not exceed N SM , kernels do not share any SM. In this case the launch order does not have an impact on the total execution time. On the other hand, with a larger number of thread blocks, multiple thread blocks from one or more kernels will need to share an SM. For instance, if there are 2N SM thread blocks in total, each SM will be assigned two thread blocks. In general, additional thread blocks are mapped to SMs in a round-robin fashion, until any one of the SM resource limitations is met: N reg_SM , N shm_SM , N warp_SM and N blk_SM , as defined in Table 1 . When a kernel consumes just one of the SM resources and leaves other resources underutilized, it prevents additional thread blocks from being assigned to the SM, and those thread blocks are relegated to the next execution round. Therefore, thread blocks from a set of kernels are split into multiple execution rounds, which are sequentially executed one after the other. Concurrency within each round depends on how much resources are utilized; an ill-suited launch order can result in just one of the SM resources being heavily utilized thereby limiting the number of concurrent kernels within an execution round, which can lead to a reduced performance. Our goal is thus to obtain a launch order that maximizes the utilization of all SM resources within an execution round.
Scope and Applicability:
Reordering is useful only when the total number of thread blocks exceeds N SM , which is normally the case. Even in this case, if the kernels are identical and differ only in the number of thread blocks, the composition of each execution round and the number of rounds is the same regardless of the order, because a thread block cannot split across SMs. In this specific case, the order will not matter. Additionally, even if the kernels are non-identical, it might so happen that the thread block of every kernel is resource-heavy and the SM can accommodate only one thread block at a time; in this case too, the order will not impact the performance. Our work thus covers only the most common cases.
Balancing Compute & Memory Accesses:
Apart from resource limitations, multi-kernel execution performance is affected by the balance of compute and memory accesses. As indicated by NVIDIA, even for a single kernel there exists a suitable target value R B for the balanced instructions/bytes ratio, and we use the same concept for multiple kernels. For each execution round, we aim to achieve a combined instructions/bytes ratio R comb that is as close to R B as possible. This translates to having memory-bound kernels launching in close proximity to compute-bound kernels. Using CUDA profiler data from the individual kernels, we can compute R comb = total # of instructions / 4*(total # of global stores + total # of L1 cache global load misses). Proposed Algorithm: Considering both factors -SM resources and balanced compute/memory -we propose and implement (using C) a greedy algorithm for scheduling GPU kernels. The basic idea is to select the kernel launch order such that the number of kernels within an execution round is maximized, and the SM resources are progressively utilized in a balanced manner as kernels arrive. Selection of kernels is made sequentially based on a computed score. ScoreGen(K X , K Y ) computes the score between every kernel pair taken from the set K X and K Y respectively. The resultant score matrix is two dimensional or one dimensional depending on the input dimensions. For every kernel pair, the resulting SM resources that remain available add to the score, lines 18-20 in Algorithm 1 (see Table 1 for symbol definitions). Kernel pairs that result in a balanced (and lower) usage of all three resources result in the highest score, allowing more subsequent kernels to co-execute within the execution round. Similarly, a higher score is provided if the resulting instructions/bytes ratio for the execution round is closer to the target value R B , line 22 in Algorithm 1. Note that the conditional statement in line 21 ensures that a score is added only if the kernels under consideration are of opposing type, i.e., compute-bound vs memory-bound, because R B is deemed to be the ratio for an ideal, balanced kernel that is neither compute-bound nor memory-bound. For each execution round r, a pair of kernels with the highest score is selected and inserted into the round, denoted by the set Rdr. The inserted pair's order is sorted decreasingly by shared memory usage since this allows kernels with more N shm_i to finish faster, and thus release N shm_i sooner. The kernel pair is virtually combined by profile into a virtual kernel K comb with function ProfileCombine() so that the overall resource of current Rdr can be taken into account when choosing the next kernel for the execution round. Kernels continue to be incorporated into the round r as long as resources permit until a new round r+1 needs to be opened.
Algorithm 1 Concurrent Kernel Launch Order Algorithm

Experimental Results:
The experimental platform is a GPU computing node with dual Intel Xeon X5570 CPUs and an NVIDIA GTX580 GPU (16 SMs, R B =4.11, N reg_SM =32K, N warp_SM =48, N shm_SM =48K, N blk_SM =8). All benchmark results are collected under Ubuntu 11.10 with CUDA 5.0 while N tblk_i , N reg_i , N shm_i , N warp_i and R i are analyzed using CUDA profiler. Our experiments evaluate the concurrent execution time of all possible kernel orderings (all permutations) and compare the performance of the kernel ordering given by the algorithm with the optimal (best) result. The percentile rank among all permutations, the speedup over the worst case and the deviation from the optimal result for the algorithm results are also presented, as shown in Table 3 . To demonstrate the effectiveness of our algorithm on different resource metrics, we initially conduct six experiments, each of which consists of six concurrent kernels. We use NAS Parallel Benchmarks (NPB) kernel EP (M=24) (Rep=3.11 < R B ) [8] and the European option pricing benchmark BlackScholes (BS) (4M options) (R bs =11.1 > R B ) as two applications to represent memory-bound and compute-bound respectively. The experiment parameters are summarized in Table 2 . EP-6-shm consists of six EP kernels that varies only the shared memory usage, whereas EP-6-grid varies only the warp usage by changing just the kernel grid size. The experiment BS-6-blk again varies only the warps, but this time by changing the block size alone. Thus, EP-6-grid and BS-6-blk both demonstrate the effectiveness of algorithm on varied N warp_i , as shown in Table 3 . The next experiment, EpBs-6 tests the same but with two different kernels with varied Inst/Mem ratios (R i ). The effect of varying the shared memory is further factored in by running the EpBs-6-shm experiment. From the comparison in Table 3 , all the six experiments with specific variation in resource metrics prove that the kernel launch order from the algorithm provides close-to-optimal results. We further conduct a more general experiment with four applications from different fields: the Electrostatics (ES) algorithm (40K atoms) from Visual Molecular Dynamics, Smith Waterman(SW) algorithm plus BS and EP. The experiment EpBsEsSw-8 is composed of 2 kernels of each application with a total of 8 kernels. With 4 different applications, kernels are varied with each other for all N tblk_i , N reg_i , N shm_i , N warp_i , R i metrics. Fig.1 demonstrates the performance ranking of all possible kernel orderings for EpBsEsSw-8 while showing the near-optimal algorithm results with a percentile ranking of 94.8%. It also shows the time distribution of all 40,320 permutations for EsBsEsSw-8. By comparing the median sequence against the one from the algorithm, we demonstrate that our algorithm has 50% of the probability to provide a minimum 16.1% performance gain over a random order choice, and further up to 5.185 speedup over the worst case.
