Reordering GPU Kernel Launches to Enable Efficient Concurrent Execution by Li, Teng et al.
ar
X
iv
:1
51
1.
07
98
3v
1 
 [c
s.D
C]
  2
5 N
ov
 20
15
Reordering GPU Kernel Launches to Enable
Efficient Concurrent Execution
Teng Li, Vikram K. Narayana and Tarek El-Ghazawi
Contemporary GPUs allow concurrent execution of small computational
kernels in order to prevent idling of GPU resources. Despite the
potential concurrency between independent kernels, the order in which
kernels are issued to the GPU will significantly influence the application
performance. A technique for deriving suitable kernel launch orders is
therefore presented, with the aim of reducing the total execution time.
Experimental results indicate that the proposed method yields solutions
that are well above the 90 percentile mark in the design space of all
possible permutations of the kernel launch sequences.
Introduction: Graphics processing units (GPU) have experienced
widespread adoption in the scientific computing community as application
accelerators. Programmers encapsulate parts of their application as
compute kernels for execution on the GPU co-processor, by using
language extensions such as NVIDIA’s CUDA [9]. Frequently, these
compute kernels cannot completely utilize the GPU resources. Vendors
have therefore introduced features of concurrent execution of kernels,
thereby enabling increased resource utilization and an overall reduction
in the GPU execution time. For NVIDIA GPUs, concurrency is achieved
by queueing independent kernels into separate CUDA streams. When a
limited number of streams are deployed, it is a well-known fact that the
practically achieved parallelism is affected by the order in which kernels
are enqueued into their respective streams, due to false dependencies
arising from hardware and software limitations [11]. To avoid these false
dependencies, users can dedicate one stream for every kernel, as long as
the kernels are independent. However, researchers have overlooked the
fact that even in this case, the order in which the streams are initiated can
significantly influence the concurrency and thus the total execution time.
For instance, a recent study [7] reported that the effect of kernel launch
order on the total execution time is insignificant; however, their conclusion
was erroneous because it was based on identical kernels differing only
in the number of thread blocks within each experiment. As we shall see
shortly, ordering does not matter for that case. Only very recently, Pai et
al [10] identified this issue of “non-commutative concurrency” for GPUs;
nevertheless, their solution follows a different approach through source to
source transformation of kernels into elastic versions, whereas we propose
the reordering of kernel launch orders without any kernel modification.
Li et al [5, 6, 2] also proposed several power/energy/performance-aware
scheduding techniques for concurrent GPU kernel executions. The work
was primarily to support efficient GPU sharing [1, 3, 4] by improving
the overall GPU resource utilization through efficient kernel scheduling
algorithms.
Fundamental Concept of Reordering: GPU cores, or streaming processors
(SP), are organized into groups known as streaming multiprocessors (SM).
Each SM executes one or more thread blocks. When there are several
kernels ready for execution, all thread blocks from the earliest issued kernel
are first allocated to the SMs, followed by thread blocks from the next
issued kernel [10]. If the total number of thread blocks does not exceed
NSM, kernels do not share any SM. In this case the launch order does not
have an impact on the total execution time. On the other hand, with a larger
number of thread blocks, multiple thread blocks from one or more kernels
will need to share an SM. For instance, if there are 2NSM thread blocks in
total, each SM will be assigned two thread blocks. In general, additional
thread blocks are mapped to SMs in a round-robin fashion, until any one
of the SM resource limitations is met: Nreg_SM, Nshm_SM, Nwarp_SM and
Nblk_SM, as defined in Table 1. When a kernel consumes just one of the SM
resources and leaves other resources underutilized, it prevents additional
Table 1: GPU and Kernel Parameters*
NSM # of SMs in the GPU Nreg_SM # of registers per SM
Nshm_SM Shared mem size per SM Nwarp_SM Max # of warps per SM
Nblk_SM Max # of blocks per SM RB Balanced Inst/Mem ratio
Ninst_i # of inst for kernel i Nreg_i # of registers for kernel i
Nshm_i Shared mem size for kernel i Nwarp_i # of warps for kernel i
Ntblk_i # of blocks for kernel i Ri Inst/Mem ratio for kernel i
*The first three rows are constant for a GPU, whereas the remainings are kernel-specific.
thread blocks from being assigned to the SM, and those thread blocks are
relegated to the next execution round. Therefore, thread blocks from a set
of kernels are split into multiple execution rounds, which are sequentially
executed one after the other. Concurrency within each round depends on
how much resources are utilized; an ill-suited launch order can result in
just one of the SM resources being heavily utilized thereby limiting the
number of concurrent kernels within an execution round, which can lead
to a reduced performance. Our goal is thus to obtain a launch order that
maximizes the utilization of all SM resources within an execution round.
Scope and Applicability: Reordering is useful only when the total number
of thread blocks exceeds NSM, which is normally the case. Even in this
case, if the kernels are identical and differ only in the number of thread
blocks, the composition of each execution round and the number of rounds
is the same regardless of the order, because a thread block cannot split
across SMs. In this specific case, the order will not matter. Additionally,
even if the kernels are non-identical, it might so happen that the thread
block of every kernel is resource-heavy and the SM can accommodate only
one thread block at a time; in this case too, the order will not impact the
performance. Our work thus covers only the most common cases.
Balancing Compute & Memory Accesses: Apart from resource
limitations, multi-kernel execution performance is affected by the
balance of compute and memory accesses. As indicated by NVIDIA, even
for a single kernel there exists a suitable target value RB for the balanced
instructions/bytes ratio, and we use the same concept for multiple kernels.
For each execution round, we aim to achieve a combined instructions/bytes
ratio Rcomb that is as close to RB as possible. This translates to having
memory-bound kernels launching in close proximity to compute-bound
kernels. Using CUDA profiler data from the individual kernels, we can
compute Rcomb = total # of instructions / 4*(total # of global stores + total
# of L1 cache global load misses).
Algorithm 1 Concurrent Kernel Launch Order Algorithm
Input: the set of Nknl kernels (K) with profiling results (PR): Ntblk_i,Nreg_i ,Nshm_i,Nwarp_i,Ri
Denote Rdr to be the set storing kernel order within execution round r; r=0
ScoreMatrix[][]=ScoreGen(K, K, PR)
while K != null do
r++ ⊲ Counting towards the next execution round
5: Within K, find kernel Ka,Kb with highest score in ScoreMatrix[][]
Push Ka,Kb into Rdr (using decreasing order of Nshm_a, Nshm_b) and remove from K
Kcomb=ProfileCombine(Ka ,Kb)
for All kernels Kr (from K) whose resource can fit within Rdr do
ScoreVec[]=ScoreGen(Kcomb, Kr , PR)
10: Push Kc with the highest score in ScoreVec[] into Rdr (Sort by Nshm_c, Nshm_comb)
Kcomb=ProfileCombine(Kcomb ,Kc) and remove Kc from K
Output: Kernel launch order from Rd1 to Rdr
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
function SCOREGEN(KM , KN, PR) ⊲ KM & KN are two kernel sets
for All kernels Ki within KM do
15: for All kernels Kj within KN do
if Ki and Kj cannot fit within an execution round then S[i][j] = 0
else
S[i][j] += max{(Nshm_SM-Nshm_i-Nshm_j)/Nshm_SM, 0}
S[i][j] += max{(Nreg_SM-Nreg_i-Nreg_j)/Nreg_SM , 0}
20: S[i][j] += max{(Nwarp_SM-Nwarp_i-Nwarp_j)/Nwarp_SM, 0}
if Ri≤RB≤Rj or Rj≤RB≤Ri then
S[i][j]+= max{1-(|Rcomb(i,j)-RB |/RB), 0} ⊲ Rcomb(i,j) is the combined ratio
return S[][]
end function
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25: function PROFILECOMBINE(Ka , Kb)
Nshm_comb=Nshm_a+Nshm_b; Nreg_comb=Nreg_a+Nreg_b ; Nwarp_comb=Nwarp_a+Nwarp_b;
Ntblk_comb=Ntblk_a+Ntblk_b; Rcomb=Rcomb(a,b)=(Ninst_a+Ninst_b)/(Ninst_a/Ra+Ninst_b/Rb)
return Kcomb ⊲ Virtual “kernel” with combined profile
end function
Proposed Algorithm: Considering both factors - SM resources and
balanced compute/memory - we propose and implement (using C) a greedy
algorithm for scheduling GPU kernels. The basic idea is to select the kernel
launch order such that the number of kernels within an execution round is
maximized, and the SM resources are progressively utilized in a balanced
manner as kernels arrive. Selection of kernels is made sequentially based
on a computed score. ScoreGen(KX , KY) computes the score between every
kernel pair taken from the set KX and KY respectively. The resultant score
matrix is two dimensional or one dimensional depending on the input
dimensions. For every kernel pair, the resulting SM resources that remain
available add to the score, lines 18-20 in Algorithm 1 (see Table 1 for
symbol definitions). Kernel pairs that result in a balanced (and lower) usage
of all three resources result in the highest score, allowing more subsequent
ELECTRONICS LETTERS 23rd November 2015 Vol. 00 No. 00
 0
 100
 200
 300
 400
 500
 600
 0  10000  20000  30000  40000
Sorted GPU Time of All Possible Launch Orders
Launch Order Ranking
Ti
m
e 
(m
s)
Sequence from the algorithm
 0
 200
 400
 600
 800
 1000
 1200
 1400
 110 120 130 140 150 160 170 180 190 200 210
GPU Time Distribution
Time(ms)
Count
Sequence from the algorithm
Sequence with median
performance
Fig. 1 Ranking and Distribution of GPU Execution Time in the Launch Order
Permutation Space for EpBsEsSw-8
Table 2: Experiment Parameters
Experiment Constant Parameters Variables Across Kernels
EP-6-shm Ri=3.11,
16Grid Size x 128Block Size
Nshm_i = 8K, 16K, 24K, 32K, 40K, 48K
EP-6-grid Ri=3.11, Nshm_i = 0,
128Block Size
Nwarp_i = 4, 8, 12, 16, 20, 24
(Grid Size = 16, 32, 48, 64, 80, 96)
BS-6-blk Ri=11.1, Nshm_i = 0,
32Grid Size
Nwarp_i = 4, 8, 12, 16, 32, 64
(Block Size = 64, 128, 256, 512, 768, 1024)
EpBs-6 Nshm_i = 0 3 EP kernels w/ Nwarp_i=4, Ri=3.11
3 BS kernels w/ Nwarp_i=12, Ri=11.1
EpBs-6-shm — 3 EP w/ Nwarp_i=4, Nshm_i=16K,24K,48K
3 BS w/ Nwarp_i=12, Nshm_i=16K,24K,48K
EpBsEsSw-8 — EP, BS, ES and SW kernels, 2 each
kernels to co-execute within the execution round. Similarly, a higher score
is provided if the resulting instructions/bytes ratio for the execution round
is closer to the target value RB, line 22 in Algorithm 1. Note that the
conditional statement in line 21 ensures that a score is added only if the
kernels under consideration are of opposing type, i.e., compute-bound vs
memory-bound, because RB is deemed to be the ratio for an ideal, balanced
kernel that is neither compute-bound nor memory-bound.
For each execution round r, a pair of kernels with the highest score is
selected and inserted into the round, denoted by the set Rdr. The inserted
pair’s order is sorted decreasingly by shared memory usage since this
allows kernels with more Nshm_i to finish faster, and thus release Nshm_i
sooner. The kernel pair is virtually combined by profile into a virtual
kernel Kcomb with function ProfileCombine() so that the overall resource
of current Rdr can be taken into account when choosing the next kernel for
the execution round. Kernels continue to be incorporated into the round r
as long as resources permit until a new round r+1 needs to be opened.
Experimental Results: The experimental platform is a GPU computing
node with dual Intel Xeon X5570 CPUs and an NVIDIA GTX580 GPU (16
SMs, RB=4.11, Nreg_SM=32K, Nwarp_SM=48, Nshm_SM=48K, Nblk_SM=8). All
benchmark results are collected under Ubuntu 11.10 with CUDA 5.0 while
Ntblk_i, Nreg_i, Nshm_i, Nwarp_i and Ri are analyzed using CUDA profiler.
Our experiments evaluate the concurrent execution time of all possible
kernel orderings (all permutations) and compare the performance of the
kernel ordering given by the algorithm with the optimal (best) result. The
percentile rank among all permutations, the speedup over the worst case
and the deviation from the optimal result for the algorithm results are
also presented, as shown in Table 3. To demonstrate the effectiveness
of our algorithm on different resource metrics, we initially conduct six
experiments, each of which consists of six concurrent kernels. We use NAS
Parallel Benchmarks (NPB) kernel EP (M=24) (Rep=3.11 < RB) [8] and
the European option pricing benchmark BlackScholes (BS) (4M options)
(Rbs=11.1 > RB) as two applications to represent memory-bound and
compute-bound respectively. The experiment parameters are summarized
in Table 2. EP-6-shm consists of six EP kernels that varies only the
shared memory usage, whereas EP-6-grid varies only the warp usage by
changing just the kernel grid size. The experiment BS-6-blk again varies
only the warps, but this time by changing the block size alone. Thus, EP-
6-grid and BS-6-blk both demonstrate the effectiveness of algorithm on
varied Nwarp_i, as shown in Table 3. The next experiment, EpBs-6 tests
the same but with two different kernels with varied Inst/Mem ratios (Ri).
The effect of varying the shared memory is further factored in by running
the EpBs-6-shm experiment. From the comparison in Table 3, all the six
experiments with specific variation in resource metrics prove that the
kernel launch order from the algorithm provides close-to-optimal results.
We further conduct a more general experiment with four applications from
different fields: the Electrostatics (ES) algorithm (40K atoms) from Visual
Molecular Dynamics, Smith Waterman(SW) algorithm plus BS and EP.
Table 3: Experimental Results (GPU execution time) and Comparisons
Experiment Optimal
(ms)
Worst
(ms)
Algorithm
(ms)
Percentile
rank
Speedup
over worst
Deviation
from optimal
EP-6-shm 140.46 249.15 146.38 91.5% 1.702 4.21%
EP-6-grid 123.39 156.03 123.45 96.3% 1.264 0.049%
BS-6-blk 699.29 1699.04 702.29 96.5% 2.419 0.43%
EpBs-6 100.03 167.47 100.20 96.1% 1.671 0.17%
EpBs-6-shm 251.90 311.79 251.95 99.4% 1.238 0.02%
EpBsEsSw-8 109.21 597.43 115.23 94.8% 5.185 5.51%
The experiment EpBsEsSw-8 is composed of 2 kernels of each application
with a total of 8 kernels. With 4 different applications, kernels are varied
with each other for all Ntblk_i, Nreg_i, Nshm_i, Nwarp_i, Ri metrics. Fig.1
demonstrates the performance ranking of all possible kernel orderings
for EpBsEsSw-8 while showing the near-optimal algorithm results with
a percentile ranking of 94.8%. It also shows the time distribution of all
40,320 permutations for EsBsEsSw-8. By comparing the median sequence
against the one from the algorithm, we demonstrate that our algorithm has
50% of the probability to provide a minimum 16.1% performance gain over
a random order choice, and further up to 5.185 speedup over the worst case.
Acknowledgment: This work was supported in part by the I/UCRC
Program of the NSF under Grant Nos. IIP-1161014 and IIP-1230815.
Teng Li, Vikram K. Narayana and Tarek El-Ghazawi (Department of
Electrical and Computer Engineering, The George Washington University,
801 22nd St NW, Washington, DC, 20052, United States)
E-mail: {tengli, tarek}@gwu.edu; vikramkn@ieee.org
References
1 T. Li, V. K. Narayana, E. El-Araby, and T. El-Ghazawi.
GPU resource sharing and virtualization on high performance computing
systems.
In Parallel Processing (ICPP), 2011 International Conference on, pages
733–742. IEEE, Sept 2011.
2 T. Li, V. K. Narayana, and T. El-Ghazawi.
A static task scheduling framework for independent tasks accelerated
using a shared graphics processing unit.
In Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th
International Conference on, pages 88–95. IEEE, Dec 2011.
3 T. Li, V. K. Narayana, and T. El-Ghazawi.
Accelerated high-performance computing through efficient multi-process
GPU resource sharing.
In Proceedings of the 9th Conference on Computing Frontiers, CF ’12,
pages 269–272, New York, NY, USA, 2012. ACM.
4 T. Li, V. K. Narayana, and T. El-Ghazawi.
Exploring graphics processing unit (GPU) resource sharing efficiency for
high performance computing.
Computers, 2(4):176–214, 2013.
5 T. Li, V. K. Narayana, and T. El-Ghazawi.
Symbiotic scheduling of concurrent GPU kernels for performance and
energy optimizations.
In Proceedings of the 11th ACM Conference on Computing Frontiers, CF
’14, pages 36:1–36:10, New York, NY, USA, 2014. ACM.
6 T. Li, V. K. Narayana, and T. El-Ghazawi.
A power-aware symbiotic scheduling algorithm for concurrent gpu
kernels.
In The 21st IEEE International Conference on Parallel and Distributed
Systems (ICPADS 2015). IEEE, 2015.
7 F. Lu, J. Song, F. Yin, and X. Zhu.
GPU computing using concurrent kernels: A case study.
In Proceedings of 2nd World Congress on Computer Science and
Information Engineering (CSIE 2011), pages 173–181, 2011.
8 M. Malik, T. Li, U. Sharif, R. Shahid, T. El-Ghazawi, and G. Newby.
Productivity of GPUs under different programming paradigms.
Concurrency and Computation: Practice and Experience, 24(2):179–
191, 2012.
9 NVIDIA.
NVIDIA CUDA C-Programming Guide V6.0, Feb 2014.
10 S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan.
Improving GPGPU concurrency with elastic kernels.
In Proceedings of 18th International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), pages
407–418, 2013.
11 S. Rennich.
CUDA C/C++ Streams and Concurrency, NVIDIA Webinar, Jan. 2012.
http://developer.download.nvidia.com/CUDA/
training/StreamsAndConcurrencyWebinar.pdf.
2
