Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and
  Scheduling by Zhong, Jianlong & He, Bingsheng
ar
X
iv
:1
30
3.
51
64
v1
  [
cs
.D
C]
  2
1 M
ar 
20
13
1
Kernelet: High-Throughput GPU Kernel
Executions with Dynamic Slicing and Scheduling
Jianlong Zhong, Bingsheng He
Abstract—Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clusters
and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important
metric for performance and total ownership cost. Despite the recently improved runtime support for concurrent GPU kernel executions,
the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system with
dynamic slicing and scheduling techniques to improve the throughput of concurrent kernel executions on the GPU. With slicing, Kernelet
divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other
slices and to fully utilize the GPU resources. We develop a novel and effective Markov chain based performance model to guide the
scheduling decision. Our experimental results demonstrate up to 31.1% and 23.4% performance improvement on NVIDIA Tesla C2050
and GTX680 GPUs, respectively.
Index Terms—Graphics processors, Dynamic scheduling, Concurrent kernel executions, Dynamic Slicing, Performance models
✦
1 INTRODUCTION
The graphics processing unit (or GPU) has become an ef-
fective accelerator for a wide range of applications from
computation-intensive applications (e.g., [25], [26], [43])
to data-intensive applications (e.g, [10], [18]). Compared
with multicore CPUs, new-generation GPUs can have
much higher computation power in terms of FLOPS
and memory bandwidth. For example, an NVIDIA Tesla
C2050 GPU can deliver the peak single precision float-
ing point performance of over one Tera FLOPS, and
memory bandwidth of 144 GB/s. Due to their immense
computation power and memory bandwidth, GPUs have
been integrated into clusters and cloud computing in-
frastructures. In Top500 list of November 2012, two out
of the top ten supercomputers are with GPUs integrated.
Amazon and Penguin have provided virtual machines
with GPUs. In both cluster and cloud environments,
GPUs are often shared by many concurrent GPU pro-
grams (or kernels) (most likely submitted by multiple
users). Additionally, to enable sharing GPUs remotely,
a number of software frameworks such as rCUDA [7]
and V-GPU [45] have been developed. This paper studies
whether and how we can improve the throughput of
concurrent kernel executions on the GPU.
Throughput is an important optimization metric for
efficiency and the total ownership cost of GPUs in such
shared environments. First, many GPGPU applications
such as scientific and financial computing tasks are usu-
ally throughput oriented [9]. A high throughput leads
to the high performance and productivity for users.
Second, compared with CPUs, GPUs are still expensive
• J. Zhong and B. He are with School of Computer Engineering, Nanyang
Technological University, Singapore, 639798.
E-mail: jzhong2@ntu.edu.sg, bshe@ntu.edu.sg
devices. Therefore, a high throughput not only means
a high utilization on GPU resources but also the total
ownership cost of running the application on the GPU.
That might also be one of the reasons that GPUs are
usually deployed and shared to handle many kernels
from users.
Recently, we have witnessed the success of GPGPU
research. However, most studies focus on single-kernel
optimizations (e.g., new data structures [17] and GPU
friendly computing patterns [10], [11]). Despite the fruit-
ful research, a single kernel usually severely under-
utilizes the GPU. This severe underutilization is mainly
due to the inherent memory and computation behavior
of a single kernel (e.g., irregular memory accesses and
execution pipeline stalls). In our experiments, we have
studied eight common kernels (details are presented
in Section 5). On C2050, their average IPC is 0.52,
which is far from the optimal value (1.0). Their memory
bandwidth utilization is only ranging from 0.02% to 14%.
Recent GPU architectures like NVIDIA Fermi [27] ar-
chitecture supports concurrent kernel executions, which
allows multiple kernels to be executed on the GPU
simultaneously if resources are allowed. In particular,
Fermi adopts a form of cooperative kernel scheduling.
Other kernels requesting the GPU must wait until the
kernel occupying the GPU voluntarily yields control.
Here, we use NVIDIA CUDA’s terminology, simply be-
cause CUDA is nowadays widely adopted in GPGPU
applications. A kernel consists of multiple executions
of thread blocks with the same program on different
data, where the execution order of thread blocks is not
defined. On the Fermi GPUs, one kernel can take the
entire GPU as long as it has sufficient thread blocks
to occupy all the multi-processors (even though it can
have severely low resource utilization). Concurrent exe-
cution of two such kernels almost degrades to sequential
2execution on individual kernels. Recent studies [34] on
scheduling the concurrent kernels mainly focus on the
kernels with low occupancy (i.e., the thread blocks of
a single kernel cannot fully utilize all GPU multipro-
cessors). However, the occupancy of kernels (with large
input data sizes in practice) is usually high after single-
kernel optimizations.
Individual kernels as a whole cannot realize the real
sharing of the GPU resources. A natural question is:
can we slice the kernel into small pieces and then
co-schedule slices from different kernels in order to
improve the GPU resource utilization? The answer is
yes. One observation is that GPU kernels (e.g., those
written in CUDA or OpenCL) conform to the SPMD
(Single Program Multiple Data) execution model. In
such data-parallel executions, a kernel execution can
usually be divided into multiple slices, each consisting
of multiple thread blocks. Slices can be viewed as low-
occupancy kernels and can be executed simultaneously
with slices from other kernels. The GPGPU concurrent
kernel scheduling problem is thus converted to the slice
scheduling problem.
With slicing, we have two more issues to address. The
first issue is on the slicing itself: what is the suitable
slice size? How to perform the slicing in a transparent
manner? The smallest granularity of a slice is one thread
block, which can lead to significant runtime overhead of
submitting too many such small slices onto the GPUs for
execution. To the other extreme, the largest granularity
of the slice is the entire kernel, which degrades to the
non-sliced execution. The second issue is how to select
the slices for co-schedule in order to maximize the GPU
utilization.
To address those issues, we develop Kernelet, a runtime
system with dynamic slicing and scheduling techniques
to improve the GPU utilization. Targeting at the concur-
rent kernel executions in the shared GPU environment,
Kernelet dynamically performs slicing on the kernels,
and the slices are carefully designed with tunable oc-
cupancy to allow slices from other kernels to utilize the
GPU resources in a complementary way. For example,
one slice utilizes the computation units and the other
one on memory bandwidth. We develop a novel and ef-
fective Markov chain based performance model to guide
kernel slicing and scheduling in order to maximize the
GPU resource utilization. Compared with existing GPU
performance models which are limited to a single kernel
only, our model are designed to handle heterogeneous
workloads (i.e., slices from different kernels). We further
develop a greedy co-scheduling algorithm to always co-
schedule the slices from the two kernels with the highest
performance gain according to our performance model.
We have evaluated Kernelet on two latest GPU ar-
chitectures (Tesla C2050 and GTX680). The GPU kernels
under study have different memory and computational
characteristics. Experimental results show that 1) our an-
alytical model can accurately capture the performance of
heterogeneous workloads on the GPU, 2) our scheduling
increase the GPU throughput by up to 31.1% and 23.4%
on C2050 and GTX680, respectively.
Organization. The rest of the paper is organized as
follows. We introduce the background and definition of
our problem in Section 2. Section 3 presents the system
overview, followed by detailed design and implementa-
tion in Section 4. The experimental results are presented
in Section 5. We review the related work in Section 6 and
conclude this paper in Section 7.
2 BACKGROUND AND PROBLEM DEFINITION
In this section, we briefly introduce the background
on GPU architectures, and next present our problem
definition.
2.1 GPU Architectures
GPUs have rapidly evolved into a powerful accel-
erator for many applications, especially after CUDA
was released by NVIDIA [30]. The top tags in
http://gpgpu.org/ show that a wide range of appli-
cations have been accelerated with GPGPU techniques,
including vision and graphics, image processing, linear
algebra, molecular dynamics, physics simulation and
scientific computing etc. Those applications cover quite
a wide range of computation and memory intensiveness.
In the shared environment like clusters and cloud, it
is very likely that users submit kernels from different
applications to the same GPU. Thus, it is feasible to
schedule kernels with different memory and computa-
tion characteristics to better utilize GPU resources.
This paper focuses on the design and implementation
with NVIDIA CUDA. Since OpenCL and CUDA have
very similar designs, our design and implementation
can be extended to OpenCL with little modification.
Kernelet takes advantage of concurrent kernel execution
capability of new-generation GPUs like NVIDIA Fermi
GPUs. With the introduction of CUDA, a GPU can be
viewed as a many-core processor with a set of streaming
multi-processors (SM). Each SM has a set of scalar cores,
which executes the instructions in the SIMD (Single
Instruction Multiple Data) manner. The SMs are in turn
executed in the SPMD manner. The program is called
kernel.
In CUDA’s abstraction, GPU threads are organized in a
hierarchical configuration: usually 32 threads are firstly
grouped into a warp; warps are further grouped into
thread blocks. The CUDA runtime performs mapping
and scheduling at the granularity of thread blocks. Each
thread block is mapped and scheduled on an SM, and
cannot be split among multiple SMs. Once a thread block
is scheduled, its warps become active on the SM. Warp is
the smallest scheduling unit on the GPU. Each SM uses
one or more warp schedulers to issue instructions of the
active warps. Another important resource of the GPU
is shared memory, which is a small piece of scratchpad
memory at the scope of a thread block. It is small and
3User N
(VM N)
ĂĂ
ĂĂ
Nvidia GPUs
GPU Virtualization Layer
User 1
(VM 1)
GPU Server
ĂĂ
Ă
Kernelet
(a) GPU shared within one ma-
chine.
ĂĂ
Nvidia GPUs
ĂĂ
GPU Server
network
r
ĂĂ
Remote
Node 1
API Inception API Inception
Remote
Node N
Ă
Kernelet
API reception framwork
(like rCUDA)
(b) GPU shared by remote
clients.
Fig. 1: Application scenarios of concurrent kernel execu-
tions on the GPUs.
has very low latency. Shared memory is visible for all
the threads in the same thread block.
We define the SM occupancy as the ratio of active
warps to the maximum active warps that are allowed to
run on the SM. Higher occupancy means higher thread
parallelism. The aggregated register and shared memory
usage of all warps should not exceed the total amount
of available registers and shared memory on an SM.
Block Scheduling. CUDA runtime system maps
thread blocks to SMs in a round-robin manner. If the
number of thread blocks in a kernel is less than the
number of SMs on the GPU, each thread block will be
executed on a dedicated SM; otherwise, multiple thread
blocks will be mapped to the same SM. The number of
thread blocks that can be concurrently executed on the SM
depends on their total resource requirements (in terms
of registers and shared memory). If the kernel currently
being executed on the GPU cannot fully utilize the GPU,
the GPU allows to schedule thread blocks from other
kernels for execution.
GPU Code Compilation. In the process of CUDA
compilation, the compiler first compiles the CUDA C
code to PTX (Parallel Thread Execution) code. PTX is
a virtual machine assembly language and offers a sta-
ble programming model for evolving hardware archi-
tectures. PTX code is further compiled to native GPU
instructions (SASS). The GPU can only execute the SASS
code. CUDA executables and libraries usually provide
either PTX code or SASS code or both. In the shared
environments, the source code is usually not available
for kernel scheduling upon users submit the kernels.
Thus, Kernelet should be designed to work on both PTX
and SASS code.
2.2 Problem Definition
Application scenario. We consider two typical applica-
tion scenarios in the shared environments as shown in
Figure 1. One is sharing the GPUs among multiple ten-
ants in the virtualized environment (e.g., cloud). As illus-
trated in Figure 1a, there is usually a GPU virtualization
layer integrated with the hypervisor. Figure 1b shows the
other scenario, in which GPU servers offer API reception
softwares (like rCUDA [7], [45]) to support local/remote
CUDA kernel launches. In both scenarios, the GPU faces
multiple pending kernel launch requests. Kernelet can be
applied to schedule those pending kernels.
Our study mainly considers the throughput issues of
sharing a single GPU. Kernelet can be extended to multi-
ple GPUs with a workload dispatcher to each individual
GPU.
We have made the following assumptions on the
kernels.
1. We target at the concurrent kernel executions on
the shared GPU. The kernels are usually through-
put oriented, with flexibility on the response time
for scheduling. Still, we do not assume the a priori
knowledge of the order of the kernel arrival.
2. Thread blocks in a kernel are independent with
each other. This assumption is mostly true for the
GPGPU kernels due to SPMD programming model.
Most kernels in NVIDIA SDK and benchmarks like
Parboil [13] do not have dependency among thread
blocks in the same kernel. This assumption ensures
that our slicing technique on a given kernel is safe.
The data dependency among thread blocks can be
identified with standard static or dynamic program
analysis.
We formally define the terminology in Kernelet.
Kernel. A kernel K consists of k thread blocks with
IDs, 0, 1, 2, ..., (k − 1).
Slice. A slice is a subset of the thread blocks of a
launched kernel. Block IDs of a slice is continuous in
the grid index space. The size of a slice s is defined as
the number of thread blocks contained in the slice.
Slicing plan. Given a kernel K, a slicing plan S(K) is
a scheme slicing K into a sequence of n slices (s0, s1, s2,
..., sn−1). We denote the slicing plan to be K=s0, s1, s2,
..., sn−1.
Co-schedule. Co-schedule cs defines concurrent exe-
cution of n (n ≥ 1) slices, denoted as s0, ..., sn−1. All the
n slices are active on the GPU.
Scheduling plan. Given a set of n kernels K0, K1, ...,
Kn, a scheduling plan C (cs0, cs1, ..., csn−1) determines
a sequence of co-schedules in their order of execution.
csi is launched before csj if i < j. All thread blocks of
the n kernels occur in one of the co-schedules once and
only once. A scheduling plan embodies a slicing plan for
each kernel.
We define the performance benefit of co-scheduling n
kernels to be the co-scheduling profit (CP ) in Eq. (1).
IPCi and cIPCi are IPC (Instruction Per Cycle) for
sequential execution and concurrent execution of kernel i
respectively. Our definition is similar to those in the pre-
vious studies on CPU multi-threaded co-scheduling [20],
[39].
CP = 1−
1
n∑
i=1
cIPCi
IPCi
(1)
4Problem definition. Given a set of kernels for execu-
tion, the problem is to determine the optimal scheduling
plan (and slicing) so that the total execution time for
those kernels is minimized. That corresponds to the
maximized throughput. Given a set of n kernels K0,
K1, ..., Kn−1, we aim at finding the optimal scheduling
plan C for a minimized total execution time of C(S0(K0),
S1(K1), ..., Sn−1(Kn−1)), or maximized co-scheduling
profit. Note, in the shared GPU environment, the arrival
of new kernels trigger the recalculation of the optimiza-
tion on the kernel residuals and new kernels.
3 SYSTEM OVERVIEW
In this section, we present the rationales on the Kernelet
design, followed by an overview of the Kernelet runtime.
3.1 Design Rationale
Since kernels are submitted in an ad-hoc manner in our
application scenarios, the scheduling decision has to be
made at real time. The optimization process should take
the newly arrived kernels into consideration. Moreover,
our runtime system of slicing and scheduling should be
designed with light overhead. That is, the overhead of
slicing and scheduling should be small compared with
their performance gain.
Unfortunately, finding the optimal slicing plans and
scheduling plan is a challenging task. The solution space
for such candidate plans is large. For slicing a kernel, we
have the factors including the number of slices as well
as the slice size. For scheduling a set of slices, we can
generate different scheduling plan with co-scheduling
slices from different kernels. All those factors are added
up into a large solution space. Considering newly arrival
kernels makes the huge scheduling space even larger.
Due to the real-time decision making and light-weight
requirements, it is impossible to search the entire space
to get a globally optimal solution. The classic Monte
Carlo simulation methods are not feasible because they
usually exceed our budget on the runtime overhead and
violate real-time decision making requirement. There
must be a more elegant compromise between the op-
timality and runtime efficiency. Our solution to the
scheduling problem is detailed in Section 4.2.
Given the complexity of dynamic slicing and schedul-
ing in concurrent kernel executions, we make the follow-
ing considerations.
First, the scheduling considers two kernels only. Previ-
ous studies [20] on the CPU have shown that when there
are more than two co-running jobs, finding the optimal
scheduling plan becomes an NP-complete problem even
ignoring the job length differences and rescheduling.
Following previous studies [20], [34], we make our
scheduling decision co-scheduling two kernels only.
Second, once we choose two kernels to schedule, their
slice sizes keep unchanged until either kernel finishes.
The suitable slice size is determined according to our
performance model 4.4.
ĂĂ
ĂĂ
Ă
Kernelet
Scheduler
Kernel
Slicer
Performance
Model
kernels
GPU
slices
coscheduled
slices
Fig. 2: Design overview of Kernelet.
3.2 System Overview
We develop Kernelet as a runtime system to generate
the slicing plan and scheduling plans for the optimized
throughput.
Figure 2 shows an overview of Kernelet. Kernels
are submitted and are temporarily buffered in a ker-
nel queue for further scheduling. Usually, a kernel is
submitted in the form of binary (SASS) or PTX code.
Submitted kernels are first preprocessed by kernel slicer
to determine the smallest slice size for a given overhead
limit. If the kernel has been submitted before, we simply
use the smallest slice size in the previous execution. Ker-
nelet’s scheduler determines the scheduling plan based
on the results of performance model, which estimates
the performance of slices from two different kernels
in a probabilistic manner. Once the scheduling plan is
obtained, slices are dispatched for execution on the GPU.
We will describe the detailed design and implementation
of each component in the next section.
4 KERNELET METHODOLOGY
In this section, we first describe our kernel slicing mecha-
nism. Next, we present our greedy scheduling algorithm,
followed by our pruning techniques on the co-schedule
space and the description of the performance model.
4.1 Kernel Slicing
The purpose of kernel slicing is to divide a kernel into
multiple slices so that the finer granularity of each slice
as a kernel in the scheduling can create more opportu-
nities for time sharing. Moreover, we need to determine
the suitable slice size for minimizing the slicing overhead
(i.e., between the total execution time of all slices and the
kernel execution time) . Particularly, we experimentally
determine the suitable slice size to be the minimum slice
so that the overhead is not greater than p% of the kernel
execution time. In this study, p% is set to be 2% by
default. We focus on the implementation of slicing on
PTX or SASS code. Note that, warps within the same
thread block usually have data dependency with each
other, e.g., with the usage of shared memory. That is
why we choose thread blocks as the units for slicing,
instead of warps.
5void MatrixAdd(float *MatrixA, float *MatrixB, int 
width){
    int row = blockID_X*blockDim_X+threadID_X;
    int col = blockID_Y*blockDim_Y+threadID_Y;
    /* Each thread process one element */
    A[row+col*width] += B[row+col*width];
}
(a) Each thread access the cor-
responding matrix element us-
ing block and thread indices.
int 
;
/* Grid dimension: 16×16 */
dim3 gridConf(16,16);
/* Block dimension: 16×16 */
dim3 blockConf(16,16);
MatrixAdd<<<gridConf, blockConf>>>(...);
(b) Launch the same number
threads as the number of ma-
trix elements.
void MatrixAdd(float *MatrixA, float *MatrixB, int 
width, dim3 blockOffset, dim3 gridConf){
    /* Compute rectified  indices*/
    int rBlockID_X = blockID_X+blockOffset.x;
    int rBlockID_Y = blockID_Y+blockOffset.y;
    /* Process carry bit*/
    while (rBlockID_X > gridConf.x){
        rBlockID_X -= gridConf.x;
        rBlockID_Y ++;
    }
    /* Replace all subsequent access to blockID_X 
(blockID_Y) with rBlockID_X (rBlockID_Y) */
    ...
}
(c) Sliced kernel with rectified
thread block indices.
/* Sliced grid dimension: 8×1 */
dim3 sGridConf(8,1);
dim3 blockOffset = (0,0);
while(blockOffset.x < gridConf.x && \
                         blockOffset.y < gridConf.y){
    MatrixAdd<<<sGridConf, blockConf>>>(..., 
blockOffset, gridConf);
    blockOffset.x += sliceGrid.x;
    while (blockOffset.x > gridConf.x){
        blockOffset.x -= gridConf.x;
        blockOffset.y ++;
    }
}
int 
 
(d) Launch all slices of a kernel
in a loop.
Fig. 3: An example of kernel slicing.
Figure 3 illustrates the sliced execution of MatrixAdd
with pseudo code. In the example, MatrixAdd is a GPU
kernel to add two 256 × 256 matrices and each thread
adds one element pair from the input matrices. Based on
the matrix size, MatrixAdd is configured to launch with
16× 16 thread blocks in total and each block has 16× 16
threads. Figures 3a and 3b illustrate the definition and
launch of the original kernel, respectively. In comparison,
the sliced version of the kernel launches a slice with 8
thread blocks each time. The built-in thread block indices
(denoted as blockID X and blockID Y ) of the sliced
kernel are in a smaller index space ({(x, 0)|0 ≤ x < 8})
compared with the original index space ({(x, y)|0 ≤ x <
16 and 0 ≤ y < 16}). To make the slices execute as
individual kernels, we apply a procedure called index
rectification. As shown in Figure 3c, we add an offset
value to the thread block indices and obtain the rectified
index values. The rectified indices are used to replace all
subsequent accesses to the built-in indices. On the CPU
side, we launch the slices in a loop and adjust the offset
values for each slice launch (Figure 3d).
Kernelet automatically implements slice rectification
without any user intervention. With PTX or SASS code
as input, Kernelet does not require source code. Kernelet
interprets and modifies the PTX/SASS code at runtime.
The resulting PTX code is compiled to GPU executables
by the GPU driver, and the SASS code is assembled using
the open source Fermi assembler Asfermi [1]. Kernelet
stores the rectified block index in registers, and replaces
all references to those built-in variables with the new
registers. Since this process may use more registers.
Kernelet tries to minimize the register usage by adopting
the classic register minimization techniques [5], [33], e.g.,
variable liveness analysis. With register optimizations,
register usage by slicing keeps unchanged in most of
our test cases in experiments. Note, kernel slicing only
requires a single scan on the input code and the runtime
overhead is negligible.
4.2 Scheduling
According to our design rationales, our scheduling de-
cision is made on the basis of two kernels, to avoid
the complexity of scheduling three or more kernels as
a whole; the slice sizes of the two kernels are tuned
for GPU utilization so that their execution times in the
concurrent kernel execution are close. Thus, we develop
a greedy scheduling algorithm, as shown in Algorithm 1.
The scheduling algorithm considers new arrival kernels
in Lines 2–3 in the main algorithm. The main procedure
calls the procedure FindCoSchedule to obtain the optimal
co-schedule in Line 5. The co-schedule is represented
in four parameters < K1, K2, size1, size2 >, where K1
and K2 denotes the two selected kernels; size1 and size2
represents the slice sizes accordingly. We use the same
co-schedule if the kernels pending for execution do not
change, or both kernels still have thread blocks.
Algorithm 1 Scheduling algorithm of Kernelet
1: Denote R to be the set of kernels pending for execu-
tions;
2: if A new kernel K comes then
3: Add K into R;
4: while R!=null do
5: < K1, K2, size1, size2 >=FindCoSchedule(R);
6: Denote the co-schedule to be c;
7: Execute c on the GPU;
8: while R does not change, or K1 and K2 both still
have thread blocks do
9: Generate co-schedule according to c and execute
it on the GPU;
Proc. FindCoSchedule(R)
Function: generate the optimal co-schedule from
R.
1: Generate the candidate space for co-schedules C;
2: Perform pruning on C according to the computation
and memory characteristics of input kernels;
3: Apply the performance model (Section 4.4) to com-
pute CP for all the co-schedule in C;
4: Obtain the optimal co-schedule with the maximized
CP ;
5: Return the result co-schedule;
In Procedure FindCoSchedule, we first consider the
entire candidate space consisting of co-schedules on
pair-wise kernel combinations. Because the space may
consider of N(N−1)2 co-schedules (N is the number of
kernels for consideration), it is desirable to reduce the
search space. Therefore, we perform pruning according
to the computation and memory characteristics of input
kernels. We present the details of pruning mechanisms
in Section 4.3. After pruning, we apply the perfor-
mance model (Section 4.4) to estimate the CP for all co-
6schedules, and pick the one with the maximized CP for
executing on the GPU.
4.3 Co-scheduling Space Pruning
Given a set of co-schedules as input, we aim at devel-
oping pruning techniques to remove the co-schedules
that are not “promising” to deliver performance gain. So
that the overhead of running the performance model is
avoided. The basic idea to identify the key performance
factor of a single kernel that affects the throughput of
concurrent kernel executions on the GPU.
There are many factors affecting the GPU perfor-
mance. According to the CUDA profiler, there are around
100 profiler counters and statistics for the performance
tuning. For an effective scheduling algorithm, there is no
way of considering all these factors. Following a previ-
ous work [24], we use the regression model to explore
the correlation between the above mentioned factors and
CP . The input values are obtained from single kernel
executions. Through detailed performance studies, we
find that instruction throughput and memory bandwidth
utilization are the most correlated performance factors
with co-scheduling friendliness. We define PUR (Pipeline
Utilization Ratio) and MUR (Memory-bandwidth Uti-
lization Ratio) to characterize user submitted kernels.
High PUR means the instruction pipeline is highly uti-
lized and there is little room for performance improve-
ment. High MUR means a large number of memory
requests are being processed and memory latency is
high. PUR and MUR is calculated as follows,
PUR =
Instruction Executed
T ime× Frequency × Peak IPC
MUR =
Dram Reads+Dram Writes
T ime× Frequency × Peak MPC
Peak IPC and Peak MPC represent peak number of
instructions and memory requests per cycle respectively.
Instruction Executed is the total number of instructions
executed. Dram Reads and Dram Writes are the num-
bers of read and write requests to DRAM respectively.
We build a set of testing kernels to demonstrate the
correlation between PUR/MUR and CP . A testing kernel
is a mixture of memory and computation instructions.
We tune the respective instruction ratios to obtain ker-
nels with different memory and computational charac-
teristics. The single kernel execution PURs and MURs
of the testing kernels are in the range of [0.26, 0.83] and
[0.07, 0.84] respectively. Figure 4 shows the strong cor-
relation between MUR/PUR and CP . The observation
conforms to our expectation. First, if one kernel has high
PUR while the other kernel has low PUR, the former
kernel is able to utilize the idle cycles exposed by the
latter kernel when co-scheduled. Second, co-scheduling
kernels with complementary memory requirements (one
kernel has low MUR and the other kernel has high MUR)
(a) MUR and CP (b) PUR and CP
Fig. 4: Correlation between MUR/PUR and CP
will alleviate memory contention and reduce idle cycles
exposed by long latency memory operations.
In summary, our pruning rule is to remove the co-
schedules where the two kernels have close PUR or MUR
values. We set two threshold values αp and αm for PUR
and MUR, respectively. That means, we prune the co-
schedule if the two kernels have PUR difference lower
than αp, or have MUR difference lower than αm. Note, if
all the co-schedules are pruned, we need to increase αp
or αm. We experimentally evaluate the impact of those
two threshold values in Section 5.
4.4 Performance Model
We need a performance model for two purposes: firstly,
to select the two kernels for co-schedule; secondly, to
determine the number of thread blocks for each kernel
in the co-schedule (i.e., the slice size). Previous perfor-
mance models on the GPU [19], [38], [44] assume a single
kernel on the GPU, and are not applicable to concurrent
kernel executions. They generally assume that the thread
blocks execute the same instruction in a round-robin
manner on an SM. However, this is no longer true on
concurrent kernel executions. The thread blocks from dif-
ferent kernels have interleaving executions, which cause
non-determinism on the instruction execution flow. It is
not feasible to statically predict the interleaving warp ex-
ecutions for multi-kernel executions. To capture the non-
determinism, we develop a probabilistic performance
model to estimate the performance of co-schedule. Our
cost model has very low runtime overhead, because it
uses a series of simple parameters as input and leverages
the Markov chain theory to get the performance of
concurrent kernel executions.
Table 1 summarizes the notations used for our perfor-
mance model.
Since the GPU adopts SPMDmodel, we use the perfor-
mance estimation of one SM to represent the aggregate
performance of all SMs on the GPU. We model the pro-
cess of kernel instruction issuing as a stochastic process
and devise a set of states for an SM during execution.
By modeling the SM state, we first develop our Markov
chain based models for single-kernel executions (homo-
7TABLE 1: Parameters and notations in the performance
model
Para. Description
W Maximum number of active warps
Round A warp scheduling cycle that all ready warps
are served by the warp scheduler
Rm Memory instruction ratio
Pr→r Probability that a ready warp remains ready
Pr→i Probability that a ready warp transits to idle
Pi→r Probability that an idle warp transits to ready
Pi→i Probability that an idle warp remains in idle
Nr→r Number of ready warps that remain ready
Nr→i Number of ready warps that transit to idle
Ni→r Number of idle warps that transit to ready
Ni→i Number of idle warps that remain idle
L Average memory latency (cycle)
B GPU global memory bandwidth (requests/cy-
cle)
Si Si corresponds the state where i warps are idle
on the SM (i = 0, 1, . . . ,W )
Pij P is the Markov chain transit matrix. Entry
Pij of P represents the probability of transiting
from state Si to state Sj .
geneous workloads), and then extend it to concurrent
kernel executions (heterogeneous workloads).
For presentation clarity, we begin with our description
on the model with the following assumptions, and relax
those assumptions at the end of this section. First, we
assume that all the memory requests are coalesced.
This is the best case for memory performance. We will
relax this assumption by considering both coalesced and
uncoalesced memory accesses. Second, we assume that
the GPU has a single warp scheduler. We will extend it
to the GPU with multiple warp schedulers.
Homogeneous Workloads. We first investigate the
performance of a single kernel executed on the GPU and
each SM accommodates W active warps at most.
A warp can be in two states: idle or ready. An idle
warp is stalled by memory accesses, and a ready warp
has one instruction ready for execution. Its transition is
illustrated in Figure 5. When a warp is currently in the
ready state, we have two cases for state transitions by
definition:
• remaining in the ready state with the probability of
Pr→r = 1−Rm.
• transiting to the idle state with the probability of
Pr→i = Rm.
When a warp is currently in the idle state, we also
have two cases for state transitions:
• transiting to the ready state with the probability of
Pi→r =
1
L
W−I
= W−I
L
, where I is the number of idle
warps on the SM.
• remaining in the idle state with the probability of
Pi→i = 1− Pi→r .
We specifically define the time step and state tran-
sition of the Markov chain model to capture the GPU
architectural features. GPU adopts a round-robin style
scheduling [19]. In each round, the warp scheduler polls
each warp to issue its ready instructions so all ready
ĂĂ
ĂĂ
Ă
Dependency
Resolved
Issue long latency
memory operation
Issue computation
instruction
ReadyIdle
Dependency
Unresolved
Fig. 5: Warp state transition diagram.
warps can make progress. We model the SM state with
the number of idle warps. We denote Si to be the SM
state where i warps are idle on the SM (i = 0, 1, . . . ,W ).
Thus, we consider the state change of the SM in one
round and use round as time step in our Markov chain
model. In each round, every ready warp has an equal
chance to issue instructions. In contrast, models for the
CPU assume that the CPU will keep executing one
thread until this thread is suspended.
We use IPC to represent the throughput of the SM.
Thus, the number of idle warps on the SM is a key
parameter for IPC . Thus, we define the state of SM as
the number of idle warps on the SM (i.e., the state is Si
when the number of idle warps is i). More outstanding
memory requests usually lead to higher latency because
of memory contention [3]. We adopt a linear memory
model to account for the memory contention effects. We
calculate L as L = L0 +
B
a0·Si
+ b0, where a0 and b0 are
the constant parameters in the linear model. We obtain
L0 and B according to hardware specifications.
For homogeneous workload, the probabilities of state
transitions are the same for all ready warps in a round.
We assume when SM transits from Si to Sj , Ni→r idle
warps transit to the ready state and Nr→i ready warps
transit to idle state. The following conditions hold by
definition. 

0 ≤ Ni→r ≤ Si
0 ≤ Nr→i ≤W − Si
Nr→i −Ni→r = Sj − Si
(2)
With those constraints, there are multiple possible
transitions to transit from Si to Sj . Since the possible
transitions are mutually exclusive events, the probability
of each state transition Pij is calculated as the sum of the
probabilities of all possible transitions. With all entries
of the transition matrix P obtained, we can calculate
the steady-state vector of the Markov chain. This is
done by finding the eigenvector pi corresponding to the
eigenvalue one for matrix P [31].
pi = (γ0, γ1, ..., γW ) (3)
In Equation 3, γi is the probabilities that the SM stays
in state Si in each round, i.e., the probability there are i
idle warps in one round. The duration of the time step is
(W−i) cycles since each of the (W−i) ready warps issue
one instruction within the round. In the case i = W , the
round duration is one, indicating no warp is ready and
the SM experiences an idle cycle. Hence, the estimated
IPC is the ratio of non-idle cycles given in equation 4,
8where
∑W−1
i=0 γi(W − i) is the total non-idle cycles and
γW is the total idle cycles.
IPCK =
∑W−1
i=0 γi(W − i)∑W−1
i=0 γi(W − i) + γW
(4)
Heterogeneous Workloads. When there are multiple
kernels running concurrently, the model needs to keep
track of the state of each workload. Although we only
consider two concurrent kernels (K1 and K2) in schedul-
ing, our model can be used to handle more than two
kernels.
Suppose there are two kernels K1 and K2, and K1 has
w1 active warps and K2 has w2 active warps (w1 +w2 =
W ). The number of possible states of the SMwill be (w1+
1)×(w2+1). The state space is represented as a pair (p, q)
with 0 ≤ p ≤ w1 and 0 ≤ q ≤ w2, where p and q are the
numbers of idle warps of K1 and K2 respectively. We can
calculate the probability of transiting from state (pi, qi′)
to state (pj , qj′) by first considering individual workload
state transition probability using the single kernel model,
and then calculating the SM state transition probability.
The state transitions of different kernels are independent
with each other, because the kernels are independent.
Then the SM state transition probability is the product
of the individual transition probabilities.
With Markov chain approach, we obtain the steady
state vector pi = {γ(0,0), γ(0,1), ..., γ(w1,w2)}. Next, we can
obtain the IPC of each workload using the same method
as the model in single-kernel executions, except the
parameters are defined and calculated in the context of
two kernels. For example, the round duration is equal
to the total number of ready warps of both kernels.
Individual IPCs of K1 and K2 is calculated as the ratio
of non-idle cycles for each workload, as shown in Eq. (5)
and (6), respectively. The concurrent IPC is the sum of
individual IPCs Eq. (7). Then CP can be obtained using
Eq. (1).
IPCK1 =
∑w1−1
i=0
∑w2
i′=0 γ(i,i′) × (w1 − i)∑w1
i=0
∑w2
j=0 γ(i,j) ×R(i,j)
(5)
IPCK2 =
∑w1
i=0
∑w2−1
i′=0 γ(i,i′) × (w2 − i
′)∑w1
i=0
∑w2
i′=0 γ(i,i′) ×R(i,i′)
(6)
C = IPCK1 + IPCK2 (7)
With the estimated IPC and CP , we now discuss how
to estimate the optimal slice size ratio for two kernels.
We define the slice ratio which minimizes the execution
time difference of co-scheduled slices as the balanced
slice ratio. By minimizing the execution time difference,
the kernel-level parallelism is maximized. The execution
time difference is calculated as ∆T in Eq. (8).
∆T = | 1
IPCK1
× IK1 × PK1 −
1
IPCK2
× IK2 × PK2 | (8)
IKi and PKi represent the number of instruction per
block and the slice size of kernel Ki (i = 1, 2) in number
of thread blocks. Since PKi is less than the maximal
number of active thread blocks, only a limited number
of slice ratios need to be evaluated to get the balanced
ratio.
Uncoalesced Access. So far, we assume that all mem-
ory accesses are coalesced and each memory instruc-
tion results in the same number of memory requests.
However, due to the different address patterns, memory
instructions may result in a different amount of memory
requests. On Fermi GPUs, one memory instruction can
generate 1 to 32 memory requests. Here we consider
the two most common access patterns: fully coalesced
access, and fully uncoalesced access. We extend our
model to handle both coalesced and uncoalesced ac-
cesses by defining three states for a warp: ready, stalled
on coalesced access (uncoalesced idle), and stalled on
uncoalesced access (coalesced idle). The memory opera-
tion latency depends on the memory access type. Since
uncoalesced access generates more memory traffic, its
latency is higher than coalesced access. We also use the
linear model to estimate the latency. By identifying the
ratio of coalesced and uncoalesced memory instructions,
we can easily extend the two-state model to handle
three states and their state transitions. The Markov chain
performance model can be developed in a similar way.
Distinguishing between coalesced and uncoalesced ac-
cesses increases the accuracy of our model.
Adaptation to GPUs with multiple warp schedulers.
Our model assumes there is only one warp scheduler.
New-generation GPUs can support more than one warp
schedulers. The latest Kepler GPU features four warp
schedulers per SMX (SMX is the Kepler terminology
for SM) [28]. We extend our model to handle this case
by deriving a single pipeline virtual SM based on the
parameters of the SMX. The virtual SM has one warp
scheduler, and its parameters such as active thread
blocks and memory bandwidth are obtained by dividing
the corresponding parameters of the SMX by the number
of warp schedulers. This virtual SM can still capture the
memory and computation features of a kernel running
on the SMX. Experimental results in Section 5 show
that performance modeling on the virtual SM provides
a good estimation on the Kepler architectures.
There are two more issues that are worthwhile to
discuss.
The first issue is on the efficiency of executing our
model at runtime. We have developed mechanisms to
make our model more efficient without significantly
sacrificing the model accuracy. The O(N3) complexity
of calculating the steady state in Markov chain makes it
hard to meet the online requirement (N is the dimension
of the transition matrix). To reduce the computational
complexity, we consider the thread block as a schedul-
ing unit, instead of considering individual warps. In
this way, the computational complexity is significantly
reduced, and time cost of our model is negligible to the
GPU kernel execution time.
The second issue is on getting the input for the model.
Our current approach is based on hardware profiling of
9TABLE 2: GPU configurations.
C2050 GTX680
Architecture Fermi GF110 Kepler GK104
Number of SMs 14 8
Number of cores per SM 32 192
Core frequency (MHz) 1147 706
Global memory size (MB) 3072 2048
Global memory bandwidth (GB/s) 144 192
a small number of thread blocks from a single kernel.
Thus, the pre-execution is only a very small part of
the kernel execution. From profiling, we can obtain the
number of memory instructions issued and the total
number of instructions executed, and calculate Rm as
their ratio.
In summary, our probabilistic model has captured the
inherent non-determinism in concurrent kernel execu-
tions. First, it simply requires only a small set of profiling
inputs on the memory and computation characteristics
of individual kernels. Second, with careful probabilistic
modeling, we develop a performance model that is
sufficiently accurate to guide our scheduling decision.
The effectiveness of our model will be evaluated in the
experiments (Section 5).
5 EVALUATION
In this section, we present the experimental results on
evaluating Kernelet on latest GPU architectures.
5.1 Experimental Setup
We have conducted experiments on a workstation
equipped with one NVIDIA Tesla C2050 GPU, one
NVIDIA GTX680 GPU, two Intel Xeon E5645 CPUs and
24GB RAM. Table 2 shows some architectural features of
C2050 and GTX680. We note that C2050 and GTX680 are
based on Fermi and Kepler architectures, respectively.
One C2050 SM has two warp schedulers, and each can
serve half a warp per cycle (with a theoretical IPC of
one). In contrast, one GTX680 SMX features four warp
schedulers and each warp scheduler can serve one warp
per cycle (with a theoretical IPC of eight considering its
dual-issue capability). Our implementation is based on
GCC 4.6.2 and NVIDIA CUDA toolkit 4.2.
Workloads. We choose eight benchmark applications
with different memory and computation intensivenesses.
Sources of the benchmarks include the CUDA SDK, the
Parboil Benchmark [13], the CUSP library [4] and our
home grown applications. Table 3 describes the details
of each application, including input settings and thread
configurations of the most time-consuming kernel on
C2050.
Table 4 shows the memory and computation char-
acteristics of the most time-consuming kernel of each
application on both C2050 and GTX680. We observed
that the PUR/MUR values are stable as we vary the
input sizes (as long as the input size is sufficiently large
to keep the GPU occupancy high).
TABLE 3: Specification of benchmark applications and
thread configuration (#threads per thread block ×
#thread blocks).
Name Description Input settings Thread configu-
ration on C2050
Pointer Chas-
ing (PC)
Traversing an
array randomly
Index values for 40
million accesses
256 × 16384
Sum of Ab-
solute Differ-
ences (SAD)
An operation
used in MPEG
encoding
Image with 1920×
1072 pixels
32 × 8048
Sparse
Matrix Vector
Multiplica-
tion (SPMV)
Multiplying a
sparse matrix
with a dense
vector.
A 131072×81200
matrix with 16
non-zero elements
per row on average
256 × 16384
Stencil (ST) Stencil
operation on a
regular 3-D grid
3D grid with
134217728 points
128 × 16384
Matrix Mul-
tiplication
(MM)
Multiplying two
dense matrices
One 8192 × 2048
matrix, the other
2048 × 2048
256 × 16384
Magnetic
Resonance
Imaging - Q
(MRIQ)
A matrix opera-
tion in magnetic
resonance imag-
ing
2097152 elements 256 × 8192
Black Scholes
(BS)
Black-Scholes
Option Pricing
40 million 128 × 16384
Tiny
Encryption
Algorithm
(TEA)
A thread block
cipher
20971520 elements 128 × 16384
TABLE 4: Memory and computational characteristics of
benchmark applications.
Benchmarks
C2050 GTX680
PUR MUR Occupancy PUR MUR Occupancy
PC 0.0096 0.1404 100% 0.0072 0.1746 100%
SAD 0.1498 0.1120 16.7% 0.1062 0.1351 25%
SPMV 0.3464 0.003 100% 0.3027 0.0043 100%
ST 0.3629 0.1156 66.7% 0.2016 0.1179 100%
MM 0.5804 0.0161 67.7% 0.5321 0.0569 100%
MRIQ 0.8539 0.0002 83.3% 1.6784 0.0007 100%
BS 0.8642 0.0604 67.7% 1.2007 0.1323 100%
TEA 0.9978 0.0196 67.7% 1.1417 0.0353 100%
To assess the impact of kernel scheduling under dif-
ferent mixes of kernels, we create four groups of kernels
namely CI, MI, MIX and ALL (as shown in Table 5).
CI represents the computation-intensive workloads in-
cluding kernels with high PUR, whereas MI represents
workloads with intensive memory accesses. MIX and
ALL include a mix of CI and MI kernels. ALL has
more kernels than MIX. In each workload, we assume
the application arrival conforms to Poisson distribution.
The parameter λ in the Poisson distribution affects the
weight of the application in the workload. For simplicity,
we assume that all application has the same λ. We
also assume λ is sufficiently large so that at least two
kernels are pending for execution at any time for a high
utilization of the GPU.
Comparisons. To evaluate the effectiveness of kernel
scheduling in Kernelet, we have implemented the fol-
TABLE 5: Workload configurations.
Workload Applications
CI BS, MM, TEA, MRIQ
MI PC, SPMV, ST, SAD
MIX PC, BS, TEA, SAD
ALL PC, SPMV, ST, BS, MM, TE, MRIQ, SAD
10
0
10
20
30
40
50
60
70
14 28 42 56 70 84 98 112
S
li
c
e
d
 E
x
e
c
u
ti
o
n
 O
v
e
rh
e
a
d
 (
%
)
Slice Size
PC
SPMV
ST
BS
MM
TEA
MRIQ
SAD
(a) C2050
-1
0
1
2
3
4
5
6
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
S
li
c
e
d
 E
x
e
c
u
ti
o
n
 O
v
e
rh
e
a
d
 (
%
)
Slice Size
PC
SPMV
ST
BS
MM
TEA
MRIQ
SAD
(b) GTX680
Fig. 6: Sliced execution overhead with varying slice size on both C2050 and GTX680.
lowing scheduling techniques:
• Kernel Consolidation (BASE): the kernel consolida-
tion approach of concurrent kernel execution [34].
• Oracle (OPT): OPT uses the same scheduling al-
gorithm as Kernelet, except that it pre-executes all
possible slice ratios for all combinations to obtain
the CP and then determines the best slice ratio and
kernel combination. In another word, OPT is an
offline algorithm and provides the optimal through-
put for the greedy scheduling algorithm.
• Monte Carlo-based co-schedule (MC): we develop
a Monte Carlo approach to generate the distribution
of performance of different co-schedule methods in
the solution space. In each Monte Carlo simulation,
we randomly pick the kernel pairs and slice ra-
tios for co-scheduling. Through many Monte Carlo
simulations, we can quantitatively understand how
different co-schedules affect the performance. We
denote the result of MC to be MC(s), where s is
the number of Monte Carlo simulations.
5.2 Results on Kernel Slicing
We first evaluate the overhead of sliced execution, which
is defined as Ts
Tns
−1, where Ts and Tns are the sliced and
unsliced execution time, respectively.
Figure 6 shows the overhead for executing individual
kernels with varying slice sizes on C2050 and GTX680.
Slice sizes are set to multiples of the number of SMs
on the GPU and ranges from |SM | to the maximum
number under the occupancy limit. Overall, as the slice
size increases, the slicing overhead decreases. However,
we observe quite different performance behaviors on
C2050 and GTX680, due to their architectural differences.
On C2050, when the size is small, the slicing overhead is
very high (up to 66.7% for SAD). When the slice is larger
than or equal to 42 (three thread blocks per SM), the
overhead is ignorable for most kernels. Sliced execution
overhead on GTX680 is much smaller than on C2050.
Almost all slice sizes lead to overhead less than 2% on
GTX680. Regardless of the architectural differences, the
ignorable overhead of kernel slicing allows us to exploit
kernel slicing for co-scheduling slices from different
kernels with little additional cost.
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
P
r
e
d
ic
te
d
 I
P
C
 
Measured IPC 
(a) C2050
 
0
0.8
1.6
2.4
3.2
4
4.8
0 0.8 1.6 2.4 3.2 4 4.8 5.6 6.4 7.2
P
r
e
d
ic
te
d
 I
P
C
 
Measured IPC 
(b) GTX680
Fig. 7: Comparison between predicted and measured
single kernel execution IPCs on two GPUs.
5.3 Results on Model Prediction
We evaluate the accuracy of our performance model in
different aspects, including the estimation of IPC s for
single kernels and concurrent kernel executions, and CP
prediction for concurrent kernel executions.
Single Kernel Performance Prediction. Figure 7 com-
pares the measured and estimated IPC values for the
eight benchmark applications on C2050 and GTX680. We
also show the two lines ( y = x ± 0.2 for C2050 and
y = x±1.6 for GTX680) to highlight the scope where dif-
ference between measurement and estimation is within
± 20% of the peak IPC. Note, the theoretical IPCs for
C2050 and GTX680 are one and eight respectively. If
the result falls in this scope, we consider the estimation
well captures the trend of the measurement. We can
see that, most results are within the scope. We further
define the absolute error to be |e− e′|, where e and e′ are
the measured and estimated IPC values, respectively.
The average absolute error for the eight benchmark
applications is 0.08 and 0.21 on C2050 and GTX680,
respectively. Our probabilistic model has achieved a
reasonable accuracy in estimating the performance of
single-kernel executions on the GPU.
Concurrent Kernel Performance Prediction. For the
eight benchmark applications, we run every possible
combination of kernel pairs and measure the IPC for
each combination. Figure 8 compares the measured and
predicted IPCs with the suitable slice ratio given by our
model. We have also studied other slicing ratios. Figure 9
compares the measured and predicted IPCs with fixed
11
 
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
P
r
e
d
ic
te
d
 I
P
C
 
Measured IPC 
(a) C2050
 
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
P
r
e
d
ic
te
d
 I
P
C
 
Measured IPC 
(b) GTX680
Fig. 8: Comparison between predicted and measured
concurrent kernel execution IPCs on two GPUs with
optimal slice ratio.
 
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
P
r
e
d
ic
te
d
 I
P
C
 
Measured IPC 
(a) C2050
 
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
P
r
e
d
ic
te
d
 I
P
C
 
Measured IPC 
(b) GTX680
Fig. 9: Comparison between predicted and measured
concurrent kernel execution IPCs on two GPUs with
fixed one-to-one slice ratio.
ratio = 1 : 1. We observed similar results on other fixed
ratios. Regardless of different kernel combinations and
slicing ratios, our model is able to well capture the trend
of concurrent executions for both dynamic and static
slice ratios.
Model Optimizations. We further evaluate the im-
pact of incorporating coalesced/uncoalesced memory
accesses and the number of warp schedulers on the GPU.
Only two applications (PC and SPMV) in our bench-
mark have uncoalesced memory accesses. We conduct
the single kernel execution prediction experiments by
(wrongly) assuming those two kernels with coalesced
memory accesses only. The results are shown in Fig-
ure 10. Without considering uncoalesced access, the pre-
dicted IPC values are much larger than measurements
since the assumption of coalesced access only underes-
timates the memory contention effects.
Figure 11 shows the results of concurrent execution
IPC prediction on GTX680 without considering the mul-
tiple warp schedulers. The estimation without consider-
ing the number of warp schedulers severely underesti-
mates the IPC on GTX680, in comparison with the results
in Figure 8.
CP Prediction. We further evaluate the accuracy of
CP prediction. Figure 12 shows the comparison between
measured and predicted CP on C2050. We observe sim-
ilar results on GTX680. The prediction is close to the
measurement. With accurate prediction on IPC , the CP
difference between prediction and measurement is small.
 
0
0.1
0.2
0.3
0.4
0.5
0.6
PC SPMV
I
P
C
 
measurement
w/ uncoalesced
access
w/o uncoalesced
access
Fig. 10: Comparison between predicted and measured
concurrent kernel execution IPCs with/without consid-
ering uncoalesced access on C2050.
 
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
P
r
e
d
ic
te
d
 I
P
C
 
Measured IPC 
Fig. 11: Comparison between predicted and measured
concurrent kernel execution IPCs without considering
multiple warp schedulers on GTX680.
The results are sufficiently good to guide the scheduling
decision as shown in the next section.
5.4 Results on Kernel Scheduling
In this section, we evaluate the effectiveness of our
kernel scheduling algorithm by comparing with BASE
and OPT. To simulate the continuous kernel submission
process, we initiate 1000 instances for each of each
kernel mix and submit them for execution according
to Poisson distributions. Different scheduling algorithms
are applied and the total kernel execution time is re-
ported. Figure 13 shows the total execution time of those
kernels on C2050 and GTX680. On all the four workloads
with different memory and computation characteristics,
 
0
0.1
0.2
0.3
0.4
0.5
-0.1 0 0.1 0.2 0.3 0.4 0.5
P
r
e
d
ic
te
d
 C
P
 
Measured CP 
Fig. 12: Comparison between predicted and measured
CP on C2050.
12
 
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
CI MI MIX ALL
E
x
e
c
u
ti
o
n
 T
im
e
 (
m
s)
 
Workload 
Base
Kernelet
Oracle
(a) C2050
 
0
500000
1000000
1500000
2000000
2500000
CI MI MIX ALL
E
x
e
c
u
ti
o
n
 T
im
e
 (
m
s)
 
Workload 
Base
Kernelet
Oracle
(b) GTX680
Fig. 13: Comparison between different scheduling methods on both C2050 and GTX680.
TABLE 6: Number of kernels pruned with varying αp
and αm on C2050.
P
P
P
P
P
αm
αp 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.015 0 0 2 2 3 4 4 4 4 4
0.03 0 2 5 6 7 8 9 9 9 9
0.045 0 3 7 8 9 10 11 11 11 11
0.06 1 4 8 9 12 13 15 15 15 15
0.075 1 4 8 9 12 13 15 15 15 15
0.09 1 4 8 9 12 13 15 15 16 16
0.105 1 4 9 10 14 15 18 18 20 20
0.12 2 6 11 12 17 18 21 22 24 24
0.135 2 6 11 12 17 19 22 23 25 26
0.15 2 6 11 13 18 20 23 24 27 28
Kernelet outperforms Base (with the improvement 5.0–
31.1% for C2050 and 6.7–23.4% for GTX680). Kernelet
achieves similar performance to OPT (with the difference
0.7-3.1% for C2050 and 4.0–15.0% for GTX680). The
performance improvement of Kernelet over Base is more
significant on MIX and ALL, because Kernelet have more
chances to select kernel pairs with complementary re-
source usage. Still, Kernelet outperforms Base in CI and
MI, because slicing exposes the scheduling opportunities
(even though they are small on CI and MI).
Table 6 shows the number of kernels pruned with
different pruning parameters αp and αm on C2050. In-
creasing αp and αm leads to more kernel combinations
being pruned. Similar pruning result is oberserved for
GTX680. Varying those two parameters can affect the
pruning power and also the optimization opportunities.
Thus, we choose the default values for αp and αm as
0.4 and 0.1 on C2050, and 0.4 and 0.105 on GTX680,
respectively, as a tradeoff between pruning power and
optimization opportunities.
We finally study the execution time distribution of the
scheduling candidate space. Figure 14 shows the CDF
(cumulative distribution function) of the execution time
of the MC(1000). As we can see from the figure, none of
the random schedules is better than Kernelet. It demon-
strates that random co-schedules hurt the performance
in a high probability due to the huge space of schedule
plans.
 
0
10
20
30
40
50
60
70
80
90
100
2665000 2670000 2675000 2680000 2685000 2690000 2695000
P
ro
b
a
b
il
it
y
 (
%
) 
Execution Time (ms) 
Fig. 14: CDF (cumulative distribution function) of exe-
cution time of MC(1000).
6 RELATED WORK
In this section, we review the related work in two cate-
gories: 1) scheduling algorithms on CPUs, especially for
CPUs with SMT (Simultaneous Multi-Threading), and 2)
multi-kernel executions on GPUs.
6.1 CPU Scheduling and Performance Modeling
SMT has been an effective hardware technology to
better utilize CPU resources. SMT allows instructions
from multiple threads to be executed on the instruction
pipeline at the same time. Various SMT aware schedul-
ing techniques have been proposed to increase the CPU
utilization [8], [24], [39]–[42]. The core idea of SMT aware
scheduling is to co-schedule threads with complemen-
tary resource requirements. Several mechanisms have
been proposed to select the threads to be executed on
the same SMT core, such as hardware counters feed-
back [24], [41], pre-execution of all combinations [39]
and probabilistic job symbiosis model [8]. Performance
models including the Markov chain based ones have
also been adopted for concurrent tasks modeling on
the CPUs. Serrano et al. [36], [37] developed a model
to estimate the instruction throughput of super-scalar
processors executing multiple instruction streams. Chen
13
et al. [6] proposed an analytical model for estimating
throughput of multi-threaded CPUs.
Despite the fruitful research results on SMT schedul-
ing, they are not applicable to GPUs due to the architec-
ture differences. First, main performance affecting issues
are different for CPU and GPU applications. L2 data
cache is a key consideration issue for SMT-aware thread
scheduling, whereas thread parallelism is usually more
important for the performance of GPGPU programs [2],
[19], [38]. Second, scheduling on GPUs is not as flexible
as that on CPUs. Current GPUs do not support task
preemption. Third, unlike CPUs supporting the concur-
rent execution of a relatively small number of threads,
each GPGPU kernel launches thousands of threads.
Additionally, the maximum number of co-scheduling
threads equals to the number of hardware context on
the CPU, while the number of active warps on GPUs is
dynamic, depending on the resource usage of the thread
blocks. The slicing, scheduling and performance models
in Kernelet are specifically designed for GPUs, taking
those issues into consideration.
6.2 GPU Multiple Kernel Execution and Sharing
In the past few years, GPU architectures have under-
gone significant and rapid improvements for GPGPU
support. Due to lack of concurrent kernel support in
early GPU architectures, researchers initially proposed
to merge two kernels at the source code level [12], [14].
In those methods, two kernels are combined into a single
kernel with if-else branches on different granularities
(e.g., thread blocks). They have three major disadvan-
tages compared with our approach. First, combining the
code of two kernels will increase the resource usage of
each thread block, leading to lower SM occupancy and
performance degradation [29]. Second, those approaches
require source code, which may not be always available
in the shared environments. Third, it requires two ker-
nels with different block sizes avoiding using barriers
within the thread block, otherwise deadlock may occur.
Recently, new-generation GPUs like NVIDIA Fermi
GPUs support concurrent kernel executions. Taking ad-
vantage of this new capability, a number of multi-
kernel optimization techniques [15], [23], [35] have been
developed to improve the utilization of GPUs. Ravi et
al. [34] proposed kernel consolidation to enable space
sharing (different kernels run on different SMs) and time
sharing (multiple kernels reside on the same SM) on
GPUs. Space sharing happens when the total number of
thread blocks of all kernels does not exceed the number
of SMs and each block can be executed on a dedicated
SM. If the total number of thread blocks is larger than the
number of SMs, while SMs have sufficient resources to
accommodate more thread blocks from different kernels,
time sharing happens. That means, kernel consolidation
does not have space sharing and have little time sharing
when the launched kernels have sufficient thread blocks
to occupy the GPU. Furthermore, they determined the
kernel to be consolidated with heuristics based on the
number of thread blocks. In contrast, Kernelet utilizes
slicing to create more opportunities for time sharing, and
develops a performance model to guide the scheduling
decision. Peters et al. [32] used a persistently running
kernel to handle requests from multiple applications.
GPU virtualization has also been investigated [15], [23].
Recent studies also address the problem of GPU
scheduling when multiple users co-reside in one ma-
chine. Pegasus [16] coordinates computing resources like
accelerators and CPU and provides a uniform resource
usage model. Timegraph [22] and PTask [35] manage the
GPU at the operating system level. Kato et al. [21] intro-
duced the responsive GPGPU execution model (RGEM).
All those scheduling methods do not consider how to
schedule concurrent kernels in order to fully utilize the
GPU resources.
As for performance models on GPUs, Hong [19]
and Kim [38] proposed analytical models based on the
round-robin warp scheduling assumption. Baghsorkhi et
al. [2] introduced the work flow graph interpretation
of GPU kernels to estimate their execution time. All
those models are designed for a single kernel. More-
over, they usually require extensive hardware profiling
and/or simulation processes. In contrast, our perfor-
mance model is designed for concurrent kernel execu-
tions on the GPU, and relies on a small set of key
performance factors of individual kernel to predict the
performance of concurrent kernel executions.
7 CONCLUSION
Recently, GPUs have been more and more widely used
in clusters and cloud environments, where many kernels
are submitted and executed on the shared GPUs. This
paper proposes Kernelet to improve the throughput
of concurrent kernel executions for such shared envi-
ronments. Kernelet creates more sharing opportunities
with kernel slicing, and uses a probabilistic performance
model to capture the non-deterministic performance fea-
tures of multiple-kernel executions. We evaluate Kernelet
on two NVIDIA GPUs, Tesla C2050 and GTX680, with
Fermi and Kepler architectures respectively. Our exper-
iments demonstrate the accuracy of our performance
model, and the effectiveness of Kernelet by improving
the concurrent kernel executions by 5.0–31.1% and 6.7–
23.4% on C2050 and GTX680 on our workloads, respec-
tively.
REFERENCES
[1] asfermi: An assembler for the nvidia fermi instruction set.
http://code.google.com/p/asfermi/, accessed on Dec 17th, 2012.
[2] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-
m. W. Hwu. An adaptive performance modeling tool for GPU
architectures. SIGPLAN Not., 45(5):105–114, Jan. 2010.
[3] S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu.
Efficient performance evaluation of memory hierarchy for highly
multithreaded graphics processors. In Proc. of PPoPP ’12.
[4] N. Bell and M. Garland. Cusp: Generic parallel algo-
rithms for sparse matrix and graph computations version 0.3.0.
http://cusp-library.googlecode.com, accessed on Dec 17th, 2012.
14
[5] G. Chaitin. Register allocation & spilling via graph coloring. In
ACM Sigplan Notices, volume 17, pages 98–105. ACM, 1982.
[6] X. Chen and T. Aamodt. A first-order fine-grained multithreaded
throughput model.
[7] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti.
[8] S. Eyerman and L. Eeckhout. Probabilistic job symbiosis modeling
for smt processor scheduling. In Proc. of ASPLOS ’10.
[9] M. Garland and D. B. Kirk. Understanding throughput-oriented
architectures. Commun. ACM, 53(11), Nov. 2010.
[10] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTera-
Sort: high performance graphics co-processor sorting for large
database management. In Proc. of SIGMOD’06.
[11] N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manfer-
delli. High performance discrete fourier transforms on graphics
processors. In ACM/IEEE SuperComputing ’08, 2008.
[12] C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained
resource sharing for concurrent GPGPU kernels. 0:389–398.
[13] I. R. Group et al. Parboil benchmark suite, 2007.
[14] M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron. Enabling
task parallelism in the cuda scheduler. In Workshop on Program-
ming Models for Emerging Architectures (PMEA), 2009, page 69C76.
[15] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia,
V. Talwar, and P. Ranganathan. Gvim: GPU-accelerated virtual
machines. In Proceedings of the 3rd ACM Workshop on System-level
Virtualization for High Performance Computing, HPCVirt ’09, pages
17–24, New York, NY, USA, 2009. ACM.
[16] V. Gupta, K. Schwan, N. Tolia, V. Talwar, and P. Ranganathan.
Pegasus: coordinated scheduling for virtualized accelerator-based
systems. In Proc. of USENIXATC’11.
[17] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo,
and P. V. Sander. Relational query coprocessing on graphics
processors. ACM Trans. Database Syst., 34(4):1–39, 2009.
[18] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and
P. Sander. Relational joins on graphics processors. In Proc. of
SIGMOD ’08.
[19] S. Hong and H. Kim. An analytical model for a GPU architecture
with memory-level and thread-level parallelism awareness. In
Proc. of ISCA’09.
[20] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approxi-
mation of optimal co-scheduling on chip multiprocessors. In Proc.
of PACT’08.
[21] S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and
R. R. Rajkumar. Rgem: A responsive gpgpu execution model for
runtime engines. In Proc. of RTSS’11.
[22] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Time-
Graph: GPU scheduling for real-time multi-tasking environments.
In Proc. of USENIXATC’11.
[23] T. Li, V. Narayana, E. El-Araby, and T. El-Ghazawi. In Proc. of
ICPP’11.
[24] T. Moseley, D. Grunwald, J. L. Kihm, and D. A. Connors. Methods
for modeling resource contention on simultaneous multithreading
processors. In Proc. of ICCD’05.
[25] R. Nath, S. Tomov, T. T. Dong, and J. Dongarra. Optimizing
symmetric dense matrix-vector multiplication on GPUs. In Proc.
of SC’11.
[26] A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth
intensive 3-D FFT kernel for gpus using cuda. In Proc. of SC’08.
[27] NVIDIA. NVIDIA Fermi Compute Architecture Whiltepaper, v1.1
edition.
[28] NVIDIA. NVIDIA GeForce GTX 680 Whiltepaper, v1.0 edition.
[29] NVIDIA. NVIDIA CUDA C Programming Guide 4.2, 2012.
[30] NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.
[31] A. Papoulis and R. Probability. Stochastic processes, volume 3.
McGraw-hill New York, 1991.
[32] H. Peters, M. Koper, and N. Luttenberger. Efficiently using a
CUDA-enabled GPU as shared resource. In Proc. of CIT’10.
[33] M. Poletto and V. Sarkar. Linear scan register allocation. ACM
Transactions on Programming Languages and Systems (TOPLAS),
21(5):895–913, 1999.
[34] V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting
GPU sharing in cloud environments with a transparent runtime
consolidation framework. In Proc. of HPDC’11.
[35] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel.
PTask: operating system abstractions to manage gpus as compute
devices. In Proc. of SOSP’11.
[36] M. Serrano. Performance estimation in a simultaneous multi-
threading processor. In Proc. of MASCOTS’96.
[37] M. Serrano, W. Yamamoto, R. Wood, and M. Nemirovsky. A
model for performance estimation in a multistreamed superscalar
processor. In Computer Performance Evaluation Modelling Techniques
and Tools, volume 794 of Lecture Notes in Computer Science, pages
213–230. 1994.
[38] J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance
analysis framework for identifying potential benefits in GPGPU
applications. In Proc. of PPoPP’12.
[39] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a
simultaneous multithreaded processor. In Proceedings of the ninth
international conference on Architectural support for programming
languages and operating systems, ASPLOS-IX, 2000.
[40] A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobschedul-
ing with priorities for a simultaneous multithreading processor.
In Proc. of SIGMETRICS’02.
[41] S. E. Sujay Parekh and H. Levy. Thread-sensitive scheduling
for SMT processors. Technical report 2000-04-02, University of
Washington.
[42] D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-
aware scheduling on smp-cmp-smt multiprocessors. In Pro-
ceedings of the 2nd ACM SIGOPS/EuroSys European Conference on
Computer Systems 2007, EuroSys ’07, 2007.
[43] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense
linear algebra. In Proc. of SC’08.
[44] Y. Zhang and J. D. Owens. A quantitative performance analysis
model for gpu architectures. In Proc. of HPCA’11, 2011.
[45] Zillians. V-GPU: GPU virtualization.
http://www.zillians.com/products/vgpu-gpu-virtualization/,
accessed on Dec 17th, 2012.
