KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch
  Parameters Targeting CUDA Programs by Brandt, Alexander et al.
KLARAPTOR: A Tool for Dynamically Finding
Optimal Kernel Launch Parameters Targeting
CUDA Programs
Alexander Brandt∗, Davood Mohajerani∗, Marc Moreno Maza∗, Jeeva Paudel† and Linxiao Wang∗
∗Department of Computer Science, University of Western Ontario, London, Canada
†IBM Canada Software Labratory, Markham, Canada
Email: abrandt5@uwo.ca, dmohajer@uwo.ca, moreno@csd.uwo.ca, pjeeva01@ca.ibm.com, lwang739@uwo.ca
Abstract—In this paper we present KLARAPTOR (Kernel
LAunch parameters RAtional Program estimaTOR), a new tool
built on top of the LLVM Pass Framework and NVIDIA CUPTI
API to dynamically determine the optimal values of kernel launch
parameters of a CUDA program P . To be precise, we describe
a novel technique to statically build (at the compile time of P)
a so-called rational program R. Using a performance prediction
model, and knowing particular data and hardware parameters of
P at runtime, the program R can automatically and dynamically
determine the values of launch parameters of P that will yield
optimal performance. Our technique can be applied to parallel
programs in general, as well as to generic performance prediction
models which account for program and hardware parameters.
We are particularly interested in programs targeting manycore
accelerators. We have implemented and successfully tested our
technique in the context of GPU kernels written in CUDA using
the MWP-CWP performance prediction model.
Index Terms—Performance estimation, Performance portabil-
ity, CUDA, Program Parameters, Kernel Launch Parameters,
Manycore accelerators, LLVM Pass Framework
I. INTRODUCTION
Programming for high-performance parallel computing is
a notoriously difficult task. Programmers must be conscious
of many factors impacting performance including scheduling,
synchronization, and data locality. Of course, program code
itself impacts the program’s performance, however, there are
still further parameters which are independent from the code
and greatly influence performance. For parallel programs three
types of parameters influence performance: (i) data parame-
ters, such as input data and its size, (ii) hardware parameters,
such as cache capacity and number of available registers, and
(iii) program parameters, such as granularity of tasks and the
quantities that characterize how tasks are mapped to processors
(e.g. dimension sizes of a thread block for a CUDA kernel).
Data and hardware parameters are independent from pro-
gram parameters and are determined by the needs of the user
and available hardware resources. Program parameters, how-
ever, are intimately related to data and hardware parameters.
The choice of program parameters can largely influence the
performance of a parallel program, even up to an order of
magnitude difference (see Section VI). Therefore, it is critical
This work has been supported in part by IBM Canada Ltd (CAS project
880) and in part by NSERC of Canada (CRD grant CRDPJ500717-16, award
PGSD3-535362-2019).
to determine optimal values of program parameters that yield
the best program performance for a given set of hardware and
data parameter values.
In the CUDA programming model the kernel launch pa-
rameters, and thus the size and shape of thread blocks, greatly
impact performance. This should be obvious considering that
the memory accesses pattern of threads in a thread block
can depend on the block’s dimension sizes. The same could
be said about multithreaded programs on CPU where paral-
lel performance depends on task granularity and number of
threads. We dedicate, however, this paper to the discussion
of GPU programs where optimizing performance is much
more difficult. An important consequence of the impact of
kernel launch parameters on performance is that optimal thread
block format (that is, dimension sizes) for one GPU (Graphics
Processing Unit) architecture may not be optimal for another,
as illustrated in [1]. This emphasizes not only the impact
of hardware parameters on program parameters, but also the
need for performance portability. That is to say, enabling users
to efficiently execute the same parallel program on different
architectures that belong to the same hardware platform.
In this paper, we describe the development of KLARAP-
TOR (Kernel LAunch parameters RAtional Program estima-
TOR), a tool for automatically and dynamically determin-
ing the values of CUDA kernel launch parameters which
optimize the kernel’s performance, for each kernel launch
independently. That is to say, based on the actual data and
target device of a kernel invocation. The source code of
KLARAPTOR is freely available at https://github.com/orcca-
uwo/KLARAPTOR.
The accuracy of KLARAPTOR’s optimal prediction is il-
lustrated in Fig. 1 where kernel execution times are compared
between KLARAPTOR’s choice of thread block configuration
and the optimal thread block configuration determined by an
exhaustive search.
KLARAPTOR applies to CUDA a generic and novel tech-
nique, also described herein in Section IV, to statically build
a so-called rational program which is then used dynamically
to determine optimal program parameters for a given multi-
threaded program on specific data and hardware parameters.
The key principle underpinning the proposed technique can be
summarized as follows.
ar
X
iv
:1
91
1.
02
37
3v
1 
 [c
s.D
C]
  5
 N
ov
 20
19
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Preformance Ratio
(Exhaustive / KLARAPTOR Predicted)C
on
vo
lut
ion
2Dmm
3 K
1mm
3 K
2mm
3 K
3ata
x K
1ata
x K
2bic
g K
1bic
g K
2
cor
rm
ea
nred
uce
std
cov
arm
ea
nred
uce
ge
mmge
sum
mvm
vt 
K1m
vt 
K2
syr
2k
syr
k
Ke
rn
el
0.155
60.711
60.662
59.705
1.744
0.565
0.714
2.207
2966.500
0.577
0.134
0.590
2971.677
0.579
0.011
59.681
4.659
1.752
0.575
468.965
240.844
Kernel Performance Ratios
Fig. 1. Comparing kernel execution times with optimal launch parameters
as determined by an exhaustive search versus KLARAPTOR’s prediction.
Kernels are part of the PolyBench/GPU benchmark suite and executed on
a GTX 1080Ti with a data size of N = 2048. Ratios greater than 85% are
considered good. Absolute timings given next to each bar are in ms. Due to
memory limitations not all benchmarks could not run for N = 2048.
In most performance predictions models, high-level perfor-
mance metrics, such as execution time, memory consumption,
and hardware occupancy, are piece-wise rational functions
(PRFs) of low-level performance metrics, such as memory
bandwidth, cache miss rate, and the number of clock cycles per
instruction. These lower-level metrics are themselves PRFs of
program, data, and hardware parameters. As such, a high-level
performance metric can be estimated by a piece-wise rational
function of the program, data, and hardware parameters. For
a fixed machine, the hardware parameters are also fixed
and performance metrics depend only on program and data
parameters. Henceforth, we regard a computer program that
computes such a piece-wise rational function as a rational
program, a technical notion fully defined in Section II.
In most cases, the values of the data parameters are only
given at runtime, which makes it difficult to determine optimal
values of the program parameters at an earlier stage. On
another hand, a bad choice of program parameters can have
drastic consequences. Hence, it is crucial to be able to deter-
mine the optimal program parameters at runtime without much
overhead added to the program execution. This is precisely the
intention of the approach proposed in this paper.
A. Contributions
The goal of this work is to determine values of program
parameters which optimize a multithreaded program’s perfor-
mance. Towards that goal, the method by which such values
are found must be receptive to changing data and changing
hardware parameters. Our contributions encapsulate this re-
quirement through the dynamic use of a rational program. Our
specific contributions include:
(i) a technique for devising a mathematical expression in
the form of a rational program to evaluate a high-level
performance metric from a set of program and data
parameters;
(ii) KLARAPTOR, a tool implementing the rational program
technique to dynamically optimize CUDA kernels by
choosing optimal launch parameters; and
(iii) an empirical and comprehensive evaluation of our tool
on kernels from the Polybench/GPU benchmark suite.
B. Related Works
The Parallel Random Access Machine (PRAM) model [2],
[3], including PRAM models tailored to GPU code analysis
such as TMM [4] and MCM [5] analyze the performance of
parallel programs at an abstract level. More detailed GPU
performance models are proposed such as MWP-CWP by
Hong et al [6], [7], which estimates the execution time of GPU
kernels based on the profiling information of the kernels.
In the context of improving CUDA program performance,
other research groups have used techniques such as loop
transformation [8], auto-tuning [9], [10], [11], [12], dynamic
instrumentation [13], or a combination of the latter two [14].
Auto-tuning techniques have achieved great results in projects
such as ATLAS [15], FFTW [16], and SPIRAL [17] in which
multiple kernel versions are generated off-line and then applied
and refined on-line once the runtime parameters are known.
Although much research has been devoted to optimiz-
ing a kernel algorithmically, previous works such as [18]
and [1] suggest that kernel launch parameters (i.e. thread
block configurations) have a large impact on performance
and must be considered as a target for optimization. In [19]
the authors claim to have a static analysis method which
does not require running the code for determining the best
configuration for CUDA kernels. Through static code analysis,
IPC (instructions per clock-cycle) is estimated, however, there
is no analysis of memory access patterns. The authors assume
“the execution time of a CUDA program is proportional to the
input problem size N” [19, pp.527]. This is, obviously, a very
strong assumption, and impractical for real world applications
where even simple operations such as matrix multiplication
are not proportional to their input size.
In [20], the authors present an input-adaptive GPU code
optimization framework G-ADAPT, which uses statistical
learning to find a relation between the input sizes and the
thread block sizes. At runtime or linking time, the framework
predicts the best block size for a given input size using the
linear model obtained from compile time. This approach only
considers the total size of the thread blocks and not their
configuration. Meanwhile, the authors of [11] use a linear re-
gression model to predict optimal thread block configurations
(that is, dimension sizes and not just the total size). However,
like the authors of [19], they assume that the execution time
of a kernel is proportional to its data size. Moreover, only two
benchmarks are provided. In [21], machine learning techniques
are used in combination with auto-tuning to search for optimal
configurations of OpenCL kernels. Their examples are limited
to stencil computations.
C. Structure of the Paper
The remainder of this paper is organized as follows. Sec-
tion II formalizes and exemplifies the notion of rational
programs and their relation to piece-wise rational functions
and performance prediction. Section III highlights our two
main results. Firstly, that rational programs can be determined
as a mathematical expression for evaluating a performance
metric from program and data parameters and, secondly, our
success in applying this technique to CUDA kernels via the
KLARAPTOR tool.
The remaining sections examine our generic technique for
building and using a rational program to estimate program
performance (Section IV), the specifics of the implementation
of that technique as KLARAPTOR (Section V), and further
experimentation and evaluation of our implementation (Sec-
tion VI). We draw conclusions and explore future work in
Section VII.
II. THEORETICAL FOUNDATIONS
Let P be a multithreaded program to be executed on a
targeted multiprocessor. By fixing the target architecture, the
hardware parameters, say, H = (H1, . . . ,Hh) then become
fixed and we can assume that the performance of P depends
only on data parameters D = (D1, . . . , Dd) and program
parameters P = (P1, . . . , Pp). Also, an optimal choice of P
depends on a specific choice of D. For example, in programs
targeting GPUs the parametersD are typically dimension sizes
of data structures, like arrays, while P typically specify the
formats of thread blocks.
Let E be a high-level performance metric for P that we
want to optimize. More precisely, given the values of the data
parametersD, the goal is to find values of the program param-
eters P such that the execution of P optimizes E . Performance
prediction models attempt to estimate E from a combination
of P , D, H , and some model- or platform-specific low-level
metrics L = (L1, . . . , L`). It is natural to assume that these
low-level performance metrics are themselves combinations
of P , D, H . This is an obvious observation from models
based on PRAM such as TMM [4] and MCM [5]. Hence,
regardless of the particular performance prediction model, we
may drop direct dependency on L and look to estimate E
exclusively from some (unknown) combination of program and
data parameters.
To address our goal, we compute a mathematical expression,
parameterized by data and program parameters, in the format
of a rational program R at compile-time. At runtime, given
the specific values of D, we can efficiently evaluate E using
R. After that, we can feasibly pick a value of P that optimizes
E , and feed that to P for execution. We refer to such P which
optimizes E as optimal values of the program parameters. This
method is detailed in Section IV.
A. Rational Programs
Let X1, . . . , Xn, Y be pairwise different variables1. Let S
be a sequence of three-address code (TAC [22]) instructions
such that the set of the variables that occur in S and are
never assigned a value by an instruction of S is exactly
{X1, . . . , Xn}.
Definition 1: We say that the sequence S is rational if
every arithmetic operation used in S is either an addition,
a subtraction, a multiplication, or a comparison of integer
numbers in either fixed or arbitrary precision. Moreover, we
say that the sequence S is a rational program in X1, . . . , Xn
evaluating Y if the following two conditions hold:
1) S is rational, and
2) after specializing each of X1, . . . , Xn to an arbitrary inte-
ger value in S, the execution of the specialized sequence
S always terminates and the last executed instruction
assigns an integer value to Y .
It is worth noting that the above definition can easily
be extended to include Euclidean division, the integer part
operations floor and ceiling, and arithmetic over rational
numbers. For Euclidean division one can write a rational
program evaluating the quotient q of integer a by b, leaving
the remainder r to be simply calculated as a− qb. Then, floor
and ceiling can be computed via Euclidean division. Rational
numbers and their associated arithmetic is easily implemented
using only integer arithmetic.
Notice then, that by adding these operations to Defini-
tion 1, the class of rational programs does not change. Thus,
adding rational numbers, Euclidean division, and integer part
operations to the definition of a rational sequence yields an
equivalent definition of a rational program. We adopt this
definition henceforth.
B. Rational Programs as flowcharts
For any sequence S of computer program instructions, it is
convenient to associate S with a control flow graph (CFG). In
the CFG of S, the nodes are the basic blocks of S, that is, the
sub-sequences of S such that
(i) each instruction except the last one is not a branch, and
(ii) which are maximum in length with property (i).
Moreover, in the CFG of S , there is a directed edge from
a basic block B1 to a basic block B2 whenever, during the
execution of S, one can jump from the last instruction of B1
to the first instruction of B2. Recall that a flowchart is another
graphic representation of a sequence of computer program
instructions. In fact, CFGs can be seen as particular flowcharts.
If, in a given flowchart C, every arithmetic operation occur-
ring in every (process or decision) node is either an addition,
subtraction, multiplication, or comparison of integers in either
fixed or arbitrary precision then C is the flowchart of a rational
1Here variable refers to both the algebraic meaning of a polynomial variable
and the programming language concept.
T Bmax ≤ 32Wmax and
RT Bmax ≤ Rmax and
Z Bmax ≤ Zmax?
32Wmax ≤ T Bmax, and
32WmaxR ≤ Rmax and
32Wmax Z ≤ Zmax T ?
Rmax ≤ RT Bmax and
Rmax ≤ R 32Wmax and
Rmax Z ≤ RT Zmax?
Zmax ≤ Bmax Z and
Zmax T ≤ 32Wmax Z and
ZmaxRT ≤ Z Rmax?
Bactive = Bmax
Bactive = b(32Wmax)/T c
Bactive = b(Rmax/(RT )c
Bactive = bZmax/Zc
Bactive = 0 (Failure to Launch)
No
No
No
No
Yes
Yes
Yes
Yes
Fig. 2. Rational program (presented as a flow chart) for the calculation of
hardware occupancy in CUDA.
sequence of computer program instructions. Therefore, it is
meaningful to depict rational programs using flowcharts. Two
meaningful examples of rational programs are presented in
the following section while Fig. 2 depicts one of them as a
flowchart.
C. Examples
Example 1: Hardware occupancy, as defined in the CUDA
programming model, is a measure of a program’s effectiveness
in using the Streaming Multiprocessors (SMs) of a GPU.
Hardware occupancy is calculated from a number of hardware
parameters, namely:
- the maximum number Rmax of registers per thread block,
- the maximum number Zmax of shared memory words per
thread block,
- the maximum number Tmax of threads per thread block,
- the maximum number Bmax of thread blocks per SM and
- the maximum number Wmax of warps per SM,
as well as low-level kernel-dependent performance metrics,
namely:
- the number R of registers used per thread and
- the number Z of shared memory words used per thread
block,
and a program parameter, namely the number T of threads per
thread block. The occupancy of a CUDA kernel is defined as
the ratio between the number of active warps per SM and the
maximum number of warps per SM, namely Wactive/Wmax,
where
Wactive = min (bBactiveT/32c,Wmax) (1)
and Bactive is given as a flowchart by Fig. 2. This flowchart
shows how one can derive a rational program computing
Bactive from Rmax, Zmax, Tmax, Bmax, Wmax, R, Z, T . It
follows from Equation (1) that Wactive can also be computed
by a rational program from Rmax, Zmax, Tmax, Bmax, Wmax,
R, Z, T . Finally, the same is true for the occupancy of a
CUDA kernel using Wactive and Wmax.
Example 2: The execution time of a GPU kernel is a high-
level performance metric defined in the MWP- model [6], [7]
which is calculated from hardware parameters including:
- clock frequency of a SM,
- the number of SMs on the device, etc.,
as well as low-level kernel-dependent performance metrics,
including:
- the total number #Mem insts of memory instructions
per thread,
- the total number #Comp inst of computation instruc-
tions per thread, etc.,
and a program parameter, namely the number T of threads per
thread block. Thus, this high-level metric can be computed as
rational program of program, data, and hardware parameters.
Indeed, we have realized such a rational program making us of
the MWP-CWP model and its metrics in our implementation
of KLARAPTOR (See Sections III-B and V).
D. Piece-Wise Rational Functions in Rational Programs
Observation 1: Let S be a rational program in X1, . . . , Xn
evaluating Y . Let s be any instruction of S other than a branch
or an integer part instruction. Hence, this instruction can be
of the form C = −A, C = A+B, C = A−B, C = A×B,
where A and B can be any machine-representable rational
numbers. Let V1, . . . , Vv be the variables that are defined at
the entry point of the basic block of the instruction s. An
elementary proof by induction yields the following fact. There
exists a rational function2 in V1, . . . , Vv that we denote by
fs(V1, . . . , Vv) such that C = fs(V1, . . . , Vv) for all possible
values of V1, . . . , Vv . From there, one derives the following
observation. There exists a partition T = {T1, T2, . . .} of Qn
(where Q denotes the field of rational numbers) and rational
functions f1(X1, . . . , Xn), f2(X1, . . . , Xn), . . . such that, if
X1, . . . , Xn receive respectively the values x1, . . . , xn, then
the value of Y returned by S is one of fi(x1, . . . , xn) where
i is such that (x1, . . . , xn) ∈ Ti holds. In other words, S
computes Y as a piece-wise rational function (PRF).
Example 1 shows that the hardware occupancy of a CUDA
kernel is given as a piece-wise rational function in the variables
Rmax, Zmax, Tmax, Bmax, Wmax, R, Z, T . Hence, in this
example, we have n = 8. Moreover, Fig. 2 shows that its
partition of Qn contains 5 parts as there are 5 terminating
nodes in the flowchart.
2Here, rational function is in the sense of algebra, see [23] and section V-E.
III. MAIN RESULTS
A. Rational Programs for Performance Prediction
In Section II, we saw how a rational program could be
seen as a piece-wise rational function. We now show how a
rational program can encode, and thus evaluate, a performance
prediction model using only program and data parameters. A
technique for constructing such a rational program is given in
Section IV.
Observation 2: We assume that the high-level performance
metric E is computable as a rational program depending on
program, data, and hardware parameters, and some low-level
performance metrics. Then, it is natural to assume that these
low-level metrics also depend only rationally (as opposed
to some more complex dependency) on program, data, and
hardware parameters. Hence, we replace the direct dependency
of E on low-level metrics with a further dependency on
the program, data, and hardware parameters. When applied
to a multithreaded program to be executed on a particular
multiprocessor, the hardware parameters become specialized
and we can assume that E can be expressed as a rational
program depending only on program and data parameters.
Observations 1 and 2 culminate into this final result on
rational programs. Suppose that a flow chart C representing the
rational program R is partially known; to be precise, suppose
that the decision nodes are known (that is, mathematical
expressions defining them are known) while the process nodes
are not. From Observation 1, each process node can be given
by a series of rational functions. Determining each of those
rational functions can be achieved by solving an interpolation
or curve fitting problem.
It follows that a performance prediction model which can be
depicted as a flow chart can also be depicted as a piece-wise
rational function and thus a rational program. For example,
the MWP-CWP performance model on top of CUDA (see
Examples 1 and 2) gives us an estimate of E from program,
data, and hardware parameters and some low-level metrics.
Thus, by determining rational functions describing these low-
level metrics, which are parameterized by program and data
parameters, we can fully determine a rational program estimat-
ing E which depends only on program and data parameters.
Recall that building such a rational program R is our goal.
Once R is known, it can be used at runtime (that is, when the
program P is run on specific input data) to compute optimal
values for the program parameters. This is exactly what is
achieved in our tool implementing this technique. Before
exploring this tool we make one final remark. We assumed
that the decision nodes in the flowchart of the rational program
were known, however, we could relax this assumption. Indeed,
each decision node is given by a series of rational functions.
Hence, those could be determined by solving curve fitting
problems as well. We do not discuss this direction further since
this is not needed in our proposed technique or implementation
presented in the remainder of this paper.
B. KLARAPTOR: A Dynamic Optimization tool for CUDA
The theory of rational programs is put into practice for
the CUDA programming model by our tool KLARAPTOR.
KLARAPTOR is a compile-time tool implemented using the
LLVM Pass Framework and the MWP-CWP performance
model to dynamically choose a CUDA kernel’s launch param-
eters (thread block configuration) which optimize its perfor-
mance. Most high-performance computing applications require
computations be as fast as possible and so kernel performance
is simply measured as its execution time.
As mentioned in Section I, thread block configurations
drastically affect the running time of a kernel. Determining
optimal thread block configurations typically follows some
heuristics, for example, constraining block size to be a multiple
of 32 [24]. However, it is known that the dimension sizes of a
thread block, not only its total size, affect performance [1],
[18]. Moreover, since thread block configurations are inti-
mately tied to the size of data being operated on, it is very
unlikely that a static thread block configuration optimizes the
performance of all data sizes. Our tool effectively uses a
rational program to dynamically consider the particular data
size of every kernel invocation and determines the thread
block configuration which minimizes the execution time of
that particular invocation. This is achieved in two main steps.
1) At the compile-time of a CUDA program, its kernels are
analyzed in order to build a rational program estimating
the performance of each individual kernel under the
MWP-CWP model. Each rational program, written as
code in the C language, is inserted into the code of the
CUDA program so that it is called before the execution
of the corresponding kernel.
2) At runtime, immediately preceding the launch of a kernel,
where data parameters have specific values, the rational
program is evaluated to determine the kernel launch
parameters (thread block configuration) which optimize
the performance of the kernel. The kernel is then launched
using this thread block configuration.
The practical benefits of KLARAPTOR are numerous, but
in particular we are concerned with kernel performance and
programmer performance. Of course, by choice of optimal
kernel launch parameters, the execution time of a kernel is
minimized without any modifications to the kernel. By pro-
grammer performance we mean the efficiency of a programmer
to produce optimal code. When a programmer is attempting
to optimize a kernel, choosing optimal launch parameters
can either be completely ignored, performed heuristically,
determined by trial and error, or determined by an exhaustive
search. The latter two options quickly become infeasible as
data sizes grow large. Regardless, any choice of optimal thread
block configuration is likely to optimize only a single data size.
For KLARAPTOR to be practical, not only does the choice
of optimal kernel launch parameters need to be correct, but
the two main steps defined previously must also be performed
in an way which is more efficient than trial and error or
exhaustive search. Namely, the compile-time analysis cannot
Co
nv
olu
tio
n2
D
bi
cg
 K
2
bi
cg
 K
1
at
ax
 K
2
at
ax
 K
1
m
vt
 K
2
m
vt
 K
1
ge
su
m
m
v
sy
rk
m
m
2 
K1
m
m
2 
K2
m
m
3 
K2
m
m
3 
K1
m
m
3 
K3
sy
r2
k
0
10
20
30
40
fd
td
_s
te
p3
fd
td
_s
te
p1
fd
td
_s
te
p2
re
du
ce
m
ea
n
co
rr st
d
m
ea
n
co
va
r
re
du
ce
ge
m
m
co
nv
olu
tio
n3
D
gr
am
sc
hm
id
t K
1
gr
am
sc
hm
id
t K
2
gr
am
sc
hm
id
t K
3
0
100
200
300
400
Polybench/GPU Kernel
Op
tim
iza
tio
n 
Ti
m
e 
(s
)
Time to Determine Optimal Launch Parameters KLARAPTOR Exhaustive Search
Fig. 3. Comparing cumulative times to determine optimal launch parameters for data sizes 32 ≤ N ≤ 2048 for each kernel in Polybench/GPU.
overwhelm the compilation time and the runtime decision of
the kernel launch parameters cannot overwhelm the kernel
execution time. For the former our analysis is performed very
quickly using only small data sizes whereas for the later the
rational program evaluation is fast and simple, maintaining a
runtime history to instantly provide results for future kernel
launches. The full details of this process is presented in
Section V. In the remainder of this section we highlight the
performance of KLARAPTOR.
The Polybench/GPU benchmark suite provides an em-
pirical evaluation of our tool on a range of CUDA pro-
grams. In Fig. 1 we have already seen that KLARAPTOR
accurately predicts the optimal or near-optimal thread block
configuration. In Fig. 3 we compare the so-called system per-
formance. System performance describes the time required to
determine the optimal thread block configuration by either (a)
completely running KLARAPTOR, including rational program
construction, evaluation, and running the CUDA program over
many data sizes; or (b) invoking the kernel on all possible
thread block configurations in an exhaustive search over those
same data sizes. Evidently, using KLARAPTOR is orders
of magnitude faster than an exhaustive search while also
maintaining the fact that it dynamically adjusts for any data
size. More detailed results and experimentation is presented
in Section VI.
IV. AN ALGORITHM TO BUILD AND DEPLOY A
PERFORMANCE-PREDICTING RATIONAL PROGRAM
In this section, the notations and hypotheses are the same
as in Section II. Namely, E is a high-level performance metric
for the multithreaded program P (e.g. execution time, memory
consumption, or hardware occupancy), L is a set of low-level
metrics given by the performance prediction model or the
particular platform of P , and P , D, H are sets of program,
data, and hardware parameters, respectively.
Let us assume that the values of H are known at compile-
time of P while the values ofD are known at runtime. Further,
let us assume that P and D take integer values. Hence the
values of P belong to a finite set F ⊂ Zp. That is to say, the
possible values of P are tuples of the form (pi1, . . . , pip) ∈
F , with each pii being an integer. Let us call such a tuple a
configuration of the program parameters. Due to the nature of
program parameters, those are not necessarily all independent
variables (i.e. a program parameter may depend on the value
of another program parameter). For example, in CUDA, the
product of the dimension sizes of a thread block is usually
(i) a multiple of the warp size (32) and, (ii) bounded over by
the maximum number of threads per block.
Following Section III-A, we can compute a rational program
R estimating E as a function of D and P . To do so, we shall
first compute rational functions g1(D,P ), . . . , g`(D,P ), es-
timating L1, . . . , L` respectively. Once R has been computed,
it can be used when P is being executed (thus at runtime) in
order to determine optimal values for P , on given values of
D. The entire process is decomposed into six steps: the first
three occur at compile-time and the next three at runtime.
1) Data collection: In order to perform a curve fitting of the
rational functions gi(D,P ), 1 ≤ i ≤ `, we require data
points with which to perform the fit. These are collected
by (i) selecting a set of points K which is a subset of the
space of possible values of (D, P ); and (ii) we execute
the program P , recording the values of the low-level
performance metrics L as V = (V1, . . . , V`), at each
point in K.
2) Rational function estimation: For each low-level metric
Li we use the set of points K and the value Vi measured
for each point to estimate the rational function gi(D,P ).
We observe that if the values of Vi were known exactly
(that is, without error) the rational function gi(D,P )
could be determined exactly via rational function in-
terpolation. In practice, however, these empirical values
are likely to be noisy from profiling, and/or numerical
approximations. Consequently, we determine a rational
function gˆi(D,P ) which approximates gi(D,P ).
3) Code generation: In order to generate the rational pro-
gram R, we proceed as follows:
(i) we convert the rational program representing E into
code, say in the C programming language, essentially
encoding the CFG for computing E ;
(ii) we convert each gˆi(D,P ) into code, specifically sub-
routines, estimating Li; and
(iii) we include those sub-routines into the code computing
E , which yields the desired rational program R.
At this point the rational program R is fully determined.
4) Rational program evaluation: At the runtime of P ,
the data parameters D are given particular values, say
δ1, . . . , δd. For those specified values of D and for all
practically meaningful3 values of P from the set F , we
compute an estimate of E using R. The evaluation of E
over so many different possible program parameters is
feasible for three reasons:
(i) The number of program parameters is small, typically
p ≤ 3, see Section V;
(ii) the set of meaningful values for P is small (consider
that in CUDA the product of thread block dimensions
should be a multiple of 32 less than 1024); and
(iii) the program R simply evaluates a few formulae and
thus runs almost instantaneously.
5) Selection of optimal values of program parameters:
When the search space of values of the program param-
eters P is large, a numerical optimization technique is
required for this step. However, as just explained, the
number of evaluations to sort through is quite small and
thus an exhaustive search is feasible. However, due to
inaccuracies in the performance prediction model being
used, and in the estimation of the rational functions
gi(D,P ) as gˆi(D,P ), there may be several configura-
tions which, up to some margin, optimize E . Then, a
secondary performance metric or some heuristic specific
to the platform of P may be used to refine the choice of
optimal configuration.
6) Program execution: Once an optimal configuration is
selected, the program P can finally be executed using
this configuration along with the values δ1, . . . , δd of D.
V. THE IMPLEMENTATION OF KLARAPTOR
In this section we give an overview of the implementation
of our previously presented technique (Section IV) applied
to CUDA using the MWP-CWP performance model. The
KLARAPTOR tool is built in the C language, making use of
the LLVM Pass Framework (see Section V-C) and the NVIDIA
CUPTI API (see Section V-D). KLARAPTOR is freely avail-
able in source at https://github.com/orcca-uwo/KLARAPTOR.
In the case of a CUDA kernel, the data parameters specify
the input data size. In many examples this is a single param-
eter, say N , describing the size of an array (or the order of
a multi-dimensional array), the values of which are usually
powers of 2. Program parameters describe the kernel launch
parameters, i.e. grid and thread block dimensions, and are also
typically powers of 2. For example, a possible thread block
configuration may be 1024× 1× 1 (a one-dimensional thread
3“Practically meaningful” refers to the fact that the values of the P are
likely to be constrained by the values D. For example, if P1, P2 are the
two dimension sizes of a two-dimensional thread block of a CUDA kernel
operating on a square matrix of order D1, then P1P2 ≤ D21 is meaningful.
block), or 16 × 16 × 2 (a three-dimensional thread block).
Lastly, the hardware parameters are values specific to the target
GPU, for example, memory bandwidth, the number of SMs
available, and their clock frequency.
This section is broken into subsections. Sections V-A,
V-B, and V-C are specific to our implementation and do not
correspond to any step of Section IV. The compile time steps
1 (data collection) and 2 (rational function estimation) are
reflected in Sections V-D and V-E, while step 3 requires no
additional explanation. The runtime steps 4 (rational program
evaluation), 5 (selection of optimal configuration), and 6 (pro-
gram execution) are also straight forward from the algorithm
in Section IV. Throughout this section we refer to the notion
of a driver program as the code which includes the rational
program and uses it to select an optimal configuration.
A. Annotating the source code
Beginning with a CUDA program written in C/C++, we
minimally annotate the host code to make it compatible with
our pre-processor. We take into account the following points:
(i) the code targets CUDA Compute Capability 3.x or higher;
(ii) there should be no CUDA runtime API calls as such calls
will interfere with later CUDA driver API calls used by
our tool, for example, cudaSetDevice;
(iii) the block dimensions and grid dimensions must be de-
clared as the typical CUDA dim3 structs.
For each kernel in the CUDA code, we add two pragmas,
one defining the index of the kernel input arguments corre-
sponding to the data size N and one specifying the dimension
of the kernel (1, 2, or 3). As an example, consider the
code segment below for the Sample CUDA kernel and the
associated pragmas. This kernel operates of a two-dimensional
kernel of order N , making it a two-dimensional kernel.
#pragma kernel_info_size_param_idx_Sample = 1;
#pragma kernel_info_dim_sample_kernel = 2;
__global__ void Sample (int *A, int N) {
int tid_x = threadIdx.x + blockIdx.x*blockDim.x;
int tid_y = threadIdx.y + blockIdx.y*blockDim.y;
...
}
Lastly, for each kernel, the user must fill two formatted
configuration files which follow Python syntax. One speci-
fies the constraints on the thread block configuration while the
other specifies the grid dimensions. For example, for the 2D
kernel Sample above, one could specify that its thread block
configuration (bx, by, bz) must satisfy bx < by2, bx < N and
by < N . Similarly, the grid dimensions (gx, gy, gz), could be
specified as gx = ceil
[
N
bx
]
and gy = ceil
[
N
by
]
.
B. Preprocessing
The preprocessor processes the annotated source replacing
CUDA runtime API calls to driver API kernel launches. This
step includes source code analysis in order to extract the list of
kernels, the list of kernel calls in the host code, and finally, the
body of each kernel (which will be used for further analysis)
and building a so-called “PTX lookup table” to store specific
kernel information. This table will be inserted into the so-
called “instrumented binary”—the compiled CUDA program,
augmented by our driver programs.
C. Input/Output Builder
The Input/Output builder Pass, or IO-builder, is a compiler
pass written in the LLVM Pass Framework to build the previ-
ously mentioned “instrumented binary”. This pass embeds IO
mechanisms to communicate with the driver program of each
kernel at runtime. There is one such function call embedded
per kernel call. Thus, at the runtime of the CUDA program
being analyzed (step 6 of Section IV), an IO function is
called before each kernel invocation to return six integers,
(gx, gx, gz, bx, by, bz), the optimal kernel launch parameters.
The IO-builder pass goes through the following steps:
(i) obtain the LLVM IR (intermediate representation) of the
instrumented source code and find all CUDA driver API
kernel calls;
(ii) relying on the annotated information for each kernel,
determine which variables in the IR contain the value
of N for a corresponding kernel call; and
(iii) insert a call to an IO function immediately before each
kernel call in order to pass the runtime values of N to the
corresponding driver program and retrieve the optimal
kernel launch parameters.
D. Building a Driver Program: Data collection
For the MWP-CWP model as well as our eventual rational
function estimation, we must collect data and statistics regard-
ing certain performance counters and runtime metrics (these
are thoroughly defined in [6] and [25]). These metrics can be
partitioned into three categories.
Firstly, architecture-specific performance counters of a ker-
nel, basically, characteristics dictated by the CC of the target
device. Such characteristics can be obtained at compile-time,
as the target CC is specified at this time. These characteristics
include the number of registers used in each thread, the amount
of static shared memory per thread block, and the number of
(arithmetic and memory) instructions per thread.
Secondly, runtime-specific performance counters that de-
pend on the behavior of the kernel at runtime. This in-
cludes values impacted by memory access patterns, namely,
the number of memory accesses per warp, the number of
memory instructions in each thread, and eventually, the total
number of warps that are being executed. We have developed
a customized profiler using NVIDIA’s EVENT API within
the CUPTI API to collect the required runtime performance
counters. The profiler is customized to collect exactly the
information required for the MWP-CWP model and nothing
else, allowing it to be very lightweight and avoid the huge
overheads of a typical profiler (e.g. NVIDIA’s NVPROF [26]).
Thirdly, device-specific parameters, which describe an ac-
tual GPU card, allow us to compute a more precise perfor-
mance estimate. A subset of such parameters can be deter-
mined by microbenchmarking the device (see [27] and [28]),
this includes the memory bandwidth, and the departure delay
for memory accesses. The remaining parameters can easily
obtained by consulting the vendor’s guide [29], or by querying
the device itself via the CUDA driver API. Such parameters
include the number of SMs on the card, the clock frequency
of SM cores, and the instruction delay.
E. Building a Driver Program: Rational function estimation
Using the runtime data collected in the previous step,
we look to determine the rational functions gˆi(D,P ) (see
Section IV) which estimate the low-level metrics or other
intermediate values in the performance model. For simplicity
of discussion, we replace the parameters D and P with the
generic variables X1, . . . , Xn.
A rational function is simply a fraction of two polynomials.
With a degree bound (an upper limit on the exponent) on each
variable Xk in the numerator and the denominator, say uk
and vk, respectively, these polynomials can be defined up to
some parameters (using the language of parameter estimation),
namely the coefficients of the polynomials, say α1, . . . , αi and
β1, . . . , βj . Through analysis of the MWP-CWP model these
degree bounds, and thus i and j, are relatively small.
fb(X1, . . . , Xn) = pb(X1, . . . , Xn) / qb(X1, . . . , Xn)
=
α1 · (X01 · · ·X0n) + . . . + αi · (Xu11 · · ·Xunn )
β1 · (X01 · · ·X0n) + . . . + βj · (Xv11 · · ·Xvnn )
We perform a parameter estimation (for each rational func-
tion) on the coefficients α1, . . . , αi, β1, . . . , βj . to determine
the rational function. This is a relatively simple linear regres-
sion which can be solved as an over-determined system of
linear equations, say by the method of linear least squares.
However, our system is constructed from the evaluation
of monomial terms, resulting in essentially a Vandermonde
matrix. Such matrices are very ill-conditioned. Since we are
interested in using our fitted model for extrapolation (i.e.
estimating program parameters for new data parameters) any
numerical errors in the model fitting will grow very quickly
[30]. Thus, our solution must be as numerically stable as
possible. Moreover, since we have constructed the system from
monomials, it suffers from multicollinearity (see [31, Chapter
23]) and can become rank-deficient. The typical method to
solve the system using QR-factorization is thus impossible
and we use the computationally more intensive yet more
numerically stable method of singular value decomposition
(for details see [30, Chapter 4]).
Our implementation uses the LAPACK (Linear Algebra
PACKage) library [32] for linear algebra and the Basic Poly-
nomial Algebra Subprograms (BPAS) library [33] for efficient
rational function and polynomial implementations.
VI. EXPERIMENTATION
In this section we explore the performance of KLARA-
PTOR by applying it to the CUDA programs of the
Polybench/GPU benchmark suite [9]. Throughout this sec-
tion data was collected using a GTX 1080Ti.
We note here that many of the kernels in this suite perform
relatively low amounts of work; they are best suited to being
executed many times from a loop in the host code. These
kernels with extremely fast execution create a large variance in
relative timings from trial to trial where the absolute time scale
is on the order of a fraction of a millisecond. Performance
is further varied by dynamic frequency and voltage scaling
despite setting the performance mode to peak performance.
However, when it comes to choosing thread block configura-
tions which give optimal execution times, we need only to look
at trends and relative performance. Fig. 4 shows these trends
for three different kernels: atax, corr, and gramschmidt.
While the predicted values are not exact, they do show that
the overall shape is accurate and that predicted minima align
with actual minima in execution time.
Table I provides experimentation data for every kernel in
the Polybench/GPU benchmark suite. Namely, this table
compares the execution times of the thread block configuration
chosen by KLARAPTOR against the optimal thread block
configuration found though exhaustive search. The table shows
a couple of data sizes in order to highlight that the best
configuration can change for different input sizes.
VII. CONCLUSIONS AND FUTURE WORK
The performance of a single CUDA program can vary
wildly depending on the target GPU device, the input data size,
and the kernel launch parameters. Moreover, a thread block
configuration yielding optimal performance for a particular
data size or a particular target device will not necessarily be
optimal for a different data size or different target device.
In this paper we have presented the KLARAPTOR tool for
determining optimal CUDA thread block configurations which
is adaptive to each kernel invocation and input data, allowing
for dynamic data-dependent performance and portable perfor-
mance. This tool is based upon our technique of encoding
a performance prediction model as a rational program. The
process of constructing such a rational program is a fast and
automatic compile-time process which occurs simultaneously
to compiling the CUDA program by use of the LLVM Pass
framework. Our tool was tested using the kernels of the
Polybench/GPU benchmark suite with good results.
That same experimentation has lead us to consider the limi-
tations our chosen performance prediction model, MWP-CWP.
Recently, the author of [34] and [35] has suggested a new
GPU performance model which relies on Little’s law; it
measures concurrency as a product of latency and throughput.
This model considers both warp and instruction concurrency
while the other models such as [29], [6], [7], and [36],
consider only warp concurrency. The author’s analysis of those
models suggests their limitation is the significant underesti-
mation of occupancy when arithmetic intensity (the number
of arithmetic instructions per memory access) is intermediate.
This is exactly the type of kernels on which KLARAPTOR
underperforms. In future work we look to apply an improved
performance prediction in order to achieve even better results.
REFERENCES
[1] Y. Torres, A. Gonza´lez-Escribano, and D. R. Llanos, “ubench: exposing
the impact of CUDA block geometry in terms of performance,” The
Journal of Supercomputing, vol. 65, no. 3, pp. 1150–1163, 2013.
TABLE I
COMPARING KLARAPTOR AGAINST EXHAUSTIVE SEARCH FOR THREAD
BLOCK CONFIGURATION CHOICE OF KERNELS IN POLYBENCH/GPU.
Kernel N Chosen Time Best TimeConfig. (ms) Config. (ms)
Convolution2D 8192 32, 16 2.42 32, 4 2.3416384 32, 16 11.15 128, 1 9.34
mm2 K1 256 32, 16 0.20 512, 1 0.14512 32, 16 0.97 32, 2 0.80
mm2 K2 256 32, 16 0.20 16, 2 0.17512 32, 16 0.96 32, 2 0.80
convolution3D 512 32, 16 0.03 32, 8 0.031024 32, 16 0.07 32, 8 0.07
mm3 K1 1024 32, 16 7.48 32, 8 7.442048 32, 16 60.71 32, 8 61.03
mm3 K2 1024 32, 16 7.46 32, 8 7.572048 32, 16 60.66 128, 8 56.58
mm3 K3 1024 32, 16 7.44 32, 8 7.432048 32, 16 59.70 32, 8 55.80
atax K1 1024 32, 16 0.89 8, 4 0.202048 64, 8 1.74 16, 2 0.36
atax K2 1024 32, 16 0.28 128, 4 0.252048 64, 8 0.56 32, 1 0.49
bicg K1 4096 128, 8 1.74 256, 1 1.048192 256, 4 2.81 128, 1 2.21
bicg K2 4096 128, 8 8.43 32, 1 0.858192 256, 4 13.19 16, 2 4.30
corr 1024 32, 16 778.66 4, 1 621.802048 64, 8 2966.50 4, 1 3632.81
mean 1024 32, 16 0.41 4, 1 0.282048 64, 8 0.58 32, 1 0.55
reduce 1024 32, 16 0.04 1, 8 0.222048 32, 16 0.13 1, 8 0.83
std 1024 32, 32 0.88 4, 1 0.282048 64, 8 0.59 32, 1 0.56
covar 1024 32, 16 789.44 4, 1 622.182048 64, 8 2971.68 4, 1 3656.70
mean 1024 32, 16 0.40 4, 1 0.292048 64, 8 0.58 128, 1 0.55
reduce 1024 128, 8 0.01 32, 1 0.012048 256, 4 0.01 64, 1 0.01
fdtd step1 512 32, 16 0.02 32, 8 0.011024 32, 16 0.05 64, 8 0.04
fdtd step2 512 32, 16 0.02 64, 8 0.011024 32, 16 0.05 64, 8 0.04
fdtd step3 512 32, 16 0.02 32, 8 0.011024 32, 16 0.06 32, 8 0.05
gemm 1024 32, 16 7.49 32, 8 7.342048 32, 16 59.68 32, 8 61.43
gesummv 1024 32, 16 2.39 8, 4 0.342048 64, 8 4.66 16, 2 0.73
gramschmidt K1 512 512, 1 0.02 8, 1 0.021024 512, 1 0.04 1, 1 0.03
gramschmidt K2 512 1024, 1 0.01 32, 1 0.011024 32, 16 0.01 16, 1 0.01
gramschmidt K3 512 256, 4 0.26 16, 1 0.201024 32, 16 0.66 32, 1 0.54
mvt K1 4096 128, 8 6.64 32, 1 0.858192 256, 4 12.70 16, 2 4.32
mvt K2 4096 128, 8 1.39 32, 1 1.038192 256, 4 2.71 64, 1 2.19
syr2k 1024 32, 16 59.59 8, 8 58.642048 32, 16 468.96 4, 8 670.23
syrk 1024 32, 16 30.42 16, 8 30.222048 32, 16 240.84 16, 8 247.60
[2] L. J. Stockmeyer and U. Vishkin, “Simulation of parallel random access
machines by circuits,” SIAM J. Comput., vol. 13, no. 2, pp. 409–422,
1984.
[3] P. B. Gibbons, “A more practical PRAM model,” in Proceedings of the
ACM Symposium on Parallel Algorithms and Architectures. ACM,
1989, pp. 158–168.
[4] L. Ma, K. Agrawal, and R. D. Chamberlain, “A memory access model
for highly-threaded many-core architectures,” Future Generation Comp.
Syst., vol. 30, pp. 202–215, 2014.
[5] S. A. Haque, M. Moreno Maza, and N. Xie, “A many-core machine
model for designing algorithms with minimum parallelism overheads,”
0.0
0.2
0.4
0.6
0.8
1.0
at
ax
_k
er
ne
l1
(1
,3
2,
1)
(2
,1
6,
1)
(4
,8
,1
)
(8
,4
,1
)
(1
6,
2,
1)
(3
2,
1,
1)
(1
,6
4,
1)
(2
,3
2,
1)
(4
,1
6,
1)
(8
,8
,1
)
(1
6,
4,
1)
(3
2,
2,
1)
(6
4,
1,
1)
(1
,1
28
,1
)
(2
,6
4,
1)
(4
,3
2,
1)
(8
,1
6,
1)
(1
6,
8,
1)
(3
2,
4,
1)
(6
4,
2,
1)
(1
28
,1
,1
)
(1
,2
56
,1
)
(2
,1
28
,1
)
(4
,6
4,
1)
(8
,3
2,
1)
(1
6,
16
,1
)
(3
2,
8,
1)
(6
4,
4,
1)
(1
28
,2
,1
)
(2
56
,1
,1
)
(1
,5
12
,1
)
(2
,2
56
,1
)
(4
,1
28
,1
)
(8
,6
4,
1)
(1
6,
32
,1
)
(3
2,
16
,1
)
(6
4,
8,
1)
(1
28
,4
,1
)
(2
56
,2
,1
)
(5
12
,1
,1
)
(1
,1
02
4,
1)
(2
,5
12
,1
)
(4
,2
56
,1
)
(8
,1
28
,1
)
(1
6,
64
,1
)
(3
2,
32
,1
)
(6
4,
16
,1
)
(1
28
,8
,1
)
(2
56
,4
,1
)
(5
12
,2
,1
)
(1
02
4,
1,
1)
Thread Block Dimensions
0.0
0.2
0.4
0.6
0.8
1.0
gr
am
sc
hm
id
t_
ke
rn
el
2
Predicted
Actual
No
rm
al
ize
d 
Ti
m
e 128256
512
1024
128
256
512
1024
Fig. 4. The predicted and actual execution time of three kernels for various input sizes. The trends show that predicted minima occur at actual minima.
in Parallel Computing: On the Road to Exascale, Proceedings of the
International Conference on Parallel Computing, ParCo 2015, ser.
Advances in Parallel Computing, vol. 27. IOS Press, 2015, pp. 35–44.
[6] S. Hong and H. Kim, “An analytical model for a GPU architecture
with memory-level and thread-level parallelism awareness,” in 36th ISCA
2009, 2009, pp. 152–163.
[7] J. Sim, A. Dasgupta, H. Kim, and R. W. Vuduc, “A performance analysis
framework for identifying potential benefits in GPGPU applications,” in
PPOPP 2012, New Orleans, LA, USA, February 25-29, 2012.
[8] M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam,
A. Rountev, and P. Sadayappan, “A compiler framework for optimization
of affine loop nests for GPGPUs,” in ICS 2008, Island of Kos, Greece,
June 7-12, 2008, pp. 225–234.
[9] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos,
“Auto-tuning a high-level language targeted to GPU codes,” in Innova-
tive Parallel Computing (InPar), 2012. IEEE, 2012, pp. 1–10.
[10] M. Khan, P. Basu, G. Rudy, M. Hall, C. Chen, and J. Chame, “A script-
based autotuning compiler system to generate high-performance CUDA
code,” ACM Trans. Archit. Code Optim., vol. 9, no. 4, pp. 1–25, 2013.
[11] K. Sato, H. Takizawa, K. Komatsu, and H. Kobayashi, “Automatic tuning
of CUDA execution parameters for stencil processing,” in Software
Automatic Tuning, From Concepts to State-of-the-Art Results, 2010.
[12] J. Kurzak, Y. Tsai, M. Gates, A. Abdelfattah, and J. J. Dongarra,
“Massively parallel automated software tuning,” in ICPP 2019, Kyoto,
Japan, August 05-08, 2019, 2019, pp. 92:1–92:10.
[13] T. Kistler and M. Franz, “Continuous program optimization: A case
study,” (TOPLAS), vol. 25, no. 4, pp. 500–548, 2003.
[14] C. Song, L.-P. Wang, and T. J. Martı´nez, “Automated code engine for
graphical processing units: Application to the effective core potential
integrals and gradients,” Journal of chemical theory and computation,
vol. 12, no. 1, pp. 92–106, 2015.
[15] R. C. Whaley and J. Dongarra, “Automatically tuned linear algebra
software,” in Proceedings of the 1998 IEEE/ACM Conference on Su-
percomputing, 1998.
[16] M. Frigo and S. G. Johnson, “FFTW: an adaptive software architecture
for the FFT,” in IEEE,ICASSP ’98, Seattle, Washington, USA, May 12-
15, 1998, 1998, pp. 1381–1384.
[17] M. Pu¨schel, J. M. F. Moura, B. Singer, J. Xiong, J. R. Johnson, D. A.
Padua, M. M. Veloso, and R. W. Johnson, “Spiral: A generator for
platform-adapted libraries of signal processing alogorithms,” IJHPCA,
vol. 18, no. 1, pp. 21–45, 2004.
[18] C. Chen, X. Chen, A. Keita, M. Moreno Maza, and N. Xie, “MetaFork:
A compilation framework for concurrency models targeting hardware
accelerators and its application to the generation of parametric CUDA
kernels,” in Proceedings of CASCON 2015, 2015, pp. 70–79.
[19] R. V. Lim, B. Norris, and A. D. Malony, “Autotuning GPU kernels via
static and predictive analysis,” in ICPP 2017, Bristol, U.K, August 14-17,
2017, pp. 523–532.
[20] Y. Liu, E. Z. Zhang, and X. Shen, “A cross-input adaptive framework for
GPU program optimizations,” in 2009 IEEE International Symposium
on Parallel & Distributed Processing. IEEE, 2009, pp. 1–10.
[21] J. D. Garvey and T. S. Abdelrahman, “Automatic performance tuning of
stencil computations on gpus,” in ICPP 2015, Beijing, China, September
1-4, 2015, 2015, pp. 300–309.
[22] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Tech-
niques, and Tools. Boston, MA, USA: Addison-Wesley Longman
Publishing Co., Inc., 1986.
[23] D. Eisenbud, Commutative Algebra: with a view toward algebraic
geometry. Springer Science & Business Media, 2013, vol. 150.
[24] “CUDA C Best Practices Guide, v10.1.105,” NVIDIA Corporation,
March 2019, https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
index.html.
[25] “CUDA runtime API: v10.0,” NVIDIA Corporation, September 2018,
http://docs.nvidia.com/cuda/pdf/CUDA Runtime API.pdf.
[26] “Profiler User’s Guide, v10.1.105,” NVIDIA Corporation, March 2019,
https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
[27] X. Mei and X. Chu, “Dissecting GPU memory hierarchy through
microbenchmarking,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 1,
pp. 72–86, 2017.
[28] H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos,
“Demystifying GPU microarchitecture through microbenchmarking,” in
IEEE ISPASS 2010, 28-30 March 2010, White Plains, NY, USA. IEEE
Computer Society, 2010, pp. 235–246.
[29] “CUDA C Programming Guide, v10.1.105,” NVIDIA Corporation,
March 2019, https://docs.nvidia.com/cuda/cuda-c-programming-guide/
index.html.
[30] R. M. Corless and N. Fillion, A graduate introduction to numerical
methods. Springer, 2013.
[31] J. E. Gentle, W. K. Ha¨rdle, and Y. Mori, Handbook of computational
statistics: concepts and methods. Springer Science & Business Media,
2012.
[32] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Don-
garra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and
D. Sorensen, LAPACK Users’ Guide, 3rd ed. Philadelphia, PA: Society
for Industrial and Applied Mathematics, 1999.
[33] M. Asadi, A. Brandt, C. Chen, S. Covanov, F. Mansouri, D. Mohajerani,
R. Moir, M. Moreno Maza, L. Wang, N. Xie, and Y. Xie, “Basic Poly-
nomial Algebra Subprograms (BPAS),” 2019, http://www.bpaslib.org.
[34] V. Volkov, “A microbenchmark to study GPU performance models,”
in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria,
February 24-28, 2018, 2018, pp. 421–422.
[35] ——, “Understanding latency hiding on GPUs,” Ph.D. dissertation,
EECS Department, University of California, Berkeley, Aug 2016.
[36] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W.
Hwu, “An adaptive performance modeling tool for GPU architectures,”
ser. PPoPP ’10. ACM, 2010, pp. 105–114.
