A Technique for Finding Optimal Program Launch Parameters Targeting
  Manycore Accelerators by Brandt, Alexander et al.
ar
X
iv
:1
90
6.
00
14
2v
1 
 [c
s.D
C]
  1
 Ju
n 2
01
9
A Technique for Finding Optimal Program Launch Parameters
Targeting Manycore Accelerators
Alexander Brandt
University of Western Ontario, Canada
abrandt5@uwo.ca
Davood Mohajerani
University of Western Ontario, Canada
dmohajer@uwo.ca
Marc Moreno Maza
University of Western Ontario, Canada
moreno@csd.uwo.ca
Jeeva Paudel
IBM Canada Software Laboratory, Markham,
Canada
pjeeva01@ca.ibm.com
Lin-Xiao Wang
University of Western Ontario, Canada
lwang739@uwo.ca
ABSTRACT
In this paper, we present a new technique to dynamically deter-
mine the values of program parameters in order to optimize the
performance of a multithreaded program P . To be precise, we de-
scribe a novel technique to statically build another program, say,
R , that can dynamically determine the optimal values of program
parameters to yield the best program performance for P given val-
ues for its data and hardware parameters. While this technique
can be applied to parallel programs in general, we are particularly
interested in programs targeting manycore accelerators. Our tech-
nique has successfully been employed for GPU kernels using the
MWP-CWP performance model for CUDA.
KEYWORDS
Performance estimation, Program Parameters, Portable performance,
Manycore accelerators, CUDA
Reference Format:
Alexander Brandt, Davood Mohajerani, Marc Moreno Maza, Jeeva Paudel,
and Lin-XiaoWang. 2019. ATechnique for FindingOptimal Program Launch
Parameters Targeting Manycore Accelerators. 11 pages.
1 INTRODUCTION
Three types of parameters influence the performance of parallel
programs on multiprocessors: (i) data parameters, such as input
data and its size, (ii) hardware parameters, such as cache capac-
ity and number of available registers, and (iii) program parameters,
such as granularity of tasks and the quantities that characterize
how tasks are mapped to processors.
Data and hardware parameters are independent from program
parameters, and are determined by users’ needs and available hard-
ware resources. Program parameters, however, are intimately re-
lated to data and hardware parameters. Meanwhile, the choice of
program parameters can largely effect the performance of the pro-
gram. Therefore, determining optimal values of program parame-
ters that yield the best program performance for a given conflu-
ence of hardware and data parameter values is critical. Further, de-
termining such values automatically is important to enable users
to execute the same parallel program efficiently on different hard-
ware platforms.
This paper describes a novel technique to statically build a pro-
gram R that can dynamically determine the optimal values of pro-
gram parameters to yield the best program performance for given
values of the data and hardware parameters of a given multithreaded
program P . The key principle underpinning the proposed tech-
nique can be summarized as follows.
Inmost program executionmodels, high-level performancemet-
rics, such as execution time, memory consumption, and hardware
occupancy, are piece-wise rational functions (PRFs) of lower-level
metrics, which include the number of cachemisses and the number
of cycles per instruction. These lower-level metrics are themselves
PRFs of program, hardware, and data parameters. As such, for a
fixed machine, a high-level performance metric can be estimated
by a piece-wise rational function of the program and data param-
eters. Henceforth, we regard a computer program that computes
such a piece-wise rational function as a special sort of rational pro-
gram, a technical notion defined in Section 2.
1.1 Problem Statement
LetP be amultithreaded program to be executed on a targetedmul-
tiprocessor. By fixing the target architecture, the hardware param-
eters, say, H1, . . . ,Hh then become fixed and we can assume that
the performance ofP depends only on data parametersD1, . . . ,Dd
and program parameters P1, . . . , Pp . Also, the optimal choice of
P1, . . . , Pp depends on a specific choice ofD1, . . . ,Dd . For example,
in programs targeting GPUs (e.g. programs written in CUDA), the
parameters D1, . . . ,Dd are typically dimension sizes of data struc-
tures, like arrays, while P1, . . . , Pp typically specify the formats of
thread blocks (see Section 4 for further details of this technique
applied to CUDA).
In most cases, data parameters are only given at runtime, which
makes it difficult to determine the optimal program parameters
before that. However, a bad choice of program parameters can
be very inefficient, especially for programs that consume a large
amount of resources (such as running time and memory consump-
tion). Hence, it is crucial to be able to determine the optimal pro-
gram parameters at runtime without much overhead added to the
program execution.
Let E be a performance metric for P that we want to optimize.
More precisely, given the values of the data parametersD1, . . . ,Dd ,
the goal is to find values of the program parameters P1, . . . , Pp such
that the execution of P optimizes E. Here, optimizing could mean
maximizing, as in the case of a performance metric such as hard-
ware occupancy, or minimizing, as in the case of a performance
metric like execution time. To address our goal, we compute a
mathematical expression, parameterized by data and program pa-
rameters, in the format of a rational program R (see Section 2) at
Brandt, Mohajerani, Moreno Maza, Paudel, Wang
compile-time. At runtime, given the specific values of D1, . . . ,Dd ,
we can efficiently evaluate E using R . After that, we can feasibly
pick the P1, . . . , Pp that optimize E, and feed that to P for execu-
tion.
1.2 Contributions
Towards the aforementioned goal, our specific contributions are:
(i) a technique for devising a mathematical expression in the
form of a rational program R (Section 2) that evaluates E as a
function of D1, . . . ,Dd and P1, . . . , Pp ,
(ii) an executable example of the rational program in the form of
a C program using the MWP-CWP model [20], and
(iii) an empirical evaluation of this example on kernels in the Polybench/GPU
benchmark suite.
1.3 Structure of the Paper
The rest of this paper is organized as follows. Section 2 formal-
izes and exemplifies the notion of rational programs. Section 3.1
states two observations about the targeted rational program R ,
from which Section 3.2 derives consequences. Such consequences
will lead to a step-by-step description, in Section 3.3, of our pro-
posed approach for building R .
We shall see in Section 4 that R can be built when the program
P is being compiled, that is, at compile-time. Then, when P is ex-
ecuted on a specific input data set, that is, at runtime, the rational
program R can be used to make an optimal choice for the program
parameters of P . The embodiment of the propped techniques, as
reported in Section 4, targets Graphics Processing Units (GPUs).
Section 5 presents empirical results from evaluation of this tech-
nique on the PolyBench test suite. Section 6 compares and con-
trasts the proposed idea with previous and related works.
Note:While the idea and the technique described herein are ap-
plicable to parallel programs in general, this document extensively
uses GPU architecture and the CUDA (Compute Unified Device Ar-
chitecture, see [1, 29]) programming model for ease of illustration
and description.
2 RATIONAL PROGRAM
Let X1, . . . ,Xn ,Y be pairwise different variables
1. Let S be a se-
quence of three-address code (TAC [3]) instructions such that the
set of the variables that occur in S and are never assigned a value
by an instruction of S is exactly {X1, . . . ,Xn}.
Definition 1. We say that the sequence S is rational if every
arithmetic operation used in S is either an addition, a subtraction,
a multiplication, or a comparison (=, <), for integer numbers either
in fixed precision or in arbitrary precision. Moreover, we say that the
sequence S is a rational program in X1, . . . ,Xn evaluating Y if the
following two conditions hold:
(1) S is rational, and
(2) after specializing each of X1, . . . ,Xn to an arbitrary integer
value inS, the execution of the specialized sequenceS always
terminates and the last executed instruction assigns an integer
value to Y .
1Here variable refers to both the algebraic meaning of a polynomial variable and the
programming language concept.
The above definition calls for a few natural remarks.
Remark 1. One can easily extend Definition 1 by allowing the
use of the Euclidean division for integers, in both fixed and arbi-
trary precision. Indeed, one can write a rational program evaluat-
ing the integer quotientQ of a signed integer A by a signed integer
B. Then, the remainder R in the Euclidean division ofA by B is sim-
ply A −QB.
Remark 2. One can also extend Definition 1 by allowing addi-
tion, subtraction, multiplication and comparison (=, <) of rational
numbers in arbitrary precision. Indeed, each of these operations
can easily be implemented by rational programs using addition,
subtraction, multiplication and comparison (=, <) for integer num-
bers in arbitrary precision.
Remark 3. Next, one can extend Definition 1 by allowing the
integer part operations F 7−→ ⌈F ⌉ and F 7−→ ⌊F ⌋ where F is an ar-
bitrary rational number. Indeed, for a rational numberA/B (where
A and B are signed integers) the integer parts ⌈A/B⌉ and ⌊A/B⌋ can
be computed by performing the Euclidean division ofA by B. Con-
sequently, one can allow TAC instructions of the formsQ = ⌈A/B⌉
and Q = ⌊A/B⌋, where = denotes the assignment as in the C pro-
gramming language.
Remark 4. Extending Definition 1 according to Remarks 1–3
does not change the class of rational programs. Thus, adding Eu-
clidean division, rational number arithmetic and integer part com-
putation to the definition of a rational sequence yields an equiva-
lent definition of rational program. We adopt this definition hence-
forth.
Remark 5. Recall that it is convenient to associate any sequence
S of computer program instructionswith a control flow graph (CFG).
In the CFG ofS, the nodes are the basic blocks ofS, that is, the sub-
sequences of S such that (i) each instruction except the last one is
not a branch, and (ii) which are maximum in length with property
(i). Moreover, in the CFG ofS, there is a directed edge from a basic
block B1 to a basic block B2 whenever, during the execution of S,
one can jump from the last instruction of B1 to the first instruction
ofB2. Recall also that a flow chart is another graphic representation
of a sequence of computer program instructions. In fact, CFGs can
be seen as particular flow charts.
If, in a given flow chart C, every arithmetic operation occur-
ring in every (process or decision) node is either an addition, sub-
traction, multiplication, or comparison, for integers in either fixed
or arbitrary precision (or any other operation as explained in Re-
marks 1-3), then C is the flow chart of a rational sequence of com-
puter program instructions. Therefore, it is meaningful to depict
rational programs using flow charts. An example is given by Fig-
ure 1.
2.1 Examples
Example 1. Hardware occupancy, as defined in the CUDA pro-
gramming model, is a measure of how effective a program is mak-
ing use of the hardware’s processors (Streaming Multiprocessors
in case of GPUs). Hardware occupancy is calculated from hardware
parameters, namely:
- the maximum number Rmax of registers per thread block,
A Technique for Finding Optimal Program Launch Parameters Targeting Manycore Accelerators
T Bmax ≤ 32Wmax and
RT Bmax ≤ Rmax and
Z Bmax ≤ Zmax?
32Wmax ≤ T Bmax, and
32WmaxR ≤ Rmax and
32Wmax Z ≤ Zmax T?
Rmax ≤ RT Bmax and
Rmax ≤ R 32Wmax and
Rmax Z ≤ RT Zmax?
Zmax ≤ Bmax Z and
Zmax T ≤ 32Wmax Z and
ZmaxRT ≤ Z Rmax?
Bactive = Bmax
Bactive = ⌊(32Wmax)/T ⌋
Bactive = ⌊(Rmax/(RT )⌋
Bactive = ⌊Zmax/Z⌋
Bactive = 0 (Failure to Launch)
No
No
No
No
Yes
Yes
Yes
Yes
Figure 1: Rational program (presented as a flow chart) for
the calculation of hardware occupancy in CUDA.
- the maximum number Zmax of shared memory words per
thread block,
- the maximum number Tmax of threads per thread block,
- the maximum number Bmax of thread blocks per Streaming
Multiprocessor (SM) and
- the maximum numberWmax of warps per SM,
as well as low-level performance metrics, namely:
- the number R of registers used per thread and
- the number Z of shared memory words used per thread
block,
and a program parameter, namely the number T of threads per
thread block. The hardware occupancy of aCUDA kernel is defined
as the ratio between the number of active warps per SM and the
maximum number of warps per SM, namelyWactive/Wmax, where
Wactive = min (⌊BactiveT /32⌋,Wmax) (1)
and Bactive is given as a flow chart by Figure 1. Figure 1 shows how
one can derive a rational program computing Bactive from Rmax,
Zmax, Tmax, Bmax,Wmax, R, Z , T . It follows from Equation (1) that
Wactive can also be computed by a rational program from Rmax,
Zmax, Tmax, Bmax,Wmax, R, Z , T . Finally, the same is true for the
hardware occupancy of a CUDA kernel.
Example 2. The execution time of a GPU kernel is defined in
the MWP-CWP execution model, see [21, 34], and calculated from
hardware parameters including:
- clock frequency of a SM,
- the number of SMs on the device, etc.,
as well as low-level performance metrics, including:
- the total number #Mem_insts of memory instructions per
thread,
- the total number #Comp_inst of computation instructions
per thread, etc.,
and a program parameter, namely the number T of threads per
thread block. We have implemented this model in the C program-
ming language, see Section 4. This program computes the execu-
tion time (variable clockcycles) as a function of the hardware pa-
rameters, low-level performance metrics, and program parameters
considered by the MWP-CWP execution model.
3 TECHNIQUE PRINCIPLE
3.1 Key observations
Section 3.3 presents a technique for constructing a rational pro-
gram R in P1, . . . , Pp , D1, . . . ,Dd evaluating E. This technique is
based on the following two observations. The first one is a general
remark about rational programs while the second one is specific
to rational programs estimating performance metrics.
Observation 1. LetS be a rational program inX1, . . . ,Xn eval-
uating Y . Let s be any instruction of S other than a branch or an
integer part instruction. Hence, this instruction can be of the form
Z = −U , Z = U + V , Z = U −V , Z = U × V , where U and V can
be any machine-representable rational numbers. Let V1, . . . ,Vv be
the variables that are defined at the entry point of the basic block
of the instruction s . An elementary proof by induction yields the
following fact. There exists a rational function2 in V1, . . . ,Vv that
we denote by fs (V1, . . . ,Vv ) such that Z = fs (V1, . . . ,Vv ) for all
possible values ofV1, . . . ,Vv .
From there, one derives the following observation. There exists
a partition T = {T1,T2, . . .} ofQ
n (where Q denotes the field of ra-
tional numbers) and rational functions f1(X1, . . . ,Xn), f2(X1, . . . ,Xn),
. . . such that, ifX1, . . . ,Xn receive respectively the values x1, . . . ,xn ,
then the value of Y returned by S is one of fi (x1, . . . , xn ) where i
is such that (x1, . . . ,xn ) ∈ Ti holds. In other words, S computes Y
as a piece-wise rational function. Figure 1 shows that the hardware
occupancy of a CUDA kernel is given as a piece-wise rational func-
tion in the variables Rmax, Zmax,Tmax, Bmax,Wmax, R, Z ,T . Hence,
in this example, we have n = 8. Moreover, Figure 1 shows that its
partition of Qn contains 5 parts.
Observation 2. It is natural to assume that low-level perfor-
mance metrics depend rationally on program parameters and data
parameters. While this latter fact is not explicitly discussed in the
CUDA and MWP-CWP models, it is an obvious observation for
very similar models which are based on the Parallel Random Ac-
cess Machine (PRAM) [16, 36], including PRAM-models tailored to
GPU code analysis such as TMM [27] and MCM [19].
Given our assumption that the high-level performance metric
E is computable as a rational program depending on hardware pa-
rameters, low-level performancemetrics, and program parameters,
we can therefore assume that E can be expressed as a rational pro-
gram of the hardware parameters, the program parameters, and
the data parameters. That is, we replace the direct dependency on
low-level metrics with a dependency on the data and program pa-
rameters. Moreover, when applied to the multithreaded program
P to be executed on a particular multiprocessor, we can assume
2Here, rational function is in the sense of algebra, see [14] and section 4.2 below.
Brandt, Mohajerani, Moreno Maza, Paudel, Wang
that E can be expressed as a rational program depending only on
program and data parameters.
3.2 Consequences
Conseqence 1. It follows from Observation 2 that we can use
the MWP-CWP performance model on top of CUDA (see Exam-
ples 1 and 2) to determine a rational program estimating E de-
pending on only program parameters and data parameters. Recall
that building such a rational program R is our goal. Once R is
known, it can be used at runtime (that is, when the program P
is run on specific input data) to compute optimal values for the
program paramters P1, . . . , Pp .
Conseqence 2. Observation 1 suggests an approach for deter-
mining R . Suppose that a flow chart C representing R is partially
known; to be precise, suppose that the decision nodes are known
(that is, mathematical expressions defining them are known) while
the process nodes are not. From Observation 1, each process node
that includes no integer part instructions is given by a series of ra-
tional functions. Determining each of those rational functions can
be achieved by solving an interpolation or curve fitting problem.
Knowing that a function to-be-determined is rational allows us to
perform parameter estimation techniques (e.g. the method of least
squares) to define the fitting curve and thus the rational function.
Conseqence 3. In the above consequence, one could even con-
sider relaxing the assumption that the decision nodes are known.
Indeed, each decision node is given by a series of rational func-
tions. Hence, those could be determined by solving curve fitting
problems as well. However, we shall not discuss this direction fur-
ther since this is not needed in the proposed technique presented
in Sections 3.3 and 4.
3.3 Algorithm
In this section, the notations and hypotheses are the same as in
Sections 2, 3.1 and 3.2. In particular, we assume that:
(i) E is a high-level performance metric for the multithreaded
program P (e.g. execution time, memory consumption, and
hardware occupancy),
(ii) E is given (by a program execution model, e.g. CUDA or
MWP-CWP) as a rational program depending on hardware
parametersH1, . . . ,Hh , low-level performancemetrics L1, . . . , Lℓ ,
and program parameters P1, . . . , Pp (see Examples 1 and 2),
(iii) the values of the hardware parameters are known at compile-
time forP while the values of the data parametersD1, . . . ,Dd
are known at runtime for P ,
(iv) the data and program parameters take integer values.
Extending (iv) we further assume that the possible values of the
program parameters P1, . . . , Pp belong to a finite set F ⊂ Z
p . That
is to say, the possible values of (P1, . . . , Pp ) are a tuple of the form
(π1, . . . , πp ) ∈ F , with each πi being an integer. Due to the nature
of program parameters they are not necessarily all free variables
(i.e. a program parameter may depend on the value of another pro-
gram parameter). See Section 4 for such an example.
Following Consequence 1, we can compute a rational program
R estimating E as a function ofD1, . . . ,Dd and P1, . . . , Pp . To do so,
we shall first compute rational functionsд1(D1, . . . ,Dd , P1, . . . , Pp ),
. . . , дℓ (D1, . . . ,Dd , P1, . . . , Pp ), estimating L1, . . . ,Lℓ respectively.
Once R has been computed, it can be used when P is being exe-
cuted (thus at runtime) in order to determine optimal values for
P1, . . . , Pp , on given values ofD1, . . . ,Dd . The entire process is de-
composed below into six steps: the first three take place at compile-
time while the other three are performed at runtime.
(1) Data collection: In order to determine each of the rational
functions
д1(D1, . . . ,Dd , P1, . . . , Pp ), . . . ,дℓ(D1, . . . ,Dd , P1, . . . , Pp )
by solving a curve fitting problem, we proceed as follows:
(a) we select a set of points K1, . . . ,Kk in the space of the
possible values of (D1, . . . ,Dd , P1, . . . , Pp );
(b) we run (or emulate) the programP at each pointK1, . . . ,Kk
andmeasure the low-level performancemetrics L1, . . . , Lℓ ;
and
(c) for each 1 ≤ i ≤ ℓ, we record the values (vi,1, . . . ,vi,k )
measured for Li at the respective points K1, . . . ,Kk .
(2) Rational function estimation: For each 1 ≤ i ≤ ℓ, we use
the values (vi,1, . . . ,vi,k ) measured for Li at the respective
pointsK1, . . . ,Kk to estimate the rational functionдi (D1, . . . ,Dd , P1, . . . , Pp ).
We observe that ifvi,1, . . . ,vi,k were known exactly (that is,
without error) the rational functionдi (D1, . . . ,Dd , P1, . . . , Pp )
could be determined exactly via rational function interpola-
tion. However, in practice, the valuesvi,1, . . . ,vi,k are likely
to be “noisy”. Hence, techniques from numerical analysis,
like the method of least squares, must be used instead. Con-
sequently, we compute a rational function дˆi (D1, . . . ,Dd , P1, . . . , Pp )
which approximates дi (D1, . . . ,Dd , P1, . . . , Pp )when evalu-
ated at the points K1, . . . ,Kk .
(3) Code generation: In order to generate the rational pro-
gram R , we proceed as follows:
(a) we convert the rational program representing E into code,
say in the C programming language, essentially encoding
the CFG for computing E;
(b) we convert each of дˆi (D1, . . . ,Dd , P1, . . . , Pp ), (for 1 ≤
i ≤ ℓ) into code as well, more precisely, into sub-routines
evaluating L1, . . . , Lℓ , respectively; and
(c) we include those sub-routines into the code computing E,
which yields the desired rational program R .
(4) Rational program evaluation: At this step, the rational
program R is completely determined and can be used when
the program P is executed. At the time of execution, the
data parameters D1, . . . ,Dd are given particular values, say
δ1, . . . ,δd , respectively. For those specified values ofD1, . . . ,Dd
and for all practically meaningful values of P1, . . . , Pp from
the set F , we compute an estimate of E using R . Here, “prac-
tically meaningful” refers to the fact that the values of the
program parameters P1, . . . , Pp are likely to be constrained
by the values of the data parameters D1, . . . ,Dd . For in-
stance, if P1, P2 are the two dimension sizes of a two-dimensional
thread-block of a CUDA kernel performing the transposi-
tion of a squarematrix of orderD1, then the inequality P1P2 ≤
D21 is meaningful. Despite of this remark, this step can still
be seen as an exhaustive search, and, it is practically feasible
to do so for three reasons:
(a) p is small, typically p = 1, p = 2 or p = 3, see Section 4;
A Technique for Finding Optimal Program Launch Parameters Targeting Manycore Accelerators
(b) the set of values in F are small (typically on the order of
10 elements in case of MWP-CWP model); and
(c) the program R simply evaluates a few formulas and thus
runs almost instantaneously.
(5) Selection of optimal values of program parameters: In
general, when the search space of values of the program pa-
rameters P1, . . . , Pp is large, a numerical optimization tech-
nique is required for this step. For CUDA kernels, however,
selecting a value point (also called a configuration in the rest
of this report) which is optimal can be done with an exhaus-
tive search, as seen in the previous step, and is hence triv-
ial. The only possible challenge in this case is that several
configurations may optimize E. When this happens, using
a second performance metric can help refine the choice of a
best configuration.
(6) Program execution: Once a best configuration is selected,
the program P can finally be executed using this configura-
tion of program parameters P1, . . . , Pp along with with the
values δ1, . . . , δd of D1, . . . ,Dd .
4 IMPLEMENTATION
In the previous sections we gave an overview of our technique for
general multithreaded programs P . For the embodiment of this
technique, and the resulting experimentation and performance anal-
ysis, we focus on programs for GPU architectures using the pro-
gramming model CUDA. These programs interleave (i) serial code
which is executed on the host (the CPU) and, multithreaded code
which is executed on the device (the GPU). The host launches a de-
vice code execution by calling a particular type of function, called
a kernel. Optimizing a CUDA program for better usage of the com-
puting resources on a device essentially means optimizing each of
its kernels. Therefore, we apply the ideas of Sections 3.1 through
3.3 to CUDA kernels.
In the case of a CUDA kernel the data parameters are a descrip-
tion of the input data size. In many examples this is a single pa-
rameter, say N , describing the size of an array (or the order of a
two-dimensional array), the values of which are usually powers of
2. On the other hand, the program parameters are typically an en-
coding of the program configuration, namely the thread block con-
figuration. In CUDA the thread block configuration defines both
the number of dimensions (1, 2, or 3) and the size in each dimen-
sion of each thread block — a contiguous set of threads mapped
together to a single SM processor. For example, a possible thread
block configuration may be 1024× 1× 1 (a one-dimensional thread
block), or 16 × 16 × 2 (a three-dimensional thread block). In all
cases the CUDA model restricts the number of threads per block
(i.e. the product of the size in each dimension) to be 1024 on devices
of Compute Capability (CC) 2.x, or newer. This limits the possible
thread block configurations, and, moreover, limits the set F from
which the program parameters are chosen.
Throughout the current and the following sections our discus-
sionsmake use of these specialized terms, input data size and thread
block configuration, for data and programparameters, respectively,
in order to make clear explanations and associations between the-
ory and practice.
4.1 Data collection
As detailed in Step 1 of Section 3.3, we must collect data and statis-
tics regarding certain performance counters and runtime metrics
(which are thoroughly defined in [21] and [1]). These metrics to-
gether allow us to estimate the execution time of the program and
can be partitioned into three categories.
First, we need architecture-specific performance counters, basi-
cally, characteristics dictated by the CC of the target device. Such
hardware characteristics can be obtained at compile time, as the
target CC is specified at this time. These characteristics include
the number of registers used in each thread, the amount of static
shared memory per thread block, and the number of (arithmetic,
memory, and synchronization) instructions per thread. This infor-
mation can easily be obtained fromaCUDA compiler (e.g. NVIDIA’s
NVCC).
Second, we have the values that depend on the behavior of the
kernel code at runtime which we will refer to as kernel-specific
runtime values. This includes values impacted by memory access
patterns, namely, the number of coalesced and non-coalesced mem-
ory accesses per warp, the number of memory instructions in each
thread that cause coalesced and non-coalesced accesses, and even-
tually, the total number of warps that are being executed. For this
step we have two choices. We can use an emulator, which can
mimic the behavior of a GPU on a typical Central Processing Unit
(CPU). This is important if we cannot guarantee that a device is
available at compile time. For this purpose, we have configured
Ocelot [13], a dynamic compilation framework for GPU comput-
ing as well as an emulator for low-level CUDA (PTX [2]) code, to
meet our needs. Alternatively, we can use a profiler to collect the
required metrics on an actual device. For this solution, we have
used NVIDIA’s nvprof[12].
Third, in order to compute a more precise estimate of the clock
cycles, we need device-specific parameters which describe an ac-
tual GPU card. One subset of such parameters can be determined
by microbenchmarking the device (see [28] and [40]), this includes
thememory latency, thememory bandwidth, and the departure de-
lay for coalesced/non-coalesced accesses. Another subset of such
parameters can be easily obtained by consulting the vendor’s guide
[11], or by querying the device itself via the CUDA API. This in-
cludes the number of SMs on the card, the clock frequency of SM
cores, and the instruction delay.
In summary, our general method proceeds as follows:
(1) Beginning with a CUDA program, we minimally annotate
the host code to make it compatible with our pre-processor
program, specifying the code fragment in the host codewhich
calls the kernel. We also specify the CC of the target device.
(2) The pre-processor program prepares the code for collect-
ing kernel-specific runtime values. This step includes source
code analysis in order to extract the list of kernels, the list
of kernel calls in the host code, and finally, the body of each
kernel (which will be used for further analysis). In the case
of Ocelot emulation, the pre-processor replaces the device
kernel calls with equivalent Ocelot kernel calls, while in the
case of actual profiling, the device kernel calls are left un-
touched. Finally, the pre-processor program uses the speci-
fiedCC in the host code to determine the architecture-specific
Brandt, Mohajerani, Moreno Maza, Paudel, Wang
performance counters for each kernel. The result of this step
is an executable file, which we will refer to as the instru-
mentor. The instrumentor takes as input the same program
parameters as the original CUDA code.
(3) A driver program orchestrates the combination of device-
specific characteristics (i.e. a device profile) and various con-
figurations of program and data parameters to be passed
to the instrumentor. Running the instrumentor (either emu-
lated on top of Ocelot, or running via a profiler on a GPU)
measures and records the required performance metrics.
4.2 Rational function estimation
Once data collection is completed, we are able to move on to the
estimation step, see Step 2 of Section 3.3. That is, determining pre-
cisely the rational functions fb (X1, . . . ,Xn), for each desired code
block b , which are to be used within the rational program (see Sec-
tion 2). These Xi variables are simply combinations of program pa-
rameters, data parameters, and intermediate values, obtained dur-
ing the execution of the rational program. These intermediate val-
ues are, in turn, are rational functions of the program and data
parameters. Hence, the variables X1, . . . ,Xn are used only to sim-
plify notations.
fb (X1, . . . ,Xn) =
pb (X1, . . . ,Xn)
qb (X1, . . . ,Xn)
=
α1 · (X
0
1 . . .X
0
n) + . . . + αi · (X
u1
1 . . .X
un
n )
β1 · (X
0
1 . . .X
0
n) + . . . + βj · (X
v1
1 . . .X
vn
n )
A rational function is simply a fraction of two polynomials.With
a degree bound (an upper limit on the exponent) on each variable
Xk in the numerator and the denominator, say uk and vk , respec-
tively, these polynomials can be defined up to some parameters,
namely the coefficients of the polynomials, sayα1, . . . ,αi and β1, . . . , βj .
Using the previously collected data (see Section 4.1) we per-
form a parameter estimation (for each rational function) on the
polynomial coefficients α1, . . . ,αi , β1, . . . , βj . in order to uniquely
determine the rational function. We note that while our model
is non-linear, it is indeed linear in the model-fitting parameters
α1, . . . ,αi , β1, . . . , βj .
The system of linear equations defined by the model-fitting pa-
rameters α1, . . . ,αi , β1, . . . , βj can very classically be defined as an
equation of the formAx = b, whereA ∈ Rm×n (the sample matrix)
and b ∈ Rm×1 (the right-hand side vector) encode the collected
data, while the solution vector x ∈ Rn×1 encodes the model-fitting
parameters3. An exact solution is rarely defined if m < n (where
an infinite number of solutions is possible) or ifm > n. Therefore,
we wish to get a solution in the “least squares sense”, that is, find
x such that residual is minimized:
x = min
x
| |r| |22 = min
x
| |b − Ax| |22
Many different methods exist for solving this so-called linear
least squares problem, such as the normal-equations, orQR-factorization,
however, these methods are either numerically unstable (normal-
equations), or will fail if the sample matrix is rank-deficient (both
3Keen observers will notice that, for rational functions, we must actually solve a sys-
tem of homogeneous equations. Such details are omitted here, but we refer the reader
to [8, Chapter 5].
normal-equations and QR) [10]. We rely then on the singular value
decomposition (SVD) of A to solve this problem. This decomposi-
tion is very computationally intensive, much more than that of
normal-equations or QR, but is also much more numerically sta-
ble, as wel as being capable of producing solutions with a rank-
deficient sample matrix.
We are highly concerned with the robustness of ourmethod due
to three problems present in our particular situation:
(1) the sample matrix is very ill-conditioned;
(2) the samplematrixwill often be (numerically) rank-deficient;
(3) we are interested in using our fittedmodel for extrapolation,
meaning any numerical error in the model fitting will grow
very quickly [10].
While (3) is an issue inherent to our model fitting problem, (1) and
(2) result from our choice of model, and how the sample points
(X1, . . . ,Xn) are chosen, respectively. Using a rational function (or
polynomial) as the model for which we wish to estimate parame-
ters presents numerical problems. The resulting sample matrix is
essentially a Vandermonde matrix. These matrices, while theoreti-
cally of full rank, are extremely ill-conditioned [7, 10]. To discuss
(2) we must now consider the so-called “poised-ness” of the sample
points.
For a univariate polynomial model, a sample matrix is guaran-
teed to have full rank if the sample points are all distinct. We say
that the points producing the sample matrix are poised is the ma-
trix is of full rank. In the multivariate case, it is much more difficult
to make such a guarantee [9, 30]. Given a multivariate polynomial
model with a total degree d , a set of points is poised if, for each
point xi , there exists d hyperplanes corresponding to this point
such that all other points lie on at least one of these hyperplanes
but xi does not [9]. For example, with 2 variables and a total degree
3, the sample points are poised if, for each sample point, there ex-
ists 3 lines on which all other points lie but the selected point does
not.
This geometric restriction on sample points is troublesome for
our purposes. For example, if we are trying to model a function
which has, in part, CUDA thread block dimensions as variables,
then these dimensions should be chosen such that their product
is a multiple of 32 [32]. However, this corresponds very closely
with the geometric restriction we are trying to avoid. Consider the
possible 2D thread block dimensions (B0,B1) for a CUDA kernel
(Figure 2). Many subsets of these points are collinear, presenting
challenges for obtaining poised sample points. As a result of all of
this we are highly likely to obtain a rank-deficient sample matrix.
Despite all of these challenges our parameter estimation tech-
niques are well-implemented in optimized C code. We use opti-
mized algorithms from LAPACK (Linear Algebra PACKage) [4] for
singular value decomposition and linear least squares solvingwhile
rational function and polynomial implementations are similarly
highly optimized thanks to the Basic Polynomial Algebra Subpro-
grams (BPAS) library [5, 8]. With parameter estimation complete
the rational functions required for the rational program are fully
specified and we can finally construct it.
A Technique for Finding Optimal Program Launch Parameters Targeting Manycore Accelerators
20 21 22 23 24 25 26 27 28 29
20
21
22
23
24
25
26
27
28
29
B0
B
1
2D Thread Block Configurations
Figure 2: Possible 2D thread block configurations for a
CUDA kernel.
4.3 Rational programs
In practice, the use of rational programs is split into two parts: the
generation of the rational program at the compile-time of the mul-
tithreaded program P , and the use of the rational program during
the runtime of P .
4.3.1 Compile-time code generation. We are now at Step 3 of Sec-
tion 3.3. We look to define a rational program which evaluates the
high-level metric E of the program P using the MWP-CWPmodel.
In implementation, this is achieved by using a previously defined
rational program template which contains the formulas and case
discussion of the MWP-CWP model, independent of the particu-
lar program being investigated. Using simple regular expression
matching and text manipulation we combine the rational program
template with the rational functions determined in the previous
step to obtain a rational program specialized to the multithreaded
program P . The generation of this rational program is performed
completely during compile-time, before the execution of the pro-
gram itself.
4.3.2 Runtime optimization. At runtime, the input data sizes (data
parameters) are well known. In combination with the known hard-
ware parameters, since the program is actually being executed on
a specific device, we are able to specialize every parameter in the
rational program and obtain an estimate for the high-level met-
ric E. This rational program is then easily and quickly evaluated
during (or immediately prior to) the execution of P . Evaluating
the rational program for each possible thread block configuration,
as restricted by our data parameters and the CUDA programming
model itself, we determine a thread block configuration which op-
timizes E. The program P is finally executed using this optimal
thread block configuration. Therefore, we have completed Steps 4
to 6 of Section 3.3.
5 EXPERIMENTATION
In this section we highlight the performance and experimentation
of our technique applied to example CUDA programs from the
PolyBench benchmarking suite [17]. For each PolyBench example,
we collect statistical data (see Section 4) with various input sizes
N (powers of 2 typically between 64 and 512). These are relatively
small sizes to allow for fast data collection. The collected data is
used to produce the rational functions and, lastly, the rational func-
tions are combined to form the rational programs. Table 1 gives a
short description for each of the benchmark examples, as well as
an index for each kernel which we use for referring to that kernel
in the experimental results.
The ability of the rational programs we generate to effectively
optimize the program parameters of each example CUDA program
is summarized in Tables 2 and 3. The definitions of the notations
used in these tables are as follows:
• N : the input size of the problem, usually a power of 2,
• Ec: estimated execution time of the kernel measured in clock-
cycles,
• Cr : best thread block configuration as decided by the ratio-
nal program,
• Cd : default thread block configuration given by the bench-
mark,
• Ecr : Ec of configuration Cr as decided by the rational pro-
gram,
• Ci : best thread block configuration as decided by instrumen-
tation,
• Eci : Ec of configuration Ci as decided by instrumentation,
• Best_time (B_t): the best CUDA running time (i.e. the time
to actually run the code on a GPU) of the kernel among all
possible configurations,
• Worst_time (W_t): the worstCUDA running time of the ker-
nel among all possible configurations.
To evaluate the effectiveness of our rational program, we com-
pare the results obtained from it with an actual analysis of the pro-
gram using the performance metrics collected by the instrumen-
tor. This comparison gives us estimated clock cycles as shown in
Table 2. The table shows the best configurations (Ci and Cr ) for
N = 256 and N = 512 for each example, as decided from both the
data collected during instrumentation and the resulting rational
program. This table is meant to highlight that the construction of
the rational program from the instrumentation data is performed
correctly. Here, either the configurations should be the same or, if
they differ, then the estimates of the clock-cycles should be very
similar; indeed, it is possible that different thread block configu-
rations (program parameters) result in the same execution time.
Hence, the rational programmust choose one of the possiblymany
optimal configurations. Moreover, we supply the value “Collected
Ec” which gives the Ec given by the instrumentation for the thread
block configurationCr . This value shows that the rational program
is correctly calculating Ec for a given thread block configuration.
The values “Collected Ec” and Ecr should be very close.
Figure 3 illustrates detailed results for Kernel 8.2 when N = 512.
The “CUDA” plot shows the real runtime of the kernel launched
on a device while the “RP-Estimated” plot shows the estimates ob-
tained from our rational program. The x axis shows the various
thread block configurations. The runtime values on y axis are nor-
malized to values between 0 and 100.
Table 3 shows the best configuration (Cr ) for N = 1024 and
N = 2048 estimated by the rational program of each example.
Notice these values of N are much larger than those used dur-
ing the instrumentation and construction of the rational program.
Brandt, Mohajerani, Moreno Maza, Paudel, Wang
This shows that the rational program can extrapolate on the data
used for its construction and accurately estimate optimal program
parameters for different data parameters. To compare the thread
block configuration deemed optimal by the rational program (Cr )
and the optimal thread block configuration determined by actu-
ally executing the program with various thread block configura-
tions we calculate the value named “Error”. This is a percentage
given by (Cr t - Best_time)/(Worst_time - Best_time) * 100%, where
“Cr t” is the actual running time of the program using the thread
block configuration Cr in millisecond. Best_time and Worst_time
are as defined above. The error percentage shows the quality of our
estimation; it clearly shows the difference between the estimated
configuration and the actual best configuration. This percentage
should be as small as possible to indicate a good estimation.
As an exercise in the practicality of our technique, we also show
the default thread block configuration (Cd ) given by the bench-
marks and “Cd t”, the actual running time of the program using the
thread block configuration Cd in the same table. Compared to the
default configurations, the rational program determines a better
configuration formore than half of the kernels. Notice that, in most
cases, the same default configuration is used for all kernels within
a single example. In contrast, our rational program can produce
different configurations optimized for each kernel individually.
Asmentioned in Section 4.1, we use bothOcelot [13] andNVIDIA’s
nvprof[12] as instrumentors in the data collection step. The “Er-
ror” of the implementation based on data collected by Ocelot is
shown in the column “Ocel” in Table 3. We can see that the overall
performance is better with nvprof, but for some benchmarks, for
example Kernel 12, the best configuration picked by the rational
program is not as expected. We attribute this to the limitations of
the underlying MWP-CWP model we use.
More experimental results may be found on our GitHub reposi-
tory: https://github.com/orcca-uwo/Parametric_Estimation/tree/master/plots
Benchmark Description Kernel name Kernel ID
2D Convolution 2-D convolution Convolution2D_kernel 1
FDTD_2D
2-D Finite Different fdtd_step1_kernel 2.1
Time Domain
fdtd_step2_kernel 2.2
fdtd_step3_kernel 2.3
2MM 2 Matrix Multiplications mm2_kernel1 3
3MM 3 Matrix Multiplications mm3_kernel1 4
BICG BiCGStab Linear Solver
bicg_kernel1 5.1
bicg_kernel2 5.2
GEMM Matrix Multiplication gemm_kernel 6
3D_Convolution 3-D Convolution convolution3D_kernel 7
ATAX
Matrix Transpose and atax_kernel1 8.1
Vector Multiplication atax_kernel2 8.2
GESUMMV
Scalar, Vector and
gesummv_kernel 9
Matrix Multiplication
SYRK Symmetric rank-k operations syrk_kernel 10
MVT
Matrix Vector Product and mvt_kernel1 11.1
Transpose mvt_kernel2 11.2
SYR2K Symmetric rank-2k operations syr2k_kernel 12
CORR Correlation Computation
corr_kernel 13.1
mean_kernel 13.2
reduce_kernel 13.3
std_kernel 13.4
COVAR Covariance Computation
covar_kernel 14.1
mean_kernel 14.2
reduce_kernel 14.3
GRAMSCHM Gram-Schmidt decomposition
gramschmidt_kernel1 15.1
gramschmidt_kernel2 15.2
gramschmidt_kernel3 15.3
Table 1: Benchmark names and descriptions
(
1
,3
2
)
(
2
,1
6
)
(
4
,8
)
(
8
,4
)
(
1
6
,2
)
(
3
2
,1
)
(
1
,6
4
)
(
2
,3
2
)
(
4
,1
6
)
(
8
,8
)
(
1
6
,4
)
(
3
2
,2
)
(
6
4
,1
)
(
1
,1
2
8
)
(
2
,6
4
)
(
4
,3
2
)
(
8
,1
6
)
(
1
6
,8
)
(
3
2
,4
)
(
6
4
,2
)
(
1
2
8
,1
)
(
1
,2
5
6
)
(
2
,1
2
8
)
(
4
,6
4
)
(
8
,3
2
)
(
1
6
,1
6
)
(
3
2
,8
)
(
6
4
,4
)
(
1
2
8
,2
)
(
2
5
6
,1
)
(
1
,5
1
2
)
(
2
,2
5
6
)
(
4
,1
2
8
)
(
8
,6
4
)
(
1
6
,3
2
)
(
3
2
,1
6
)
(
6
4
,8
)
(
1
2
8
,4
)
(
2
5
6
,2
)
(
5
1
2
,1
)
(
2
,5
1
2
)
(
4
,2
5
6
)
0
20
40
60
80
100
Block dimensions (x, y)
T
im
e
(n
o
rm
a
li
ze
d
)
RP-Estimated (n=512)
CUDA (n=512)
Figure 3: Example of Kernel atax_kernel2 from Polybench suite.
ID N
Data collection Rational Program
Ci Eci Cr Ecr Collected Ec
1
256 16 × 2 198070 16 × 2 310463 198070
512 16 × 4 794084 16 × 16 947799 889378
2.1
256 16 × 2 44336 32 × 1 40213 44336
512 16 × 4 147622 32 × 2 135821 147622
2.2
256 32 × 1 19430 32 × 1 52096 19430
512 32 × 2 64475 64 × 1 135810 64475
2.3
256 32 × 1 19476 16 × 2 97964 19640
512 32 × 2 64643 16 × 4 326474 65191
3
256 16 × 2 351845192 32 × 1 1365006816 351845192
512 16 × 4 4610508174 16 × 64 18148896630 4610508174
4
256 16 × 2 351845192 16 × 2 1364711040 351845192
512 16 × 4 4610508174 16 × 64 18147912950 4610508174
5.1
256 8 × 4 6730198 16 × 2 13105400 6927829
512 16 × 2 27224663 32 × 1 56699908 28801623
5.2
256 8 × 4 6729688 16 × 2 13105400 6927319
512 16 × 2 27223641 32 × 1 56699908 28800601
6
256 16 × 2 352450569 32 × 1 1365746112 352450569
512 16 × 4 4614492176 16 × 4 18091801600 4614492176
7
256 16 × 2 412448 8 × 4 419355 423967
512 16 × 4 1382453 16 × 4 1382446 1382453
8.1
256 8 × 4 6741283 16 × 2 13836355 6938145
512 16 × 2 27245283 32 × 1 57486243 28819169
8.2
256 8 × 4 6741791 16 × 2 13836672 6938653
512 16 × 2 27246303 32 × 1 57486880 28820189
9
256 8 × 4 52142036 16 × 2 104085531 53194198
512 16 × 2 211244951 32 × 1 438726895 219647900
10
256 16 × 2 352450567 32 × 1 1365746112 352450567
512 16 × 4 4614492172 16 × 4 18091801600 4614492172
11.1
256 8 × 4 6741790 16 × 2 13450360 6938652
512 16 × 2 27246302 32 × 1 56715288 28820188
11.2
256 8 × 4 6741283 16 × 2 13450360 6938145
512 16 × 2 27245283 32 × 1 56715288 28819169
12
256 16 × 2 1090910216 16 × 2 4293143808 1090910216
512 16 × 4 14402949134 32 × 16 57121538780 14402949134
13.1
256 8 × 4 29762 16 × 2 431621481191 34578
512 1 × 32 81924 32 × 1 7075540802475 226841
13.2
256 8 × 4 159563703552 16 × 2 5176052 222555249312
512 16 × 2 3428474954532 32 × 1 22396088 3534048733179
13.3
256 8 × 4 6938660 1 × 32 5355 33687503
512 16 × 2 28036520 1 × 64 22447 502559959
13.4
256 8 × 4 2526616 16 × 2 14040368 2659225
512 16 × 2 10298524 32 × 1 58770372 11353249
14.1
256 8 × 4 2526616 32 × 1 474507050858 2924449
512 16 × 2 10298524 32 × 1 7115480129956 11353249
14.2
256 8 × 4 162030833356 16 × 2 5147212 225920838132
512 16 × 2 3454820728440 32 × 1 22398140 3560802413900
14.3
256 64 × 1 1060 32 × 1 1484 1069
512 128 × 1 1085 128 × 1 1677 1085
15.1
256 16 × 2 77754074 256 × 1 70701 646426704
512 32 × 1 324346910 512 × 1 136765 5169397920
15.2
256 256 × 1 68119 8 × 4 1871 1145070
512 512 × 1 136231 8 × 8 1873 8409096
15.3
256 16 × 4 1492 16 × 2 78089365 1505
512 32 × 2 1492 32 × 1 324813912 1505
Table 2: Sanity check (nvprof)
A Technique for Finding Optimal Program Launch Parameters Targeting Manycore Accelerators
ID
N Default Config Rational Program
B_t W_t Error Ocel
Cd Cd t Cr Cr t
1
1024 32 × 8 0.04 32 × 8 0.04 0.04 0.34 0.00 6.11
2048 32 × 8 0.16 128 × 4 0.16 0.15 1.31 0.86 6.43
2.1 1024 32 × 8 18.2 32 × 1 23.4 18.1 84.0 7.93 0.06
2.2 1024 32 × 8 19.9 64 × 2 19.7 19.5 82.3 0.26 0.04
2.3 1024 32 × 8 25.7 16 × 2 27.2 25.7 109 1.79 0.02
3
1024 32 × 8 5.73 16 × 2 9.67 5.73 89.0 4.73 0.57
2048 32 × 8 46.7 16 × 2 78.6 46.7 707 4.83 1.10
4
1024 32 × 8 5.77 16 × 2 9.68 5.73 89.8 4.69 0.63
2048 32 × 8 47.3 16 × 2 78.8 47.1 731 4.63 1.30
5.1
1024 256 × 1 0.25 32 × 1 0.25 0.25 9.27 0.01 0.08
2048 256 × 1 0.51 32 × 1 0.50 0.50 37.2 0.00 1.10
5.2
1024 256 × 1 0.19 32 × 1 0.18 0.16 9.48 0.17 0.00
2048 256 × 1 0.38 32 × 1 0.36 0.36 38.5 0.01 0.02
6
1024 32 × 8 5.76 32 × 1 9.69 5.76 95.6 4.37 0.02
2048 32 × 8 46.3 32 × 1 78.3 46.3 725 4.70 1.26
7 1024 32 × 8 56.0 8 × 4 81.6 55.2 369 8.38 1.49
8.1
1024 256 × 1 0.20 32 × 1 0.19 0.17 9.87 0.15 0.36
2048 256 × 1 0.39 32 × 1 0.37 0.37 42.5 0.00 0.28
8.2
1024 256 × 1 0.26 32 × 1 0.24 0.24 9.35 0.03 0.42
2048 256 × 1 0.50 32 × 1 0.49 0.49 40.3 0.01 0.67
9
1024 256 × 1 0.40 32 × 1 0.37 0.34 2.10 1.53 1.04
2048 256 × 1 0.80 32 × 1 0.74 0.73 6.32 0.08 0.44
10
1024 32 × 8 0.02 32 × 1 34.3 17.9 87.9 23.38 0.00
2048 32 × 8 239 32 × 1 968 143 2802 31.03 1.39
11.1
1024 256 × 1 0.20 32 × 1 0.19 0.17 9.83 0.17 0.00
2048 256 × 1 0.39 32 × 1 0.38 0.36 39.3 0.03 0.03
11.2
1024 256 × 1 0.25 32 × 1 0.24 0.24 9.55 0.02 0.05
2048 256 × 1 0.50 32 × 1 0.49 0.49 37.4 0.00 0.15
12
1024 32 × 8 90.5 32 × 1 338 32.0 338 100.0 0.25
2048 32 × 8 2295 32 × 1 5276 251 6723 77.64 0.89
13.1
1024 256 × 1 320 8 × 8 256 224 2461 1.42 0.00
2048 256 × 1 1359 1 × 128 7410 1186 18361 36.23 0.63
13.2
1024 256 × 1 0.26 32 × 1 0.26 0.25 1.98 0.17 6.21
2048 256 × 1 0.52 32 × 1 0.51 0.51 7.22 0.00 1.85
13.3
1024 32 × 8 0.05 1 × 128 0.02 0.02 0.21 0.00 3.02
2048 32 × 8 0.21 1 × 128 0.10 0.10 0.85 0.13 0.46
13.4
1024 256 × 1 0.26 32 × 1 0.25 0.25 2.18 0.15 0.09
2048 256 × 1 0.53 32 × 1 0.52 0.52 8.16 0.00 1.43
14.1
1024 256 × 1 1065 32 × 8 394 226 2462 7.54 0.00
2048 256 × 1 2116 64 × 16 6119 1182 18343 28.76 7.18
14.2
1024 256 × 1 1.63 64 × 1 0.26 0.25 2.30 0.39 0.11
2048 256 × 1 3.29 32 × 1 0.51 0.51 7.58 0.05 1.39
14.3
1024 32 × 8 0.03 128 × 1 0.01 0.01 1.48 0.00 0.07
2048 32 × 8 0.11 128 × 1 0.01 0.01 5.94 0.00 0.00
15.1
1024 256 × 1 17.3 512 × 1 17.5 17.1 23.1 7.27 6.91
2048 256 × 1 82.8 512 × 1 83.1 82.7 87.3 7.98 24.15
15.2
1024 256 × 1 3.03 64 × 1 2.93 2.93 4.34 0.00 27.98
2048 256 × 1 6.53 32 × 1 6.64 6.36 11.5 5.47 24.56
15.3
1024 256 × 1 389 32 × 1 375 375 522 0.00 10.42
2048 256 × 1 2000 32 × 1 1947 1944 3818 0.11 8.63
Table 3: Proof of Concept (nvprof)
6 RELATED WORK
We briefly review a number of previous works on the performance
estimation of parallel programs in the context of GPUs and CUDA
programming model. Ryoo et al [33] present a performance tun-
ing approach for general purpose applications on GPUs which is
limited to pruning the optimization space of the applications and
lacks support for memory-bound cases. Hong et al [21] describe
an analytical model that estimates the execution time of GPU ker-
nels. They use this model to identify the performance bottlenecks
(via static analysis and emulation) and optimize the performance
of a given CUDA kernel. Baghsorkhi et al [6] present a dynam-
ically adaptive approach for modeling performance of GPU pro-
grams which relies on dynamic instrumentation of the program at
the cost of adding runtime overheads. This model cannot statically
determine the optimal program parameters at compile time. In [25]
the authors claim to have a static analysis method which does not
require running the code for determining the best configuration
for CUDA kernels. They analyze the assembly code generated by
the NVCC compiler and estimate the IPC of each instruction, but,
there is no analysis of memory access pattern. The authors assume
that each CUDA kernel consists of a for-loop nest and that the ex-
ecution time is proportional to the input problem size. These are,
obviously, very strong assumptions, and impractical for real world
applications. Moreover, the paper only reports on 4 test-cases (un-
like a standard test suite such as PolyBench which includes 15 ex-
amples).
More recently, the author of [37, 38] has suggested a new GPU
performance model which relies on Little’s law [26], that is, mea-
suring concurrency as a product of latency and throughput. The
suggested model takes bothwarp and instruction concurrency into
account. This approach stands in opposition to the common view
of many models, including those of [11], [21], [34], and [6], which
only consider warps as the unit of concurrency. The model mea-
sures the performance in terms of required concurrency for the
latency-bound cases (when occupancy is small and throughput grows
with occupancy) and throughput-bound cases (when occupancy is
large while throughput is at a maximum and is constant w.r.t. oc-
cupancy). The author’s analysis of models in [6, 11, 21, 34] indi-
cates that the most common limitation among all of them is the
significant underestimation of occupancy when the arithmetic in-
tensity (number of arithmetic instructions per memory access) of
the code is intermediate. Other major drawbacks include the low
accuracy of the prediction when the code is memory intensive (i.e.,
low arithmetic intensity), and underestimation, or overestimation,
of the throughput (in the case of [6] and [22], respectively). Notice-
ably, the author emphasizes the importance of cusp behavior ; that
is, "hiding both arithmetic and memory latency at the same time
may require more warps than hiding either of them alone" [37].
Cusp behavior is observed in practice, however, it is not indicated
in the previously mentioned performance prediction models.
7 CONCLUSIONS AND FUTUREWORK
Ourmain objective is to provide a solution for portable performance
optimization of multithreaded programs. In the context of CUDA
programs, other research groups have used techniques like auto-
tuning [18, 23], dynamic instrumentation [24], or both [35]. Auto-
tuning techniques have achieved great results in projects such as
ATLAS [39], FFTW [15], and SPIRAL [31] in which part of the op-
timization process occurs off-line and then it is applied and refined
on-line. In contrast, we propose to build a completely off-line tool,
namely, the rational program.
For this purpose, we present an end-to-end novel technique for
compile-time estimation of program parameters for optimizing a
chosen performance metric, targeting manycore accelerators, par-
ticularly,CUDA kernels. Ourmethod can operatewithout precisely
knowing the numerical values of any of the data or program param-
eters when the optimization code is being constructed. This natu-
rally leads to a case discussion depending on the values of those
parameters, which is precisely what can be accomplished by our
notion of a rational program.Moreover, our idea is verifiably exten-
sible to any architectural additions, owing to the principle that low
level performance metrics are piece-wise rational functions of data
parameters, hardware parameters, and program parameters. Our
Brandt, Mohajerani, Moreno Maza, Paudel, Wang
approach relies on the instrumentation of the program4, and col-
lection of certain runtime values for few selected small input data
sizes, to approximate the runtime performance characteristics of
the program at larger input data sizes. Program instrumentations,
in conjunction with source code analysis and theoretical modeling,
are central to avoid costly dynamic analysis. Consequently, com-
pilers will be able to tune applications for the target hardware at
compile-time, rather than having to offload that responsibility to
the end users.
As it is observed from Tables 2 and 3, our current implementa-
tion suffers from inaccuracy for certain examples. We attribute this
problem to usage of the MWP-CWP of [21, 34] as our performance
prediction model. In future work, we expect that by switching to
the model of [37, 38], we will be able to improve the accuracy of
the rational program estimation.
Acknowledgements
The authors would like to thank IBMCanada Ltd (CAS project 880)
and NSERC of Canada (CRD grant CRDPJ500717-16).
REFERENCES
[1] 2018. CUDA Runtime API: v10.0. NVIDIA Corporation.
http://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf .
[2] 2018. CUDA Toolkit Documentation, version 10.0. NVIDIA Corporation.
https://docs.nvidia.com/cuda/.
[3] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers: Principles,
Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston,
MA, USA.
[4] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK
Users’Guide (third ed.). Society for Industrial and AppliedMathematics, Philadel-
phia, PA.
[5] Mohammadali Asadi, Alexander Brandt, Changbo Chen, Svyatoslav Covanov,
Farnam Mansouri, Davood Mohajerani, Robert Moir, Marc Moreno Maza, Ning
Xie, and Yuzhen Xie. 2018. Basic Polynomial Algebra Subprograms (BPAS).
http://www.bpaslib.org.
[6] Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and
Wen-mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Ar-
chitectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming (PPoPP ’10). ACM, 105–114.
[7] Bernhard Beckermann. 2000. The condition number of real Vandermonde,
Krylov and positive definite Hankel matrices. Numer. Math. 85, 4 (2000), 553–
577.
[8] Alexander Brandt. 2018. High Performance Sparse Multivariate Polynomials: Fun-
damental Data Structures and Algorithms. Master’s thesis. University of Western
Ontario, London, ON, Canada.
[9] Kwok Chiu Chung and Te Hai Yao. 1977. On lattices admitting unique Lagrange
interpolations. SIAM J. Numer. Anal. 14, 4 (1977), 735–743.
[10] Robert M Corless and Nicolas Fillion. 2013. A graduate introduction to numerical
methods. Springer.
[11] NVIDIA Corporation. 2019. CUDA C Programming Guide, v10.1.105.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[12] NVIDIA Corporation. 2019. Profiler User’s Guide, v10.1.105.
https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
[13] Gregory Frederick Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan
Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous
applications in heterogeneous systems. In 19th International Conference on Paral-
lel Architecture and Compilation Techniques, PACT 2010, Vienna, Austria, Septem-
ber 11-15, 2010, Valentina Salapura, Michael Gschwind, and Jens Knoop (Eds.).
ACM, 353–364.
[14] David Eisenbud. 2013. Commutative Algebra: with a view toward algebraic geom-
etry. Vol. 150. Springer Science & Business Media.
[15] Matteo Frigo and Steven G. Johnson. 1998. FFTW: an adaptive software archi-
tecture for the FFT. In Proceedings of the 1998 IEEE International Conference on
4We observe that instrumentation (either via profiling or based on emulation) is un-
avoidable unless the multithreaded kernel to be optimized has a very specific struc-
ture, say it simply consists of a static for-loop nest. This observation is an immediate
consequence of the fact that, for Turingmachines, the halting problem is undecidable.
Acoustics, Speech and Signal Processing, ICASSP ’98, Seattle, Washington, USA,
May 12-15, 1998. IEEE, 1381–1384.
[16] P. B. Gibbons. 1989. A more practical PRAM model. In Proceedings of the ACM
Symposium on Parallel Algorithms and Architectures. ACM, 158–168.
[17] Scott Grauer-Gray and Louis-Noel Pouchet. 2012. Im-
plementation of Polybench codes GPU processing.
http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/GPU/index.html.
[18] Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John
Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In
Innovative Parallel Computing (InPar), 2012. IEEE, 1–10.
[19] Sardar Anisul Haque, MarcMorenoMaza, and Ning Xie. 2015. AMany-CoreMa-
chine Model for Designing Algorithms with Minimum Parallelism Overheads.
In Parallel Computing: On the Road to Exascale, Proceedings of the International
Conference on Parallel Computing, ParCo 2015 (Advances in Parallel Computing),
Vol. 27. IOS Press, 35–44.
[20] H. Hong. 1990. An improvement of the projection operator in cylindrical alge-
braic decomposition. In ISSAC ’90. ACM, 261–264.
[21] Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architec-
ture with memory-level and thread-level parallelism awareness. In 36th Interna-
tional Symposium on Computer Architecture (ISCA 2009). 152–163.
[22] Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, and Hsien-Hsin S. Lee. 2014.
GPUMech: GPU Performance Modeling Technique Based on Interval Analysis.
In 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO
2014, Cambridge, United Kingdom, December 13-17, 2014. IEEE Computer Society,
268–279.
[23] Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline
Chame. 2013. A Script-based Autotuning Compiler System to Generate High-
performance CUDA Code. ACM Trans. Archit. Code Optim. 9, 4, Article 31 (Jan.
2013), 25 pages.
[24] Thomas Kistler and Michael Franz. 2003. Continuous program optimization: A
case study. ACM Transactions on Programming Languages and Systems (TOPLAS)
25, 4 (2003), 500–548.
[25] Robert V. Lim, Boyana Norris, and Allen D. Malony. 2017. Autotuning GPU
Kernels via Static and Predictive Analysis. In 46th International Conference on
Parallel Processing, ICPP 2017, Bristol, United Kingdom, August 14-17, 2017. 523–
532.
[26] John DC Little. 1961. A proof for the queuing formula: L= λ W. Operations
research 9, 3 (1961), 383–387.
[27] L. Ma, K. Agrawal, and R. D. Chamberlain. 2014. A memory access model for
highly-threaded many-core architectures. Future Generation Computer Systems
30 (2014), 202–215.
[28] Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy
Through Microbenchmarking. IEEE Trans. Parallel Distrib. Syst. 28, 1 (2017), 72–
86.
[29] J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable Parallel Program-
ming with CUDA. Queue 6, 2 (2008), 40–53.
[30] Peter J Olver. 2006. Onmultivariate interpolation. Studies in Applied Mathematics
116, 2 (2006), 201–240.
[31] Markus Püschel, José M. F. Moura, Bryan Singer, Jianxin Xiong, Jeremy R. John-
son, David A. Padua, Manuela M. Veloso, and Robert W. Johnson. 2004. Spiral:
A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms.
IJHPCA 18, 1 (2004), 21–45.
[32] Shane Ryoo, Christopher I Rodrigues, Sara S Baghsorkhi, Sam S Stone, David B
Kirk, and Wen-mei W Hwu. 2008. Optimization principles and application per-
formance evaluation of a multithreaded GPU using CUDA. In Proceedings of the
13th ACM SIGPLAN Symposium on Principles and practice of parallel program-
ming. ACM, 73–82.
[33] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and
W.W. Hwu. 2008. Program optimization space pruning for amultithreaded GPU.
In Proc. of CGO. ACM, 195–204.
[34] Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and RichardW. Vuduc. 2012.
A performance analysis framework for identifying potential benefits in GPGPU
applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, Febru-
ary 25-29, 2012. 11–22.
[35] Chenchen Song, Lee-Ping Wang, and Todd J Martínez. 2015. Automated Code
Engine for Graphical Processing Units: Application to the Effective Core Poten-
tial Integrals and Gradients. Journal of chemical theory and computation (2015).
[36] L. J. Stockmeyer and U. Vishkin. 1984. Simulation of Parallel Random Access
Machines by Circuits. SIAM J. Comput. 13, 2 (1984), 409–422.
[37] Vasily Volkov. 2016. Understanding Latency Hiding on GPUs. Ph.D. Dissertation.
EECS Department, University of California, Berkeley.
[38] Vasily Volkov. 2018. A microbenchmark to study GPU performance models. In
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018, Andreas
Krall and Thomas R. Gross (Eds.). ACM, 421–422.
A Technique for Finding Optimal Program Launch Parameters Targeting Manycore Accelerators
[39] R. Clinton Whaley and Jack Dongarra. 1998. Automatically Tuned Linear Alge-
bra Software. In Proceedings of the 1998 IEEE/ACM Conference on Supercomput-
ing.
[40] Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and An-
dreas Moshovos. 2010. Demystifying GPU microarchitecture through mi-
crobenchmarking. In IEEE International Symposium on Performance Analysis of
Systems and Software, ISPASS 2010, 28-30March 2010,White Plains, NY, USA. IEEE
Computer Society, 235–246.
