In this paper we present KLARAPTOR (Kernel LAunch parameters RAtional Program estimaTOR), a new tool built on top of the LLVM Pass Framework and NVIDIA CUPTI API to dynamically determine the optimal values of kernel launch parameters of a CUDA program P. To be precise, we describe a novel technique to statically build (at the compile time of P) a so-called rational program R. Using a performance prediction model, and knowing particular data and hardware parameters of P at runtime, the program R can automatically and dynamically determine the values of launch parameters of P that will yield optimal performance. Our technique can be applied to parallel programs in general, as well as to generic performance prediction models which account for program and hardware parameters. We are particularly interested in programs targeting manycore accelerators. We have implemented and successfully tested our technique in the context of GPU kernels written in CUDA using the MWP-CWP performance prediction model.
I. INTRODUCTION
Programming for high-performance parallel computing is a notoriously difficult task. Programmers must be conscious of many factors impacting performance including scheduling, synchronization, and data locality. Of course, program code itself impacts the program's performance, however, there are still further parameters which are independent from the code and greatly influence performance. For parallel programs three types of parameters influence performance: (i) data parameters, such as input data and its size, (ii) hardware parameters, such as cache capacity and number of available registers, and (iii) program parameters, such as granularity of tasks and the quantities that characterize how tasks are mapped to processors (e.g. dimension sizes of a thread block for a CUDA kernel).
Data and hardware parameters are independent from program parameters and are determined by the needs of the user and available hardware resources. Program parameters, however, are intimately related to data and hardware parameters. The choice of program parameters can largely influence the performance of a parallel program, even up to an order of magnitude difference (see Section VI). Therefore, it is critical This work has been supported in part by IBM Canada Ltd (CAS project 880) and in part by NSERC of Canada (CRD grant CRDPJ500717-16, award PGSD3-535362-2019).
to determine optimal values of program parameters that yield the best program performance for a given set of hardware and data parameter values.
In the CUDA programming model the kernel launch parameters, and thus the size and shape of thread blocks, greatly impact performance. This should be obvious considering that the memory accesses pattern of threads in a thread block can depend on the block's dimension sizes. The same could be said about multithreaded programs on CPU where parallel performance depends on task granularity and number of threads. We dedicate, however, this paper to the discussion of GPU programs where optimizing performance is much more difficult. An important consequence of the impact of kernel launch parameters on performance is that optimal thread block format (that is, dimension sizes) for one GPU (Graphics Processing Unit) architecture may not be optimal for another, as illustrated in [1] . This emphasizes not only the impact of hardware parameters on program parameters, but also the need for performance portability. That is to say, enabling users to efficiently execute the same parallel program on different architectures that belong to the same hardware platform.
In this paper, we describe the development of KLARAP-TOR (Kernel LAunch parameters RAtional Program estima-TOR), a tool for automatically and dynamically determining the values of CUDA kernel launch parameters which optimize the kernel's performance, for each kernel launch independently. That is to say, based on the actual data and target device of a kernel invocation. The source code of KLARAPTOR is freely available at https://github.com/orccauwo/KLARAPTOR.
The accuracy of KLARAPTOR's optimal prediction is illustrated in Fig. 1 where kernel execution times are compared between KLARAPTOR's choice of thread block configuration and the optimal thread block configuration determined by an exhaustive search.
KLARAPTOR applies to CUDA a generic and novel technique, also described herein in Section IV, to statically build a so-called rational program which is then used dynamically to determine optimal program parameters for a given multithreaded program on specific data and hardware parameters.
The key principle underpinning the proposed technique can be summarized as follows. Comparing kernel execution times with optimal launch parameters as determined by an exhaustive search versus KLARAPTOR's prediction. Kernels are part of the PolyBench/GPU benchmark suite and executed on a GTX 1080Ti with a data size of N = 2048. Ratios greater than 85% are considered good. Absolute timings given next to each bar are in ms. Due to memory limitations not all benchmarks could not run for N = 2048.
In most performance predictions models, high-level performance metrics, such as execution time, memory consumption, and hardware occupancy, are piece-wise rational functions (PRFs) of low-level performance metrics, such as memory bandwidth, cache miss rate, and the number of clock cycles per instruction. These lower-level metrics are themselves PRFs of program, data, and hardware parameters. As such, a high-level performance metric can be estimated by a piece-wise rational function of the program, data, and hardware parameters. For a fixed machine, the hardware parameters are also fixed and performance metrics depend only on program and data parameters. Henceforth, we regard a computer program that computes such a piece-wise rational function as a rational program, a technical notion fully defined in Section II.
In most cases, the values of the data parameters are only given at runtime, which makes it difficult to determine optimal values of the program parameters at an earlier stage. On another hand, a bad choice of program parameters can have drastic consequences. Hence, it is crucial to be able to determine the optimal program parameters at runtime without much overhead added to the program execution. This is precisely the intention of the approach proposed in this paper.
A. Contributions
The goal of this work is to determine values of program parameters which optimize a multithreaded program's perfor-mance. Towards that goal, the method by which such values are found must be receptive to changing data and changing hardware parameters. Our contributions encapsulate this requirement through the dynamic use of a rational program. Our specific contributions include:
(i) a technique for devising a mathematical expression in the form of a rational program to evaluate a high-level performance metric from a set of program and data parameters; (ii) KLARAPTOR, a tool implementing the rational program technique to dynamically optimize CUDA kernels by choosing optimal launch parameters; and (iii) an empirical and comprehensive evaluation of our tool on kernels from the Polybench/GPU benchmark suite.
B. Related Works
The Parallel Random Access Machine (PRAM) model [2] , [3] , including PRAM models tailored to GPU code analysis such as TMM [4] and MCM [5] analyze the performance of parallel programs at an abstract level. More detailed GPU performance models are proposed such as MWP-CWP by Hong et al [6] , [7] , which estimates the execution time of GPU kernels based on the profiling information of the kernels.
In the context of improving CUDA program performance, other research groups have used techniques such as loop transformation [8] , auto-tuning [9] , [10] , [11] , [12] , dynamic instrumentation [13] , or a combination of the latter two [14] . Auto-tuning techniques have achieved great results in projects such as ATLAS [15] , FFTW [16] , and SPIRAL [17] in which multiple kernel versions are generated off-line and then applied and refined on-line once the runtime parameters are known.
Although much research has been devoted to optimizing a kernel algorithmically, previous works such as [18] and [1] suggest that kernel launch parameters (i.e. thread block configurations) have a large impact on performance and must be considered as a target for optimization. In [19] the authors claim to have a static analysis method which does not require running the code for determining the best configuration for CUDA kernels. Through static code analysis, IPC (instructions per clock-cycle) is estimated, however, there is no analysis of memory access patterns. The authors assume "the execution time of a CUDA program is proportional to the input problem size N " [19, pp.527 ]. This is, obviously, a very strong assumption, and impractical for real world applications where even simple operations such as matrix multiplication are not proportional to their input size.
In [20] , the authors present an input-adaptive GPU code optimization framework G-ADAPT, which uses statistical learning to find a relation between the input sizes and the thread block sizes. At runtime or linking time, the framework predicts the best block size for a given input size using the linear model obtained from compile time. This approach only considers the total size of the thread blocks and not their configuration. Meanwhile, the authors of [11] use a linear regression model to predict optimal thread block configurations (that is, dimension sizes and not just the total size). However, like the authors of [19] , they assume that the execution time of a kernel is proportional to its data size. Moreover, only two benchmarks are provided. In [21] , machine learning techniques are used in combination with auto-tuning to search for optimal configurations of OpenCL kernels. Their examples are limited to stencil computations.
C. Structure of the Paper
The remainder of this paper is organized as follows. Section II formalizes and exemplifies the notion of rational programs and their relation to piece-wise rational functions and performance prediction. Section III highlights our two main results. Firstly, that rational programs can be determined as a mathematical expression for evaluating a performance metric from program and data parameters and, secondly, our success in applying this technique to CUDA kernels via the KLARAPTOR tool.
The remaining sections examine our generic technique for building and using a rational program to estimate program performance (Section IV), the specifics of the implementation of that technique as KLARAPTOR (Section V), and further experimentation and evaluation of our implementation (Section VI). We draw conclusions and explore future work in Section VII.
II. THEORETICAL FOUNDATIONS
Let P be a multithreaded program to be executed on a targeted multiprocessor. By fixing the target architecture, the hardware parameters, say, H = (H 1 , . . . , H h ) then become fixed and we can assume that the performance of P depends only on data parameters D = (D 1 , . . . , D d ) and program parameters P = (P 1 , . . . , P p ). Also, an optimal choice of P depends on a specific choice of D. For example, in programs targeting GPUs the parameters D are typically dimension sizes of data structures, like arrays, while P typically specify the formats of thread blocks.
Let E be a high-level performance metric for P that we want to optimize. More precisely, given the values of the data parameters D, the goal is to find values of the program parameters P such that the execution of P optimizes E. Performance prediction models attempt to estimate E from a combination of P , D, H, and some model-or platform-specific low-level metrics L = (L 1 , . . . , L ). It is natural to assume that these low-level performance metrics are themselves combinations of P , D, H. This is an obvious observation from models based on PRAM such as TMM [4] and MCM [5] . Hence, regardless of the particular performance prediction model, we may drop direct dependency on L and look to estimate E exclusively from some (unknown) combination of program and data parameters.
To address our goal, we compute a mathematical expression, parameterized by data and program parameters, in the format of a rational program R at compile-time. At runtime, given the specific values of D, we can efficiently evaluate E using R. After that, we can feasibly pick a value of P that optimizes E, and feed that to P for execution. We refer to such P which optimizes E as optimal values of the program parameters. This method is detailed in Section IV.
A. Rational Programs
Let X 1 , . . . , X n , Y be pairwise different variables 1 . Let S be a sequence of three-address code (TAC [22] ) instructions such that the set of the variables that occur in S and are never assigned a value by an instruction of S is exactly {X 1 , . . . , X n }.
Definition 1: We say that the sequence S is rational if every arithmetic operation used in S is either an addition, a subtraction, a multiplication, or a comparison of integer numbers in either fixed or arbitrary precision. Moreover, we say that the sequence S is a rational program in X 1 , . . . , X n evaluating Y if the following two conditions hold: 1) S is rational, and 2) after specializing each of X 1 , . . . , X n to an arbitrary integer value in S, the execution of the specialized sequence S always terminates and the last executed instruction assigns an integer value to Y . It is worth noting that the above definition can easily be extended to include Euclidean division, the integer part operations floor and ceiling, and arithmetic over rational numbers. For Euclidean division one can write a rational program evaluating the quotient q of integer a by b, leaving the remainder r to be simply calculated as a − qb. Then, floor and ceiling can be computed via Euclidean division. Rational numbers and their associated arithmetic is easily implemented using only integer arithmetic.
Notice then, that by adding these operations to Definition 1, the class of rational programs does not change. Thus, adding rational numbers, Euclidean division, and integer part operations to the definition of a rational sequence yields an equivalent definition of a rational program. We adopt this definition henceforth.
B. Rational Programs as flowcharts
For any sequence S of computer program instructions, it is convenient to associate S with a control flow graph (CFG). In the CFG of S, the nodes are the basic blocks of S, that is, the sub-sequences of S such that (i) each instruction except the last one is not a branch, and (ii) which are maximum in length with property (i). Moreover, in the CFG of S, there is a directed edge from a basic block B 1 to a basic block B 2 whenever, during the execution of S, one can jump from the last instruction of B 1 to the first instruction of B 2 . Recall that a flowchart is another graphic representation of a sequence of computer program instructions. In fact, CFGs can be seen as particular flowcharts.
If, in a given flowchart C, every arithmetic operation occurring in every (process or decision) node is either an addition, subtraction, multiplication, or comparison of integers in either fixed or arbitrary precision then C is the flowchart of a rational The execution time of a GPU kernel is a highlevel performance metric defined in the MWP-model [6] , [7] which is calculated from hardware parameters including:
-clock frequency of a SM, -the number of SMs on the device, etc., as well as low-level kernel-dependent performance metrics, including:
-the total number #Mem insts of memory instructions per thread, -the total number #Comp inst of computation instructions per thread, etc., and a program parameter, namely the number T of threads per thread block. Thus, this high-level metric can be computed as rational program of program, data, and hardware parameters. Indeed, we have realized such a rational program making us of the MWP-CWP model and its metrics in our implementation of KLARAPTOR (See Sections III-B and V).
D. Piece-Wise Rational Functions in Rational Programs
Observation 1: Let S be a rational program in X 1 , . . . , X n evaluating Y . Let s be any instruction of S other than a branch or an integer part instruction. Hence, this instruction can be of the form
where A and B can be any machine-representable rational numbers. Let V 1 , . . . , V v be the variables that are defined at the entry point of the basic block of the instruction s. An elementary proof by induction yields the following fact. There exists a rational function 2 
From there, one derives the following observation. There exists a partition T = {T 1 , T 2 , . . .} of Q n (where Q denotes the field of rational numbers) and rational functions f 1 (X 1 , . . . , X n ), f 2 (X 1 , . . . , X n ), . . . such that, if X 1 , . . . , X n receive respectively the values x 1 , . . . , x n , then the value of Y returned by S is one of f i (x 1 , . . . , x n ) where i is such that (x 1 , . . . , x n ) ∈ T i holds. In other words, S computes Y as a piece-wise rational function (PRF).
Example 1 shows that the hardware occupancy of a CUDA kernel is given as a piece-wise rational function in the variables R max , Z max , T max , B max , W max , R, Z, T . Hence, in this example, we have n = 8. Moreover, Fig. 2 shows that its partition of Q n contains 5 parts as there are 5 terminating nodes in the flowchart.
III. MAIN RESULTS

A. Rational Programs for Performance Prediction
In Section II, we saw how a rational program could be seen as a piece-wise rational function. We now show how a rational program can encode, and thus evaluate, a performance prediction model using only program and data parameters. A technique for constructing such a rational program is given in Section IV.
Observation 2: We assume that the high-level performance metric E is computable as a rational program depending on program, data, and hardware parameters, and some low-level performance metrics. Then, it is natural to assume that these low-level metrics also depend only rationally (as opposed to some more complex dependency) on program, data, and hardware parameters. Hence, we replace the direct dependency of E on low-level metrics with a further dependency on the program, data, and hardware parameters. When applied to a multithreaded program to be executed on a particular multiprocessor, the hardware parameters become specialized and we can assume that E can be expressed as a rational program depending only on program and data parameters.
Observations 1 and 2 culminate into this final result on rational programs. Suppose that a flow chart C representing the rational program R is partially known; to be precise, suppose that the decision nodes are known (that is, mathematical expressions defining them are known) while the process nodes are not. From Observation 1, each process node can be given by a series of rational functions. Determining each of those rational functions can be achieved by solving an interpolation or curve fitting problem.
It follows that a performance prediction model which can be depicted as a flow chart can also be depicted as a piece-wise rational function and thus a rational program. For example, the MWP-CWP performance model on top of CUDA (see Examples 1 and 2) gives us an estimate of E from program, data, and hardware parameters and some low-level metrics. Thus, by determining rational functions describing these lowlevel metrics, which are parameterized by program and data parameters, we can fully determine a rational program estimating E which depends only on program and data parameters.
Recall that building such a rational program R is our goal. Once R is known, it can be used at runtime (that is, when the program P is run on specific input data) to compute optimal values for the program parameters. This is exactly what is achieved in our tool implementing this technique. Before exploring this tool we make one final remark. We assumed that the decision nodes in the flowchart of the rational program were known, however, we could relax this assumption. Indeed, each decision node is given by a series of rational functions. Hence, those could be determined by solving curve fitting problems as well. We do not discuss this direction further since this is not needed in our proposed technique or implementation presented in the remainder of this paper.
B. KLARAPTOR: A Dynamic Optimization tool for CUDA
The theory of rational programs is put into practice for the CUDA programming model by our tool KLARAPTOR. KLARAPTOR is a compile-time tool implemented using the LLVM Pass Framework and the MWP-CWP performance model to dynamically choose a CUDA kernel's launch parameters (thread block configuration) which optimize its performance. Most high-performance computing applications require computations be as fast as possible and so kernel performance is simply measured as its execution time.
As mentioned in Section I, thread block configurations drastically affect the running time of a kernel. Determining optimal thread block configurations typically follows some heuristics, for example, constraining block size to be a multiple of 32 [24] . However, it is known that the dimension sizes of a thread block, not only its total size, affect performance [1] , [18] . Moreover, since thread block configurations are intimately tied to the size of data being operated on, it is very unlikely that a static thread block configuration optimizes the performance of all data sizes. Our tool effectively uses a rational program to dynamically consider the particular data size of every kernel invocation and determines the thread block configuration which minimizes the execution time of that particular invocation. This is achieved in two main steps. 1) At the compile-time of a CUDA program, its kernels are analyzed in order to build a rational program estimating the performance of each individual kernel under the MWP-CWP model. Each rational program, written as code in the C language, is inserted into the code of the CUDA program so that it is called before the execution of the corresponding kernel. 2) At runtime, immediately preceding the launch of a kernel, where data parameters have specific values, the rational program is evaluated to determine the kernel launch parameters (thread block configuration) which optimize the performance of the kernel. The kernel is then launched using this thread block configuration.
The practical benefits of KLARAPTOR are numerous, but in particular we are concerned with kernel performance and programmer performance. Of course, by choice of optimal kernel launch parameters, the execution time of a kernel is minimized without any modifications to the kernel. By programmer performance we mean the efficiency of a programmer to produce optimal code. When a programmer is attempting to optimize a kernel, choosing optimal launch parameters can either be completely ignored, performed heuristically, determined by trial and error, or determined by an exhaustive search. The latter two options quickly become infeasible as data sizes grow large. Regardless, any choice of optimal thread block configuration is likely to optimize only a single data size.
For KLARAPTOR to be practical, not only does the choice of optimal kernel launch parameters need to be correct, but the two main steps defined previously must also be performed in an way which is more efficient than trial and error or exhaustive search. Namely, the compile-time analysis cannot overwhelm the compilation time and the runtime decision of the kernel launch parameters cannot overwhelm the kernel execution time. For the former our analysis is performed very quickly using only small data sizes whereas for the later the rational program evaluation is fast and simple, maintaining a runtime history to instantly provide results for future kernel launches. The full details of this process is presented in Section V. In the remainder of this section we highlight the performance of KLARAPTOR. The Polybench/GPU benchmark suite provides an empirical evaluation of our tool on a range of CUDA programs. In Fig. 1 we have already seen that KLARAPTOR accurately predicts the optimal or near-optimal thread block configuration. In Fig. 3 we compare the so-called system performance. System performance describes the time required to determine the optimal thread block configuration by either (a) completely running KLARAPTOR, including rational program construction, evaluation, and running the CUDA program over many data sizes; or (b) invoking the kernel on all possible thread block configurations in an exhaustive search over those same data sizes. Evidently, using KLARAPTOR is orders of magnitude faster than an exhaustive search while also maintaining the fact that it dynamically adjusts for any data size. More detailed results and experimentation is presented in Section VI.
IV. AN ALGORITHM TO BUILD AND DEPLOY A PERFORMANCE-PREDICTING RATIONAL PROGRAM
In this section, the notations and hypotheses are the same as in Section II. Namely, E is a high-level performance metric for the multithreaded program P (e.g. execution time, memory consumption, or hardware occupancy), L is a set of low-level metrics given by the performance prediction model or the particular platform of P, and P , D, H are sets of program, data, and hardware parameters, respectively.
Let us assume that the values of H are known at compiletime of P while the values of D are known at runtime. Further, let us assume that P and D take integer values. Hence the values of P belong to a finite set F ⊂ Z p . That is to say, the possible values of P are tuples of the form (π 1 , . . . , π p ) ∈ F , with each π i being an integer. Let us call such a tuple a configuration of the program parameters. Due to the nature of program parameters, those are not necessarily all independent variables (i.e. a program parameter may depend on the value of another program parameter). For example, in CUDA, the product of the dimension sizes of a thread block is usually (i) a multiple of the warp size (32) and, (ii) bounded over by the maximum number of threads per block.
Following Section III-A, we can compute a rational program R estimating E as a function of D and P . To do so, we shall first compute rational functions g 1 (D, P ), . . . , g (D, P ), estimating L 1 , . . . , L respectively. Once R has been computed, it can be used when P is being executed (thus at runtime) in order to determine optimal values for P , on given values of D. The entire process is decomposed into six steps: the first three occur at compile-time and the next three at runtime.
1) Data collection:
In order to perform a curve fitting of the rational functions g i (D, P ), 1 ≤ i ≤ , we require data points with which to perform the fit. These are collected by (i) selecting a set of points K which is a subset of the space of possible values of (D, P ); and (ii) we execute the program P, recording the values of the low-level performance metrics L as V = (V 1 , . . . , V ), at each point in K.
2) Rational function estimation: For each low-level metric
L i we use the set of points K and the value V i measured for each point to estimate the rational function g i (D, P ). We observe that if the values of V i were known exactly (that is, without error) the rational function g i (D, P ) could be determined exactly via rational function interpolation. In practice, however, these empirical values are likely to be noisy from profiling, and/or numerical approximations. Consequently, we determine a rational functionĝ i (D, P ) which approximates g i (D, P ).
3) Code generation:
In order to generate the rational program R, we proceed as follows: (i) we convert the rational program representing E into code, say in the C programming language, essentially encoding the CFG for computing E;
(ii) we convert eachĝ i (D, P ) into code, specifically subroutines, estimating L i ; and (iii) we include those sub-routines into the code computing E, which yields the desired rational program R. At this point the rational program R is fully determined. 4) Rational program evaluation: At the runtime of P, the data parameters D are given particular values, say δ 1 , . . . , δ d . For those specified values of D and for all practically meaningful 3 values of P from the set F , we compute an estimate of E using R. The evaluation of E over so many different possible program parameters is feasible for three reasons: (i) The number of program parameters is small, typically p ≤ 3, see Section V; (ii) the set of meaningful values for P is small (consider that in CUDA the product of thread block dimensions should be a multiple of 32 less than 1024); and (iii) the program R simply evaluates a few formulae and thus runs almost instantaneously.
5) Selection of optimal values of program parameters:
When the search space of values of the program parameters P is large, a numerical optimization technique is required for this step. However, as just explained, the number of evaluations to sort through is quite small and thus an exhaustive search is feasible. However, due to inaccuracies in the performance prediction model being used, and in the estimation of the rational functions g i (D, P ) asĝ i (D, P ), there may be several configurations which, up to some margin, optimize E. Then, a secondary performance metric or some heuristic specific to the platform of P may be used to refine the choice of optimal configuration. 6) Program execution: Once an optimal configuration is selected, the program P can finally be executed using this configuration along with the values δ 1 , . . . , δ d of D.
V. THE IMPLEMENTATION OF KLARAPTOR
In this section we give an overview of the implementation of our previously presented technique (Section IV) applied to CUDA using the MWP-CWP performance model. The KLARAPTOR tool is built in the C language, making use of the LLVM Pass Framework (see Section V-C) and the NVIDIA CUPTI API (see Section V-D). KLARAPTOR is freely available in source at https://github.com/orcca-uwo/KLARAPTOR.
In the case of a CUDA kernel, the data parameters specify the input data size. In many examples this is a single parameter, say N , describing the size of an array (or the order of a multi-dimensional array), the values of which are usually powers of 2. Program parameters describe the kernel launch parameters, i.e. grid and thread block dimensions, and are also typically powers of 2. For example, a possible thread block configuration may be 1024 × 1 × 1 (a one-dimensional thread block), or 16 × 16 × 2 (a three-dimensional thread block). Lastly, the hardware parameters are values specific to the target GPU, for example, memory bandwidth, the number of SMs available, and their clock frequency.
This section is broken into subsections. Sections V-A, V-B, and V-C are specific to our implementation and do not correspond to any step of Section IV. The compile time steps 1 (data collection) and 2 (rational function estimation) are reflected in Sections V-D and V-E, while step 3 requires no additional explanation. The runtime steps 4 (rational program evaluation), 5 (selection of optimal configuration), and 6 (program execution) are also straight forward from the algorithm in Section IV. Throughout this section we refer to the notion of a driver program as the code which includes the rational program and uses it to select an optimal configuration.
A. Annotating the source code
Beginning with a CUDA program written in C/C++, we minimally annotate the host code to make it compatible with our pre-processor. We take into account the following points:
(i) the code targets CUDA Compute Capability 3.x or higher;
(ii) there should be no CUDA runtime API calls as such calls will interfere with later CUDA driver API calls used by our tool, for example, cudaSetDevice; (iii) the block dimensions and grid dimensions must be declared as the typical CUDA dim3 structs. For each kernel in the CUDA code, we add two pragmas, one defining the index of the kernel input arguments corresponding to the data size N and one specifying the dimension of the kernel (1, 2, or 3). As an example, consider the code segment below for the Sample CUDA kernel and the associated pragmas. This kernel operates of a two-dimensional kernel of order N , making it a two-dimensional kernel. Lastly, for each kernel, the user must fill two formatted configuration files which follow Python syntax. One specifies the constraints on the thread block configuration while the other specifies the grid dimensions. For example, for the 2D kernel Sample above, one could specify that its thread block configuration (bx, by, bz) must satisfy bx < by 2 , bx < N and by < N . Similarly, the grid dimensions (gx, gy, gz), could be specified as gx = ceil N bx and gy = ceil N by .
B. Preprocessing
The preprocessor processes the annotated source replacing CUDA runtime API calls to driver API kernel launches. This step includes source code analysis in order to extract the list of kernels, the list of kernel calls in the host code, and finally, the body of each kernel (which will be used for further analysis) and building a so-called "PTX lookup table" to store specific kernel information. This table will be inserted into the socalled "instrumented binary"-the compiled CUDA program, augmented by our driver programs.
C. Input/Output Builder
The Input/Output builder Pass, or IO-builder, is a compiler pass written in the LLVM Pass Framework to build the previously mentioned "instrumented binary". This pass embeds IO mechanisms to communicate with the driver program of each kernel at runtime. There is one such function call embedded per kernel call. Thus, at the runtime of the CUDA program being analyzed (step 6 of Section IV), an IO function is called before each kernel invocation to return six integers, (gx, gx, gz, bx, by, bz), the optimal kernel launch parameters.
The IO-builder pass goes through the following steps: (i) obtain the LLVM IR (intermediate representation) of the instrumented source code and find all CUDA driver API kernel calls; (ii) relying on the annotated information for each kernel, determine which variables in the IR contain the value of N for a corresponding kernel call; and (iii) insert a call to an IO function immediately before each kernel call in order to pass the runtime values of N to the corresponding driver program and retrieve the optimal kernel launch parameters.
D. Building a Driver Program: Data collection
For the MWP-CWP model as well as our eventual rational function estimation, we must collect data and statistics regarding certain performance counters and runtime metrics (these are thoroughly defined in [6] and [25] ). These metrics can be partitioned into three categories.
Firstly, architecture-specific performance counters of a kernel, basically, characteristics dictated by the CC of the target device. Such characteristics can be obtained at compile-time, as the target CC is specified at this time. These characteristics include the number of registers used in each thread, the amount of static shared memory per thread block, and the number of (arithmetic and memory) instructions per thread.
Secondly, runtime-specific performance counters that depend on the behavior of the kernel at runtime. This includes values impacted by memory access patterns, namely, the number of memory accesses per warp, the number of memory instructions in each thread, and eventually, the total number of warps that are being executed. We have developed a customized profiler using NVIDIA's EVENT API within the CUPTI API to collect the required runtime performance counters. The profiler is customized to collect exactly the information required for the MWP-CWP model and nothing else, allowing it to be very lightweight and avoid the huge overheads of a typical profiler (e.g. NVIDIA's NVPROF [26] ).
Thirdly, device-specific parameters, which describe an actual GPU card, allow us to compute a more precise performance estimate. A subset of such parameters can be determined by microbenchmarking the device (see [27] and [28] ), this includes the memory bandwidth, and the departure delay for memory accesses. The remaining parameters can easily obtained by consulting the vendor's guide [29] , or by querying the device itself via the CUDA driver API. Such parameters include the number of SMs on the card, the clock frequency of SM cores, and the instruction delay.
E. Building a Driver Program: Rational function estimation
Using the runtime data collected in the previous step, we look to determine the rational functionsĝ i (D, P ) (see Section IV) which estimate the low-level metrics or other intermediate values in the performance model. For simplicity of discussion, we replace the parameters D and P with the generic variables X 1 , . . . , X n .
A rational function is simply a fraction of two polynomials. With a degree bound (an upper limit on the exponent) on each variable X k in the numerator and the denominator, say u k and v k , respectively, these polynomials can be defined up to some parameters (using the language of parameter estimation), namely the coefficients of the polynomials, say α 1 , . . . , α i and β 1 , . . . , β j . Through analysis of the MWP-CWP model these degree bounds, and thus i and j, are relatively small.
We perform a parameter estimation (for each rational function) on the coefficients α 1 , . . . , α i , β 1 , . . . , β j . to determine the rational function. This is a relatively simple linear regression which can be solved as an over-determined system of linear equations, say by the method of linear least squares. However, our system is constructed from the evaluation of monomial terms, resulting in essentially a Vandermonde matrix. Such matrices are very ill-conditioned. Since we are interested in using our fitted model for extrapolation (i.e. estimating program parameters for new data parameters) any numerical errors in the model fitting will grow very quickly [30] . Thus, our solution must be as numerically stable as possible. Moreover, since we have constructed the system from monomials, it suffers from multicollinearity (see [31, Chapter 23] ) and can become rank-deficient. The typical method to solve the system using QR-factorization is thus impossible and we use the computationally more intensive yet more numerically stable method of singular value decomposition (for details see [30, Chapter 4] ).
Our implementation uses the LAPACK (Linear Algebra PACKage) library [32] for linear algebra and the Basic Polynomial Algebra Subprograms (BPAS) library [33] for efficient rational function and polynomial implementations.
VI. EXPERIMENTATION
In this section we explore the performance of KLARA-PTOR by applying it to the CUDA programs of the Polybench/GPU benchmark suite [9] . Throughout this section data was collected using a GTX 1080Ti.
We note here that many of the kernels in this suite perform relatively low amounts of work; they are best suited to being executed many times from a loop in the host code. These kernels with extremely fast execution create a large variance in relative timings from trial to trial where the absolute time scale is on the order of a fraction of a millisecond. Performance is further varied by dynamic frequency and voltage scaling despite setting the performance mode to peak performance.
However, when it comes to choosing thread block configurations which give optimal execution times, we need only to look at trends and relative performance. Fig. 4 shows these trends for three different kernels: atax, corr, and gramschmidt. While the predicted values are not exact, they do show that the overall shape is accurate and that predicted minima align with actual minima in execution time. Table I provides experimentation data for every kernel in the Polybench/GPU benchmark suite. Namely, this table compares the execution times of the thread block configuration chosen by KLARAPTOR against the optimal thread block configuration found though exhaustive search. The table shows a couple of data sizes in order to highlight that the best configuration can change for different input sizes.
VII. CONCLUSIONS AND FUTURE WORK
The performance of a single CUDA program can vary wildly depending on the target GPU device, the input data size, and the kernel launch parameters. Moreover, a thread block configuration yielding optimal performance for a particular data size or a particular target device will not necessarily be optimal for a different data size or different target device. In this paper we have presented the KLARAPTOR tool for determining optimal CUDA thread block configurations which is adaptive to each kernel invocation and input data, allowing for dynamic data-dependent performance and portable performance. This tool is based upon our technique of encoding a performance prediction model as a rational program. The process of constructing such a rational program is a fast and automatic compile-time process which occurs simultaneously to compiling the CUDA program by use of the LLVM Pass framework. Our tool was tested using the kernels of the Polybench/GPU benchmark suite with good results.
That same experimentation has lead us to consider the limitations our chosen performance prediction model, MWP-CWP. Recently, the author of [34] and [35] has suggested a new GPU performance model which relies on Little's law; it measures concurrency as a product of latency and throughput. This model considers both warp and instruction concurrency while the other models such as [29] , [6] , [7] , and [36] , consider only warp concurrency. The author's analysis of those models suggests their limitation is the significant underestimation of occupancy when arithmetic intensity (the number of arithmetic instructions per memory access) is intermediate. This is exactly the type of kernels on which KLARAPTOR underperforms. In future work we look to apply an improved performance prediction in order to achieve even better results. 
