Abstract-This paper presents an integrated analytical and profile-based CUDA performance modeling approach to accurately predict the kernel execution times of sparse matrix-vector multiplication for CSR, ELL, COO, and HYB SpMV CUDA kernels. Based on our experiments conducted on a collection of 8 widely-used testing matrices on NVIDIA Tesla C2050, the execution times predicted by our model match the measured execution times of NVIDIA's SpMV implementations very well. Specifically, for 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the differences are between 8% and 10%. For CSR, ELL, COO, and HYB SpMV kernels, the differences are 4.2%, 5.2%, 1.0%, and 5.7% on the average, respectively.
I. INTRODUCTION
Sparse matrix-vector multiplication (SpMV) is an essential operation in solving linear systems and partial differential equations. For many scientific and engineering applications, the matrices can be very large and sparse. It is a challenging problem to accurately and effectively predict the execution time of SpMV CUDA kernel for a matrix with any scale and sparsity. The CUDA performance modeling approach proposed by this paper addresses this challenge.
In this paper, we propose an integrated analytical and profile-based performance modeling approach to accurately predict the CUDA kernel execution time of SpMV. Our modeling approach consists of two phases: instrumenting and modeling. In the phase of instrumenting, benchmark matrices are generated according to a GPU's architecture features, then SpMV computations with these benchmark matrices are conducted on the GPU to obtain the execution times. The properties and the execution times of benchmark matrices are recorded as the input in the phase of modeling. In the phase of modeling, we instantiate our parameterized performance models according to the experimental results of benchmark matrices. Finally, we utilize the instantiated models to estimate the CUDA kernel execution time of SpMV for any target sparse matrix. In this paper, we use SpMV CUDA kernels developed by NVIDIA [1] and NVIDIA Tesla C2050 for our performance modeling and experiments. However, the proposed approach is general and can be applied to any SpMV kernel and NVIDIA GPU architecture. For different SpMV kernels, only the execution times for benchmark matrices need to be retested; for different GPU architectures, the benchmark matrices also need to be regenerated. However, these changes only occur in the phase of instrumenting. Our parameterized performance models can be reused in the phase of modeling.
To predict the performance of SpMV for a target sparse matrix on a GPU, we partition the target matrix into strips. A strip is a maximum submatrix that can be handled by a GPU within one iteration. The size of matrix strip is determined by the physical limitations of a GPU and SpMV kernel granularity. The CUDA kernel execution time of SpMV for a target sparse matrix can be predicted by comparing the given target matrix and benchmark matrices.
Our innovative approach combines two major techniques used for performance modeling: profiling and analytics. In our approach, dividing modeling into two phases follows the profile-based technique; and it follows the analytical technique to generate benchmark matrices and performance models according to hardware properties. The integration of both analytical and profile-based modeling has the following advantages: (1) Compared to analytical models, our model is simple and easy to use. (2) Compared to traditional profile-based models, which are usually inaccurate for parallel architectures [2] , our model can accurately and effectively capture the performance effects of GPUs.
We evaluated our performance modeling on 8 matrices using NVIDIA SpMV CUDA kernels [1] . For 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the performance differences are between 8% and 10%. Specifically, the differences are 4.2%, 5.2%, 1.0%, and 5.7% on the average for CSR, ELL, COO, and HYB SpMV kernels, respectively.
The rest of this paper is organized as follows: Section II surveys the related work about sparse matrix-vector multiplication (SpMV) and the recent performance modeling techniques. Section III presents the details of our CUDA performance modeling. Section IV evaluates the accuracy of our performance modeling by comparing the measured and predicted execution times, and the difference rates. Section V gives the conclusion and future work. Choi et al. [12] designed a blocked ELLPACK format and proposed a CUDA performance model to predict matrixdependent tuning parameters. Xu et al. [13] proposed the optimized SpMV based on ELL format and a SpMV CUDA performance model. Zhang and Owens [14] adopted a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: instruction pipeline, shared memory access, and global memory access. Their model focuses on identifying performance bottlenecks and guiding programmers for optimization; our model focuses on predicting the execution time, which is similar to [15] - [17] . Baghsorkhi et al. [15] presented a compiler-based GPU performance modeling approach with accurate prediction using program analysis and symbolic evaluation techniques. Hong and Kim [16] proposed a simple analytical GPU model to estimate the execution time of massively parallel programs. Their model estimates the number of parallel memory requests by taking into account the number of running threads and memory bandwidth. Kothapalli et al. [17] presented a performance model by combining several known models of parallel computation: BSP, PRAM, and QRQW. However, their proposed analytical models are based on the abstraction of GPU architecture. Unlike these analytical performance models, our model is based on both analytical and profilebased modeling techniques.
III. CUDA PERFORMANCE MODELING FOR SPMV

A. The workflow of our modeling
The modeling workflows for CSR and ELL, COO SpMV CUDA kernels are shown in Figure 1 and Figure 2 , respectively.
1) The phase of instrumenting:
• Compute the size of matrix strip (Section III-B).
• Generate the benchmark matrices (Section III-C).
• Test the execution times of the benchmark matrices (Section III-D).
• Compute the number of matrix strips and non-zero elements per row (it is not applicable to COO) for a target matrix (Section III-E).
2) The phase of modeling:
• Instantiate parameterized performance models according to the experimental results of benchmark matrices.
• Estimate the kernel execution time of SpMV for a target matrix using CUDA performance models (Section III-F).
The symbols used in our model are shown in Table I .
GPU hardware features
The size of matrix strip
The execution times of benchmark matrices 
TABLE I SYMBOLS USED IN OUR PERFORMANCE MODELING
N SM
The number of streaming multiprocessors (SMs), which is 14 in NVIDIA Tesla C2050.
R
The number of rows of a benchmark matrix.
S
The size of a matrix strip, which is the maximum number of rows (for CSR and ELL) or non-zero elements (for COO) that can be processed by a GPU within one iteration.
I
The number of strips of a benchmark or target sparse matrix.
N
The set of natural numbers.
C
The number of columns of a benchmark matrix.
P N Z
The number of non-zero elements per row of a benchmark or target sparse matrix.
G M
The size (bytes) of GPU global memory.
M R×C A benchmark matrix, where R × C indicates the dimension of the benchmark matrix.
V C A random vector, where C indicates the dimension of the random vector.
The execution time of matrix-vector multiplication, where
M indicates the benchmark matrix and V indicates the random vector.
α, β Natural numbers, where α < β is required.
T
The execution time of a benchmark matrix.
N R The number of rows of a target sparse matrix.
N N Z The number of non-zero elements of a target sparse matrix.
D S
A data set consisting of the number of non-zero elements in each row of a target matrix. 
B. The size of matrix strip (S)
• A strip is a maximum submatrix that can be handled by a GPU within one iteration. For a large matrix, it may need multiple iterations to handle the whole matrix. Thus, a matrix may contain multiple strips.
• The size of matrix strip is determined by the physical limitations of a GPU and SpMV kernel granularity, which are shown in Table II and III, respectively. The physical limitations of a GPU are determined by its compute capability. Our experiments are based on NVIDIA Tesla C2050, whose compute capability is 2.0.
1) CSR kernel:
2) ELL kernel:
3) COO kernel:
C. The benchmark matrices 1) The criteria for generating the benchmark matrices:
• The number of rows (R):
-The value of C does not affect the performance since the sparse matrices are stored in compressed formats.
• The number of non-zero elements per row (P N Z ):
We assume that the non-zero elements are in singleprecision (f loat) and each row has the same number of non-zero elements. The maximum P N Z is derived according to the maximum non-zero elements that can be stored in the GPU global memory in the corresponding sparse matrix format. The values used in our experiments are introduced in Section III-C2.
• The value of the matrix entry: random value
2) The experimental setup: To obtain accurate performance models, we generate a series of benchmark matrices. A benchmark matrix is determined by R and P N Z . Since R = S × I, where S is fixed, we just enumerate values of I and P N Z according to the above criteria to obtain combinations. Each combination indicates a benchmark matrix.
• The number of strips (I):
-CSR and ELL: Let I = 1, 2, 3...10 * In our experiment, the largest benchmark matrix contains 10 strips, which is accurate enough to measure the performance. 
D. The execution times of benchmark matrices (T )
• For each benchmark matrix, a random vector is generated for measuring the execution time of SpMV kernel.
• We remove the effect of long initialization delay and average the execution time of a benchmark matrix as follows:
E. The target matrix 1) The number of strips (I):
• Given a target matrix with N R rows and N N Z nonzero elements, the number of strips can be computed as follows: 
2) The number of non-zero elements per row (P N Z ):
• CSR: P N Z is set to be mode (statistics) of a data set D S .
• ELL: P N Z is set to be the Max. value of a data set D S .
F. Performance Modeling and Estimating
There exists relationships between the number of strips, the number of non-zero elements per row, and the execution times of the benchmark matrices. Hence, we can estimate the CUDA kernel execution time of SpMV for a target matrix according to these relationships.
1) CSR kernel: Our method contains the following steps:
Step 1: Establish the following relationships:
• Relationship-1 (T = A * x + B): For a set of benchmark matrices with the same number of strips (it can be any arbitrary value within the range defined in Section III-C), we establish the relationship between the number of nonzero elements per row (x) and the execution time of the benchmark matrices (T ).
For a set of benchmark matrices with the same number of non-zero elements per row, we establish the relationship between the number of strips (y) and the execution time of the benchmark matrices (T ′ ). By studying the physical limitations of NVIDIA Tesla C2050, we were surprised to discover that its number of max threads per block, i.e.1024, is exactly a threshold: when the number of non-zero elements per row is smaller or larger than it, there are two different linear relationships. The two linear equations in Relationship-1 are shown in Figure 3 and Figure  4 . The two linear equations in Relationship-2 are shown in Figure 5 and Figure 6 .
Step 2: Estimate the execution time of a target matrix:
• According to the number of non-zero elements per row of the target matrix (denoted by x 0 ), find t 0 = A * x 0 + B from Relationship-1, and the execution time t 1 of any previously tested benchmark matrix M . Here, we assume that Z is the number of non-zero elements per row of matrix M .
• According to the number of strips of the target matrix (denoted by y 0 ), find t 2 = C * y 0 + D from a corresponding linear equation in Relationship-2. Note that, the number of non-zero elements per row of matrix M (denoted by Z) is set to be the number of non-zero elements per row in Relationship-2.
• Estimate the execution time of the target matrix by
2) ELL kernel: Our method works as follows: Step 1: Establish the following relationships:
• Relationship-1 (T = f (y 1 ) * x + g(y 1 )): For a set of benchmark matrices with the same number of strips (it can be any arbitrary value within the range defined in Section III-C and denoted by y 1 ), we establish the relationship between the number of non-zero elements per row (x) and the execution time of the benchmark matrices (T ), as shown in Figure 7 .
• Relationship-2 (f (y)): For sets of benchmark matrices with different number of strips, we establish the relationship between the number of strips of the benchmark matrices (y) and the corresponding coefficient of the linear equations (f ) in Relationship-1, as shown in Figure  8 .
For a set of benchmark matrices with the same number of non-zero elements per row (it can be any arbitrary value within the range defined in Section III-C and denoted by x 1 ), we establish the relationship between the number of strips (y) and the execution time of the benchmark matrices (e), as shown in Figure 9 . Thus, g(y) = e(y) − f (y) * x 1 .
• Given a target matrix, in order to estimate its execution time, we need to obtain the coefficient f (Y ) and the 3) COO kernel: Our method contains the following steps:
• Relationship-1: We establish the relationship between the number of strips and the execution time of the benchmark matrices, as shown in Figure 10 .
• Count the total number of non-zero elements of the target matrix, then calculate the number of strips according to Section III-E1.
• Estimate the execution time of the target matrix using Relationship-1.
4) HYB kernel:
Our method works as follows:
• Since HYB kernel is the combination of ELL and COO kernels, we can reuse the relationships in Sections III-F2 and III-F3.
• Compute HYB threshold [1] to divide the target matrix into two parts: ELL and COO.
• Calculate the number of strips of ELL and COO parts of the target matrix, respectively.
• Use HYB threshold as the number of non-zero elements per row of ELL part of the target matrix.
• Count the total number of non-zero elements of COO part of the target matrix.
• Estimate the execution times of ELL and COO parts of target matrix, respectively.
• Sum above two parts of execution times.
IV. EXPERIMENTAL EVALUATION
Our experiments are performed on NVIDIA Tesla C2050 with 3GB global memory. We evaluated our performance models on the 14 widely-used testing matrices [18] . However, NVIDIA's SpMV implementations [1] cannot execute ELL SpMV kernel on 6 sparse matrices on NVIDIA Tesla C2050 ("Wind Tunnel", "Economics", "FEM/Accelerator", "Circuit", "Webbase", and "LP") because of the limitation of "num cols per row". Hence, we conducted experiments on the rest 8 unstructured sparse matrices [18] , as shown in Table  IV .
To show the prediction accuracy, we compare our four performance prediction models with NVIDIA's SpMV implementations by two aspects: the execution times and the performance difference rates. Figure 11 , 12, 13, and 14 show the comparisons between the measured execution times of NVIDIA's CSR, ELL, COO, and HYB SpMV implementations and the execution times predicted by our CSR, ELL, COO, and HYB performance prediction models, respectively. Figure 15 shows the performance difference rates between the measured and predicted execution times. The experiments evaluate CSR, ELL, COO, and HYB SpMV CUDA kernels on a collection of 8 unstructured sparse matrices. Hence, there are 32 test cases in total. The execution times of SpMV CUDA kernels predicted by our model match the measured execution times of NVIDIA's SpMV implementations very well. Specifically, for 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the differences are between 8% and 10%. For CSR, ELL, COO, and HYB SpMV kernels, the differences are 4.2%, 5.2%, 1.0%, and 5.7% on the average, respectively. In this paper, we proposed four CUDA kernel performance prediction models: CSR, ELL, COO, and HYB SpMV performance prediction models, which are based on CSR, ELL, COO, and HYB SpMV CUDA kernels, respectively, to accurately predict the kernel execution time of sparse matrix-vector multiplication for a target sparse matrix. Our proposed models utilize an integrated analytical and profilebased CUDA performance modeling approach. Compared to analytical models, our model is simple and easy to use; compared to traditional profile-based models, our model can effectively capture the performance effects of GPUs. The experimental results show that the execution times of SpMV kernels predicted by our models match the measured execution times of NVIDIA's SpMV implementations very well.
In the future, we will extend our CUDA SpMV performance modeling to predict the execution times of other SpMV CUDA kernels (e.g. DIA). In addition, we will also propose and design a performance modeling to predict the execution time of dense matrix-vector multiplication.
ACKNOWLEDGMENT
The work was supported in part by NSF under Grants 0941735, CAREER-1054834, and by the Graduate Assistantship of the School of Energy Resources at the University of Wyoming.
