A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation on modern throughput-oriented computer architectures. This format extends the standard compressed row storage (CRS) format and is easily convertible to and from it without any memory overhead. Computational performance of an SpMV kernel for the new format is determined for over 140 sparse matrices on two Fermi-class graphics processing units (GPUs) and the efficiency of the kernel, which peaks at 36 and 25 GFLOPS at single and double precision, respectively, is compared with that of five existing generic algorithms and industrial implementations. The efficiency of the new format is also measured as a function of the mean (µ) and of the standard deviation (σ) of the number of matrix nonzero elements per row. The largest speedup is found for matrices with µ 20 and µ σ 1.5 and can be as high as 43%.
Introduction
The sparse matrix-vector (SpMV) multiplication is one of the most important kernels in scientific computing with numerous applications ranging from sparse linear solvers to the PageRank algorithm used by Google in its Web search engine [1] . However, its high memory bandwidth requirements combined with the poor data locality exhibited by typical sparse matrices result in poor performance on general purpose processors which usually attain only a small fraction of their peak performance in this kernel. The literature devoted to SpMV optimization techniques on traditional, cache-based processor designs is ample (see [2] for an extensive overview), and currently one of the major issues is how to exploit new features available in multicore hardware. Several SpMV optimization strategies were already proposed for various designs of chip multiprocessors, including multicore central processing units (CPUs) from Intel and AMD [3] , single-and dual-socket STI Cell [3] , Sun UltraSPARC T2 [3] , field programmable gate arrays (FPGAs) [4, 5] , and graphics processing units (GPUs) [6] [7] [8] [9] [10] [11] [12] [13] [14] .
In just a few years GPUs have evolved from fixedpipeline application-specific integrated circuits into highly programmable, versatile computing devices. Under conditions often met in scientific computing, modern GPUs surpass CPUs in computational power, data throughput and computational efficiency per watt by almost an order of magnitude. They are not only the core ingrediEmail address: zkoza@ift.uni.wroc.pl (Zbigniew Koza) ent of many today's and tomorrow's petaflop supercomputers, but also an affordable mathematical accelerator for everyday usage, with peak computational performance matching that of the most powerful supercomputers of only a decade ago. These devices can be programmed in high-level programming languages, e.g. Nvidia's C++ for CUDA (proprietary) or OpenCL (open standard), and several high-quality mathematical libraries as well as bindings to many high-level programming languages are available, including C, C++, Python and Matlab. The major problem in their usage is software: any new hardware architecture will succeed only if appropriate software can be developed to exploit the parallelism in the hardware efficiently. However, the current multicore architectures are so diverse and are subject to so frequent changes that to exploit their potential the applications must be highly specialized and use architecture-specific optimization strategies. Moreover, massively parallel architectures, like the one utilized in modern GPUs, require to use a completely new programming paradigm, which in turn requires radical rethinking of how numerical computations should be performed. Since the vast majority of the existing scientific software adheres to "old" programming paradigms, and since development of this software took many, many millions man-hours, perhaps the best way to introduce new concepts is by concentrating on the most important kernels and showing their usability in the "old" environment.
In this paper we present an efficient SpMV algorithm optimized for throughput-oriented processors and examine its implementation for Nvidia's CUDA-enabled GPUs. The algorithm is based on a new sparse matrix storage for-mat designed as an extension of a popular compressed row storage (CRS, also known as compressed sparse row, CSR) format. It is characterized by a small memory footprint and small conversion times to other storage formats, which should facilitate its adoption in existing applications. It also turns out to be among the fastest SpMV algorithms available for GPUs. These features make it a candidate for a universal method of doing the SpMV multiplication efficiently on a wide range of throughput-oriented multicore architectures.
We compared the efficiency of our GPU implementation of the new format with that of several generic GPU SpMV algorithms as well as industrial libraries. However, in contrast to many recent studies on GPU SpMV, which were based on small sets of sparse matrices, which in turn almost reduces the papers to case studies, here we use a far larger set from the University of Florida Sparse Matrix Collection (UF SMC) [15] . This enabled us to make some general statements not only about the absolute performance of our CMRS-based implementation, but also about relative performance of several other alternative solutions. We also try to exploit on a broader scale some general characteristics of any SpMV product implementation, e.g. the bandwidth efficiency, in the hope that our results will be valuable even after the next-generation GPU architectures have been released.
The structure of the paper is as follows. In Sec. 2 we introduce the problem of SpMV product calculation and briefly describe specific computer architecture features we use to solve it. Section 3 contains the definition of the new sparse matrix storage format, and Sec. 4 describes its implementation for Fermi-class GPUs. Results of extensive performance tests of the new implementation and its comparison with other, alternative solutions is given in Sec. 5. Finally, Sec. 6 is devoted to conclusions.
Problem statement

SpMV multiplication
The aim of the SpMV multiplication algorithm is to calculate the product y =Âx, whereÂ is a large sparse matrix and x, y are dense vectors. Typical matrices involved in the SpMV product have thousands or even millions of rows and columns and are very sparse, with an average number of nonzero elements per row rarely exceeding 100. In some cases the pattern of nonzero elements is quite regular (e.g. for problems resulting from discretization of partial differential equations on regular grids), but the most interesting-and difficult-matrices are those whose distribution of nonzero elements appears to be random.
Nonzero elements ofÂ are usually stored in a onedimensional auxiliary array, and additional information is needed to uniquely map the values in the array to their locations inÂ. The way this information is stored is called a sparse matrix format. Calculation of a sparse matrixvector product essentially reduces to many "multiply and add" operations, which in modern GPUs are implemented as a single fused multiple-add (FMA) instruction. Thus, SpMV multiplication involves several memory accesses per each arithmetic instruction, which means that the SpMV kernel is inherently memory-bound. For example, serverclass Tesla M2090 GPU can perform ≈ 3.3 × 10 11 FMA operations per second and can access its main memory at ≈ 1.7×10 11 B/s, which yields the ratio of about two operations per byte. Therefore, if four eight-byte long operands of each FMA instructions were to be read from and written to the off-chip memory, the memory bottleneck would restrict the processor computational efficiency to 1/64 of its peak theoretical value. Although this value can be increased by using fast on-chip memory, other factors, like additional memory transactions necessary to read sparse matrix format data or reduced off-chip memory throughput due to poor data locality and indirect addressing, can decrease it to even smaller values. The main challenge in designing an efficient and universal SpMV algorithm is thus how to exploit and balance all the performancerelated features available in hardware, focusing on the memory subsystem utilization.
GPU architecture and the CUDA programming model
The architecture of modern GPUs is a massively parallel design which excels in computing-intensive stream data processing. Here we briefly discuss the main properties of Nvidia's "Fermi" GPU architecture [16] introduced in 2010.
A GPU contains a number of units called multiprocessors, each one containing a set of relatively simple computing cores called CUDA cores. From the programmer's perspective, multiprocessors are essentially independent single-instruction multiple data (SIMD) devices in which a group of 32 CUDA threads, called a warp, executes the same instruction on multiple data simultaneously. Multiprocessors are connected to high bandwidth (up to ≈ 200 GB/s), high latency (≈ 800 clock cycles), limited size (≤6 GB) external dynamic random access memory through a coherent L2 cache (768 KB). Each multiprocessor has an L1 cache (16 or 48 KB), a read-only texture cache (6 to 8 kB) and a constant cache (8 KB) . Beside these hardwaremanaged caches, GPUs contain also several fast on-chip memories managed in software: so called shared memory and registers.
While CPUs are equipped with sophisticated hardware mechanisms, like complex cache memories, branch prediction and out-of-order execution, which allow them to execute efficiently even a poorly written code, GPUs consist of relatively simple units which must be controlled on much lower level to achieve good performance. This makes GPU programming harder, but at the same time paves the way for new optimization strategies. These strategies require detailed information on physical capabilities, limitations and mutual dependencies of various GPU components [16, 17] . For example, the size of the L1 cache is guaranteed to be configurable at runtime (48 or 16 KB), the constant cache is bound to a 64 KB read-only memory space residing in global memory, and the texture cache can be assigned to any area in global memory at runtime. The shared memory (16 or 48 KB) is shared by all threads belonging to well-defined groups of threads (called blocks) executing on the same multiprocessor. In addition, each multiprocessor has a pool of 2 15 32-bit registers that can be combined in pairs for double precision arithmetic. Registers form the fastest memory (≈ 8 TB/s for GTX 480 [17, 18] ), as the CUDA cores can operate directly on them. Next comes the shared memory (≈ 1.3 TB/s), then the global memory (≈ 170 GB/s), and finally the host CPU memory connected to the GPU through a PCI-Express bus (≈ 5 GB/s). Various memories available in GPUs differ not only in their speed, but also in latencies. For example, the latency of registers (which is caused by register dependencies, e.g. when an input operand is written by the previous instruction) is ≈ 20 clock cycles, whereas the latency of the global memory accesses can be as high as 800 clock cycles. To hide such high latencies, each multiprocessor loads into its registers the states of up to 1536 threads (forming up to 48 warps) and attempts to execute the warp that has all operands ready for execution. This leads to massive parallelism with up to 24 576 threads being processed on-chip simultaneously. Another factor crucial for GPU efficiency is the memory access pattern. For example, shared memory is divided into 32 equally-sized banks and whenever two or more addresses of a memory request fall in the same memory bank, the access has to be serialized, which can degrade the shared memory bandwidth by an order of magnitude. Similarly, the global memory can be read or written at full speed only when the memory accesses can be coalesced into 128-byte transactions matching 128-byte aligned segments in device memory.
CUDA is an abstract general purpose parallel computing architecture, programming model, and programming environment designed for Nvidia's GPUs [17] . It is based on a few key concepts such as groups of threads (arranged hierarchically in warps, blocks and a grid), shared memories and barrier synchronization. These concepts are exposed to the programmer through a minimal set of extensions to a high-level programming language (C or C++). Warps within a block of threads can be executed in any order; similarly, each block of threads can be run on any of the available multiprocessors in an arbitrary order, sequentially or in parallel. These features guarantee scalability of CUDA software-once compiled, it can be executed on various devices with different numbers of multiprocessors and can be easily ported to GPU clusters.
To cope with frequent changes in the GPU architecture, Nvidia introduced the concept of compute capability (cc), which is a specification of the core hardware properties of a given device, e.g. the maximum number of threads in a block or support for atomic operations. The "Fermi" architecture corresponds to cc 2.0 and 2.1. Similarly, there are several versions of CUDA programming environments, the current being CUDA 4.1, which differ in their support for compute capabilities and C++ language constructs, quality of libraries, etc. Even though subsequent CUDA versions are backward compatible with older ones, frequent improvements in the CUDA environment and hardware compute capabilities make it difficult to maintain the highest optimization level of the software-a strategy developed for cc 1.1 may turn out counterproductive for cc 2.0. This implies that while the CUDA technology has a great potential, it should be still considered experimental.
Existing matrix formats for SpMV multiplication
The simplest sparse matrix format is the coordinate (COO) format, in which the information about the row index, column index, and the value of each non-zero matrix element is stored in three one-dimensional arrays, RowInd, ColInd, and Val, respectively. As an example, consider a 
Its COO representation (with zero-based indexing) reads To complete the matrix definition, one also needs to supply three integers: the number of matrix rows (rows), columns (cols) and non-zero elements (nnz).
In the above example the row-major ordering was used, i.e., the matrix index arrays were first sorted by row indices and then by column indices. In such a case array RowInd will typically contain sequences of many identical entries. This property is utilized in the CRS format to reduce the memory footprint by replacing array RowInd with a shorter array RowPtr. In the most general case this array is defined by the requirement that RowPtr 
While the ELL format belongs to the most efficient sparse matrix formats for vector architectures, it may involve a costly storage overhead. Several attempts have been made to modify this format so as to extend its practical usability for general sparse matrices. One such attempt is the hybrid (HYB) format [6, 7] , which is a combination of the ELL and COO formats. Another interesting idea is to divide the matrix into several slices, each represented separately in the ELL format, and/or use some kind of preprocessing, e.g. permutation of rows and columns [11, 12] to reduce padding.
Existing GPU implementations
One of the first efficient implementations of SpMV on GPU architecture was proposed by Bell and Garland [6] from Nvidia. They implemented SpMV kernels for several sparse matrix formats, including COO, ELL, and HYB. In addition, two SpMV kernels for the CRS format were provided: scalar and vector. The scalar kernel uses a naive approach by assigning one thread per matrix row, which leads to non-coalesced access to memory and poor performance. The vector kernel uses a 32-thread warp for each row which, on the one hand, provides contiguous access to the memory, but on the other hand it leads to a large bandwidth waste when the number of nonzero elements in a row is much smaller than the warp size. The tests indicated that the COO is not flexible enough to handle unstructured matrices efficiently. The ELL format is usually faster than others, but fails whenever the number of nonzero elements in matrix rows varies significantly. The best overall performance was achieved for the HYB format, which offered the speed of ELL and flexibility of COO. Therefore the authors recommended HYB to be used as the fastest format for a broad selection of unstructured matrices. Their best kernel implementing SpMV on structured matrices achieved the performance of 36 GFLOPS in single precision and 16 GFLOPS in double precision on a GeForce GTX 280 GPU. For unstructured matrices the authors reported the performance in excess of 15 GFLOPS and 10 GFLOPS in single and double precision, respectively. A slightly better performance was achieved for SpMV kernels tested on a GeForce GTX 285 GPU for the same set of test matrices [7] .
Baskaran and Bordawekar [8] proposed a few optimization techniques to improve Bell's and Garland's SpMV kernels. By exploiting synchronization-free parallelism, optimized off-chip memory access, different thread mapping based on affinity, and by caching input vector elements in texture memory, they were able to outperform the HYB format in 9 out of 19 matrices on GeForce GTX 280 (with cache) and 7 matrices (without cache).
The kernels investigated in [7] served as building blocks for CUSP [19] , an open-source library of sparse linear algebra and graph computations on CUDA. To improve coalescing of matrix data reads for sparse matrices in the CRS representation, if the average number of nonzeros per row is less than 32 this library virtually divides each warp into 2, 4, 8 or 16 smaller parts, called vectors, and assigns them to different rows.
Another direction of research on improving the efficiency of the SpMV kernel on GPUs focuses on various extensions and modifications of the ELL or CRS formats. This resulted in the development of the ELL-R [10] , sliced-ELL [11] , ELLR-T [9] , Sliced ELLR-T [12] and rgCSR [13] formats, as well as tiling and composite storage [14] . Their major drawback, however, is that their usage involves a large memory overhead, either directly through zero-padding or indirectly through matrix preprocessing. This is hardly acceptable in serious numerical applications, as they typically require that the solver has the ability to handle large data sets. Since the GPU memory is relatively small and the connection to a larger CPU memory is slow, each of the bytes residing on the GPU must be used very cautiously.
Compressed Multi-Row Sparse Format
Efficiency of existing GPU implementations of the SpMV product is often significantly better if an ELL-based formats, e.g. the HYB format, is used instead of the CRS. The main reason for this is that GPUs are SIMD-like machines with relatively wide SIMD units, often far wider than the average number of nonzero elements per matrix row. Since efficient utilization of the CRS format requires the matrix elements to be accessed row by row, processing short rows in long SIMD-like units leads to wasting of the computational capability of the device. Therefore, our main idea is to process a sparse matrix in chunks larger than individual rows, at the same time preserving the overall structure of the matrix representation typical of the CRS format. A group of rows processed by an individual SIMD unit shall be called 'strip', and the number of rows in a strip shall be called 'strip height' and denoted height. The number of strips, strips is thus equal to ceil(rows/height).
The new sparse matrix format, which we call compressed multi-row storage (CMRS) format, comprises one integer parameter height and four arrays: data array Val and three auxiliary integer arrays, ColInd, StripPtr, and 
Implementation
Our GPU implementation of the SpMV product for matrices in the CMRS format is based on the vector kernel by Garland and Bell [7] , with rows replaced by strips. Each SIMD unit, or warp, is assigned a strip to process. The method of doing the SpMV product in a strip is presented in Algorithm 1.
This algorithm is assumed to be executed in parallel by all threads in a warp. Implicit synchronization of the threads forming a warp is assumed. All auxiliary buffers and temporary variables are local to a warp, so no explicit synchronization of different warps is necessary, which allows for massively parallel processing of strips. The values of matrix elements and the corresponding column indices are read in parallel directly from arrays Val and ColInd, whereas row indices are assembled from the information hold in arrays StripPtr and RowInStrip. For sufficiently long strips these memory operations are coalesced to a high degree (j runs through consecutive matrix elements), and hence are very fast. Then the necessary elements of the input vector are fetched from the memory. This is the most sensitive part of each parallel SpMV implementation, as the vector elements required by a SIMD unit are often stored in memory locations scattered almost randomly in the memory, and hence their parallel processing is very problematic. Individual products of the matrix by vector elements are computed and stored in a buffer buf allocated in the fast on-chip shared memory. The buffer size is quite large, height · WARP SIZE, as each thread needs its own memory buffer for each row. The exact mapping of thread lanes and row numbers into buf is arbitrary, as long as it is one-to-one, and affects the number of shared memory bank conflicts and the efficiency of the parallel reduction buf i,r , r = 0, . . . , height − 1 row ← strip id · height + thread lane {height elements of buf 0 contain row sums} if thread lane < height and row < num rows then y row ← buf 0,thread lane end if step. The mapping presented in Algorithm 1 is designed to minimize the latter factor. For example, for height = 8 one can reduce 32 × 8 partial row sums in buf into 8 row sums using just 9 instructions.
Our implementation needs height-fold more shared memory than in the vector kernel of Ref. [7] . On the one hand this is beneficial for the parallel reduction, but on the other hand it imposes a severe limit on acceptable values of height, as the buffer size in Fermi-class GPUs is restricted to only 48 KB. For example, if we take height = 8 and store the data as 8-byte numbers, 64 bytes of the shared memory will be needed for each thread, and so the maximum number of resident threads per multiprocessor will be limited to 768, which is only half the number a multiprocessor can handle. Taking into account that a large number of resident threads is necessary to hide large memory latencies, we can safely assume that the maximum value of height in an efficient implementation for Fermi-class GPUs does not exceed 16. This, in turn, implies that the values stored in array RowInStrip are in the range 0. . . 15 and hence can be encoded in just four bits.
Recall that modern GPUs can support only up to 6 GB of memory. If we restrict our attention to square matrices that have at least one nonzero element per row and column, and if we take into account the storage necessary for the input and result vectors in double precision, we will find that the number of rows and columns of a sparse matrix for a problem that fits into 6 GB of memory is bounded by 2 28 . Therefore in our implementation we used 28 bits of ColInd[j] to store a column index and the remaining 4 bits to store the corresponding value of RowInStrip. This made array RowInStrip superfluous and explains 'uncompression' steps in Algorithm 1. In this way the CMRS format turns out to be even more compressed than the CRS one, i.e., it requires fewer bytes of storage. We believe that the SpMV kernel on GPUs is so much memory-bound that it is of utmost importance to reduce its memory footprint, even at the cost of several arithmetic operations, which in this kernel are almost free.
We also implemented several optional speed optimizations. The first one consists in buffering the input vector in the texture cache [7] rather than in the L1 cache. The second one consists in enlarging the shared memory size from 16 to 48 KB, at the cost of the L1 cache size. The third optimization strategy, adapted form [8] , aims at improving the effective memory bandwidth for arrays Val and ColInd, for large µ = nnz/rows, by first accessing the non-aligned portion of a strip and then accessing the remaining, aligned portion at full speed. The fourth one consists in reordering the elements of CMRS arrays so that the index array ColInd is first sorted by strip indices and then within the same strip by column indices. The idea behind such ordering is the same as for the rowmajor ordering in the CRS format: enhance the frequency of coalesced or cache-buffered accesses of a SIMD unit to the elements of the input vector. Note that most matrices coming from real problems have some internal structure and the locations of nonzero elements in neighboring rows are correlated. In such cases reordering the entries in the CMRS arrays can have a pronounced impact on SpMV efficiency. Assuming again height = 2, the CMRS representation ofM after data reordering would read: Note that the reordering affects only the matrix internal representation and does not involve any actions on the input and output vectors. Moreover, reordering is local to strips and hence is prone to parallelization.
Results
Hardware and software specification
The tests were performed on two Nvidia devices, GTX 480 (1.5 GB) and Tesla C2070 (6 GB). The ECC memory support in the Tesla device was switched off for a larger bandwidth. In both cases the operating system was a 64-bit Linux with Nvidia GPU driver v. 290.10 and CUDA 4.1. Theoretical peak capabilities of these devices are listed in Table 1 .
Test matrices and methodology
The tests were performed using 138 real matrices from the University of Florida Sparse Matrix Collection satisfying 10 6 ≤ nnz ≤ 10 8 , including all matrices with nnz ≥ 5 × 10 6 , and three additional sparse matrices of our choice. The matrices are of various sizes and represent a wide spectrum of applications and structure patterns. In particular, our test case includes all matrices used in several recent studies on GPU SpMV performance [7, 9-11, 20, 21] . The additional matrices include an artificial matrix, p7, which is a large (10 7 × 10 7 ) random permutation matrix, dense10000, which is a dense matrices, and aorta, which is a sparse matrix representing the pressure equation in the problem of the flow through the human abdominal aorta [22] . We included p7 to get a better insight into the role played by structural correlations between adjacent rows and the impact of (un)coalesced accesses to the input vector; dense10000 is an example of a matrix which can be processed at the highest speed; and aorta is an example of a sparse matrix for which efficiency of the SpMV kernel is of utmost importance, as it directly affects the time to solution in biomedical applications. Except for rail4284 used in [3, 7] , all test matrices are square.
For each test matrix and each combination of optimization parameters, our CMRS implementation of the SpMV product was called 10 times and the execution times were recorded. The largest time was omitted (the GPUs were connected to the display, which could occasionally interfere with the measurement) and the remaining 9 results were analyzed to find their average and standard deviation. The optimization parameters included the strip hight (height = 2, 4, 8, 16 ), block size (BS = 64j, j = 1, . . . , 8) , and four boolean parameters referring to the optimization techniques described in Sec. 4, so that the total number parameter combinations for each matrix was 512. We then searched for the parameters that give the shortest SpMV execution time (brute-force search), τ . These computations were then repeated using three freely available library implementations of the SpMV routine on GPUs: Nvidia CUDA cuSparse 4.1 implementations for CRS and HYB formats and CUSP 0.2.0 implementation (CSR-tex) for the CRS format. Each library function was called using the default library configuration. Finally, we used the bruteforce method to find the best possible SpMV times for two SpMV kernels: scalar and vector, as described in Sec. 2.4. The scalar kernel was optimized with respect to the value of the block size and the usage of the texture cache. The vector kernel was optimized with respect to all parameters used for CMRS optimization.
For each matrix the computational efficiency of the CMRS SpMV kernel was determined as the ratio of the number of elementary arithmetic operations to the SpMV kernel time, i.e. (2nnz − rows)/τ . The bandwidth was calculated as the total number of bytes that had to be transferred to or from the GPU main memory, β, divided by τ . We considered two extreme cases: the input vector either is not cached or is fully cached. The bytes transferred in each case, β − and β + , respectively, are calculated from β − = 20nnz + 4strips + 8rows,
and the bandwidth can be defined either as β − /τ or β + /τ . This leads to two definitions of memory utilization efficiency, η ± :
where B is the hardware theoretical bandwidth of the device. Clearly, η + < 1 for all CMRS SpMV implementations and a value of η − ≥ 1 indicates that the device efficiently buffers the input vector in its caches.
SpMV multiplication results
Selected results for the CMRS SpMV kernel running in double precision are presented in Tab. 2; results for the remaining matrices can be found in the additional material. This table contains the names and main properties of the test matrices, the best CMRS SpMV time, comparison of this time with the ones obtained using existing libraries or other well known algorithms, and analysis of the kernel efficiency in terms of the computational efficiency and memory bandwidth. The table is ordered according to the increasing number of nonzero elements per row, µ. Relative speedups against 5 other implementation are expressed as dimensionless parameters. These 5 implementations include two closed-source cuSparse 4.1 library implementations (CRS and HYB), one open-source library CUSP 0.2.0, and our implementations of two well-known CRS kernels, vector and scalar, as described in Sec. 2.4.
Artificial matrices
We start our analysis from two artificial matrices: p7 and dense10000. The SpMV kernels for these matrices could be implemented much more efficiently using other methods, but we use them as well-defined benchmarks for real-world sparse matrices.
Matrix dense10000 is a dense 10 000 × 10 000 matrix represented as a sparse matrix. The structure of dense matrices allows for fully coalesced memory accesses to all data, including the input vector, and efficient utilization of caches. This matrix yields the highest effective bandwidth, 258 GB/s, which is almost 50% larger than the theoretical peak hardware bandwidth (η − = 1.48 ≫ 1), a phenomenon resulting from buffering the input vector in device caches. Note that the previous generation GPUs, e.g. GTX 285, which had neither L1 nor L2 caches, allowed for far less efficient data caching (η − ≤ 1.08) [7] .
If one assumes perfect caching of the input vector, the bandwidth for dense10000 drops to 155 GB/s, which is below the theoretical peak of 177 GB/s, but above 148 GB/s reported by the bandwidthtest utility program from the CUDA 4.1 SDK. This has several interesting consequences. First, the SpMV kernel is clearly memory-bound and the contribution to its execution time from arithmetic operations, e.g. parallel reduction, is practically negligible. Second, the performance metrics obtained for this matrix (η + = 0.87, 26 GFLOPS, 258 GB/s) can be regarded as upper bounds for any CMRS format implementations on the Fermi architecture and are very close to the physical limit (η + = 1). Third, even though Nvidia artificially decreased four-fold the rate of double precision operations in consumer GeForce cards, this has practically no impact on the SpMV kernel performance.
Two alternative SpMV kernels, HYB and scalar, perform rather poorly for dense matrices. One reason for this is that these kernels assign one thread per matrix row. As the number of rows is rather small, the GPU cannot launch enough threads to hide memory latency. Moreover, the scalar kernel cannot coalesce memory accesses for matrices with high µ.
Another artificial matrix, p7, constitutes another extreme case in which accesses to the input vector are totally uncoalesced. In such a case a warp must issue 32 (128-byte long) memory requests to fetch 32 doubles from the memory, whereas a fully coalesced memory access would require only 2 requests. This leads to the memory bandwidth 10 times smaller than for dense10000. Note that HYB and scalar kernels perform much better than CMRS, probably due to their better coalescing of matrix data fetches.
Sparse matrices
Of all test matrices from the University of Florida Sparse Matrix Collection, TSOPF RS b2383 yields the highest bandwidth, 245 GB/s, and the highest computational performance, 24.5 GFLOPS. These results are very close to those obtained for dense10000 and hence to the physical limit (η + very close to 1), which means that for some realworld sparse matrices the CMRS format allows for almost full data coalescence and very efficient cache utilization (η − = 1.39). It's interesting to notice that Nvidia HYB is over 3 time less efficient for this matrix.
Matrix lp1 is the other extreme-the memory bandwidth obtained for this matrix, 12 GB/s, is over two times worse than for matrix p7, which indicates that our implementation cannot coalesce memory accesses to the data stored in this matrix, let alone the input vector. Even more striking is the comparison of our implementation with Nvidia HYB and the scalar kernels-the former is 9 times faster, and the latter runs 12.5 more slowly than our implementation. The reason for this strange behavior, leading to over a 100-fold (!) difference in the computational efficiency between the HYB and scalar kernels, is the structure of lp1. The first row in this matrix has 249 643 nonzero elements (≈ nnz/6), while the next longest row has only 99 nonzero elements. In the HYB algorithm most of nonzero elements in the very long row is represented in the COO format and processed by a group of warps. In our implementation of the CMRS format this row is processed by a single warp and in the scalar CRS algorithm it is processed by only a single thread. Therefore the efficiency of both CMRS and scalar kernels suffer from GPU memory latency, which in this case is very high (recall that a GPU needs to launch tens or even hundreds of warps to hide various memory latencies).
Poor performance of the scalar kernel for lp1 rises a question whether this kernel is of any practical relevance. We found only one matrix, asia osm, for which the scalar kernel gives the shortest SpMV time. For this matrix it is ≈ 3.5 times faster than our CMRS-based implementation and ≈ 15% faster than the HYB implementation.
Matrix Chebyshev4 is another interesting example. It contains 16 rows which are fully or almost fully filled with nonzero values, arranged in 8 pairs of adjacent rows. On the one hand, grouping a few adjacent rows into strips would lead to the same latency problems as with lp1. On the other hand, in the HYB algorithm these rows are processed in the COO format, which is less efficient than the vector CRS kernel. For this reason the best SpMV time for this matrix is achieved by the vector kernel.
Matrix cage15, which requires ≈ 1.3 GB of storage, is the largest test matrix that fitted into the 1.5 GB of memory available in GTX 480. Our implementation deals with such a large matrix well and outperforms other implementations by a factor of 2 or more. In particular, since the CMRS format is an extension of the CRS format, it is possible to implement conversions between CMRS and CRS that use essentially the same storage. In contrast to this, the HYB and CRS format are completely different from each other and conversion between them requires two large, separate storages on the device. Since the only way to use CUDA 4.1 implementation of the HYB format is to load a matrix in the CRS format into the device and then convert it to the HYB format, this implementation cannot be used for matrices that occupy more than half of the available GPU memory.
Finally, matrix aorta, which comes from a real problem where the SpMV kernel efficiency is an essential issue, shows that the best results are obtained for three implementations, including ours. However, the computational performance for this matrix, is rather disappointing (4.1 GFLOPS, η + = 0.20) and calls for further optimization.
Comparison with other implementations
The data collected in Tab. 2 are not representative for general sparse matrices and cannot be used for evaluation of different GPU SpMV algorithms and implementations. Instead, we used statistical analysis of the results obtained for 138 sparse matrices from UF SMC, assuming that this collection contains a representative sample of sparse matrices.
A fair comparison of implementations alternative to ours must take into account that CRS, HYB and CUSP are complex implementations which exploit various simple SpMV kernels. Which kernel with which internal optimization parameters is actually used depends on the matrix structure, which may require some matrix preprocessing. These complex implementations have, essentially, no free parameters, but their efficiency could be increased by adjusting some internal parameters, e.g. the thread block size, to the particular matrix. On the other hand, scalar, vector and CMRS kernels are simple, irreducible algorithms, which can be considered as building blocks for more advanced implementations; they contain several optimization parameters whose optimal values are found here by a brute-search method.
We found no matrix for which the cuSparse 4.1 CRS implementation would give the shortest SpMV product time. Although for 15 matrices it turns out significantly faster than our CMRS implementation, each of these matrices is characterized by a low number of nonzero elements per row (µ < 5) and for each of them either the HYB or the scalar kernel turns out significantly faster (we adopt a convention that implementation A is significantly faster than B if and only if its execution time is at least 10% shorter). The CUSP implementation is generally even less efficient. The vector kernel could be easily included as a special case into our CMRS implementation, with height = 1. The fact that for some matrices, e.g. Chebyshev4, this kernel gives the fastest SpMV implementation, implies that for some matrices the optimal value of height is 1. It is also interesting to notice that the CMRS format can improve the efficiency of the vector kernel it is based upon by as much as 4.3 times (mc2dpi) and even by 4.7 times for the artificial matrix p7.
The scalar kernel needs more attention. We found 21 matrices for which it is significantly faster than our CMRS implementation, but only for 1 of them this kernel turned out to be faster than HYB. The scalar kernel exhibits good performance for matrices with a small number of nonzero elements per row (µ 7) and its performance can be as good as 12.6 GFLOPS and 166 GB/s. However, in most of such cases the HYB implementation gives even better results.
The most interesting is comparison of our algorithm with the cuSparse 4.1 HYB implementation. We found HYB to be significantly faster than our implementation for 48 matrices, and it is the fastest implementation for 46 of them. On the other hand, our implementation is significantly faster than HYB for 55 matrices and for 41 of them it is the fastest implementation. Although for 5 matrices the HYB implementation is faster than ours from 4 to 9 times, all these matrices have a special structure, i.e. they contain one or just a few rows with a very large number of nonzero elements (we will call them 'lp1-like matrices'). HYB implementations deals with such cases efficiently because it preprocesses each matrix before the first SpMV routine can be called on it. Similar preprocessing could be implemented in the CMRS format, most likely with a similar performance boost. On the other hand, the HYB format could not be used for 16 largest matrices because of its memory requirements; for each of these matrices our SpMV implementation turned out to be the most efficient one.
It is difficult to find a good criterion for an SpMV algorithm to be the most efficient for general sparse matrices. One such criterion could be the sum of single SpMV product execution times for all tested matrices. If we neglect in such comparison 5 lp1-like matrices and 16 largest matrices for which the HYB algorithm could not be run, we will get that these times are in the ratio 1.3 : 1.0 : 1.3 : 2.2 : 3.7 : 1 for the CRS, HYB, CUSP, vector, scalar and CMRS kernels, respectively. This could be loosely interpreted that our CMRS implementation is on average as fast as the HYB kernel and faster than other kernels. Note that the CMRS format with its adjustable parameter height allows for, on average, ≈ 2.2 times faster SpMV implementation than the vector kernel with height fixed at 1. Figure 1 depicts the speedup of our CMRS SpMV implementation against scalar, vector and HYB kernels as a function of the mean number of nonzero elements per row (µ). Clearly µ is a relevant parameter for determining relative performance of various SpMV implementations. The scalar and hybrid kernels give the shortest SpMV times for small µ, but their efficiency decreases as µ is increased, with the scalar kernel being very inefficient for large µ. This property of the scalar kernel is related to its inability to coalesce data transfers if µ is large. The vector kernel behaves in just the opposite way: its efficiency relative to other kernels is very good for large µ, but it decreases as µ drops below ≈ 100. This is related to the fact that the vector kernel processes each row with a warp of 32 threads, and hence is most efficient for sparse matrices with µ far larger than 32. Interestingly, for µ 120 the speedup of the CMRS over the vector kernel can be approximated as 6µ −0.35 , i.e. by a power law. From Fig. 1 it can be immediately seen that the CMRS format generally does not yield much improvement over the CRS format for µ 150, and tends to be systematically slower than the HYB format for µ 20. Hence one expects that the advantages of the CMRS will be most pronounced for moderate values of µ. This is confirmed by Fig. 2 , which depicts the speedup of our CMRS SpMV implementation against the best of all five alternative SpMV implementations considered here, calculated individually for each matrix. Using CMRS can improve the SpMV performance by up to 43%. However, at the moment there is no simple criterion that could be used to tell in advance whether using it can be of any benefit. Nevertheless, since the CMRS format introduces practically no overhead over the vector kernel, we suggest that for the Fermi GPU architecture one should use the CMRS format for matrices with µ 70 and test its efficiency against the HYB format for 20 µ 70.
A quite large dispersion of the data in Fig. 1 suggests that µ is not the only parameter controlling the relative performance of various implementations. As a second such parameter we propose the standard deviation (σ) of the number of nonzero elements (n i ) per row Figure 3 depicts the σ-dependence of the efficiency of the CMRS-based implementation of the SpMV kernel relative to that of HYB. As could be expected, the HYB kernel is the most efficient for small values of σ. In particular, for σ = 0 all rows have the same number of nonzero elements and the HYB format reduces to the ELL format without any memory overhead. As σ grows, the memory overhead of ELL begins to reduce the efficiency of HYB and for very large σ an increasing part of the matrix is processed using the COO format, which is computationally inefficient. Our results suggest that as long as σ < µ, the HYB format should be used if σ 1.5 and the CMRS is more efficient for σ 3. Matrices for which σ > µ are 'atypical', powerlaw [14] or lp1-like matrices for which the HYB format is preferable.
Optimization parameter choice
Our analysis is based on a brute-force search in a fairly large parameter space, which is time consuming and certainly impractical. This rises the question of whether it is possible to find the optimal set of parameters for the CMRS algorithm quickly? While a detailed answer is beyond the scope of this research, there are several conclusions that can be already drawn from our results. First of all, the parameters are not only mutually correlated, but also dependent on other factors, like µ, which makes the problem hard. For example, since the value of height controls the shared memory required by the CMRS kernel, different settings for the shared memory size are required for height = 2 and for height = 16. While this particular correlation can be easily related to the GPU architecture, other correlations are less obvious. Second, there are some regular patterns in the best-case sets of optimization parameters. For example, for all test matrices in double precision we found that the input vector should be cached in the texture cache rather than in L1. It is also clear that proper sorting of data elements, as described in Sec. 4, will never degrade the SpMV efficiency. Third, the CMRS format was already implemented in the SpeedIT 2.0 library, and the average performance of the SpMV kernel, which uses some very crude heuristics to determine the optimization parameters, already lies above the performance of the CRS implementations from Nvidia. Finally, really challenging applications of GPUs are in the area of high performance computing (HPC), where individual calculations often take from hours to weeks. If choosing between the HYB or CMRS format and then using appropriate internal optimization parameters can decrease the computation time by a factor of 3, it should be fully acceptable to devote several seconds or minutes to analyze a matrix (or a group of matrices) to find these parameters. It is also very likely that a particular sparse matrix format, CMRS or HYB, together with its best optimization parameters, will be adequate to all matrices appearing in a given problem, so that detailed matrix analysis would have to be done only once.
Comparison with Tesla C2070
Tesla C2070 is a GPU designed specifically for applications in high performance computing. Compared to GTX 480, which is a high-end gaming card, it delivers almost 3 times faster theoretical computational performance in double precision and has almost 4 times larger memory, at the price of ≈ 20% smaller memory bandwidth (see Tab. 1). We calculated η + for all test matrices and found its value for C2070 to lay between 0.85 and 1.00 of the corresponding value for GTX 480. This shows to what extent the SpMV kernel is memory-bound: the awesome computing power of C2070 can not be exploited in this kernel and the only factor that really matters is the memory utilization, which turns out a bit better in GTX 480.
Single precision performance
Since it is more difficult for GPUs to coalesce scattered memory transfers for single precision data [7] , the peak effective bandwidth achieved by GTX 480 in the single precision mode is lower than for double precision (216 GB/s for matrix nd24k). However, since doing SpMV in single precision requires far less bytes to be transferred, the computational power is in this case generally larger, peaking at 36 GFLOPS for nd24k. The performance of our CMRS implementation relative to others tends to be even better in single than in double precision. Our method turned out the fastest by at least 10% in 58 cases, the maximum speedup is 63%, and it is on average 20% faster than Nvidia HYB implementation. We also found that in single precision the CMRS format can give a significant speedup of the SpMV kernel even for matrices with µ as small as 6.
Conclusions and Outlook
The new format for storing sparse matrices, CMRS, is designed specifically for optimizing the SpMV operation on modern throughput-oriented computer architectures. It has several features distinguishing it from other formats developed for the same purpose: (i) it is an extension of a popular format, CRS, with a quick and in-place conversion to and from it; (ii) it does not impose any artificial pressure on the device memory, in particular, it does not require zero-padding; (iii) it does not require any row or column reordering nor any other matrix preprocessing; (iv) for Fermi-class high-end GPUs it gives the most efficient SpMV product implementation for typical sparse matrices if the average number of their nonzero elements per row, µ, is greater than ≈ 70 or when the standard deviation of the number of nonzero elements per row, σ, is greater than 3; it can allow for the most efficient implementation for some matrices satisfying µ 70 and 1.5 µ 3 . These features should facilitate its adoption in existing software and open new possibilities for its further optimization, e.g. by combining it with other formats for better handling of atypical sparse matrices, e.g. lp1. Moreover, the fact that our CMRS-based implementation of the SpMV kernel often approaches the hardware limit suggests that this format will scale well into future GPU architectures and can prove efficient also in other throughput-oriented computer architectures.
Our results reveal the importance of developing a representative collection (or collections) of sparse matrices for which the SpMV product is a truly relevant operation. We are aware that the two simple criteria we applied here (a matrix must be square and have at lest 10 6 nonzero elements) do not lead to a 'representative' collection, as the UF SMC contains also very unusual matrices for which the SpMV product is very unlikely to be applicable, e.g. matrices with empty rows or columns. Such atypical matrices can obfuscate the general picture of the SpMV per-formance and its dependence on the matrix format and techniques used to implement it.
We also demonstrated the importance of general characteristics of SpMV product efficiency, e.g. the bandwidth efficiency. Such parameters enable to identify matrices for which further optimization is required as well as help determine the quality of hardware support for the SpMV operation.
Additional material
This section contains an extended version of Tab. 2, with the data for all 138 sparse matrices from the Florida University Sparse Matrix Collection used in our study, including the value of σ, obtained for Nvidia GTX 480 graphics card in double precision. As some of the test matrices in the original collection contain explicit zeroes, in their cases we provide the data for the matrix both without and with the explicit zeros included in the computer representation of the matrix. Matrices with explicit zeros are distinguished by adding (*) to their names, e.g. "ohne2 (*)". Matrices are ordered according to their "kind", as given in the original collection. The following short forms of the kind were used for brevity: 
