Abstract
Introduction
Sparse matrix-vector multiplication (SpMV) is a key linear algebra algorithm used in many application domains. Dozens of compressed storage formats (e.g., COO, CSR, ELL, JDS, and HYB) and their variants have been proposed for storing sparse matrices in the past decades.
GPUs have been widely used as efficient accelerators for general computing due to their massive computational power. GPUs perform quite well in dense linear algebra algorithms (e.g., Dense General Matrix-Matrix multiplication (DGEMM)), since those computations have well-structured datasets which fit GPU's the vectorized singleinstruction multiple-data (SIMD) architecture well. While facing sparse linear algebra algorithms (e.g., SpMV), the performance is greatly degraded for three reasons: (1) since sparse matrices are stored in compressed storage format, many indirect and irregular memory accesses occur thus causing irregular and non-coalesced memory accesses; (2) due to the inhomogeneity of input sparse matrices, load imbalance at warp-level could cause performance bottleneck; (3) intra-warp load imbalance could cause thread divergence and result in underutilization of SIMD resources.
Many compressed storage formats have been proposed. The most common formats include Compressed Sparse Row (CSR), ELLPACK (ELL), Diagonal (DIA), Coordinate (COO), and also Hybrid (HYB) which combines ELL and COO. HYB performs better than other formats in most cases, it splits the matrix into two submatrices: a compressed part contains fixed number of columns per row well-suited to ELL, and the other part stored in COO. However HYB suffers from redundant computations when row lengths vary significantly in a matrix. Besides, the COO part could also be a performance bottleneck. Besides those traditional storage formats, quite a few variant formats have been proposed to store sparse matrices based on original formats, e.g., BELLPACK [12] . Obviously, these blocked formats only fit for matrices with dense sub-blocks which is beyond our concern since we work on general sparse matrices.
In this work, we propose a new hybrid storage format Blocked CSR and JDS (BCJ) which combines the advantages of row-major storage and column-major storage, therefore we modified row-major CSR and column-major JDS to satisfy the hardware features on GPU. Our strategy is based on following reasons:
1, CSR stores non-zeros in row granularity. When CSR kernel assigns each row with a warp, the kernel performs best for matrices with long rows since the threads within a warp access the DRAM in coalesced manner thus reducing memory requests, further, parallel reduction within a warp could also accelerate the reduction step especially for rows with more non-zeros.
2, ELL-like formats store non-zero mostly in column-major method. ELL-like formats assigns each row with a single thread. Since non-zero elements are store in column major manner, the memory accesses could also be coalesced. However, the non-zeros within a row are processed by a single thread, the reduction step could become performance bottleneck while processing longer rows. However ELL-like formats outperform others when rows are comparatively short.
Our strategy involves reordering the data which incurs a run-time cost. However, this cost is quite common in auto-tuning frameworks for SpMV and it could be amortized during many iterations since SpMV is often used in iterative methods.
The rest of the paper is organized as follows: Section 2 introduces some background about GPUs, CUDA programming model and some storage formats. Section 3 defines our hybrid auto-tuning strategy. Section 4 describes our evaluation methodology and analysis of the experimental results. Section 5 summarizes related work. We conclude in Section 6.
Background

GPUs and CUDA Programming Model
GPUs have been a very popular and powerful parallel accelerator in many areas. NVIDIA GPUs consist of an array of Streaming Multiprocessors (SM). Each SM contains a set of CUDA cores or Streaming Processors (SP). A group of 32 SPs within a SM share a single instruction unit and execute programs in a SIMD manner.
CUDA programming model is provided by GPUs vendor NVIDIA for programming on NVIDIA GPUs, thus makes GPUs general purpose accelerators. CUDA is a warp-centric programming model, a set of 32 threads run in SIMD fashion is called a warp in CUDA. A CUDA program generally consists of CPU code and GPU code, the GPU code is also called kernel. A kernel can launch a grid of threads. Grid is organized in a hierarchical way, a grid consists of a scalable array of blocks and a block consists of an array of threads. Threads within a block can cooperate through shared memory and barriers mechanism.
Threads within a warp always execute the same instruction at the same time. If workload imbalance or branch divergence occurs, the program have to be serialized due to the feature of SIMD architecture. Unlike CPUs, GPUs hide memory access latency through warp scheduling mechanism which require many threads (or warps) running concurrently on GPU. In order to alleviate the impact of memory access, GPU provides memory coalescing mechanism to compress many memory requests into fewer transactions within a half-warp (16 threads) if the memory requests satisfy certain conditions.
Formats for Sparse Matrix
Dozens of formats have been proposed for storing sparse matrix. Besides typical compressed formats, there are also some variant formats [1-3, 8, 9] have been also presented in last decade. In this work, we focus on SpMV problem in the form of y = y + Ax where y and x are one-dimensional vectors and A is two-dimensional sparse matrix.
COO
Coordinate (COO) is the simplest format in which the sparse matrix is stored in three vectors: one vector stores the non-zero elements, the other two store the corresponding indices of row and column of each non-zero respectively. COO SpMV kernel assigns each non-zero to a thread, and the performance is greatly degraded due to the use of atomic writes for updating vector y. In addition, load imbalance occurs when row length vary significantly.
CSR
Compressed Sparse Row (CSR) is the most popular format for sparse matrix and graph. CSR store non-zeros at row granularity. Based on COO, CSR does not need to keep the row indices for every element, it keeps only the row offsets instead. Both coalesced memory accesses and high data parallelism could be achieved when rows have large size.
ELL
ELLPACK (ELL) uses two dense matrices storing the non-zeros and column indices respectively. All rows within ELL have the same size after padding each row with zeros. ELL works at the granularity of thread per row and only performs well when matrices are quite structured. At most cases, ELL performance is degraded due to the expense of redundant zero padding and computation. So ELL is often combined with COO format in HYB format.
JDS
JDS (Jagged Diagonal Storage) is a much more adaptive storage format than ELLPACK. JDS eliminates the redundant padding in ELLPACK. In JDS, non-zeros are reordered in decreasing order of row lengths. JDS stores the beginning index of each column and number of non-zeros of each row. JDS kernel also assigns each row with one thread. Since rows are reordered, consecutive rows have comparatively same row lengths, thread divergence is eliminated within a warp. However, JDS suffers when processing rows with large size which require more steps to get a sum result.
Hybrid auto-tuning Framework
Unlike compiler optimization techniques, auto-tuning technique is usually used for some key algorithms in specific domains, e.g., SpMV. SpMV is the most important core kernel in linear algebra system. Since the inhomogeneity of different sparse matrices, we proposed an adaptive hybrid format and corresponding auto-tuning framework for SpMV on GPU. Our format combines the advantages of row-major storage and column-major storage. We reorder the rows of a given matrix according to row sizes in decreasing order like JDS, then split the matrix into denser part and sparser part using an empirical partitioning threshold. These two sub matrices are stored in variant CSR and JDS separately. Inspired by previous work [4, 6] , we further block both CSR and JDS along column dimension in order to mitigate warp-level load imbalance. We call this format Blocked CSR and JDS (BCJ).
For denser sub matrix, warp is the basic task assigning unit, memory accesses are coalesced due to row-wise storage. For sparser sub matrix, threads within a warp are assigned to neighboring consecutive rows. Since row have been reordered and non-zeros are stored in column-major manner, memory coalescing can be achieved and thread divergence can be reduced. Furthermore, we solve warp-level load imbalance through blocking along column dimension. BCJ consists of B-CSR and B-JDS which will be described later in this section. In order to illustrate our hybrid storage format, we use following matrix A with 10 non-zero elements distributed in 4 rows and 6 columns as an example. 
Blocked and Padded CSR (B-CSR)
CSR-vector kernel assigns each row with a warp, and CSR stores non-zero in rowmajor manner, the memory accesses within a warp could be coalesced. To further improve memory coalescing, we pad each row with zeros to ensure the length is multiple of 64Byte. In CUDA, 64Byte is basic DRAM segment, better alignment can help improve memory coalescing. We pad each row with zeros to be multiple of 16 for single precision computation, or to be multiple of 8 for double precision.
The top rows are comparatively longer, thus load imbalance could occur at warp-level. Ashari [4] blocked their format BRC along column dimension. Liu [6] also employed blocking strategy into their ESB format. Inspired by their work, we further block the CSR to ensure every warp have almost the same workload. Figure 1 -a shows the storage of B-CSR for example matrix A. Some extra meta-data is needed to keep the information. Besides non-zero values and their column indices, we use B_CSR_ptr to track the row offsets after blocking. Row_perm keeps the information of reordering. And the kernel code for B_CSR is shown in Figure 3 -a. After blocking, a row may be processed by multiple warps. Therefore, we use atomic operation (atomicAdd())to update vector y when each warp finishes parallel reduction.
Blocked and Padded JDS (B-JDS)
ELLPACK only works well when non-zero in each row are identical. However, it suffers when row lengths vary significantly. JDS was proposed to eliminate the expense of padding and computation within ELLPACK. Based on JDS, we pad each column with zeros to be multiple of 32 since consecutive 32 threads within a warp are assigned to adjacent rows. We use different padding strategy here with B-CSR (multiple of 64B) due to our execution pattern which will be described later. Warp-level load imbalance is inherent in JDS format since top rows are always larger. We block JDS along column dimension to B-JDS. Figure 1-b shows the storage of the example matrix A. We keep the information of row sizes after blocking in Non_zeros per row. Rows_pre_sum is the prefix sum of row numbers in each block, this short vector is used to calculate the starting position of each thread. Unlike JDS where each row is processed by only one thread, B-JDS may require a row to be processed by multiple threads. Therefore, atomic operation (atomicAdd()) is also needed for updating vector y, Figure 3 -b shows the pseudo kernel for B_JDS SpMV. 
Parameters to be Modeled
For a given sparse matrix, we reorder the rows according to the lengths in decreasing order, then split it to an upper and lower part. We use an empirical value as the partitioning threshold. We manually generate structured matrices with different row lengths to show the advantages of row-major storage and column major storage. The matrices generated are quite well structured since ELL only works well on structured matrices and the performance comparison can reveal the advantages of row-major and column-major storage. Figure 2 depicts the speedup of CSR over ELL with different row lengths. The number of rows is set to be 10000 here. Figure 2 shows that CSR vector kernel outperforms ELL when row sizes are comparatively large. When row sizes are quite small (e.g. less than 32), CSR performance is degraded either due to uncoalesced memory accesses or SIMD resources underutilization. We chose 256 as the empirical threshold which partition a given matrix into a row-major stored and a column-major stored sub matrix.
We use Bc to stand for blocking size in CSR part and Bj for blocking size in JDS part. Since those two parameters are both relative to non-zeros distribution. Since Bc and Bj are mostly relevant with the average non-zeros per row. For B_CSR part, we set Bc to be multiple of 32 since the basic task assigning unit is a warp. We use b_csr_ave to denote the average number of non-zeros in B_CSR part. Then Bc is set using following simple equation:
For B_JDS part, we use b_jds_ave to denote the average number of non_zeros in this part. Then Bj is set through:
Bj = b_jds_ave
Both b_csr_ave and b_jds_ave is calculated through transforming matrix into BCJ format when scanning a given sparse matrix. 
Experimental Results
In order to evaluate the performance of our hybrid BCJ format, we selected 13 sparse matrices which are also used in previous work from a broad range of domains that shows different sparsity characteristics. Those matrices are all from University of Florida sparse matrix collection. Table 1 shows the basic information about them. In order to show the advantages of BCJ format, we compared BCJ performance with original formats which include CSR, JDS, and HYB both in single and double float precision. Our experiments were done on an NVIDIA K20 card in Kepler architecture which consists of 14 SMs each with 192 CUDA cores. We compiled our kernel using compute capability 3.5. Since texture memory can always improve the performance, we implicitly use texture memory to load the vector x and other short vectors in our auto-tuning framework.
We plot the performance results of BCJ compared with CSR, JDS, and HYB both in single precision and double precision in Figure 4 .
BCJ format partitions a given sparse matrix into two sub matrices, since in some cases the row sizes are all larger or smaller than the empirical threshold. Then BCJ format degrades into B-CSR or B-JDS. In our selected matrices, some matrices degrade into B-JDS, and some degrade into B-CSR but still with padding modification.
The results show that for most cases, BCJ format outperforms the original formats significantly. For matrix dense which is actually not a sparse matrix, BCJ format degrades to original CSR format since every row contains the same 2000 non-zeros. For matrix epidemiology, our format degrades into JDS since this matrix contains almost the same non-zeros in each row in a quite small number (which is 4), then there is no need for blocking along column dimension, thus BCJ can't outperform HYB for this case because HYB format works quite well in structured matrices. 
Related Work
There is much literature on SpMV optimization both on multi-core CPU platform and emerging many-core GPU platform since its importance in linear algebra systems and other algorithms. Bell and Garland [7] implemented SpMV in CUDA for several formats which include DIA, ELL, CSR, COO and hybrid format HYB (ELL+COO). Alexander proposed a variant format Sliced ELL (SELL) based on ELL. SELL split matrix into several slices and pad individual slice into ELL in order to reduce padding zeros. Liu [6] proposed ELLPACK Sparse Block (ESB) for Intel Xeon Phi Co-processor which is also a vector processor with a SIMD width of 8.
For matrices with dense sub blocks. Choi [12] implemented the BCSR and BELL formats on GPUs trying to reduce memory usage and Yan [9] proposed a BCCOO format. Su [16] also proposed a hybrid format COCKTAIL which uses different formats to store different parts of a matrix. This hybrid format use several formats which may not address the drawbacks within the original format. We improve our hybrid format through blocking that make our format more balanced and structured.
Besides formats adjusting, there are also some operation techniques have been proposed for SpMV. Tang [15] used bit-representations to compress index arrays. Blelloch [14] proposed the segmented operations to SpMV.
Conclusions
In this paper, we presented yet another hybrid storage format BCJ for auto-tuning SpMV on GPU. Our strategy combines the advantages of row-major storage and columnmajor storage. For rows with larger size, we assign each row with a warp, thus both coalesced memory accesses can be achieved without intra-warp thread divergence since all rows are in comparatively large size. For sparser part of a given matrix we assign each row with a thread. Since matrix is stored in column-major manner and consecutive rows have similar row sizes, both memory coalescing and intra-warp load balance can be achieved.
In order to mitigate the work load imbalance at warp level. We introduced blocking scheme into CSR and JDS in our format. BCJ is an adaptive method for sparse matrix. And the experiments show that our BCJ format outperforms the original formats significantly in most cases.
