Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations.
INTRODUCTION
Over the past few decades, sparse matrix-vector multiplication (SpMV) has probably been the most studied sparse BLAS routine because of its importance in many scientific applications. The SpMV operation multiplies a sparse matrix A of size m×n by a dense vector x of size n and obtains a dense vector y of size m. Its naïve sequential implementation can be very simple, and can be easily parallelized by adding a few pragma directives for the compilers. But to accelerate large-scale computation, parallel SpMV is still required to be hand-optimized with specific data storage formats and algorithms [1, 2, 3, 4, 6, 7, 10, 13, 16, 17, 19, 21, 22, 24, 29, 30, 31, 33, 34] .
As a result, a conflict may emerge between the requirements of SpMV and other sparse matrix operations such as preconditioning operations [22] and sparse matrix-matrix multiplication [23] . The reason is that those operations commonly require matrices stored in the basic formats such as the compressed sparse row (CSR). Therefore, when users construct a real-world application, they need to consider a cost of format conversion between the SpMV-oriented formats and the basic formats. Unfortunately, this conversion overhead may offset the benefits of using these specialized formats, in particular when only tens of iterations are needed in a solver.
The conversion cost is mainly from the expensive structuredependent parameter tuning of a storage format. For example, some block-based formats require finding a good 2D block size [6, 7, 10, 31, 32, 34] . Moreover, some hybrid formats [4, 29] may need completely different partitioning parameters for distinct input matrices.
To avoid the format conversion overhead, a few algorithms have concentrated on accelerating CSR-based SpMV with either row block methods [1, 17] or segmented sum methods [5, 16] . However, each of the two types of methods has its own drawbacks. As for the row block methods, despite their good performance for regular matrices, they may provide very low performance for irregular matrices due to unavoidable load imbalance. In contrast, the segmented sum methods can achieve near perfect load balance, but suffer from high overhead due to more global synchronizations and global memory accesses. Furthermore, none of the above work can avoid an overhead from preprocessing, since certain auxiliary data for the basic CSR format have to be generated for better load balancing [1, 17] or established primitives [5, 16] .
Therefore, to be practical, an efficient format must satisfy two criteria: (1) it should limit format conversion cost by avoiding structure-dependent parameter tuning, and (2) it should support fast SpMV for both regular and irregular matrices.
To meet these two criteria, in this paper, we have designed CSR5 (Compressed Sparse Row 5) 1 , a new format directly extending the classic CSR format. The CSR5 format leaves one of the three arrays of the CSR format unchanged, stores the other two arrays in an in-place tile-transposed order, and adds two groups of extra auxiliary information. The format conversion from the CSR to the CSR5 merely needs two tuning parameters: one is hardware-dependent and the other is sparsity-dependent (but structure-independent). Because the added two groups of information are usually much shorter than the original three in the CSR format, very limited extra space is required. Furthermore, the CSR5 format is SIMD-friendly and thus can be easily implemented on all mainstream processors with the SIMD units. Because of the structure-independence and the SIMD utilization, the CSR5-based SpMV algorithm can bring stable high throughput for both regular and irregular matrices.
In this paper, we make the following contributions:
• We propose CSR5, an efficient storage format with low conversion cost and high degree of parallelism.
• We present a CSR5-based SpMV algorithm based on a redesigned low-overhead segmented sum algorithm.
• We implement the work on four mainstream devices: CPU, nVidia GPU, AMD GPU and Intel Xeon Phi.
• We evaluate the CSR5 format in both isolated SpMV tests and iteration-based scenarios.
We compare the CSR5 with 11 state-of-the-art formats and algorithms on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi. By using 14 regular and 10 irregular matrices as a benchmark suite, we show that the CSR5 obtains comparable or better performance over the previous work for the regular matrices, and can greatly outperform the prior work for the irregular matrices. As for the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6%, 28.5%, 173.0% and 293.3% (up to 213.3%, 153.6%, 405.1% and 943.3%) over the second best work on the four platforms, respectively. Moreover, for iteration-based real-world scenarios, the CSR5 format achieves higher speedups because of the fast format conversion. To the best of our knowledge, this is the first time that a single storage format can outperform state-of-the-art work on all four modern multicore and manycore processors.
PRELIMINARIES

The CSR Format
The CSR format for sparse matrices consists of three arrays: (1) row_ptr array which saves the start and end pointers of the nonzeros of the rows. It has size m + 1, where m 1 The reason we call the storage format CSR5 is that it has five groups of data, instead of three in the classic CSR.
is the number of rows of the matrix, (2) col_idx array of size nnz stores column indices of the nonzeros, where nnz is the number of nonzeros of the matrix, and (3) val array of size nnz stores values of the nonzeros. Figure 1 shows an example.
Figure 1:
A sparse matrix and its CSR format.
Parallel Algorithms for CSR-based SpMV
Row Block Methods
In a given sparse matrix, rows are independent from each other. Therefore an SpMV operation can be parallelized on decomposed row blocks. A logical processing unit is responsible for a row block and stores dot product results of the matrix rows with the vector x to corresponding locations in the result y. When the SIMD units of a physical processing unit are available, the SIMD reduction sum operation can be utilized for higher efficiency. These two methods are respectively known as the CSR-scalar and the CSRvector algorithms, and have been implemented on CPUs [33] and GPUs [4, 29] . Algorithm 1 shows a parallel CSR-scalar method.
Algorithm 1 SpMV using the CSR-scalar method.
1: for i = 0 to m − 1 in parallel do 2:
for j = row_ptr[i] to row_ptr[i + 1]−1 do 4:
end for 6: end for Despite the good parallelism, exploiting the scalability in modern multi-processors is not trivial for the row block methods. The performance problems mainly come from load imbalance for matrices which consist of rows with uneven lengths. Specifically, if one single row of a matrix is significantly longer than the other rows, only a single core can be fully used while the other cores in the same chip may be completely idle. Although various strategies, such as data streaming [11, 17] , memory coalescing [13] , data reordering or reconstruction [3, 18, 25] , static or dynamic binning [1, 17] and Dynamic Parallelism [1] , have been developed, none of those can fundamentally solve the problem of load imbalance, and thus provide relatively low SpMV performance for the CSR format.
Segmented Sum Methods
Blelloch et al. [5] pointed out that the segmented sum may be more attractive for the CSR-based SpMV, since it is SIMD friendly and insensitive to the sparsity structure of the input matrix, thus overcoming the shortcomings of the row block methods.
Segmented sum (which is a special case of the backward segmented scan) performs a reduction sum operation for the entries in each segment in an array. A segment has its first entry flagged as TRUE and the other entries flagged as FALSE. Algorithm 2 lists a serial segmented sum algorithm. Vectorized parallel segmented sum algorithms can be found in [9, 15, 28] .
Algorithm 2 Serial segmented sum operation.
1: function segmented sum(*in, *flag) 2: length ← sizeof(*in) 3:
for i = 0 to length − 1 do 4:
if flag[i] = TRUE then 5:
j ← j + 1 9: end while 10:
end for 14: end function
In the SpMV operation, the segmented sum treats each matrix row as a segment and calculates a partial sum for the entry-wise products generated in each row. The SpMV operation using the segmented sum methods consists of four steps: (1) generating an auxiliary bit_flag array of size nnz from the row_ptr array. An entry in bit_flag is flagged as TRUE if its location matches the first nonzero entry of a row, otherwise it is flagged as FALSE, (2) calculating all intermediate entries (i.e., entry-wise products) to an array of size nnz, (3) executing the parallel segmented sum for the array, and (4) collecting all partial sums to the result vector y if a row is not empty. Algorithm 3 lists the pseudocode. Figure 2 illustrates an example using the matrix A plotted in Figure 1 .
Algorithm 3 Segmented sum method CSR-based SpMV.
1: malloc(*bit_flag, nnz) 2: memset(*bit_flag, FALSE) 3: for i = 0 to m − 1 in parallel do
Step 1 4:
bit_flag[row_ptr[i]] ← TRUE 5: end for 6: malloc(*product, nnz) 7: for j = 0 to nnz − 1 in parallel do
Step 2 8:
: end for 10: segmented sum(*product, *bit_flag)
Step 3 11: for k = 0 to m − 1 in parallel do
Step 4 We can see that once the heaviest workload, i.e., step 3, is parallelized through a fast segmented sum method described in [9, 15, 28] , nearly perfect load balance can be expected in all steps of Algorithm 3. However, in this context, the load balanced computation does not mean high performance. Figure 3 shows that the row block method in cuSPARSE v6.5 can significantly outperform the segmented sum method in cuDPP v2.2 [16, 28] , while doing SpMV on relatively regular matrices (see Table 2 for the used benchmark suite). Why is this the case? We can see that the step 1 is a scatter operation and the step 4 is a gather operation, both from the row space of size m. This prevents the two steps from fusing with the steps 2 and 3 in the nonzero entry space of size nnz. In this case, more global synchronizations and global memory accesses may degrade the overall performance. Previous research [4, 29] has found that the segmented sum may be more suitable for the COO (coordinate storage format) based SpMV, since the fully stored row index data can convert the steps 1 and 4 to the nonzero entry space: the bit_flag array can be generated by comparison of neighbor row indices, and the partial sums in the product array can be directly saved to y since their final locations are easily known from the row index array. Further, Yan et al. [34] and Tang et al. [30] reported that some variants of the COO format can also benefit from the segmented sum. However, it is well known that accessing row indices in the COO pattern brings higher off-chip memory pressure, which is just what the CSR format tries to avoid.
In the following, we will show that the CSR5-based SpMV can utilize both the segmented sum for load balance and the compressed row data for better load/store efficiency. In this way, the CSR5-based SpMV can obtain up to 4x speedup (see Figure 3 ) over the CSR-based SpMV using the segmented sum primitive [28] . 
THE CSR5 STORAGE FORMAT
Basic Data Layout
To achieve near-optimal load balance for matrices with any sparsity structures, we first evenly partition all nonzero entries to multiple 2D tiles of the same size. Thus when executing parallel SpMV operation, a compute core can consume one or more 2D tiles, and each SIMD lane of the core can deal with one column of a tile. Then the main skeleton of the CSR5 format is simply a group of 2D tiles. The CSR5 format has two tuning parameters: ω and σ, where ω is a tile's width and σ is its height. In fact, the CSR5 format only has these two tuning parameters.
Further, we need extra information to efficiently compute SpMV. For each tile, we introduce a tile pointer tile_ptr and a tile descriptor tile_desc. Meanwhile, the three arrays, i.e., row pointer row_ptr, column index col_idx and value val, of the classic CSR format are directly integrated. The only difference is that the col_idx data and the val data in each complete tile are in-place transposed (i.e., from row-major order to column-major order) for coalesced memory access from contiguous SIMD lanes. If the last entries of the matrix do not fill up a complete 2D tile (i.e., nnz mod (ωσ) = 0), they just remain unchanged and discard their tile_desc.
In Figure 4 , an example matrix A of size 8 × 8 with 34 nonzero entries is stored in the CSR5 format. When ω = 4 and σ = 4, the matrix is divided into three tiles including two complete tiles of size 16 and one incomplete tile of size 2. The arrays col_idx and val in the two complete tiles are stored in tile-level column-major order now. Moreover, only the first two tiles have tile_desc, since they are complete.
Auto-Tuned Parameters ω and σ
Because the computational power of the modern multicore or manycore processors is mainly from the SIMD units, we design an auto-tuning strategy for high SIMD utilization.
First, the tile width ω is set to the size of the SIMD execution unit of the used processor. Then an SIMD unit can consume a 2D tile in σ steps without any explicit synchronization, and the vector registers can be fully utilized. For the double precision SpMV, we always set ω = 4 for CPUs with 256-bit SIMD units, ω = 32 for the nVidia GPUs, ω = 64 for the AMD GPUs, and ω = 8 for Intel Xeon Phi with 512-bit SIMD units. Therefore, ω can be automatically decided once the processor type used is known.
The other parameter σ is decided by a slightly more complex process. For a given processor, we consider its on-chip memory strategy such as cache capacity and prefetching mechanism. If a 2D tile of size ω × σ can empirically bring better performance than using the other sizes, the σ is simply chosen. We found that the x86 processors fall into this category. For the double precision SpMV on CPUs and Xeon Phi, we always set σ to 16 and 12, respectively.
As for GPUs, the tile height σ further depends on the sparsity of the matrix. Note that the "sparsity" is not equal to "sparsity structure". We define "sparsity" to be the average number of nonzero entries per row (or nnz/row for short). In contrast, "sparsity structure" is much more complex because it includes 2D space layout of all nonzero entries.
On GPUs, we have several performance considerations on mapping the value nnz/row to σ. First, σ should be large enough to expose more thread-level local work and to amortize a basic cost of the segmented sum algorithm. Second, it should not be too large since a larger tile potentially generates more partial sums (i.e., entries to store to y), which bring higher pressure to last level cache write. Moreover, for the matrices with large nnz/row, σ may need to be small. The reason is that once the whole tile is located inside a matrix row (i.e., only one segment is in the tile), the segmented sum converts to a fast reduction sum.
Therefore, for the nnz/row to σ mapping on GPUs, we define three simple bounds: r, s and t. The first bound r is designed to prevent a too small σ. The second bound s is used for preventing a too large σ. But when nnz/row is further larger than the third bound t, σ is set to a small value u. Then we have
The three bounds, r, s and t, and the value u are hardwaredependent, meaning that for a given processor, they can be fixed for use. For example, to execute double precision SpMV on nVidia Maxwell GPUs and AMD GCN GPUs, we always set <r, s, t, u> = <4, 32, 256, 4> and <4, 7, 256, 4>, respectively. As for future processors with new architectures, we can obtain the four values through some simple benchmarks during initialization, and then use them for later runs. So the parameter σ can be decided once the very basic information of a matrix and a low-level hardware are known.
Therefore, we can see that the parameter tuning time becomes negligible because ω and σ are easily obtained. This can save a great deal of preprocessing time.
Tile Pointer Information
The added tile pointer information tile_ptr stores the row index of the first matrix row in each tile, indicating the starting position for storing its partial sums to the vector y. By introducing tile_ptr, each tile can find its own starting position, allowing tiles to execute in parallel. The size of the tile_ptr array is p + 1, where p = nnz/(ωσ) is the number of tiles in the matrix. For the example in Figure 4 , the first entry of Tile 1 is located in the 4th row of the matrix, and thus 4 is set as its tile pointer. To build the array, we binary search the index of the first nonzero entry of each tile on the row_ptr array. Lines 1-4 in Algorithm 4 show this procedure.
Recall that an empty row has exactly the same row pointer information as its first non-empty right neighbor row (see the second row in the matrix A in Figure 1 ). Thus for the nonempty rows with an empty left neighbor, we need a specific process (which is similar to lines 12-16 in Algorithm 3) to store their partial sums to correct positions in y. To recognize whether the specific process is required, we give a hint to the other parts (i.e., tile descriptor data) of the CSR5 format and the CSR5-based SpMV algorithm. Here we set an entry in tile_ptr to its negative value, if its corresponding tile includes any empty rows. Lines 5-12 in Algorithm 4 show this operation. If the first tile has any empty rows, we need to store a −0 (negative zero) for it. To record −0, here we use unsigned 32-or 64-bit integer as data type of the tile_ptr array. Therefore, we have 1 bit for explicitly storing the sign and 31 or 63 bits for an index. For example, in our design, tile pointer −0 is represented as a binary style '1000 ... 000', and tile pointer 0 is stored as '0000 ... 000'. To the best of our knowledge, the index of 31 or 63 bits is completely compatible to most numerical libraries such as Intel MKL. Moreover, reference implementation of the recent high performance conjugate gradient (HPCG) benchmark [14] also uses 32-bit signed integer for problem dimension no more than 2 31 and 64-bit signed integer for problem dimension larger than that. Thus it is safe to save 1 bit as the empty row hint and the other 31 or 63 bits as a 'real' row index.
Tile Descriptor Information
Only having the tile pointer is not enough for a fast SpMV operation. For each tile, we also need four extra hints: (1) bit_flag of size ω × σ, which indicates whether an entry is the first nonzero of a matrix row, (2) y_offset of size ω used to further let each column know where the starting point to store its local partial sums is, (3) seg_offset of size ω used to accelerate the local segmented sum inside a tile, and (4) empty_offset of unfixed size (but no longer than ω × σ) constructed to help the partial sums to find correct locations in y if the tile includes any empty rows. The tile descriptor tile_desc is defined to denote a combination of the above four groups of data.
Generating bit_flag is straightforward. The procedure is very similar to lines 3-5 in Algorithm 3. The main difference is that the bit flags are saved in column-major order, which matches the in-place transposed col_idx and val. Additionally, the first entry of each tile's bit_flag is set to TRUE for sealing the first segment from the top and letting 2D tiles to be independent from each other.
The array y_offset of size ω is used to help the columns in each tile knowing where the starting points to store their partial sums to y are. In other words, each column has one entry in the array y_offset as a starting point offset for all segments in the same column. We save a row index offset (i.e., relative row index) for each column in y_offset. Thus for the ith column in the tidth tile, by calculating tile_ptr[tid] + y_offset[i], the column knows where its own starting position in y is. Thus the columns can work in a high degree of parallelism without waiting for a synchronization. Generating y_offset is simple: each column counts the number of TRUEs in its previous columns' bit_flag array. Consider Tile 1 in Figure 4 as an example: because there are 3 TRUEs in the 1st column, the 2nd column's corresponding value in y_offset is 3. In addition, since there are in total 4 TRUEs in the 1st, 2nd and 3rd columns' bit_flag, Tile 1's y_offset[3]= 4. Algorithm 5 lists how to generate y_offset for a single 2D tile in an SIMD-friendly way.
Algorithm 5 Generating y_offset and seg_offset.
1: malloc(*tmp_bit, ω) 2: memset(*tmp_bit, FALSE) 3: for i = 0 to ω − 1 in parallel do 4:
end for 9:
seg_offset[i] ← 1− tmp_bit[i] 10: end for 11: exclusive prefix sum scan(*y_offset) 12: segmented sum(*seg_offset, *tmp_bit) 13: free(*tmp_bit)
The third array seg_offset of size ω is used for accelerating a local segmented sum in the workload of each tile. The local segmented sum is an essential step that synchronizes partial sums in a 2D tile (imagine multiple columns in the tile come from the same matrix row). In the previous segmented sum (or segmented scan) method [5, 9, 28, 15] , the local segmented sum is complex and not efficient enough. Thus we prepare seg_offset as an auxiliary array to facilitate implementation of segmented sum by way of the prefix-sum scan, which is a well optimized fundamental primitive for the SIMD units.
To generate seg_offset, we let each column search its right neighbor columns and count the number of contiguous columns without any TRUEs in their bit_flag. Using Tile 0 in Figure 4 as an example, its 2nd column has one and only one right neighbor column (the 3rd column) without any TRUEs in its bit_flag. Thus the 2nd column's seg_offset value is 1. In contrast, because the other three columns (the 1st, 3rd and 4th) do not have any 'all FALSE' right neighbors, their values in seg_offset is 0. Algorithm 5 shows how to generate seg_offset using an SIMD-friendly method. Algorithm 6 and Figure 5 show the fast segmented sum using seg_offset and an inclusive prefix-sum scan. The principle of this operation is that the prefix-sum scan is essentially an increment operation. Once a segment knows the distance (i.e., offset) between its head and its tail, its partial sum can be deduced from its prefix-sum scan results. Therefore, the more complex segmented sum operation in [5, 9, 28, 15] can be converted to a faster prefix-sum scan operation (line 5) and a few arithmetic operations (lines 6-8).
Algorithm 6 Fast segmented sum using seg_offset.
1: function fast segmented sum(*in, *seg_offset) 2: length ← sizeof(*in) 3:
malloc(*tmp, length) 4:
memcpy(*tmp, *in) 5:
inclusive prefix sum scan(*in) 6:
for i = 0 to length − 1 in parallel do 7:
in
free(*tmp) 10: end function The last array empty_offset occurs when and only when a 2D tile includes any empty rows (i.e., its tile pointer is negative). Because an empty row of a matrix has the same row pointer with its rightmost non-empty neighbor row (recall the second row in the matrix A in Figure 1 ), y_offset will record an incorrect offset for it. We correct for this by storing correct offsets for segments within a tile. Thus the length of empty_offset is the number of segments (i.e., the total number of TRUEs in bit_flag) in a tile. For example, Tile 0 in Figure 4 has 4 entries in its empty_offset since its bit_flag includes 4 TRUEs. Algorithm 7 lists the pseudocode that generates empty_offset for a tile that contains at least one empty row. 
Storage Details
To store the tile_desc arrays in a space-efficient way, we find upper bounds to the entries and utilize the bit-field pattern. First, since entries in y_offset store offset distances inside a 2D tile, they have an upper bound of ωσ. So log 2 (ωσ) bits are enough for each entry in y_offset. For example, when ω = 32 and σ = 16, 9 bits are enough for each entry. Second, since seg_offset includes offsets less than ω, log 2 (ω) bits are enough for an entry in this array. For example, when ω = 32, 5 bits are enough for each entry. Third, bit_flag stores σ 1-bit flags for each column of a 2D tile. When σ = 16, each column needs 16 bits. So 30 (i.e., 9 + 5 + 16) bits are enough for each column in the example. Therefore, for a tile, the three arrays can be stored in a compact bit-field composed of ω 32-bit unsigned integers. If the above example matrix has 32-bit integer row index and 64-bit double precision values, only around 2% extra space is required by the three newly added arrays. The size of empty_offset depends on the number of groups of contiguous empty rows, since we only record one offset for the rightmost non-empty row with any number of empty rows as its left neighbors.
The CSR5 for Other Matrix Operations
Since we in-place transposed the CSR arrays col_idx and val, a conversion from the CSR5 to the CSR is required for doing other sparse matrix operations using the CSR format. This conversion is simply removing tile_ptr and tile_desc and transposing col_idx and val back to row-major order. Thus the conversion can be very fast. Further, since the CSR5 is a superset of the CSR, any entry accesses or slight changes can be done directly in the CSR5 format, without any need to convert it to the CSR format. Additionally, some applications such as finite element methods can directly assemble sparse matrices in the CSR5 format from data sources.
are independent of each other as well. So we assign a thread on GPU cores or an SIMD lane on x86 cores to each column in a tile.
While running the CSR5-based SpMV, each column in a tile can extract information from bit_flag and label the segments in its local data to three colors: (1) red means a sub-segment unsealed from its top, (2) green means a completely sealed segment existed in the middle, and (3) blue means a sub-segment unsealed from its bottom. There is an exception that if a column is unsealed both from its top and from its bottom, it is colored to red.
Algorithm 8 shows the pseudocode of the CSR5-based SpMV algorithm. Figure 6 plots an example of this procedure. We can see that the green segments can directly save their partial sums to y without any synchronization, since the indices can be calculated by using tile_ptr and y_offset. In contrast, the red and the blue sub-segments have to further add their partial sums together, since they are not complete segments. For example, the sub-segments B2, R2 and R3 in Figure 6 have contributions to the same row, thus an addition is required. This addition operation needs the fast segmented sum shown in Algorithm 6 and Figure 5 . Furthermore, if a tile has any empty rows, the empty_offset array is accessed to get correct global indices in y. Figure 6 : The CSR5-based SpMV in a tile. Partial sums of the green segments are directly stored to y. The red and the blue sub-segments require an extra segmented sum before issuing off-chip write.
Consider the synchronization among the tiles, since the same matrix row can be influenced by multiple 2D tiles running concurrently, the first and the last segments of a tile need to store to y by atomic add (or a global auxiliary array used in device-level reduction, scan or segmented scan [15, 28] ). In Figure 6 , the atomic add operations are highlighted by arrow lines with plus signs.
For the last entries not in a complete tile (e.g., the last two nonzero entries of the matrix in Figure 4 ), we execute a conventional CSR-vector method after all of the complete 2D tiles have been consumed. Note that even though the last tile (i.e., the incomplete one) does not have tile_desc arrays, it can extract a starting position from tile_ptr.
In Algorithm 8, we can see that the main computation (lines 5-21) only contains very basic arithmetic and logic operations that can be easily programmed on all mainstream processors with SIMD units. As the most complex part in our algorithm, the fast segmented sum operation (line 22) only requires a prefix-sum scan, which has been well-studied and can be efficiently implemented by using CUDA, OpenCL or x86 SIMD intrinsics.
EXPERIMENTAL RESULTS
Experimental Setup
We evaluate the CSR5-based SpMV and 11 state-of-theart formats and algorithms on four mainstream platforms: dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi. The platforms and participating approaches are shown in Table 1 .
Host of the two GPUs is a machine with AMD A10-7850K APU, dual-channel DDR3-1600 memory and 64-bit Ubuntu Linux v14.04 installed. Host of the Xeon Phi is a machine with Intel Xeon E5-2680 v2 CPU, quad-channel DDR3-1600 memory and 64-bit Red Hat Enterprise Linux v6.5 installed. The two GPU platforms use the g++ compiler v4.8.2. The two Intel machines always set the Intel C/C++ complier 15.0.1 as default.
Here we evaluate double precision SpMV. So cuDPP library [16, 28] , clSpMV [29] and yaSpMV [34] are not included since they only support single precision floating point as data type. Two recently published methods [20, 30] are not tested since the source code is not available to us yet.
We use OpenCL profiling scheme for timing SpMV on the AMD platform and record wall-clock time on the other three platforms. For all participating formats and algorithms, we evaluate SpMV 10 times (each time contains 1000 runs and records the average) and report the best observed result.
The testbeds
The participating formats and algorithms Dual-socket Intel Xeon E5-2667 v3 (Haswell, 2×8 cores @ 3.2 GHz, 1.64 SP TFlops, 819.2 DP GFlops, 64 GB GDDR4, ECC-on, 2×68.3 GB/s bandwidth).
(1) The CSR-based SpMV in Intel MKL 11.2 Update 1.
(2) BiCSB v1.2 using CSB [7] with bitmasked register block [6] .
(3) pOSKI v1.0.0 [8] using OSKI v1.0.1h [31, 32] kernels. (4) The CSR5-based SpMV implemented by using OpenMP and AVX2 intrinsics.
An nVidia GeForce GTX 980 (Maxwell GM204, 2048 CUDA cores @ 1.13 GHz, 4.61 SP TFlops, 144.1 DP GFlops, 4 GB GDDR5, 224 GB/s bandwidth, driver v344. 16 ).
(1) The best CSR-based SpMV [4] from cuSPARSE v6.5 and CUSP v0.4.0 [11] .
(2) The best HYB [4] from the above two libraries.
(3) BRC [2] with texture cache enabled. (4) ACSR [1] with texture cache enabled. (5) The CSR5-based SpMV implemented by using CUDA v6.5. An AMD Radeon R9 290X (GCN Hawaii, 2816 Radeon cores @ 1.05 GHz, 5.91 SP TFlops, 739.2 DP GFlops, 4 GB GDDR5, 345.6 GB/s bandwidth, driver v14.41).
(1) The CSR-vector method [4] extracted from CUSP v0.4.0 [11] .
(2) The CSR-Adaptive algorithm [17] implemented in ViennaCL v1.6.2 [26] .
(3) The CSR5-based SpMV implemented by using OpenCL v1.2.
An Intel Xeon Phi 5110p (Knights Corner, 60 x86 cores @ 1.05 GHz, 2.02 SP TFlops, 1.01 DP TFlops, 8 GB GDDR5, ECC-on, 320 GB/s bandwidth, driver v3.4-1, µOS v2.6.38.8).
(2) The ESB [24] with dynamic scheduling enabled.
(3) The CSR5-based SpMV implemented by using OpenMP and MIC-KNC intrinsics. Table 1 : The testbeds and participating formats and algorithms.
Benchmark Suite
In Table 2 , we list 24 sparse matrices as our benchmark suite for all platforms. The first 20 matrices have been widely adopted in previous SpMV research [2, 4, 17, 24, 29, 33, 34] . The other 4 matrices are chosen since they have more diverse sparsity structures. All matrices except Dense are downloadable at the University of Florida Sparse Matrix Collection [12] .
To achieve a high degree of differentiation, we categorize the 24 matrices in Table 2 into two groups: (1) regular group with the upper 14 matrices, (2) irregular group with the lower 10 matrices. This classification is mainly based on the minimum, average and maximum lengths of the rows. Matrix dc2 is a representative of the group of irregular matrices. Its longest single row contains 114K nonzero entries, i.e., 15% nonzero entries of the whole matrix with 117K rows. This sparsity pattern challenges the design of efficient storage format and SpMV algorithm. Figure 7 shows double precision SpMV performance of the 14 regular matrices on the four platforms. We can see that, on average, all participating algorithms deliver comparable performance. On the CPU platform, Intel MKL obtains the best performance on average and the other 3 methods behave similar. On the nVidia GPU, the CSR5 delivers the highest throughput. The ACSR format is slower than the others, because its binning strategy leads to non-coalesced memory access. On the AMD GPU, the CSR5 achieves the best performance. Although the dynamic assigning in the CSR-Adaptive method can obtain better scalability than the CSR-vector method, it still cannot achieve near perfect load balance. On the Xeon Phi, the CSR5 is slower than Intel MKL and the ESB format. The main reason is that the current generation of Xeon Phi can only issue up to 4 relatively slow threads per core (i.e., up to 4×60 threads in total on the used device), and thus the latency of gathering entries from vector x becomes the main bottleneck. Then reordering or partitioning nonzero entries based on the column index for better cache locality behaves well in the ESB-based SpMV. However, in Section 5.6 we will show that this strategy leads to very high preprocessing cost. Figure 8 shows double precision SpMV performance of the 10 irregular matrices. We can see that the irregularity can dramatically impact SpMV throughput of some approaches. On the CPU platform, the row block method based Intel MKL is now slower than the other methods. The CSR5 outperforms the others because of better SIMD efficiency from the AVX2 intrinsics. On the nVidia GPU, the CSR5 brings the best performance because of the near perfect load balance. The other two irregularity-oriented formats, HYB and ACSR, behave well but still suffer from imbalanced work decomposition. Note that the ACSR format is based on Dynamic Parallelism, a technical feature only available on recently released nVidia GPUs. On the AMD GPU, the CSR5 greatly outperforms the other two algorithms using the row block methods. Because the minimum work unit of the CSR-Adaptive method is one row, the method delivers degraded Table 3 : Preprocessing cost and its impact on the iteration-based scenarios.
Isolated SpMV Performance
Legend (i1) Webbase (i2) LP (i3) Circuit5M (i4) eu-2005 (i5) in-2004 (i6) mip1 (i7) ASIC 680k (i8) dc2 (i9) FullChip (i10) ins2 (i1-i10) Harmonic mean
Effects of Auto-Tuning
In section 3.2, we discussed a simple auto-tuning scheme for the parameter σ on GPUs. Figure 9 shows its effects (the x axis is the matrix id s). We can see that compared to the best performance chosen from a range of σ = 4 to 48, the auto-tuned σ does not have obvious performance loss. On the nVidia GPU, the performance loss is on average -4.2%. On the AMD GPU, the value is on average -2.5%. 
Format Conversion Cost
The format conversion from the CSR to the CSR5 includes four steps: (1) memory allocation, (2) generating tile_ptr, (3) generating tile_desc, and (4) transposition of col_idx and val arrays. Figure 10 shows the cost of the four steps for the 24 matrices (the x axis is the matrix id s) on the four used platforms. Cost of one single SpMV operation is used for normalizing format conversion cost on each platform. We can see that the conversion cost can be on average as low as the overhead of a few SpMV operations on the two GPUs. On the two x86 platforms, the conversion time is longer (up to cost of around 10-20 SpMV operations). The reason is that the conversion code is manually SIMDized using CUDA or OpenCL on GPUs, but only auto-parallelized by OpenMP on x86 processors. 
Iteration-Based Scenarios
Since both the preprocessing (i.e., format conversion from a basic format) time and the SpMV time are important for real-world applications, we have designed an iteration-based benchmark. This benchmark measures the overall performance of a solver with n iterations. We assume the input matrix is already stored in the CSR format. So the overall cost of using the CSR format for the scenarios is Table 3 shows the new formats' preprocessing cost (i.e., T new pre /T new spmv ) and their speedups over the CSR format in the iteration-based scenarios when n = 50 and n = 500. The emboldened font in the table shows the highest positive speedups on each platform. The compared baseline is the fastest CSR-based SpMV implementation (i.e., Intel MKL, nVidia cuSPARSE/CUSP, CSR-vector from CUSP, and Intel MKL, respectively) on each platform. We can see that because of the very low preprocessing overhead, the CSR5 can further outperform the previous methods when doing 50 iterations and 500 iterations. Although two GPU methods, the ACSR format and the CSR-Adaptive approach, in general have shorter preprocessing time, they suffer from lower SpMV performance and thus cannot obtain the best speedups. On all platforms, the CSR5 always achieves the highest overall speedups. Moreover, the CSR5 is the only format that obtains higher performance than the CSR format when only 50 iterations are required.
RELATED WORK
A great deal of work has been published on accelerating the SpMV operation. The block-based sparse matrix construction has received most attention [2, 6, 7, 10, 25, 31, 34] because of two main reasons: (1) sparse matrices generated by some real-world problems (e.g., finite element discretization) naturally have the block sub-structures, and (2) off-chip load operations may be decreased by using the block indices instead of the entry indices. However, for many matrices that do not exhibit a natural block structure, trying to extract the block information is time consuming and has limited effects.
On the other hand, the hybrid formats [4, 29] , such as HYB, have been designed for irregular matrices. However, higher kernel launch overhead and invalidated cache among kernel launches tend to decrease their overall performance. Moreover, it is hard to guarantee that every sub-matrix can saturate the whole device. In addition, some relatively simple operations such as solving triangular systems become complex while the input matrix is stored in two or more separate parts.
The recent row block methods showed good performance either for regular matrices [17] or for irregular matrices [1] , but not for both. In contrast, the CSR5 can deliver higher throughput both for regular matrices and for irregular matrices.
The segmented sum methods have been used in two recently published papers [30, 34] for the SpMV on either GPUs or Xeon Phi. However, both of them need to store the matrix in COO-like formats to utilize the segmented sum. In contrast, the CSR5 format saves useful row index information in a compact way, and thus can be more efficient both for the format conversion and for the SpMV operation.
Sedaghati et al. [27] constructed machine learning classifiers for automatic selection of the best format for a given sparse matrix on a target GPU. The CSR5 format described in this work can further simplify such a selection process because it is insensitive to the sparsity structure of the input sparse matrix.
Moreover, to the best of our knowledge, the CSR5 is the only format that supports high throughput cross-platform SpMV on CPUs, nVidia GPUs, AMD GPUs and Xeon Phi at the same time. This advantage may simplify the development of scientific software for processors with massive onchip parallelism.
CONCLUSIONS
In this paper, we proposed the CSR5 format for efficient cross-platform SpMV on CPUs, GPUs and Xeon Phi. The format conversion from the CSR to the CSR5 was very fast because of the format's insensitivity to sparsity structure of the input matrix. The CSR5-based SpMV was implemented by a redesigned segmented sum algorithm with higher SIMD utilization compared to the classic methods. The experimental results showed that the CSR5 delivered high throughput both in the isolated SpMV tests and in the iteration-based scenarios.
ACKNOWLEDGMENTS
