Abstract. The Recursive Sparse Blocks (RSB ) is a sparse matrix layout designed for coarse grained parallelism and reduced cache misses when operating with matrices, which are larger than a computer's cache. By laying out the matrix in sparse, non overlapping blocks, we allow for the shared memory parallel execution of transposed SParse Matrix-Vector multiply (SpMV ), with higher efficiency than the traditional Compressed Sparse Rows (CSR) format. In this note we cover two issues. First, we propose two improvements to our original approach. Second, we look at the performance of standard and transposed shared memory parallel SpMV for unsymmetric matrices, using the proposed approach. We find that our implementation's performance is competitive with that of both the highly optimized, proprietary Intel MKL Sparse BLAS library's CSR routines, and the Compressed Sparse Blocks (CSB) research prototype.
Introduction and Related Work
Many scientific/computational problems require the solution of systems of partial differential equations (PDEs). Often, discretization of these problems result in sparse matrices. A common approach for the solution of sparse linear systems is through the use of iterative methods, whose computational core requires sparse matrix-vector multiplication. In this document, we focus on the efficient implementation of sparse matrix-vector multiplication, on cache based, shared memory computers. In this context, we have recently proposed a sparse matrix format, called Recursive Sparse Blocks (RSB ) [3, 4] . The central idea of RSB is a recursive partitioning-based organization of matrices, with either Compressed Sparse Rows (CSR) or Coordinate (COO) format leaves of a quad-tree structure over matrices. In this paper, we present some optimizations to our RSB-based SpMV implementation, and compare performance of the modified approach to that of the Intel's MKL proprietary Sparse BLAS implementation, and the publicly available CSB (see [1] ) prototype. To this end, we briefly recall the way that the SpMV /SpMV T computational kernels work and behave on computers of our interest in § 2. Next, we introduce the proposed optimizations in § 3. Finally in § 5, we discuss the efficiency of our prototype, by comparing it to the mentioned highly efficient MKL's CSR and CSB implementations. 
SpMV and Transposed SpMV
We define the sparse matrix-vector multiply (SpMV ) operation as "y ← A x" and its transposed version (SpMV T ) as "y ← A T x" (where A is a sparse matrix, while x, y are vectors). With RSB, A is recursively partitioned into submatrices, and then the individual N leaf submatrices s 1 : s N are represented in either COO or CSR format (eventually using 16 bits for the local indices); for details, see [3, 4] . Clearly, while in the untransposed case the requirement for avoiding race conditions is on the rows interval, in the transposed case the columns intervals of the participating submatrices shall be disjoint. Our basic shared memory parallel algorithm for RSB/SpMV is outlined in Fig. 5 (see also [4] ); in the listing, at line 8, assume for the time being s.r h = 0, s.r t = 0. Workload is partitioned among threads by means of a parallel section (lines 5-15). Repeatedly, each participating thread picks up a submatrix and updates the y array with its contribution to the product. When picking up a submatrix, a thread locks y array's interval corresponding to the submatrix rows interval. The same listing is suitable for SpMV T, after a slight modification; namely, the pairwise exchange in all occurrences of: s.roff and s.coff, s.rows and s.cols. The bulk of computation is executed at the leaf level, when either COO or CSR submatrices are multiplied by the corresponding subvector. See Fig. 3 for the CSR submatrices code (again, assume s.r h = 0, s.r t = 0), and Fig. 1 for the COO version of the SpMV. For the SpMV T, see listings Fig. 4 and Fig. 2 . We use the most common variant of CSR storing rows in ascending order, and column indices in ascending order within each row. The COO submatrices of RSB are organized exactly in the same way. A consequence of this layout ordered by rows is that for most real world matrices, for each given nonzero coefficient a i,j , it is likely that the next stored nonzero a i,j is quite near, i.e. with Δ = j − j reasonably small. If Δ < C l /N s , with C l the cache line length, and N s the floating point number size, both expressed in bytes, then after computing the contribution y i ← y i + a ij x j (line 4 in Fig. 1 , line 4 in Fig. 3 ), loading of element x j will, with very high probability, only require a single fetch from cache memory, with no further cache misses. Normally both CSR and COO SpMV algorithms are written such that the compiler uses registers for the accumulation of the y i contribution, while updating the y i memory location no more than once per submatrix. In the case of CSR and listing Fig. 3 , this is straightforward to achieve -it requires referencing a local variable instead y i in the inner loop, and update of the y i location right after the inner loop. In the case of COO listing, Fig. 1 , one should reorganize in two loops: an outer one cycling on rows, and an inner one cycling on a single row nonzeroes. If the number of nonzeroes of a COO submatrix is likely to be less than the number of rows, as is the case for our leaf matrices because of the criteria for COO in [3] , and discussion of hypersparsity in [2] , we clearly have a problem, since such a double loop would perform O(s.nnz + s.rows) control instructions, which is always more than O(s.nnz) as in Fig. 1 . In the case of CSR submatrices, we are constrained to a O(s.nnz + s.rows) loop complexity by the nature of CSR. However, here we have a guarantee that the number of nonzeroes exceeds the number of rows (by the definition of RSB leaves-see [3] ), so the double loop is not a concern.
3 Two Optimizations to RSB SpMV /SpMV T In this section we introduce two closely related modifications to our SpMV /SpMV T algorithms for RSB. First, we note that our implementations for CSR (Fig. 3, Fig. 4 Fig. 3 and Fig. 4 to work on the non-empty interval of rows only, and simply skipping iterations on the empty ones. For a square submatrix s having r (s) def = s.r h + s.r t , the use of this row skipping technique allows reducing the amount of indices read up to 50% (consider a square s with s.nnz = s.rows+1 nonzeroes distributed on two rows) with 4 byte indices and up to 66% with 2 bytes indices. In a more common situation, say r (s) = s.rows/ν s (for some ν s ), one would save s.rows/ν s accesses out of s.rows(κ s + 1) + 1. For s.rows 1, the saved fraction is 1 κsνs ; for realistic cases, e.g.: κ s = 2, ν s = 2, this amounts to 25%, which is not bad. A good property of this optimization is that in the case of no empty rows, there is no runtime performance loss compared to the base implementation. We also observe that this optimization is valid for both SpMV and SpMV T. The second optimization also relies on the computation of s.r h /s.r t , but is applied to the outer parallel algorithm shown in Fig. 5 . The base version of this algorithm considers to be zero both s.r h , s.r t , on all submatrices; thus when a submatrix is picked up by a thread, the entire s.roff...s.roff + s.rows interval of the output vector y may have to be updated, and therefore is locked. But if a submatrix s has s.r h empty heading rows, and s.r t empty tail rows, then rows outside the s.roff + s.r h ...s.roff + s.rows − s.r t range will not be modified; therefore we observe that only the corresponding subvector of y must be really locked. We apply our optimization by locking only the effective rows interval, thus allowing for a reduced degree of resource contention among threads and enhancing potential parallelism. The worst case is when s.r h = 0 and s.r t = 0, and this is no worse than (indeed, identical to) without the optimization. The best case may be when N s submatrices extending on the same rows range exists, but 
Experimental Setup and Methodology
For space reasons we report only a limited set of experimental data. We chose to use a sample of large (exceeding hardware cache), sparse square matrices obtained from the University of Florida Sparse Matrix Collection (see [5] ). Matrices information is summarized in Table 1 . We report results of experiments performed on an Intel Xeon 5670, supporting up to 12 hardware threads, with 3 levels of cache memory (sized respectively 32KB/256KB/12MB). Our codes were implemented in C99, and compiled with Intel's ICC v.12.0.2 compiler, with the optimization -O3 flag (no machine specific optimization flags were used). Our parallel RSB implementation using OpenMP is compared against the CSR implementation present in the proprietary Intel MKL 10.3-2 library, and the publicly available CSB (see Buluç et al. [1] ) prototype (compiled with the special purpose CILK++ compiler, version 8503, with -O3 -fno-rtti -fno-exceptions flags). Here we also use explicit loop unrolling (four-fold) on the COO and inner CSR loops of our codes. We express the performance of a computation in MFLOPS (that is, time efficiency); we count 2 operations for each nonzero of a matrix involved in the multiplication by a vector.
Results
Let us now discuss the experimental results. In Fig. 6 we report results obtained using 12 threads; in Fig. 7 results for a single thread. The first thing we note, is that the MKL (CSR) results for SpMV T are consistently lower than for SpMV. This performance gap is due to the row-major layout of CSR, which requires the transposed update of the results array to be written at random locations (unlike the normal update, which reads random locations of the multiplicand vector, but updates a sequentially accessed array). This gap is almost absent in the case of CSB: recall (see [1] ) the unbiased Z-ordering in its sparse blocks. Regarding the SpMV /SpMV T gap, RSB falls in between MKL and CSB, due to its storage of consecutive rows of sparse submatrices. We notice that running in parallel (Fig. 6) , the aforementioned efficiency gap is much more pronounced for CSR. The reason for this is the lack, in the CSR format, of immediate information for the serialization of the threads write instructions in updating the result vector. CSB and RSB are structured in blocks, offering a coarse grained way to parallelization of SpMV T. Summarizing, for the chosen set of matrices, RSB performs always better than CSB and MKL with a single thread, and almost always, in parallel runs. In interpreting performance results for the three formats (especially when running in parallel, when the CPU-memory communications channels are likely to be saturated), we may consider the bytes per indexing nonzero metric. Since RSB's index usage depends on the subdivisions/threads count, in Fig. 8 we report this value for both single and 12 threaded runs. For CSB, we report the size of the arrays allocated in the source code, although we do not take into account here the block pointers array (see [1] ), which also contributes as index-related memory traffic. It is straightforward to see that usually, RSB's higher performance cases in Fig. 6 coincide with the shortest index usage cases, and vice-versa. We also note that the relative performance of CSB/MKL seems related to the average indexing usage.
Concluding Remarks
In this paper, we have proposed two simple optimizations to our RSB algorithms for SpMV /SpMV T, and performed experiments on large sparse matrices.
Since the two optimizations have no potential negative impact, and since their benefit is difficult to quantify in advance, we skipped the comparison to the past code versions, and compared the code directly to two different, efficient SpMV implementations: Intel MKL's CSR and CSB (see Buluç et al. [1] ). Our main finding is that the block structure of RSB allows the parallel implementations of both SpMV /SpMV T to be efficient without the prominent performance gap which is inherent in a CSR implementation (in this case, MKL's). We also find confirmation that RSB's hybrid structure (a recursive layout on outside, with a row-major layout on the inside) is advantageous when performing SpMV /SpMV T on large matrices serially. Furthermore, the technique of enhancing RSB's parallelism by using empty rows information applies to triangular solve and symmetric SpMV kernels as well.
We wish to thank Pawe l Gepner and Jamie Wilcox at Intel Corporation for giving us access and technical support for the machine used in the experiments.
