Large-scale scientific applications frequently compute sparse matrix-vector products in their computational core. For this reason, techniques for computing sparse matrixvector products efficiently on modern architectures are important. In this paper we describe a strategy for improving the performance of sparse matrix-vector product computations using a loop transformation known as unrolland-jam. We describe a novel sparse matrix representation that enables us to apply this transformation. Our approach is best suited for sparse matrices that have rows with a small number of predictable lengths. This work was motivated by sparse matrices that arise in SAGE, an application from Los Alamos National Laboratory. We evaluate the performance benefits of our approach using sparse matrices produced by SAGE for a pair of sample inputs. We show that our strategy is effective for improving sparse matrix-vector product performance using these matrices on MIPS R12000, Alpha Ev67, IBM Power 3, and Itanium 2 processors. Our measurements show that for this class of sparse matrices, our strategy improves sparse matrix-vector product performance from a low of 41% on MIPS to well over a factor of 2 on Itanium.
Introduction
Sparse matrix-vector product computations are at the heart of many modern scientific applications; however, they often perform poorly on modern computer systems. Poor performance results from several factors. First, unlike dense matrices, sparse matrices require an explicit representation of the coordinates of non-zero elements. Manipulating this coordinate representation when computing a sparse matrix-vector product consumes memory bandwidth, which is a scarce commodity in computer systems based on modern microprocessors. Secondly, sparse matrix computations have less temporal reuse of values than dense matrix computations; this makes it hard to use registers and caches effectively. Thirdly, the loop structure of a matrix-vector product computation using a sparse matrix is significantly less regular than one using a dense matrix; this leads to less efficient compilergenerated code. Im and Yelick (2001) report a factor of 2 performance penalty for computing a matrix-vector product on dense matrices using a sparse matrix representation rather than a dense matrix representation.
The inefficiency of computing sparse matrix-vector products has prompted research into strategies for improving their performance. Techniques that have been explored include development of a variety of sparse matrix representations, matrix reordering, employing multiple representations in concert, cache and register blocking, and combinations of these approaches. In Section 5 we present a summary of these strategies.
SAIC's Adaptive Grid Eulerian hydrodynamics code (SAGE) is a computationally intensive, multi-dimensional, multi-material code that employs cell-by-cell adaptive refinement on a hexahedral mesh of cells (Kerbyson et al., 2001) . In working with this code, we found that a significant fraction of its execution time was spent computing sparse matrix-vector products. Furthermore, we found that SAGE's adaptive mesh refinement scheme yields sparse matrices with properties that make it difficult to multiply them efficiently with a vector on modern microprocessors. SAGE's sparse matrices typically have only a few elements per row and contain no patterns of local density, even after row and column reordering. For this reason, neither register or cache blocking algorithms were effective for improving sparse matrix-vector product performance with SAGE's sparse matrices.
In this paper, we explore a new strategy for improving the performance of sparse matrix-vector product computations for a class of matrices that includes those used by SAGE. We describe a new sparse matrix organization that enables us to optimize sparse matrix-vector product computations by using a loop transformation known as unroll-and-jam (Allen and Cocke, 1972) . We found that this data structure and computation reorganization improves performance from 41% to well over a factor of 2 on the architectures we studied.
In the next section we briefly describe SAGE, its representation for sparse matrices and the performance results that motivated our investigation of this problem. In Section 3 we describe our data and computation transformation for accelerating sparse matrix-vector product computations for SAGE on modern microprocessor-based systems. In Section 4 we present an experimental evaluation of our approach. In Section 5 we describe related work, and in Section 6 we summarize our findings and conclusions.
Background

SAGE
SAGE is a parallel, multi-dimensional, multi-material, Eulerian hydrodynamics code that employs cell-by-cell adaptive refinement of a hexahedral mesh of cells (Kerbyson et al., 2001) . SAGE was developed by Scientific Applications International Corporation (SAIC) and Los Alamos National Laboratory (LANL) as part of the Department of Energy's Accelerated Strategic Computing Initiative (ASCI). In LANL's Crestone project, SAGE is being used to explore the utility of continuous adaptive Eulerian techniques for simulation of the nation's nuclear stockpile. SAGE represents a large class of ASCI applications at LANL that run for months at a time on 2000-4000 processors (Kerbyson et al., 2001 ). In addition to stockpile stewardship, SAGE is used for simulating a variety of phenomena that arise in scientific and engineering problem domains.
SAGE's computational core consists of a conjugant gradient solver that requires each processor to compute a sparse matrix-vector product in each iteration. Each time the solver is invoked, a sparse matrix representation on each node is built from data structures that represent the arrangement of mesh cells. The cost of building the sparse matrix representation on each processor is insignificant with respect to the amount of time spent performing sparse matrix-vector multiplication in the solver. On the problems of interest at LANL, the solver typically runs hundreds to thousands of iterations before its convergence criteria are satisfied. To guide our investigation of techniques for improving the node performance of sparse matrix-vector product computations that arise in SAGE, we studied sparse matrices formed by SAGE in single-processor executions of two test problems. We used the timing_c and timing_b test problems provided to us by scientists at LANL as the basis for our study. Both inputs trigger three-dimensional simulations of two materials using adaptation and heat conduction. The timing_c input triggers a simulation in a long, skinny domain and yields a symmetric sparse matrix of with 72,000 rows containing 467,568 non-zero elements. The timing_b input triggers a smaller simulation in a cubical domain and yields a symmetric sparse matrix with 38,464 rows containing 246,634 non-zero elements.
COMPRESSED SPARSE ROW FORMAT
SAGE uses a Compressed Sparse Row (CSR) representation for sparse matrices. Figure 1 illustrates the organization of a conventional CSR matrix representation. The matrix is represented by three vectors: cols, which contains all column indices of non-zeroes in the matrix (ordered by row); nz, which contains the non-zero value corresponding to each entry in cols; and rowstart, which contains indices indicating the location in cols and nz of the first element of each row. On microprocessor-based systems, CSR format is commonly used for sparse matrix-vector product computations; during the computation, a register can be used to accumulate the result of each dot product of a row's non-zero entries with the vector. Figure 2 shows a canonical procedure for performing sparse matrix-vector multiplication using a matrix in CSR format. The first argument to the procedure csr_mvmult specifies the number of matrix rows. The next three arguments specify the sparse matrix data structure components rowstart, cols, and nz shown in Figure 1 . The final two arguments specify an input vector x and a result vector result. The outer loop in csr_mvmult computes the length of the current row as the difference between the start of the next row and the start of the current row. The inner loop accumulates a dot product of the non-zeroes in the row and their corresponding entries in the vector x. In this procedure, the rowstart, cols, nz and result vectors are accessed in sequential order with spatial but not temporal reuse. The values of x are accessed in the order dictated by the sequence of values in cols; this order is potentially arbitrary and may result in spatial and/or temporal reuse of values in x. 
MOTIVATION
A preliminary investigation of the performance of sparse matrix-vector product computations in SAGE found them to be quite slow, but suggested several promising directions for further investigation.
Performance measurements of the CSR-based sparse matrix-vector product code shown in Figure 2 on several hardware platforms showed that it was less efficient than expected. An analysis of the assembly code generated by SGI's MIPSPro Fortran 90 compiler (version 7.3.1.3m, -O3 optimization) for this CSR-based sparse matrixvector product code showed that, while memory references issue at 100% of the peak rate in the inner loop, in the outer loop they issue at only 41% of the peak rate. Using SAGE's timing_c sparse matrix, we found that on average a floating-point operation completes every 7.2 cycles with this code on a MIPS processor. On an Itanium, performance is even worse, with a floating-point operation completing only every 13.6 cycles.
We define a row's length to be the number of non-zero values it contains. In an effort to understand the relative contribution of the outer loop to overall performance, we generated histograms of the matrix row lengths for the SAGE timing_c and timing_b test cases. Figure 3 shows the prevalence of various row lengths in both matrices. These measurements show two things. First, row lengths in SAGE's sparse matrices are very small; the average row length is about 6.5, and no row has more than 16 non-zeroes. This means that a significant fraction of time in the sparse matrix-vector multiplication routine will be spent in the outer loop or in fill and drain code surrounding a software-pipelined inner loop. Secondly, a few row lengths dominate. Rows of length 1, 6, 7, or 10 make up over 97% of the rows in both matrices. In the next section, we describe an approach that exploits the structure of these matrices to boost performance.
Approach
Besides the problem of short rows, the performance of CSR-based sparse matrix-vector product code, shown in Figure 2 , suffers from a more fundamental problem; each multiply-add in the inner loop depends on the value computed by the previous iteration. Since floating-point operations on modern microprocessors are multi-stage pipelined operations, this dependence between iterations prevents the floating-point pipeline from filling, even with unrolling. To improve pipeline utilization in dense matrix codes that have a recurrence in the inner loop, the unroll-and-jam transformation (Allen and Cocke, 1972) is often used.
Applying unroll-and-jam, also known as loop jamming, to a loop nest involves unrolling an outer loop and fusing the multiple inner loop bodies that result. This transformation has three potential benefits: (1) it increases the size of inner loop bodies to aid in optimization and reduce loop overhead; (2) it can increase floating-point pipeline utilization by interleaving the computation of multiple independent recurrences (Callahan et al., 1988) ; and (3) in dense matrix computations, it can increase close temporal reuse, some of which can then be converted into register reuse by applying scalar replacement (Callahan et al., 1990; Carr and Kennedy, 1994) .
Unroll-and-jam, of course, cannot be applied if the inner loop bodies that result from unrolling cannot legally be fused. Thus, it cannot be employed directly to improve the efficiency of the CSR-based matrix-vector product computation shown in Figure 2 . If we unrolled the outer loop over rows, it would not be legal to fuse the multiple inner loop bodies that result because their rows may be of different lengths. However, the row length analysis of SAGE's sparse matrices shown in Figure 3 shows that the rows generally have one of a small number of frequently occurring lengths. If we reorganized the matrix so that rows with the same length were together, then we could unroll and jam the sparse matrix-vector product loop nest to interleave the product computations for a group of rows of the same length.
We introduce Length-grouped CSR (L-CSR), a new sparse matrix representation that enables us to apply unroll-and-jam to a CSR-style sparse matrix-vector product. Figure 4 shows the structure of this representation. Like CSR format, L-CSR contains a pair of vectors, cols and nz, that contain, respectively, the column index and value of each non-zero value in the matrix. As in CSR, a matrix row is represented by a contiguous sequence of entries in cols and nz. However, in L-CSR, rows are grouped by length, whereas in CSR rows appear in normal matrix order. In L-CSR, the row vector and a vector of (length,lenstart) pairs substitute for the vector rowstart in CSR. Each entry in row indicates the matrix row that corresponds to a contiguous sequence of elements in cols and nz. A (length, lenstart) pair represents a group of rows of a particular length; length is the row length and lenstart is the index in row where rows of that length begin. The offset of the first row in a length group in cols and nz can be determined from the length and number of rows of in length groups to the left of the current group in the pair vector. Figure 5 shows machine-generated code for performing a sparse matrix-vector multiply using L-CSR format. It exploits the L-CSR representation by applying unrolland-jam. The code shown was generated using an unrolland-jam factor of 2. Lines 32-43 show unroll-andjammed code for handling pairs of rows at a time. Interleaving the computation for multiple rows leads to better floating-point pipeline utilization. Lines 47-54 show code for any leftover rows that cannot be handled in bundles in the unroll-and-jammed loop. When rows of some particular short length occur with high frequency, it may be more efficient to entirely unroll the inner loop over elements in a row. Lines 20-30 show special-case code for handling rows of length 2. In general, special-case, fully-unrolled code can be generated for any small set of short row lengths.
To generate a tailored code for the L-CSR matrix-vector product for SAGE's sparse matrices, we generated special-case code for rows of length 1, 6, 7, and 10. Figure 3 shows that, in SAGE's sparse matrices, these lengths account for over 97% of the rows. The high frequency of these lengths in the two sample matrices is not an accident. In SAGE's hexahedral mesh, a typical cell depends upon itself and a pair of adjacent cells along each coordinate dimension; this results in a sparse matrix row of length 7. Rows of length less than 7 correspond to cells in various configurations along the domain boundary. Rows of length 10 correspond to a typical cell that has an extra three neighbors on one side because a neighboring cell has been cut into octants by adaptation. In the next section, we compare the performance of CSR matrix-vector product with code for an L-CSR matrixvector product using an unroll-and-jam factor of 5 with special case code for rows of length 1, 6, 7, and 10.
There are three disadvantages of using an L-CSR representation. First, L-CSR format is not a random-access representation: the index in cols and nz at which each row starts is computed incrementally from the start position and length of the prior row. Second, since rows are reordered by length in L-CSR format, the reordering could reduce access locality of x since it will tend to move non-zeroes away from the diagonal of the sparse matrix (assuming that the matrix was ordered so they were near the diagonal originally). Third, writes into result vector using L-CSR format may no longer be stride one. Our experiments in the next section show that for SAGE's sparse matrices, the benefits of unroll-andjam and special-case full unrolling using L-CSR outweigh these potential costs.
Experimental Results
In this section, we describe an experimental evaluation of sparse matrix-vector multiplication using CSR and L-CSR formats of sparse matrices from SAGE's timing_b and timing_c sample problems. In Section 4.1, we describe our measurement methodology. In Section 4.2, we briefly describe the hardware platforms used for our experiments. Our findings are presented in Sections 4.3 and Section 4.4, which, respectively, describe experiments on a collection of hardware platforms and results of detailed simulations.
METHODOLOGY
On each hardware platform, we measured 100 iterations of sparse matrix-vector multiplication using each strategy. We took certain precautions to ensure that our measurements were not contaminated by confounding effects. First, to avoid cache conflicts when arrays used in the same loop are placed in non-consecutive memory locations, for each experiment we allocated all components of a sparse matrix data structure, along with the input and output vectors that would be used with them, at the same point and placed them consecutively in memory. Secondly, to prevent the confounding effect of compulsory page faults and TLB misses on the first iteration, we performed one iteration of a sparse matrix-vector product computation to warm the memory system immediately before performing timing tests on multiple iterations of the computation.
Experiments that attempted to exploit the symmetry of symmetric sparse matrices to improve performance yielded lower performance than L-CSR. We explored a sparse symmetric representation that explicitly represented only the lower or upper triangular portion of a sparse matrix in CSR format. Sparse matrix-vector multiply with such a representation incurs additional overhead because when processing each (row, col) pair, it must load, update, and store result(col) as well as result(row). This has the overall effect that each entry in the result vector is updated multiple times rather than computed in its entirety and stored only once, yielding lower performance.
Sparse matrix-vector multiplication involves computing the dot product of each row of the matrix and the input vector. In the most straightforward case, this operation involves one floating-point multiplication and one addition per row element. For this reason, we computed "useful floating-point operations per second" as twice the number of non-zeroes in the matrix times the number of iterations, divided by the CPU time taken by the matrix multiplication. This counts the rate of essential floatingpoint instructions while disregarding non-productive operations such as register-to-register copies of floatingpoint values.
For comparison with CSR and L-CSR strategies, we also measured sparse matrix-vector product computations using Im and Yelick's Sparsity package for register and cache blocking (Im, 2000; Im and Yelick, 2001 ). Our study of Sparsity using SAGE's timing_c matrix showed that Sparsity's exhaustive two-dimensional search of tile sizes (register blocking from 1 × 1 to 12 × 12 and cache blocking in powers of two from 1 × 1 to 65,536 × 65,536) found no cache or register blocking size that improved performance. We believe that cache and register blocking are ineffective for SAGE's sparse matrices because they lack patterns of local density; the majority of non-zero values in the matrices are clustered near the diagonal, with a few non-zeroes along outlying diagonals.
We also compared our results with those of IBM's sparse matrix-vector multiplication routine DSMMX from its Engineering and Scientific Subroutine Library (ESSL). In DSMMX, the sparse matrix is stored in a form known as the Ellpack-Itpack format (Oppe et al., 1988) . In this form, the matrix is stored in a pair of twodimensional panels whose major dimension size is the number of rows and whose minor dimension size is the maximum number of non-zeroes per row. One panel holds the column indices; the second holds the non-zero values. The fixed width of the panels is an advantage, but if most rows are not near the maximum length, the large number of zero values in the panel leads to inefficient memory usage and computation. Results obtained with DSMMX are presented in Section 4.3.
HARDWARE PLATFORMS
Below we enumerate architectures used in our experiments. On each platform, we performed experiments on a single processor.
• processors. Each node in the system contains a pair of processors, which share an 8MB unified secondary cache. Each R12000 can issue four instructions per cycle, which may include at most one memory access and one floating-point multiply-accumulate. Each processor has a 32KB two-way set-associative primary cache. Latency penalties for L1 and L2 misses are approximately 8 and 100 cycles each. Code was compiled with MIPSpro Fortran and C compiler V7.3.1.3m using optimizations "-O3 -mips4 -r12000". Performance results were measured using the PAPI interface to hardware performance counters.
• A dual-processor 900MHz Itanium 2 workstation. Each processor has a 1.5MB L3 cache, 256MB L2 cache, and 32KB L1 cache. Code was compiled with Intel's efc and ecc V7.0 with optimization level "-O3". Performance results were measured using the PAPI interface to hardware performance counters.
HARDWARE PERFORMANCE RESULTS
We computed useful MFLOP/S rates as the expected number of floating-point operations divided by the time spent performing them. Due to the structure of the sparse matrix-vector multiplication, one floating-point multiply and one floating-point add are expected for each non-zero element. Time was measured on each platform as cycle counts using hardware performance counters as described in Section 4.2. Figure 6 shows measurements for 100 iterations of a sparse matrix-vector product using SAGE's timing_c and timing_b matrices in CSR or L-CSR format. For all architectures, results for timing_b were quite similar to the timing_c results already discussed. This is because the matrices have very similar structures that arise from the characteristics of the SAGE application, although the two matrices derive from different problem specifications. For that reason, in the remainder of the paper, we present only the analysis of timing_c. The L-CSR code used in these experiments was tailored for SAGE's matrices as described in Section 3. Our results show that using an L-CSR format improves performance of sparse matrix-vector multiplication on all hardware platforms. The smallest improvement was roughly 41% on the MIPS R12000. On the Alpha and Power3-II improvements of 47-71% were measured. On the Itanium 2, performance improved by more than a factor of 2. Using IBM's ESSL sparse matrix-vector product routine DSMMX, on the Power3 we achieved only 46 MFLOP/S with timing_c and 40 MFLOP/S with timing_b. Both CSR and L-CSR routines achieved better performance by a wide margin.
We performed a series of experiments to evaluate the impact of two aspects of L-CSR computation: (1) unrolland-jam to interleave computation of multiple rows and (2) fully-unrolled special case code for short, common row lengths. The results for the Alpha and Itanium 2 architectures are shown in detail in Table 1 . We evaluated unroll-and-jam factors of 1 (no unroll-and-jam), 2, 4, 6, and 8, with and without full unrolling for common row lengths. Our results show that both optimizations yield important performance improvements. On the Alpha, unroll-and-jam (using the best unroll-and-jam factor) and special-case full unrolling improve performance by 17% and 16%, respectively. The two optimizations together give a 26% improvement over plain L-CSR.
The improvements are especially dramatic on the Itanium 2. Unroll-and-jam improves performance by 145% in the best case; special-case full unrolling improves performance by 230%. Special-case unrolling is especially beneficial on the Itanium because of the way its compiler implements software pipelining for CSR. On the Itanium, software pipelining is accomplished by generating a single loop body that uses predicated execution to represent fill, steady state, and drain phases with one block of code. This approach decreases code size but results in wasted issue slots during execution of fill and drain phases. If the loop has a small trip count, as in the short rows of the SAGE matrices, the fill and drain overhead can severely degrade performance, taking as much time as several iterations of the main loop. L-CSR improves performance by increasing loop granularity, making software pipelining unnecessary. Interestingly, once code is unrolled for special cases, applying unroll-and-jam actually degrades performance on the Itanium 2. This may be due to a more harmful effect of larger code size on VLIW architectures.
SIMULATION RESULTS
We used the Rice Simulator for ILP Multiprocessors (RSIM; Pai et al., 1997) to understand the dynamic flow of instructions in the processor and to get more detailed data about memory system utilization. RSIM is an execution-driven simulator that uses executable files compiled for the SPARC architecture. It models both superscalar processor behavior and multi-level memory hierarchies to a high degree of detail. The RSIM processor model takes advantage of instruction-level parallelism (ILP) in a way similar to modern superscalar processors. RSIM is commonly used to model multi-processor systems; we used it in single-processor mode. We used the default RSIM parameters for single-processor simulation. This 264  218  205  182  88  238  250  244  258  258 configuration includes two ALUs, two FPUs, and two address generation units; four instruction retire slots; and eight register windows. The memory system includes a 16kB write-through L1 cache with 1-cycle latency, 64kB four-way set-associative L2 cache with 5-cycle latency, and DRAM with 18-cycle latency. These are generally consistent with a 300-MHz SPARC processor. RSIM enabled us to precisely measure dynamic instruction counts, functional unit usage, and memory performance. Figure 7 shows the number of instructions of different types measured in simulations of a single iteration of CSR and L-CSR on SAGE's timing_c matrix. Our simulation results show that L-CSR dramatically reduces the number of instructions executed, largely by eliminating arithmetic overhead. ALU instructions account for more than 50% of the total instructions in CSR multiplication. L-CSR multiplication eliminated 72% of these instructions. Floating-point and branch instructions were also eliminated. The principal difference in floating-point operations observed here was due to the lack of floating-point register copies in L-CSR and slightly fewer useful floating-point operations needed (as explained in Section 4.1). Larger loop bodies eliminated branch instructions.
Special-case Code
Reductions in total instruction count do not necessarily improve execution time. To understand how our simulations differ in performance, we measured detailed cycle usage in addition to instruction counts. RSIM divides the processor time into parts: time spent handling different kinds of stalls, such as cache misses and interlocks, and busy time spent graduating instructions at the processor's maximum rate. Figure 8 shows the breakdown of time spent during each simulation.
Our intuition was that the performance of sparse matrix-vector multiplication should be limited by memory stalls since there is no temporal locality in loading elements of the sparse matrix. In Figure 8 , we see that read stalls are the dominant stall cost. The memory stall cost measured for L-CSR was 10.9% greater than that of CSR. A slight increase was expected. CSR achieves good spatial locality for the input vector as the multiplication progresses through the rows because most of the nonzero values are near the diagonal of the matrix. This locality property can be destroyed by the length-based row reordering performed by L-CSR. In measurements of memory hierarchy utilization on our hardware systems, we found that L-CSR decreases the rate of first level and second level cache misses but increases the rate of third level cache misses and TLB misses on the Itanium 2. On the MIPS, L-CSR increases the rate of translation lookaside buffer (TLB) misses and both primary and secondary cache misses relative to CSR.
In an effort to improve memory locality, we tried a blocking strategy to reduce memory stalls: applying the L-CSR reordering strategy within groups of k rows, for different values of k. The effect on performance turned out to be minuscule. Figure 8 shows that "busy time" for CSR is more than double that for L-CSR. Busy time is a sign that the processor is bogged down with ALU overhead, causing the memory system to be used at less than full capacity. In CSR multiplication, the processor was busy 37.43% of the time. L-CSR reduced the busy time by 74% without changing stall time significantly. The reduction of instructions issued and the dramatic reduction in busy time result from the fact that L-CSR code presents larger loop bodies to the compiler, revealing more opportunity to reduce address arithmetic overhead.
L-CSR reduced the number of floating-point instructions as well as ALU instructions. In the SPARC/RSIM assembly code for CSR sparse matrix-vector multiplication, the floating-point operations include the following: (1) for each row, loading a zero into an accumulator register; (2) for each row element, multiplying an element of the matrix by an element of the input vector; (3) for each element, adding the product computed in step 2 to the accumulator register; and (4) performing register-to-register moves. In L-CSR, since the row length is known ahead of time, the accumulator register can be initialized with the first value of each row instead of zero. In this way, steps 1 and 3 can be completely eliminated for the first element of each row. For matrices with small average row lengths, the elimination of a few instructions per row can be quite significant. The unroll-and-jam optimization enabled by L-CSR can also eliminate many of the register-to-register move instructions in step 4, although this effect was negligible on the SPARC/RSIM platform. The details of eliminating floating-point operations differed greatly across different platforms. On the MIPS, almost no floating-point instructions could be eliminated. On the Power3, however, over 50% of the floating-point instructions could be eliminated; most were register-toregister moves.
L-CSR's unroll-and-jam in the outer loop and full unrolling of certain inner loops eliminated the vast majority of CSR's branch instructions. While branch prediction in CSR was good enough to nearly eliminate branch stall time, branch instructions still took up valuable issue slots, further increasing ALU overhead.
In addition to reducing the number of instructions, L-CSR improves performance by overlapping instructions to fill the floating-point pipeline. With the unroll-and-jam optimization, floating-point operations in several consecutive rows can be interleaved, reducing floating-point stalls. In RSIM, floating-point stall time was reduced by 82%.
Related Work
A variety of different data and computation reorganization strategies have been explored in an effort to increase the performance of sparse matrix-vector product computations.
The Ellpack-Itpack format (Oppe et al., 1988) collapses sparse matrices into a pair of dense matrices of size number of rows (nrows) by the maximum number of non-zeroes (maxnz) per row; one matrix contains coefficients and the other column indices. Ellpack-Itpack works well for matrices where the maximum number of non-zeroes in any row is not large and the number of non-zeroes per row fairly uniform. If not, this representation can waste a lot of computation and memory bandwidth on zero-fill entries. Paolini and Radicati di Brozolo (1989) and Saad (1988) proposed similar variants of the Ellpack-Itpack data structure known as jagged diagonals. This data structure reorders rows by length and then stores a dense vector with a column entry from each row. Each column goes from the first (longest) row down until the row just prior to the one in which the first non-zero appears in this column. While this representation facilitates vectorization along columns, for sparse matrix-vector multiply it sacrifices reuse of the output vector, which it loads and updates repeatedly as each diagonal is processed. Agarwal et al. (1992) proposed a feature extraction based algorithm to extract certain kinds of regular structure so it can be processed separately in a more efficient manner. They dissect the matrix into three separate representations: one for nearly dense blocks, one for nearly dense diagonals, and one for the remaining non-zeroes. Once the nearly dense blocks and diagonals are extracted, the remaining rows of non-zero elements are sorted by row length and stored in a "ladder scheme" matrix format similar to the Ellpack-Itpack format. Unlike Itpack, however, this format reduces zero padding by partitioning the rows with a constant number of rows per partition. The width of each partition is the maximum row length within the partition. Thus, the ladder scheme may still result in excessive zero padding when the difference between row lengths within a partition is large, though the amount of padding is almost always much less than that of Itpack. Das et al. (1992) applied a breadth-first strategy known as Reverse Cuthill-McKee (George and Liu, 1981) to reorder elements in an unstructured mesh. Applying this reordering, originally devised to minimize the bandwidth and profile of sparse matrices, has the effect of improving the locality of indirect accesses to the input vector when computing its product with the sparse matrix. The reordering improves both cache and TLB locality. Im and Yelick (2001) designed the Sparsity system, which tiles the matrix into uniform block sizes rather than searching for dense blocks within a matrix. An exhaustive search with speed profiling is used to find the optimal block size. This performs well for some patterns of non-zeroes, but tiling is less effective on matrices with irregular non-zero patterns. Vuduc et al. (2002) developed upper and lower bounds for the effectiveness of performance improvements for sparse matrix-vector multiply based on calculated memory hierarchy performance. The upper bound is a loose one, since it applies only to those improvements that work by improving memory locality. In practice, many computations have performance bottlenecks other than memory bandwidth. In particular, our work improves performance by reducing computational overhead, with little effect on memory locality. Toledo (1997) used a combination of bandwidth reduction, blocking, and prefetching to improve performance of sparse matrix-vector multiplication on RISC processors. A blocking strategy is used in which the matrix is scanned for completely dense 1 × 2 and 2 × 2 blocks. The intent was to find as many opportunities for blocking as possible, even with small blocks, rather than finding a small number of large semi-dense blocks. This strategy reduces the number of load instructions required, since some row and column index loads and input vector loads are made unnecessary. The number of arithmetic instructions, however, remains the same or increased with this system. For matrices with small average row length, the large number of arithmetic instructions prevents the memory system from being used to full capacity. Gropp et al. (2000) describe a strategy for improving performance for a code that needs to multiply a sparse matrix by multiple independent vectors. By multiplying a sparse matrix by multiple independent vectors at a time, the sparse matrix data structure components can be reused in registers rather than streaming them through the memory hierarchy once per vector. Like our unrolland-jam strategy, interleaving the computations on multiple independent vectors boosts floating-point pipeline efficiency.
Conclusions
Previous research has shown that sparse matrix-vector product computations can be accelerated by exploiting local patterns of density to enable register blocking (Toledo 1997; Im, 2000; Im and Yelick, 2001) . Register blocking strategies for sparse matrix-vector product computations reduce the need to load indirection vectors and provide temporal register reuse of both non-zero matrix elements and the corresponding vector elements by which they are multiplied. Memory hierarchy blocking for cache and TLB reuse has also been shown to be profitable (Im, 2000) .
In this paper, we have described a new technique for accelerating sparse matrix-vector product calculations for matrices in which rows occur in a small number of distinct lengths. Matrices with this property are characteristic of those that arise in SAGE, an important code that uses cell-by-cell adaptive refinement of a structured hexahedral mesh. Reorganizing the matrix representation to group together rows of identical length enabled us to apply unroll-and-jam to the sparse matrix-vector product computation. This format yielded performance improvements from 41% to over a factor of 2 for the platforms we tested.
Our work is unique in that it focuses on reducing computational overhead rather than improving memory hierarchy performance. For sparse matrix-vector multiplication on the SAGE matrices, we found that computation was as serious a bottleneck as memory bandwidth. In general, reducing computation overhead can be as important as improving memory locality for boosting performance.
Preliminary experiments on general sparse matrices from the NIST matrix market catalogue (NIST, 2003) have shown that having a high frequency of rows with a small number of fixed lengths is not a common property. Thus, we expect that our strategy will be most useful for matrices that arise out of codes such as SAGE that perform structured adaptive mesh refinement. Despite this somewhat narrow range of applicability, our technique is significant because of its benefits for long-running adaptive computations with SAGE and SAGE's applicability to a broad class of problems.
