Abstract
Introduction
Sparse matrix-vector multiplication (SpMxV) is one of the most important computational kernels, lying at the heart of very effective and popular iterative solution methods like CG and GMRES [15] , which are used to solve sparse systems arising from the simulation of a large variety of physical, financial, social etc. problems. SpMxV is generally reported to perform poorly on modern microprocessors (e.g. 10% of peak performance [19] ) mainly due to the fact that it is a memory-bound application [6] . Matrix-vector multiplication exhibits more intense memory access needs than other traditional algebra kernels like matrix multiplication (MxM) or LU decomposition, which are more computationally intensive. MxM and LU benefit from the so called surface-to-volume effect, since for a problem size of n they perform O(n 3 ) operations on O(n 2 ) amount of data. On the contrary, matrix-vector multiplication performs O(n 2 )
This research is supported by the PENED 2003 Project (EPAN), co-funded by the European Social Fund (75%) and National Resources (25%).
operations on O(n 2 ) amount of data, which means that the ratio of memory access to floating point operations is significantly higher. Seen from another point of view, there is little data reuse in the matrix-vector multiplication, i.e. very restricted temporal locality. When sparsity comes into play, the performance is further degraded. In order to avoid extra computation and storage overheads imposed by the large majority of the zero elements contained in the sparse matrix, the non-zero elements of the matrix are stored contiguously in memory, while additional data structures assist in the proper traversal of the matrix and vector elements. For example, the classic Compressed Storage Row (CSR) format [2] uses the row ptr structure to index the start of each row within the non-zero element matrix a, and the col ind structure to index the column each element is associated with. These additional data structures used for indexing greatly affect the kernel's performance, since they add additional memory access operations, memory bandwidth pressure and cache interference. Sparse matrices also incur irregular accesses to the input vector x (CSR format is assumed) that follow the sparsity pattern of the matrix. This irregularity complicates the utilization of data reuse on vector x and increases the number of cache misses on this vector. Finally, there is also a non-obvious implication in sparsity: the rows of the sparse matrices have varying lengths which are frequently small. This fact increases the loop overheads since a small number of computations is performed in each loop iteration.
A large number of research papers [1, 3, 5, [8] [9] [10] [11] [12] [13] [16] [17] [18] [19] [20] have proposed optimization techniques to improve the performance of SpMxV (see Section 2 for details). A general conclusion is that SpMxV can be efficiently optimized by exploiting information regarding the matrix structure and the processor's architectural characteristics. In general, previous research focuses on a subset of the reported problems and proposes optimizations applied to a limited number of sparse matrices. This fact, along with the CPUs used in various previous works, may lead to contradictory conclusions and, perhaps, to confusion regarding the problems and candidate solutions for SpMxV optimization. In addition, the exact reason for performance gain after the application of the proposed optimizations is rarely investigated. For example, blocking implemented with the use of the Block Compressed Storage Row (BCSR) format was proposed by Im and Yelick [8] as a transformation to tame irregular accesses on the input vector and exploit its inherent reuse as in dense matrix optimizations. One-dimensional blocking is also proposed by Pinar and Heath [13] in order to reduce indirect memory references, while, quite recently, Buttari et al. [3] and Vuduc and Moon [19] accentuate the merit of blocking (the latter with variably sized blocks), as a transformation to reduce indirect references and enable register level blocking and unrolling. However, it is not clarified if the benefits of blocking can be actually attributed to better cache utilization, memory access reduction or ILP improvement. Furthermore, White and Sadayappan [20] report that the lack of locality is not a crucial issue in SpMxV, whereas many important previous works exploit reuse on the input vector in order to improve performance [5, 11, 12, 17] .
The goal of this paper is to assist in understanding the performance issues of SpMxV on modern microprocessors. To our knowledge, there are no experimental results concerning the performance behavior of this kernel, or any of its optimized versions, on current microarchitectures. In order to achieve this goal, we have categorized the problems of the algorithm as reported in literature. For each problem we conduct a series of experiments in order to quantify its effect on performance. Our experimental results provide valuable insight on the performance of SpMxV on modern microprocessors and reveal issues that will probably prove particularly useful in the process of optimization. Our experiments are performed on a large suite of 100 matrices selected from Tim Davis' collection [4] . Based on the conclusions drawn from the conducted experiments, we propose guidelines that can aid the optimization process.
The rest of the paper is organized as follows: Section 2 presents related work on SpMxV and optimization methods and Section 3 presents the basic algorithm and the reported problems. In Section 4 we present various experimental results that illuminate the performance issues of SpMxV, while Section 5 summarizes our conclusions and discusses future research work.
Related work
Sparse matrix-vector multiplication has attracted intensive scientific attention in the last two decades. The proposal of efficient storage formats for sparse matrices like CSR, BCSR, CDS, Ellpack-Itpack and JAD [2, 10, 15] was one of the primary concerns. Temam and Jalby [16] perform a thorough analysis of the cache behavior of the algorithm, pointing out the problem of the irregular access pattern in the input vector x. Toledo [17] deals with this problem by proposing a permutation of the matrix that favors cache reuse in the access of x. Furthermore, the application of blocking is also proposed in that work, in order to both exploit temporal locality on x and reduce the need for indirect indexing through col ind. Software prefetching for a and col ind is also used to improve memory access performance. The proposed techniques were evaluated over 13 sparse matrices on a Power2 processor and achieved a significant performance gain for the majority of them. White and Sadayappan [20] state that data locality is not the most crucial issue in sparse matrix-vector multiply. Instead, small line lengths, which are frequently encountered in sparse matrices, may drastically degrade performance due to the reduction of ILP. For this reason, the authors propose alternative storage schemes that enable unrolling. Their experimental results exhibited performance gains on an HP PA-RISC processor for all 10 sparse matrices used. Pinar and Heath [13] refer to irregular and indirect accesses on x as the main factors responsible for performance degradation. Focusing on indirect accesses, the application of one-dimensional blocking with the BCSR storage format is proposed, in order to drastically reduce the number of indirect memory references. In addition, a column reordering technique is also proposed, which enables the construction of larger dense sub-blocks. An average 1.21 speedup is reported for 11 matrices on a Sun Ultra-SPARC II processor.
With a primary goal to exploit reuse on vector x, Im and Yelick propose the application of register blocking, cache blocking and reordering [7, 8] . Additionally, their blocked versions of the algorithm are capable of reducing loop overheads and indirect referencing, while increasing the degree of ILP. Register blocking is the most promising of the above techniques. The authors also propose a heuristic to determine an efficient block size. They perform their experiments on four different processors (Ultra-SPARC I, MIPS 10000, Alpha 21164, PowerPC604e) for a wide matrix suite involving 46 matrices. For almost a quarter of these matrices register blocking achieved significant performance benefits. Geus and Röllin [5] apply software pipelining to increase ILP, register blocking to reduce indirect references and matrix reordering to exploit the reuse on x. They perform a set of experiments on a variety of processors (Pentium III, UltraSPARC, Alpha 21164, PA-8000, PA 8500, Power2, i860 XP) and report significant performance gains on two matrices originating from the discretization of 3−D Maxwell's Equations with FEM. Vuduc et al. [18] estimate the performance bounds of the algorithm and evaluate the register blocked code in respect to these bounds. Furthermore, they propose a new approach to select near-optimal register block sizes. Mellor-Crummey and Garvin [9] accentuate the problem of short row lengths and propose the application of the well-known unroll-and-jam compiler optimization in order to deal with it. Unroll-andjam achieves a 1.11-2.3 speedup for two matrices taken from the SAGE package measured on MIPS R12000, Alpha 21264A, Power3-II and Itanium processors. Pichel et al. [11] model the inherent locality of a specific matrix with the use of distance functions and improve this locality by applying reordering to the original matrix. The same group proposes also the use of register blocking to further increase performance [12] . The authors report an average of 15% improvement for 15 sparse matrices on MIPS R10000, UltraSPARC II, UltraSPARC III and Pentium III processors.
Buttari et al. [3] provide a performance model for the blocked version of the algorithm based on BCSR format, and propose a method to select dense blocks efficiently. Their experimental results are performed on K6, Power3 and Itanium II processors for a suite of 20 sparse matrices and validate the accuracy of the proposed performance model. Vuduc et al. [19] extend the notion of blocking in order to exploit variable block shapes and, in order to achieve this, decompose the original matrix to a proper sum of submatrices, having each submatrix stored in the BCSR format. Their approach is tested on the Ultra2i, Pentium III-M, Power4 and Itanium II processors for a suite of 10 FEM matrices that contain dense sub-blocks. The proposed method achieves better performance than pure BCSR in all processors except Itanium II. Finally, Willcock and Lumsdaine [21] mitigate the memory bandwidth pressure, by providing an approach to compress the indexing structure of the sparse matrix, sacrificing in this way some CPU cycles. They perform their experiments on a PowerPC 970 and an Opteron processor for 20 matrices, achieving an average of 15% speedup.
Summarizing on the results of previous research on the field, the following conclusions may be drawn: (a) the matrix suites used in the experimental evaluations are usually quite small, (b) the evaluation platforms include in most cases previous generation microarchitectures, (c) the conclusions are sometimes contradictory and (d) the performance gains attained by the proposed methods are not thoroughly analyzed in relevance to the specific problems attacked. The goal of this work is to understand the performance issues of SpMxV kernel on modern microprocessors and provide solid optimization guidelines. For this reason we employ a wide suite of 100 matrices, perform a large variety of experiments and report performance data and information collected from the performance monitoring facilities provided by the microprocessors.
Basic algorithm and problems
The most frequently applied storage format for sparse matrices is the Compressed Storage Row (CSR) [2] . According to this format, the nnz non-zero elements of a sparse matrix with n rows are stored contiguously in memory in row-major order. The col ind array of size nnz stores the column of each element in the original matrix, and the row ptr array of size n + 1 stores the beginning of each row. Fig. 1 shows an example of the CSR format for a sparse 6 × 6 matrix (a), along with the implementation of the matrix-vector multiplication for a dense N × M matrix (b) and the matrix-vector multiplication for a sparse matrix stored in CSR (c).
According to literature, SpMxV presents the following problems that can potentially affect its performance:
-Memory intensity (no temporal locality in the matrix) [3, 6, 9] . This is an inherent problem in the algorithm, regardless the matrix is sparse or dense. Unlike other important numerical codes like matrix multiplication (MxM) or LU decomposition, the kernel is memory bound, while the elements of the matrix in matrixvector multiply are used only once.
-Indirect memory references [13] . This is the most apparent implication of sparsity. In order to save memory space and reduce floating-point operations, only the non-zero elements of the matrix are stored. To achieve this, the indices to the matrix elements need to be stored and accessed from memory, via the col ind and row ptr data structures. This fact implies additional load operations, traffic for the memory subsystem and cache interference.
-Irregular memory accesses for vector x [5, 7, 11] . Unlike the case of dense matrices where the access to the vector x is sequential, in sparse matrices this access is irregular and dependent on the sparsity structure of the matrix. This fact complicates the process of exploiting the inherent reuse in the access of vector x.
-Short row lengths [3, 9, 20] : Although this problem is not obvious, it is very often met in practice. Many sparse matrices exhibit a large number of rows of short length. This fact may degrade performance due to the significant overhead of the loops, when the trip count of the inner loop is small.
Experimental evaluation

Experimental preliminaries
Our experiments were performed on a 100 matrices set (see Table 4 ). The majority of them was selected from Tim Davis' collection [4] . The first matrix is a dense 1000×1000 matrix, matrices 2-45 are also used in SPARSITY [7] , matrix #46 is a 10000 × 10000 random, sparse matrix, matrix #87 is a 5-pt stencil, finite-difference matrix created by SPARSKIT [14] , while the rest are the largest rectangular matrices of the collection both in terms of non-zero elements and number of rows. All matrices are stored in CSR format.
The experimental platform includes three different microproccessors: an Intel Core 2 Xeon (Clockspeed: 2.6GHz, 4MB L2 cache -Woodcrest), an Intel Pentium 4 Xeon (Clockspeed: 2.8GHz, 1MB L2 cache -Netburst) and an AMD Opteron (Clockspeed: 1.8GHz, 1MB L2 cache -Opteron). These processors are a representative set of commodity processors currently available. The systems run Linux (kernel version 2.6) for the x86 64 ISA, while the programs were compiled using gcc version 4.1 with the -O3 -funroll-loops optimization flags. The latter switch causes the compiler to apply loop unrolling to all loops of the program.
The experiments were conducted by measuring the execution time of 128 consecutive SpMxV operations with randomly created x vectors for every matrix in the set and for each different microprocessor. The floating point operations per second (FLOPS) metric of each run was calculated by dividing the total number of operations (2 × nnz) by the execution time. We used 64-bit integers for the representation of indices in col ind and applied double precision arithmetic. It should be noted that we made no attempt to artificially pollute the cache after each iteration, in order to better simulate iterative scientific application behavior, where the data of the matrices are present in the cache, either because they have just been produced, or because they were recently accessed.
One of the most prominent characteristics of modern processors is hardware prefetching. Hardware prefetching is a technique to mitigate the ever-growing memory wall problem by hiding memory latency. It is based on a simple hardware predictor that detects reference patterns (e.g. serialized accesses) and transparently prefetches cachelines from main memory to the CPU cache hierarchy. In order to gain a better insight on the performance issues involved, we conducted experimental tests to evaluate the effect of hardware prefetching in the SpMxV kernel by disabling it. We present results for Intel processors only since there does not seem to be a (documented) way to disable hardware prefetching for AMD processors. A summary of the results obtained is presented in Table 1 . 
Experimental evaluation of serial SpMxV
Basic performance of serial SpMxV
Fig . 2 shows the detailed performance results for the SpMxV kernel in terms of FLOPS for each matrix and architecture on the experimental set. To gain a better understanding of the results, we consider the benchmark of a Dense Matrix-Vector (DMxV) Multiplication, for a dense matrix 1024 × 1024, as an upper bound for the peak performance of the SpMxV kernel. Summarized results are presented in Table 2 . As expected, the more recent Woodcrest processor outperforms the other two in the whole matrix set. Moreover, while Netburst and Opteron exhibit similar behavior for each matrix, Woodcrest in some cases deviates greatly. This is apparent, for example, in #14, #16 and #54 matrices, where the performance for the Woodcrest increases by a large factor. This is, most probably, due to its larger L2 cache. Furthermore, it is clear from 2 that the performance across the matrix set has great diversity. In order to further elaborate on this observation, we make a distinction between two different classes in the matrix set: the matrices whose working set fits perfectly into L2 cache, and thus experience only compulsory misses, and the matrices whose working set is larger than the L2 cache size and may experience capacity misses. The formula for the calculation of the working set (ws) in bytes is: ws = (nnz × 2 + nrows × 2 + ncols) × 8. In Fig. 3 we present the performance attained by each matrix, with its working set marked on the x axis. The vertical line in each graph designates the size of L2 cache for each architecture. This figure clarifies that the great differences between the performance of various matrices are due to the size of their working sets. If the working set of a matrix fits in the cache, then significantly higher performance is expected. It is evident that the performance issues involved for each category are different and comparing the performance of matrices from different classes may lead to false conclusions.
Additionally, Fig. 4 presents the performance of each matrix with respect to the L2 cache miss-rate as measured from the performance counters for each processor. As anticipated, working sets that are smaller than the cache size exhibit close to zero L2 miss-rate. At a coarser level, there seems to be a correlation between the performance in Table 2 . Summarized results (MFLOPS) for the performance of the SpMxV kernel.
FLOPS and L2 misses. Regardless, the L2 miss-rate metric does not suffice alone to understand the performance of the kernel. For example, there are cases where a great increase in the miss-rate does not have an equivalent effect on performance, whereas matrices with similar miss-rates have significantly varying MFLOPS.
Irregular accesses
In order to evaluate the performance impact of irregular accesses on x, we have developed a benchmark, henceforth called noxmiss, which tries to eliminate cache misses on vector x. More precisely, noxmiss zeroes out the col ind array, so that each reference to x accesses only x[0], resulting in an almost perfect access pattern on x. Note that the noxmiss version of the algorithm differs from the standard one only in the values of the data included in the col ind array, and thus executes exactly the same opera- tions. Obviously, its calculations are incorrect, but it is quite safe to assume that any performance deviation observed between the two versions is due to the effect of irregular accesses on the input vector x. Results of the experiments for the noxmiss are presented in Table 3 .
Processor
Speedup
Matrices with speedup: Table 3 . Summarized results for the noxmiss benchmark: Speedup and number of matrices that encountered a minimum performance gain of 10%, 20% and 30%.
It is worth noticing that only a small percentage of the matrices (no more than 1/3 of the total matrix set), did encounter a significant amount of performance speedup of over 10% for all processors. This means that the irregular access pattern of SpMxV is not the prevailing performance problem. For the large majority of matrices it seems that the access on x presents some regularity that either favors data reuse from the caches, or exhibits patterns that can be detected by the hardware prefetching mechanisms. However, the majority of matrices that performed rather poorly on the standard benchmark, encountered quite significant speedup on the noxmiss benchmark. This leads to the conclusion that there exists a subset of matrices, where the irregular accesses on x pose a considerable impediment on performance. These matrices have a rather irregular nonzero element pattern, which finally leads to poor access and low reuse on x and tends to degrade performance.
Short row lengths
Short row lengths that are frequently met in sparse matrices lead to a small trip count in the inner loop, a fact that may degrade performance due to the increased overhead of the loops. In order to evaluate the impact of short row lengths on the performance of SpMxV, we focus on matrices that include a large percentage of short rows. Fig. 5 shows the performance of matrices in which more than 80% of the rows contain less than eight elements. The x axis sorts these matrices by their ws. The vertical line represents the cache size of each processor and the horizontal line represents the average performance across all matrices (see Table 2 ). The obvious conclusion that can be drawn from Fig. 5 is that matrices with large working sets and many short rows exhibit performance significantly lower than the average. This performance degradation could be attributed to the loop overhead. However, the fact that matrices with many short rows and small working sets achieve remarkably good performance, provides a hint that loop overhead should not be the only factor. Another important observation that supports the above point is that the matrices reported in Fig. 5 coincide (with few exceptions) with the matrices that benefited when the noxmiss benchmark was applied to them. These facts guide us to the conclusion that short row lengths may indicate a large number of cache misses for the x vector. This can be explained by the fact that short row lengths increase the possibility to access completely different elements of x in subsequent rows.
Indirect memory references
There are two indirect memory accesses in the SpMxV kernel: one in row ptr to determine the bounds of the inner loop and one for the x access (col ind). To investigate the effect of the indirect memory references in the performance of the kernel we used synthetic matrices with a constant number of contiguous elements per row. These matrices enable us to eliminate both cases of indirect accesses by replacing them with sequential ones (noind-rowptr, noindcolind). Next, we compare the performance of the new versions with standard in order to attain a qualitative view on the performance impact of the indirect references. We applied the original SpMxV kernel and the modified versions on a number of synthetic matrices with 1, 048, 576 elements and varying row length. Fig. 6 summarizes the performance measured for a subset of the row lengths applied. The performance does not significantly deviate for different row lengths. It is clear that the indirect memory references in row ptr do not affect performance. This is quite predictable since these references are rare and replace an already existing overhead in the inner loop initialization. On the other hand, the overhead in the indirect access of x through col ind leads to a dramatic degradation in performance. In this case each memory access of x is burdened by one additional memory access which increases the ws of the problem, adds extra instructions in the code and limits the IPC of the kernel.
Memory intensity (no temporal locality in the matrix)
The intense memory requirements and the lack of temporal locality are two issues strongly related. In SpMxV each matrix element participates in only two FP operations. This fact increases the memory to FP operations ratio and significantly affects SpMxV's performance. Thus, the performance of the kernel is not determined by the processor speed, but by the ability of the memory subsystem to provide data to the CPU [6] . In order to further illuminate this feature of the kernel, we performed a simple, comparative set of experiments: we used 32-bit instead of 64-bit integers for the col ind structure, in order to reduce the total size of the working set. Woodcrest exhibited a 1.20, Netburst a 1.29 and Opteron 1.17 speedup. It is quite impressive that the decrease of the ws by a factor of 22.4% led to a significant increase in performance. Based on these results, we can conclude that the improvement observed in the case of indirect memory accesses (see Fig. 6 ) is mainly due to the reduction of the ws. The same applies for the benefits of the BCSR format [3, 8, 19] , which uses one index for each dense sub-block saving in this way valuable memory space in col ind.
Furthermore, the kernel exhibits no reuse in the data structures a, col ind and row ptr that represent the matrix. The lack of temporal locality is traditionally believed to affect performance. However, as seen in Fig. 1 , all aforementioned structures are accessed in a very regular, streaming pattern with unit stride. The hardware prefetcher is able to detect these simple access patterns and transparently fetch their corresponding cache-lines from memory (see Section 4.1 for experimental information on the effect of hardware prefetching). Thus, it is quite safe to conclude that the lack of temporal locality in the matrix causes an insignificant number of cache misses and therefore performance is not affected by this particular factor. 
Conclusions on the experiments and optimization guidelines
Based on the experimental evaluation of the previous sections, a number of interesting conclusions can be drawn. At first, the performance of the kernel is greatly affected by the matrix working set. As shown in Fig. 3 , matrices with working sets that entirely fit in the L2 cache exhibit a significantly higher performance. However, since these matrices correspond to small problems, their optimization is of limited importance and thus we focus on matrices with large working sets that do not fit in the L2 cache. In addition, reduction of the ws for the same problem releases memory bus resources and leads to significant execution speedups. The memory intensity of the algorithm along with the effect of the indirect memory reference on x are the most crucial factors for the poor performance of SpMxV and affect all matrices. On the other hand, the irregularity in the access of x and the existence of many short rows affect performance at a smaller range and relate to a rather limited subset of the matrices. Finally, the lack of temporal locality in the matrix structures does not affect performance through issues that could be optimized (e.g. cache misses) but inherently increases the number of memory accesses.
In an attempt to quantify the effect of each of the aforementioned issues, we performed statistical analysis of our results that is summarized in Fig. 7 , where a number of bars is included for each architecture. The first three bars represent the performance of a matrix vector multiplication for a dense matrix (1024 × 1024) stored in dense format (dmv) and stored in csr format for the standard case (csrdense) and the noind-colind benchmark (csr-dense-noindcolind). The csr-avg-nosr-reg represents average perfromance across all matrices in the suite with working sets larger than the L2 size, while the rest of the bars correspond to all possible subsets of these matrices based on their regularity (-irregular/-regular) and on whether they are dominated by short rows or not (-sr/-nosr). The criterion for the irregularity is the presence of a significant speedup (> 10%) for the noxmiss benchmark, while for the dominance of short rows is the presence of a large percentage (> 80%) of small row lengths (< 8) . Note that all matrices involved in this graph have working sets larger than the L2 size. The numbers over the bars indicate the number of matrices that belong in the particular set. Note, for example, that there exist too few matrices that are dominated by short rows and do not face performance degradation due to irregularity. This observation further supports our assumption that short rows increase the possibility for irregular accesses on x.
The most important observation from the figure is that one could set three levels of performance. The performance level determined by DMxV, the average performance level and the lowest level determined by "bad" matrices with irregularity and dominating short row lengths. Roughly speaking, the dramatic degradation (slowdown by a factor of about 2) of performance between DMxV and average level is due to the indirect references through col ind. From that level, if a matrix exhibits some poor characteristics like irregularity and many short rows, the performance may further drop by a factor of about 1.35. On the other hand, if a matrix is not dominated by short rows and accesses x in a regular manner, its performance may exhibit a 1.1 speedup to that of the average and reach that of dense matrices stored in CSR. Note also, that the majority of the matrices falls in that last category. Thus, our experimental results lead us to the following guidelines for optimization:
1. Reduce the ws size by using the smallest possible data types (e.g. 32-bit or 16-bit integers for col ind, single precision storage for x) in order to reduce the pressure on the memory subsystem. Even sacrificing CPU cycles to reduce the ws size (e.g. by applying compression) will also lead to performance improvement (as in [21] blocks (e.g. BCSR format as in [3, 8, 19] ).
3. Pad sparingly. Adding many non-zero elements to accomplish an optimization approach may dramatically affect performance, mainly due to the increase in the ws. The extra floating-point operations should not create such a big problem, since the CPU has idle cycles to spare. Thus, the BCSR format used in [3, 8, 19 ] is expected to be beneficial only in the subset of matrices that contain many dense subblocks.
4. Beware of short row lengths and loop overheads. Some optimization approaches split the matrix into a sum of submatrices (as in [1, 19] ). In this case one should take care that the submatrices do not fall into the category of matrices with short row lengths. Alternatively, one may insert an additional outer loop in the multiplication kernel (as in [13] ). This may also incur significant overheads especially in matrices with short rows.
5. Identify matrices with problematic access on the x vector and apply cache reuse optimizations only to them.
6. There is no need to apply software prefetching to attack the problem of the lack of temporal locality as long as the CPU supports hardware prefetching.
Conclusions -Future work
In this paper we presented extensive experimental results regarding the performance issues of sparse matrixvector multiplication on modern microprocessor architectures. Our results illuminate and quantify the effect of the reported problems on the kernel's performance and can aid in forming a guideline to optimize the code. For future work, we will apply the knowledge gained from this paper in order to optimize the kernel using a short vectorization approach, which we believe that will provide performance benefits from the reduction of the working set, the indirect referencing and from the utilization of vector memory loads and floating-point operations. 
