The sparse matrix-vector product is an important computational kernel that runs ineffectively on many computers with super-scalar RISC processors. In this paper we analyse the performance of the sparse matrix-vector product with symmetric matrices originating from the FEM and describe techniques that lead to a fast implementation. It is shown how these optimisations can be incorporated into an efficient parallel implementation using messagepassing.
Performance analysis of the sparse matrix-vector product
In this paper we focus on large symmetric sparse matrices, that do not fit into the memory cache. While our matrices are stored in symmetric sparse skyline format (SSS), our ideas can be applied to general sparse matrices stored in other formats as well.
The SSS format [1] is an extension to the commonly used compressed sparse row format (CSR). For matrices in SSS format the strictly lower triangle is stored in CSR format (arrays ia, ja, va). The double precision array va of length n nz contains the non-zero values of the strictly lower triangle, stored row by row. The integer array ja of length n nz contains the column indices of the nonzero elements as stored in the array va. The integer array ia of length n + 1 contains pointers to the beginning of each row in the arrays va and ja. The diagonal is stored separately in a double precision array da of length n, where n is the order of the matrix and n nz is the number of non-zero values in the strictly lower triangle.
Algorithm 1 shows a matrix-vector multiplication code y := Ax for a matrix stored in SSS format. The accesses to the matrix data structure are in a stride-1 loop, the access pattern on x and y is irregular and depends on the sparsity structure of A. To compute an upper bound for the performance of the sparse matrix-vector product for a given architecture, profound knowledge of the design of the processor and the memory subsystem is required.
In [2] it is shown how the optimal out-of-cache performance of the daxpy() Routine can be computed for the HP PA-8500 architecture. For this computation one needs to know the cache line size, the cache miss penalty and how many outstanding memory requests the processor supports. To compute the optimal out-of-cache performance of the sparse matrix-vector product in this manner is much more difficult as its performance depends on the sparsity pattern of the matrix.
Our straight-forward approach is more portable and gives comparable results. We compute the ratio η of the number of floating point operations to the number of bytes of memory traffic. For the best case scenario where x and y are read only once from memory and then kept in-cache we get η ≈ 0.31 flops/byte for our test matrices described in Tab. 2. For this approximation we assume a very large write-back cache and double precision arithmetic. If we multiply η by the memory bandwidth of the system we get an upper bound of the performance of the sparse matrix-vector product.
We determine the memory bandwidth by benchmarking highly optimised computational kernels tuned for the given hardware. For this benchmark we use the vendor supplied BLAS daxpy and other routines that have a similar ratio of read to write operations as the sparse matrix-vector product. This gives more realistic results than using the peak memory bandwidth reported by the vendor.
From Tab. 1 one can observe that the measured performance of the sparse matrixvector multiplication code is far below the optimal performance, which is limited only by the memory bandwidth. The compiler is unable to generate efficient code, mainly because of data dependencies and irregular loops. Additional cache misses generated by accesses on x and y further degrade performance. On the other hand the upper bound computed for the IBM SP2 is unrealistic because its memory cache is too small (128 KBytes) to keep the vectors x and y in-cache.
For all numerical experiments published in this paper we used the standard vendor supplied C compilers with all (safe) optimisations turned on. On the Intel Linux PC we used the GNU C compiler for our benchmarks.
Design of a fast sparse matrix vector product for one processor
We applied three techniques to improve the implementation of Alg. 1.
Software Pipelining
By use of software pipelining (reorganising the source code in such a way that the processor pipelines are better filled) we are able to load data into registers earlier (data prefetching) and reduce data-dependencies in the innermost loop iteration. This increases the instruction level parallelism.
Most compilers are unable to perform these optimisations in a satisfactory way. Often compilers do not have the information necessary to move up load instructions safely. Due to the lack of type information, they have to generate code conservatively. Also most compilers have trouble unrolling more complex loops, e.g., revisiting Alg. 1, none of the vendor supplied compilers we used were able to unroll the inner loop over k.
Alg. 2 is an optimized version of Alg. 1. Here all data is loaded from memory one loop iteration before it is actually needed, such that the processor can better overlap computation and memory transfers. 
Register Blocking
We reduce the number of memory accesses by register blocking, i.e. splitting the matrix A into a sum of matrices A = A 1 + . . . + A m , consisting of small dense blocks of a fixed size [3] . When multiplying with such a matrix consisting of small dense blocks the code has to load fewer indices j because only one is needed per block. When multiplying with a dense block, elements of x and y can be loaded once and reused several times.
In our approach we store at least two matrices: one contains the small dense blocks of equal size and the other contains the remaining non-zero elements. In [4] another approach is presented: the authors store the whole matrix in small dense blocks, at the expense of having to store some zero entries explicitly.
To store the matrix A i of small dense blocks we use the same data structure as for the original matrix (SSS format), with the exception that we store a whole block for each coordinate pair (i, j) instead of just one value. We build this data structure using a linear-time greedy algorithm that scans the matrix row by row.
Register blocking can be implemented in several ways:
• Multiplying matrix-by-matrix: Multiply each matrix A i with vector x and sum the results.
• Multiplying row-by-row: Multiply with all A i at the same time, row-by-row.
When using this variant one can optionally store the nonzero elements in the same sequence as they are accessed, i. e. store dense blocks of different size in each row instead of storing the matrices A i separately.
We implemented all above variants. None of them was superior. The optimal routine has to be chosen depending on the matrix and the machine.
Matrix Reordering
We use Cuthill-McKee reordering [5] on the matrix to reduce its bandwidth. Because of the smaller bandwidth it is likely that during matrix-vector multiplication vector elements that are accessed in a particular matrix row will be accessed again in the following row. Thus matrix reordering can reduce cache misses that accesses to x and y generate [3] and more importantly also lowers the number of messages to be sent in the parallel implementation (see section 3). 
Numerical Experiments
For the numerical experiments we use the matrices listed in Tab. 2. These matrices originate from a FEM code used for the design of particle accelerator cavities, which essentially solves Maxwell's Equations in 3D [6] . The matrices have a large amount of small dense blocks, because three node variables are located at each grid point of the FEM mesh. The exact amount of dense blocks are listed in Fig. 2 .
The experiments are carried out on seven different machines as listed in Tab. 1. First we benchmark the three optimisations separately (Figs. 1-3 ), then we measure the best performance by applying all optimisations at once (Fig. 4 ). Fig. 2 . Performance of the codes that use register blocking. Fig. (a) shows the results for matrix cav1, Fig. (b) Fig. (a) shows the results for matrix cav1, Fig. (b) shows the results for matrix cav2. These experiments are carried out on a single processor. Fig. 2 shows the impact of the block size on the performance of the register-blocked code. For the matrix cav2 the maximal improvement is 69% on the Intel Linux PC. On the other platforms the improvement lies between 6% and 48%. As can be seen by the lighter colored bars in Fig. 2 the performance of the code can be substantially higher for matrices consisting solely of small dense blocks. Fig. 4 . Performance of the best code and the unoptimised code. Fig. (a) shows the results for matrix cav1, Fig. (b) type reorderings is more substantial [3] .
For each platform we choose the fastest code that takes all discussed optimisations into account and compare it with the corresponding unoptimised version. The results are shown in Fig. 4 . On the SP2 we achieve an overall improvement of 151%, on the Intel Linux PC we still get an improvement of 97%, while on the HP X-Class, Sun Enterprise Server, HP V-Class and DEC Alpha Workstation we get performance increases of 80%, 73%, 51% and 43%.
3 Parallelisation using message-passing
Parallel Implementation
For the parallel implementation we distribute the lower triangular part of the matrix A by block-rows (see Fig. 5 ). To balance the load we assign the same number of nonzeros to each processor. The distribution of the vectors x and y corresponds to the distribution of the matrix rows.
In a preprocessing step each processor collects the necessary information for the actual matrix-vector multiplication. This is done in the following way:
• The storage format of the matrix implies that processor i needs only elements of the local parts of the x-vector from processor j where j < i. For this purpose, the smallest block containing all the needed elements is determined. These informations are exchanged.
• During the actual matrix-vector multiplication processor i receives the same portion of the y-vector from processor j as it sends to the same processor of the x-vector. This is due to the symmetry of the matrix. As a consequence, no information has to be exchanged for the y-vector in the preprocessing step.
For the actual parallel matrix-vector code we implemented three slightly different routines:
(1) Without latency-hiding: Exchange parts of x-vector, then multiply with local part of matrix, then exchange parts of y-vector and form resulting vector. with local diagonal block of the matrix [7] . Upon arrival of the remote parts of the x-vector, multiply with the remaining local part of the matrix, then exchange parts of y-vector and form resulting vector.
Routine 1 is a reasonable choice for machines that do not support latency-hiding. Routine 2 has the disadvantage that it does not exploit the symmetry of the matrix well, since the matrix has to be read from memory twice. Routine 3 has the disadvantage that the diagonal block of the matrix has to be stored separately to make the implementation efficient. This has to be done in a preprocessing step.
The parallel algorithm benefits from the matrix reordering done for the optimisation of the serial code. As can be seen from Fig. 5 the number of messages is reduced because of the smaller matrix bandwidth. Fig. 5 a) shows the worst case: the last processor (P3) needs the local x-vector parts from all other processors for the multiplication. Whereas in Fig. 5 b) the same processor needs only the local parts of the neighbour processor. The results in Fig. 11 show how crucial the matrix reordering is.
Parallel Numerical Experiments
We carried out the parallel experiments on five platforms: the HP X-Class, the HP V-Class, the Intel Paragon, the Intel Pentium III Beowulf Cluster and the IBM SP2. The HP X-Class (HP Exemplar SPP2000/X-32) and the HP V-Class (HP Exemplar V2500) are both shared memory machines with 32 processors and a crossbarswitch interconnection network. The Intel Paragon has 128 processors with distributed memory arranged in a 2D-grid. The Intel Beowulf Cluster consists of 251 dual CPU Pentium III processors. These 251 computing nodes are grouped into frames of 24 nodes. An Ethernet network connects the computing nodes. The IBM SP2 is a distributed memory machine with 64 processors connected through a multistage network.
The software-pipelining optimisation described in section 2 was incorporated into the parallel version. Although it would be entirely possible to implement register blocking also for the parallel version, we did not do that, mainly because of the limited time available. Unless otherwise mentioned the matrices are reordered using the Cuthill-McKee algorithm. linear speedup for matrix cav1. For this matrix the speedup is 21 with 8 processors. The speedups for matrix cav2 are higher compared to the HP X-Class. The speedup for 8 processors is 6.7. Fig. 8 shows the measured speedups for the Intel Paragon. The code scales well, especially for the matrix cav2. Apart from the super-linear speedups on the HP X-Class and the HP V-Class, this machine gives the best speedups, because of its fast network compared to the performance of its processors. The results for the Intel Beowulf Cluster are depicted in Fig. 9 . For the numerical experiments we used 1 CPU per node. Up to 8 CPU's, the measured speedups for the matrix cav2 are comparable to the speedups on the HP V-Class. The irregularities above 16 processors show up because in this case some processors have to communicate with processors located in another frame. the measured speedups for the IBM SP2. The code does not scale as well as on the Intel Paragon, because the SP2 has much faster processors and the slower interconnection network. Fig. 11 shows the influence of the reordering on the performance. When the matrices are left in their original ordering the performance is unacceptably low. Even for a small number of processors, where the number of messages is low, the performance is worse. These results are conducted on the IBM SP2, but this behavior can also be observed on the other platforms. 
Conclusions
We computed an upper bound for the performance of the sparse matrix-vector product and showed that straight-forward implementations perform poorly. The three techniques we presented in this paper improved the performance by up to 151%. Our message-passing implementation also benefits from these optimisations and scales reasonably. We think that future work should go into automatic generation of sparse matrix-vector multiplication codes which are optimised to a given matrix and a given target architecture. This approach has been successfully applied to other applications, such as the FFT [8] and the dense BLAS [9] .
Acknowledgments
We would like to thank Peter Arbenz and Oscar Chinellato for improvements and corrections of a draft of this paper. We would also like to thank Rolf Strebel for explanatory discussions on the subject of software pipelining.
