Abstract. When solving a system of PDEs, discretized on 9-point stencils over a nonrectangular domain, the linear systems that arise will have matrices with an irregular block structure, In this paper we discuss the vectorization of the matrix-vector multiply and of the Incomplete LU factorization and backsolve for these types of matrices. The performance of the matrix-vector multiply is already optimal for a small number of grid points (one result per clock cycle). For the ILU factorization and backsolve the vector performance is not as satisfying, partly because the resulting vector length is generally small and partly because of the heavy use of indirect addressing. A comparison with the general-purpose routines from the SLAP library shows a significant gain in computational time.
Introduction
In previous work the use of an adaptive-grid finite-difference method to solve time-dependent two-dimensional systems of partial differential equations (PDEs) was examined [Trompert and Verwer 1991 , 1993a , 1993b Trompert et al. 1993; Verwer and Trompert 1992] . Among others, a research code was developed that uses an implicit time-stepping method. This poses the task of solving at each time step large systems of nonlinear equations. In work by Trompert and others [1993] a modified version of Newton's method, combined with a direct sparse matrix solver, was used to solve these systems. However, it was shown that Krylov-type iterative solvers combined with standard Incomplete LU preconditioning were much faster . In this paper we used the ILU preconditioner and the iterative solvers GMRES [Saad and Schultz 1986] and CGS [Sonneveld 1989 ] from the Sparse Linear Algebra Package (SLAP), and a CGS variant, BiCGStab [Vorst 1992] . SLAP is a public domain code written by Anne Greenbaum and Mark K. Seager, with contributions from several other authors, that is available from Netlib [Dongarra and Grosse 1987] .
Most of the iterative solvers are highly vectorizable (triads, dot products, etc.) . In the SLAP code, which is intended for general sparse matrices, the required matrix operations, however, are much less amenable to vectorization. Since no matrix structure is assumed, the ILU factorization is more of a searching process than a computational process. The ILU backsolve and the matrix-vector multiply do vectorize, but the resulting vector lengths are relatively small. On the other hand, ILU preconditioning, combined with Krylov-type methods, has proven itself very robust and efficient for our problems with respect to the convergence rate. In addition, later experiments with the iterative solver GMRESR [Vorst and Vuik 1991] showed that in various cases for this solver the expensive ILU preconditioning can be replaced by a simple diagonal scaling, although at the cost of many more iterations. In these cases the computational time will be dominated by the matrix-vector multiplications. We therefore decided to replace the matrix-vector multiply and the ILU preconditioning routines of the SLAP library with routines written especially for our type of problem: systems of PDEs discretized on a 9-point stencil and on a nonrectangular grid, by which we mean a grid that is bounded by an arbitrary right-angled polygon.
There are well-known techniques to vectorize the matrix-vector multiply and the ILU preconditioning process for a scalar PDE defined on a rectangular domain and discretized using a 5-or "/-point stencil. The objective of this paper is to report on the vectorization of the matrix operations that occur when solving a system of PDEs discretized on a 9-point stencil and on a nonrectangular domain. We will discuss vectorizable implementations for the matrix-vector multiplication and the standard Incomplete LU preconditioning that perform significantly better than the SLAP implementations meant for general matrices. Due to the irregular domain structure, we have to resort to indirect addressing in our implementations. The matrices are stored in block diagonals and the actual location of the blocks in the matrix is given by pointers. The performance of the vectorized matrix-vector multiplication is already optimal for a small number of grid points, that is, one result per clock cycle. For the ILU preconditioning, concurrency is obtained by reordering the computations using a variant of the hyperplane method [Ashcraft and Grimes 1988] . For these routines the megaflop rate is lower, partly because the resulting vector length is generally small and partly because of the heavy use of indirect addressing. The ideas behind the implementation, that is, the block-diagonal storage mode and the indirect variant of the hyperplane method, are both easily transferable to PDEs in three dimensions or discretization on other stencils, or both.
The computations were performed on a CRAY Y-MP but our findings will also hold on comparable vector computers (comparable with respect to nl/2 values and gather/scatter operations).
Preliminaries
Spatial discretization of two-space dimensional PDE systems containing at most secondorder derivatives frequently occurs on a 9-point stencil. If the domain is rectangular and the nodes are in natural order (i.e., stored along grid lines in, say, the x-direction), the Jacobian matrices arising from those systems are sparse and very regular. For a scalar PDE the matrix is then block tridiagonal and each block itself is a tridiagonal matrix. For systems, and if the components are stored alternately, each block itself is a block-tridiagonal matrix (see Figure 1 ). For such systems much research has been done on the vectorization of the matrix operations. For example, the matrix-vector multiply will be speeded up when the matrix is stored diagonal-wise instead of column-wise as in SLAP. For the incomplete factorization several vectorization possibilities are available; one of these, first published in [Ashcraft and Grimes 1988] for a scalar PDE and a 5-or 7-point stencil, reorganizes the order of the computations, although the resulting ILU factorization actually treats the nodes in natural order. This appears to have a favorable influence on the stability of the operation and on the degree to which the eigenspectrum of the matrix is approximated. In methods that reorder the nodes, like the multicolor methods (see, e.g., [Fujino and Doi 1991] ), one often has to find the balance between a good preconditioner and an optimal vector performance. In this paper we will discuss the vectorization of the matrix-vector multiplication and the ILU preconditioning of less regular matrices. First of all, if the boundary equations of the PDE contain at most first-order derivatives and are discretized using second-order one-sided differences, some irregularity arises in the matrix when compared with firstorder one-sided differences, as depicted in Figure 2 . These irregularities in the first and last row of each diagonal block and the extra block in the first and last block row make a diagonal matrix-vector multiply very inefficient due to the large number of unnecessary operations. Second, if the domain of the PDE is not rectangular, then the diagonal Nx x Nx blocks will be replaced by blocks that have as their dimension the row length of the relevant row of the grid, and the off-diagonal blocks will be rectangular instead of square (cf. Figure 3 ). Note that the structure of the matrix is still symmetric around the diagonal. However, in this case direct diagonal storage is no longer feasible, and it is necessary to use some kind of indirect addressing. With Dirichlet boundaries or, as in the case of locally uniform grid refinement, with internal boundaries consisting of interpolated coarse grid values, the structure of the matrix is no longer symmetric (see Figure 4) . 
Matrix-Vector Multiplication
The matrices we need to consider (cf. Figure 4 ) have in the case of a system of PDEs with NPDE components, different NPDE-block row structures for the internal points and for the specific boundary points (see Figure 5 ; the boundary points that are not displayed have analogous structures). In all cases the maximum number of blocks in a row is nine, but the placement and the distance from the block to the main diagonal vary. In this section we will discuss some storage modes, namely, the SLAP column format and various types of diagonal storage modes. The inner loop of a MATVEC operation will consist in all cases of one store, one multiplication, and one addition, but different storage modes can influence the number of loads, the length of the vector, and chaining possibilities.
,~rl~l,l,,,,Jl~, Figure 4 . Locally refined grid. The internal boundary points are marked by X; the internal grid points and nonDirichlet boundary points, by 0. Note that the domain is equivalent to that used in Figure 3 . The difference in the structure of the matrix is merely caused by the replacement of the physical boundary conditions by Dirichlet boundary conditions at internal boundary points. 
SLAP
The standard storage mode used in the SLAP library is the SLAP column format. In this format the nonzeros are stored counting down columns, except for the diagonal entry, which appears first in each column. One integer array holds the row index for each nonzero and a second one holds the offsets of each column. The matrix-vector multiply is then coded as follows: The inner loop will vectorize using gather/scatter operations, resulting in three loads (A, I ROWA, Y). The advantage of this storage mode is its flexibility. However, for a scalar PDE the vector length of the inner loop will be only nine.
Diagonal Storage
The usual way ~ increase vector length is ~ s~re the matrix diagonaH~ In the case of a rectangular domain and a second-order discretizafion of the boundary equations this means th~ one g~s 14 9 NPDE --3 vec~rs of length N = (Nx / Ny) 9 NPDE (the block diagonal is 5 9 NPDE because of the left and r~ht boundaries) and 2 9 NPDE --2 vec~rs of length N• = Nx " NPDE for the lower and upper boundaries (cf. F~ure 6). The matrix-vec~r multiply looks like the code given in Algorithm 1. The inner loops will vectorize, without gather/scatter operations, using three loads and having a much larger vector length than in SLAP mode, namely, approxirfiately N for the full diagonals and N• for the diagonals resulting from the second-order discretization of the upper and lower boundary equations. If the domain is not rectangular it is impossible to avoid indirect addressing. The most "direct" translation of the diagonal storage mode for the matrix with block rows (see Figure 5) is shown in Figure 6 . However, in contrast to the implementation on a rectangular domain, now only the inner 6 9 NPDE --1 diagonals will translate in a loop with direct addressing. The other ones need indirect addressing of the multiplicand vector, resulting in one extra load. However, this should not have much effect on the resulting CPU time or megaflop rate on the CRAY Y-MP.
In the above storage mode a total of 14 9 NPDE --3 diagonals of length N and 4 9 NPDE --2 of length N• are used and the maximum number of elements on a row is only 9" NPDE. One can also store the matrix such that a minimum number of diagonals (10 9 NPDE --1) results (see Figure 7 ). Assuming that I NDM contains pointers to the left blocks in Figure 7 and I NDP to the blocks right of the diagonal, the matrix-vector multiply will look like the code given in Algorithm 2:
Algorithm 2. BDIAG: Matrix-vector multiply for block-diagonal storage mode. Note that in both cases it is also possible to implement a second-order discretization of boundary equations with mixed derivatives, or any other discretization of boundary equations using a maximum of 9 points, as long as the total number of entries on a row is not larger than 9 9 NPDE. The occurrence of nonexisting entries in the last inner loop can be prevented by letting, whenever appropriate, the indices I NDM and I NDP point to I ROW and by storing zeros in A.
ILU
In this section we describe two different ways to perform the standard no-fillin Incomplete LU decomposition and the backsolve needed for the solution of a system. With standard no-fillin Incomplete LU decomposition we mean a factorization,
obtained by Gaussian elimination, omitting all resulting entries at places where the original matrix had zeros.
SLAP
The SLAP package is intended for general sparse matrices, and therefore the ILU factorization contains more sorting and searching processes than arithmetical computations. The factorization of A in LDU format in the SLAP library is as follows. First, A, being in SLAP column format, is copied into L (the lower triangle of A) in SLAP row format, and into D (the diagonal) and into U (the upper triangle of A) in SLAP column format. The standard incomplete LU decomposition is then performed, and finally the diagonal D is replaced by its inverse. The whole process, apart from the inverse of D, is not vectorized.
Assuming that the matrices L, D, and U are stored as described above, the solution of the system Ax = b is implemented as follows: All loops will vectorize quite well, but the second and fourth are hampered by the fact that the loop length will be approximately 4 9 NPDE and, possibly, by indirect addressing.
The Hyperplane Method
The structure of the matrix A, and consequently the no-fillin factorization, is of course dependent on the ordering of the nodes. So a node-reordering strategy with the aim of getting good vectorizing properties may affect the convergence of the iterative solver. For example, the multicolor method of Fujino and Doi [1991] , in which nodes with the same color are handled in the same sweep (compare the red-black ordering for a scalar 5-point stencil), results in good vectorizing properties for a small number of colors but requires a large number of colors for comparable convergence results. The no-fillin factorization corresponding to node ordering along grid lines and the alternating storage of the PDE components proved to be a robust preconditioner for our class of PDEs. Since in many applications a good preconditioner is necessary to get convergence anyhow, we decided to restrict ourselves to this preconditioner. In [Ashcraft and Grimes 1988] it is shown that the ILU factorization and backsolve can be vectorized by reordering the computations, not the nodes. It should be emphasized that the thus-obtained factorization and backsolve are, using exact arithmetic, equivalent to the recursive one. In their paper Ashcraft and Grimes perform a data dependence analysis for the factorization and backsolve of a matrix resulting from discretization on a rectangular domain of a scalar PDE using 5-or 7-point stencils. The resulting sets of nodes in which the computation can proceed simultaneously are in the case of a 5-point stencil grid diagonal and in the case of a 7-point stencil with entries on the upper left and lower right skew grid diagonals (knight's move in chess). It should be noted, however, that a 7-point stencil with entries on the upper right and lower left results, like a 5-point stencil, in ordinary grid diagonals.
where, assuming that L and U have the same sparsity structure as the lower, respectively, the upper triangle of A, L ij 9 U is given by the following:
Z ij " U : (0, ..., O, ZiJui_lj_l, ZiJui_lj, ziJui_lj+l, ZiJui_lj+2 0 ..... 0, LiJuij_2, LiJuij_l, LiJuij, ZiJuij+l, LiJuij+2, O, ..., O, LiJui+lj_2, LO'Ui+lj_l, LO'Ui+lj, LO'ui+lj+l, O, ..., 0) . Of these, the entries LiJui_lj+2, LiJuij_2, LiJuij+2, and LiJui+lj_2 do not correspond with entries in A and thus are ignored. The remaining lead to the set of equations given below:
The data dependence analysis for the factorization goes as follows: One considers the equation for one grid node, which corresponds to the entries in one row of the matrix, and analyzes from which entries these computations are dependent. Let the numbering of the grid be as in Figure 8 . Let the partitioning of a matrix A in columns, respectively rows, be given by 
Vi j
where the row indices ij are omitted. One can see that, like for the "/-point stencil with entries at the upper left and lower right, the dependence of the computations is as shown in Figure 8 . The analysis of the forward and backward solve is analogous and results in the dependences as also shown in Figure 8 . Note that the modification of ILU from Gustafsson [1978] results in the same dependence relation. In this modification the entries in LU that do not correspond with the entries in A, LiJui_lj+2, and so on, are added to the main diagonal multiplied by a factor 0 _< o~ < 1
For systems of PDEs the same can be done with submatrices of NPDE X NPDE instead of matrix elements. The element vff must then be replaced by a lower and an upper submatrix.
For a nonrectangular domain it would be very inefficient to order the computations along skew grid diagonals as in [Ashcraft and Grimes 1988] . For irregular domains the sets of nodes for which the computations can be performed concurrently have to be built up. The first set for the factorizafion and the forward solve, Ly = b, consists of the lower-left comer and all the Dirichlet boundary points. The other nodes are processed along horizontal grid lines and are placed in the set with a number that is one higher than the maximum set number of the dependence nodes. The sets of nodes for the backward solves, Ux = y, are obtained analogously, but starting from the upper right and proceeding backwards. Note that the sets of nodes for the forward and backward solve are, in general, different. This is in contrast with the method of Ashcraft and Grimes [1988] , in which the sets are the same, but are processed from the first to the last (for the forward solve) and from the last to the first (for the backward solve). The determination of these sets is an essentially sequential but simple process, which has to be done only once per grid.
The implementation of this hyperplane method for nonrectangular domains is as simple as for rectangular domains. We give the code for the scalar case.
Let the matrix A be stored in the array A ( N, -4 : 4), the lower diagonals in A (., -4 : -1 ), the main diagonal in A (., 0), and the upper diagonals in A (., 1 : 4). Let N I M (., -4 : -2) contain the column index of the lower three diagonals and NIP (., 2:4) the column index of the upper three diagonals. Finally, let IS(0) be the number of sets, and let S ( I S (m-1) +1 : I S (m)) contain the indices of the nodes in the m-th set (Sm) for which the computations can be done concurrently assuming that the computations for the nodes in $1 ..... m-1 are done. The code is given in Algorithm 3. First, we have a loop over the independent nodes S1 to store the inverted diagonal entries of U. The rest of the hyperplane sets can be handled in one loop since the pointers N I M are set to the node index itself if the corresponding lower diagonal does not exist.
Algorithm 3. ILU decomposition using the hyperplane method for nonrectangular domains.
Compute upper diagonals
All inner loops can be vectorized. For rectangular domains the loop length varies from 1, in the first and last loop, to min(Nx/2, Ny) in the middle loops. Note that for rectangular domains the algorithm described above is algebraically equivalent to the algorithm of Ashcraft and Grimes, although in the implementation direct addressing is replaced by pointers. For the backsolve we need to store two sets of nodes for which the computations can be performed concurrently, one for the forward solve and one for the backward solve. For the forward solve we use the arrays (Sk, I SL), the equivalent of the arrays (S, IS) of the factorization. For the backward solve we store the backward analogue of (S, I S) in two additional arrays (SU, I SU). The backsolve is then implemented as shown in Algorithm 4.
Algorithm 4. ILU backsolve using the hyperplane method for nonrectangular domains. The loop length will, of course, be the same. Note that it is important to have the indirect addressing as the first index since otherwise extra vector multiplications are needed just for the indexing of an array.
B(K) = (B(K) -A(K,1).B(K+I) -A(K,2).B(NIP(
K
Performance Measurements
In this section we compare the different MATVEC and ILU implementations. The timings were done on one processor of a CRAY Y-MP, which has a clock cycle time of 6 ns. This gives a theoretical peak performance on one processor of 167 megaflops and 333 megaflops when chaining an add and a multiply. Since during one cycle time one store and two loads can be performed, indirect addressing of (one of) the vector operands of a triad will reduce the performance by at least a factor of two, with bank conflicts left out of consideration. Each processor of the CRAY Y-MP addresses 32 memory banks, or subsections, so if two consecutive memory accesses have a stride of 32, 16, or 8, memory access conflicts will occur. With random addressing 32 memory subsections will generally be sufficient to avoid significant performance degradation caused by memory conflicts. To measure the megaflop rate and the CPU time of a routine we used the Cray utility Perftrace [Cray Research 1991] , which gives the hardware performance by program unit. We first discuss the megaflop rate, for increasing order of the matrix, of the vectorized implementations of the MATVEC operation and the ILU factorization and backsolve, using both direct and indirect addressing. Since the performance of the SLAP implementations is independent of the order of the matrix, we did not include the SLAP routines in this subsection. Next, we compare the general-purpose SLAP library codes with our routines on matrices that typically occur in adaptive grid environments.
Megaflop Rates
To get an idea of the actual performance of our routines we called each routine 200 times with matrices arising from square grids with Nx ranging from 20 to 200, both for scalar PDEs and for a system with two components. For these matrices both the direct and the indirect addressing variants can be used.
5.LL Performance of Matrix-Vector Multiplication.
For the matrix-vector multiply we tested the variant using fixed strides, called DIAG (Algorithm 1), and the indirect-addressing variant BDIAG (Algorithm 2). DIAG needs three (independent) loads for one multiply and an add. So unless the stride N• in the last two loops causes bank conflicts one can expect to reach the peak performance of 167 megaflops on the CRAY Y-MP, although at the cost of some superfluous computations (multiplication with zero elements). For the indirectly addressed loops BDIAG needs seven loads (two of which are dependent) for two multiplications and two adds, which again gives, apart from bank conflicts, an optimal peak performance of 167 megaflops. Note that only the latter can be used for irregular domains.
Thus for both versions the maximum performance is 1 flop per cycle. The Perftrace report in Table 1 shows that this peak performance is almost achieved, even for the indirect addressing mode. The latter is even faster, since in the block-diagonal storage mode no superfluous computations are done, although at the expense of more storage. In contrast with the matrix-vector multiply we find here a significant difference between the versions using direct addressing and indirect addressing. This difference was not foreseen. For instance, if we inspect the forward solve, Ly = b in Algorithm 4, we see that a typical loop needs 13 loads for 9 flops. Inspection of the assembly code shows that the loads are arranged such that bank conflicts will not occur and a total of 9 cycles is needed, provided that chaining indeed occurs and is not prevented by an overly large number of intervening instructions. However, the Cray Private Systems Programmers Reference Manual gives as one of the hold-issue conditions for the gather/scatter instructions a gather/ scatter instruction in progress, which means that one can have only 1 gathered load or scattered store at a time. Since 12 of the 13 loads and the store are indirect-addressed, this means that the forward solve loop, Ly = b, takes at least 13 clock cycles for the 9 flops, which makes the performance measurements in Table 3 understandable. It may be possible to improve the performance of the ILU factorization and the forward solve loop by reordering the rows of the matrix and of the right-hand side in advance, which results in the following loop for the forward solve:
Indeed, measurements show that this loop is about 1.75 times faster than the original loop. One problem, however, is that the backward solve needs a different ordering of the matrix rows, thereby introducing in the backsolve procedure either a reordering after the forward solve or extra indirect addressing in the backward solve. In the experiments that follow we will not consider this adaptation of the ILU preconditioning.
Comparison with SLAP
We now compare the SLAP routines with our implementations of the matrix-vector multiplication and the ILU preconditioning process on matrices resulting from systems of PDEs with 1, 2, and 3 components, discretized on three levels generated by the adaptive grid code. At internal nodes entries are generated assuming 9-point stencils, even if the PDE does not contain mixed derivatives. The discretization used is a second-order central difference scheme on the interior of the domain and a second-order one-sided difference formula for the boundary equations. Since the latter does not fit into the matrix structure needed for the vectorized implementation of the ILU factorization and backsolve, we replaced it, for the ILU timings, by a first-order discretization. We assume that this will have little influence on the convergence rate of the iterative process that solves the linear system. Timings are done on a uniform grid of 81 x 81 nodes and on the grids generated by the adaptive grid code starting on a uniform 21 x 21 grid and using three levels.
5.2.L Test Matrices.
The first set of matrices comes from a scalar PDE example used in [Trompert and Verwer 1991] . The corresponding grids generated for three levels are shown in Figure 9 . The number of nodes in these grids are 441, 1013, and 2421. The second test set originates from a system of two PDEs . The corresponding grids at the three levels are shown in Figure 10 and contain 441, 797, and 719 nodes. From the PDE system with three components as used in we generated threelevel grids at two different times, T1 and T2 (cf. Figure 11) , to time for different numbers of nodes and different grid connectivities. For the first, the number of nodes is 441, 462, Figure 9 . Grids for Problem I, NPDE = 1. and 966, and for the second, 441, 929, and 2323. If a grid at a level contains disjunct subgrids, one matrix is generated, but, of course, the set of Dirichlet points at the grid interfaces allows concurrency in the computations performed with the hyperplane method.
Performance of Matrix-Vector Multiplication.
In this subsection we compare the performance of the matrix-vector multiply routine SSMV (from SLAP) with our diagonally stored implementations DIAG (direct addressing, Algorithm 1) and BDIAG (indirectly addressed variant, Algorithm 2). For the performance of SSMV only the number of components, NPDE, is important; for the diagonally stored implementations DIAG and BDIAG, the number of nodes and NPDE. Since the grid configuration is in neither case relevant, it is sufficient to discuss for each test problem only the configuration with the smallest (441) and largest (6561) number of nodes, corresponding with the uniform 21 x 21 grid and the uniform 81 x 81 grid. The results are given in Table 4 . As expected, the SLAP routine does not attain a high megaflop rate since the vector length is just 9 9 NPDE. The two diagonal storage modes result in an almost optimal megaflop rate of 167, as was mentioned in Section 5.1.1. Again, one is reminded that the DIAG implementation cannot be used for nonrectangular domains. The conclusion from these figures is that we can use the same matrix-vector multiply routine BDIAG for both the rectangular domains as well as those with irregular boundaries.
Performance of fLU Factorization and Backsolve. For the ILU factorization and
backsolve we compare the SLAP routines SSILUS and SSLUI2, respectively, with the implemented hyperplane ordering in the (indirectly addressed) version as given in Algorithms 3 and 4 (called HP_set in the tables). If possible, we also include the directly addressed variant, in which the loops are along skew grid diagonals (HP). For the multicomponent PDEs we also timed the unrolled version HP_set u (for HP_set) and HP u (for HP). As expected, the performance of the SLAP ILU factorization routine SSILUS is very poor since it contains more searching and sorting processes than arithmetic computations. For the backsolve the vector length is about 4 9 NPDE, resulting in better megaflop rates. To judge the results for the hyperplane ordering in Tables 5-10 , one should recall that the matrices generated are from relatively small test problems and so the ILU preconditioning for these problems will not attain a very good vector performance. Recall that the longest loop length for the 81 x 81 grids is 41 and for the 21 x 21 grids, 11. For the grids on levels 2 and 3 it ranges from 14 up to 26. As can be seen from the Perftrace reports the megaflop rate is indeed not very impressive, but the gain in CPU time in comparison with the SLAP code is large. For the factorization it ranges from a factor of 100 for the scalar In this paper we have discussed vectorizable implementations for the matrix-vector multiply, the standard Incomplete LU factorization and backsolve, for matrices that arise when solving systems of PDEs on a two-dimensional domain bounded by an arbitrary right-angled polygon and discretized in space using a 9-point stencil. The ideas behind these implementations can be used equally well with a three-dimensional domain or discretization on other stencils, or both. The implemented matrix-vector multiply uses a block-diagonal storage with indirect addressing. The performance of this routine for a small number of nodes is already almost optimal, that is, one result per clock cycle.
For the ILU factorization and backsolve, concurrency is obtained by reordering the computations using a variant of the hyperplane method [Ashcraft and Grimes 1988] . The performance of the standard hyperplane method reaches its optimum only for large systems, because the resulting vector length is half the number of nodes in the x-direction. The performance of the indirect addressing version has a megaflop rate that is smaller by a factor of 1.5. The reason for this unexpected performance drop is that the CRAY Y-MP does not allow the chaining of gather/scatter memory accesses. With the explicit reordering of the rows of the matrix and the elements of the right-hand-side vector, performance can be improved because the number of indirect memory accesses will be halved. One difficulty with explicit reordering is that the order of computations for the backward solve is independent from the order of the factorization and the forward solve. Therefore, explicit reordering will only speed up the factorization and the forward solve, but the backward solve will need even more indirect addressing. However, we can conclude that, even for the original, not explicitly reordered, version the gain in computational time compared with the generalpurpose routines of the SLAP library is significant, especially for the factorization.
