In this document, we are concerned with the e ects of data layouts for nonsquare processor meshes on the implementation of common dense linear algebra kernels such as matrix-matrix multiplication, LU factorizations, or eigenvalue solvers. In particular, we address ease of programming and tunability of the resulting software. We introduce a generalization of the torus wrap data layout, that results in a decoupling of \local" and \global" data layout view. As a result, it allows for intuitive programming of linear algebra algorithms, and tuning of the algorithm for a particular mesh aspect ratio or machine characteristics. This layout is as simple as the proposed HPF layout, but, in our opinion, enhances ease of programming as well as ease of performance tuning. We emphasize that we do not advocate that all users need be concerned with these issues. We do, however, believe, that for the foreseeable future \assembler coding" (as message-passing code is likely to be viewed from a HPF programmers' perspective) will be needed to deliver high performance for computationally intensive kernels. As a result, we believe that the adoption of this approach would not only accelerate the generation of e cient linear algebra software libraries, but would also accelerate the adoption of HPF as a result. We point out, however, that the adoption of this new layout would necessitate that an HPF compiler insure that data objects are operated on in a consistent fashion across subroutine and function calls.
Introduction
In this document, we present a generalization of the torus wrap data layout for processor arrays of arbitrary aspect ratio. This data layout is currently being investigated as part of the PRISM (Parallel Research on Invariant Subspace Methods) project and has been e ectively utilized in the development of a highly e cient general code for matrix multiplication on the Intel Touchstone Delta 7] . We propose that this data layout, which we call virtual 2-D torus wrap, be considered for inclusion into the High Performance Fortran standard, because we believe it to have great potential in the design and development of dense linear algebra algorithms. This data layout is a natural extension of the virtual contiguous block data layout described in papers such as 8] and subsumes the block scattered decomposition (e.g., 4]) currently being suggested for the HPF standard.
Virtual 2-D torus wrap was motivated by our desire to exploit the advantageous features of block torus wrap (see, for example, 2, 6] ), while at the same time maintaining the simplicity and cost e ectiveness of the communication patterns found in many algorithms for square meshes. In particular, virtual 2-D torus wrap has the following desirable properties:
Virtual 2-D torus wrap decouples the processor view from the physical mesh con guration, allowing the programmer to view a processor array of arbitrary aspect ratio as a square mesh of processors. This simpli es algorithm design, reduces coding and maintenance e ort and facilitates exibility in multiprocessor utilization. Virtual 2-D torus wrap allows a variety of useful virtual to physical mappings of data, including contiguous blocking and the block scattered decomposition currently supported by the draft of the HPF standard. Code tuning can be achieved by simply varying this mapping, without a ecting the \node" code. The mapping used in our work, like the block scattered decomposition, is advantageous for row-or column-oriented algorithms because it allows physically spreading of row or column blocks across processor rows or columns. However, symmetric operations such as transposition and tridiagonalization are greatly simpli ed. Virtual 2-D torus wrap provides data layouts which maximize the granularity of local computations by ensuring proper maximal alignment in algorithms such the distributed matrix multiplication algorithms found in 8, 7] . Hence, this layout can make full use of, for example, assembler-coded node BLAS, or LAPACK routines. In the remainder of this document, we rst review torus wrap and the block scattered decomposition and then describe the virtual 2-D torus wrap data layout. We will use simple examples to describe each of the data layouts presented. We now describe two generalizations of this data layout for non-square processor meshes: the block scattered decomposition, which is currently suggested for inclusion in the Fortran HPF standard, and the virtual 2-D torus wrap, which we suggest as an alternative.
Block scattered decomposition. Panel the matrix A in both dimensions to obtain r s blocks. We illustrate using a matrix having 12 panels (0, 1, : : :, 11) in each dimension, i.e., A is a 12r 12s matrix. If we distribute the panels in torus wrap fashion, i.e., the 6 6 template in Figure 5 is replicated over the blocks of A, we have a virtual analog of torus wrap. The latter results in the matrix and processor points of view depicted in Figure 6 , for the same example used in the block scattered decomposition. The matrix is arranged in memory so that the data corresponding to all virtual processors within a node is stored in a contiguous bu er. This allows the user to achieve maximal granularity in block algorithms. Notice that if r = (# rows of A)=6 and s = (# columns of A)=6, we have the contiguous block data layout 8].
An important point to realize is that successive panels in each dimension can be assigned to virtual processors arbitrarily and a very useful case occurs when panels are assigned with a \virtual panel spacing" in each dimension, s r and s c . Panels are assigned consecutively so that panel i in the row dimension is assigned to row virtual processor s r i mod s r + $ i Some algorithms are row-oriented (column-oriented) and achieve better performance and load balancing when successive row (column) panels lie in di erent physical processors. This physical spreading of rows (columns) across physical rows (columns) of the mesh can be accommodated by choosing the virtual panel spacing in the row (column) dimension to be the number of virtual processors per node in that direction. In fact, the \local processor view" in Figures 6 and 7 is identical, even though the overall load balancing properties of the algorithms are likely to be di erent. We see three main advantages resulting from this property:
Global data reorderings, or changes in block size, as motivated by considerations relating the behavior of an algorithm to the particular mesh aspect ratio, do not a ect the \node code". Symmetric operations such as transposition and tridiagonalization are straightforward, since each virtual processor behaves as if it were a physical processor on a square mesh. Row and column indices are naturally aligned for matrix operands in algorithms such as the distributed matrix multiplication algorithms in 8, 7] . Notice that our rst example of virtual 2-D torus wrap in Figure 6 has a virtual panel spacing equal to one; furthermore, if s r = 2 and s c = 3, or in general, s r = =q and s c = =p, we obtain the block scattered decomposition (up to local storage di erences).
The four special cases of virtual 2-D torus wrap discussed in this document are the only ones that currently appear to be useful, but further study of this is required. Note that the adoption of \spaced" 2-D torus wrap would necessitate that an HPF compiler insure that data objects laid out with di erent panel spacings are operated on in a consistent fashion across subroutine and function calls.
As we have seen, the extension of the HPF draft discussed in this document has several desirable attributes. Since \spaced" virtual 2-D torus wrap is a superset of the proposed HPF block scattered decomposition, adoption of this proposal would provide the functionality of all the special cases discussed above under a single general framework.
