The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability.
Introduction
High-performance dense linear algebra codes, whether sequential or parallel, rely on good spatial and temporal locality of reference for their performance. Matrix multiplication is an important kernel in linear algebraic algorithms, and has been enshrined in the dgemm routine in the BLAS 3 library [10] . There is an intimate relationship between the layout of the arrays in memory and the performance of the routine. On modern shared-memory multiprocessors with multi-level memory hierarchies, the column-major layout assumed in the BLAS 3 library can result in performance anomalies as the matrix size is varied smoothly. These anomalies result from unfavorable access patterns in the memory hierarchy that cause interference misses and false sharing and increase memory system overheads experienced by the code. In this paper, we investigate recursive array layouts accompanied by recursive control structures as a means of delivering high and robust performance for parallel dense linear algebra.
The use of quad-or oct-trees (or, in a dual interpretation, space-filling curves) is known in parallel computing [1, 22, 23, 37, 39, 44] for improving both load balance and locality. They have also been applied for bandwidth reduction in information theory [2] , for graphics applications [16, 30] , and for database applications [25] . The computations thus parallelized or restructured are reasonably coarse-grained, thus making the overheads of maintaining and accessing the data structures insignificant. A recent paper by Frens and Wise [13] champions the use of quad-trees to represent matrices and explores its use in matrix multiplication. While this paper presented some excellent ideas, we felt that the authors carried them to an extreme. Based on this previous work, we were left with several questions that we address in this paper.
Previous work using such layouts did not worry greatly about the overhead of address com-putations. The algorithms described in the literature [35] follow from the basic definitions and are not particularly optimized for performance. Are there fast algorithms, perhaps involving bit manipulation, for maintaining the "dope vectors" that would enable such data structures to be used for fine-grained parallelism? Or, even better, can the address computations be embedded implicitly in the control structure of the program?
Frens and Wise carried out their quad-tree layout of matrices down to the level of matrix elements, disregarding the result of Lam, Rothberg, and Wolf [29] that a tile that fits in cache and is contiguous in memory space can be organized in a canonical order without compromising locality of reference. This suggests the need for the quadtree decomposition to terminate well before the element level and to co-exist with tiles organized in a canonical manner. Could this interfacing of two layout functions be accomplished without increasing the cost of addressing? How does absolute performance relate to the choice of tile size?
Frens and Wise assumed that all matrices would be organized in quad-tree fashion, and therefore did not quantify the cost of converting to and from a canonical order at the routine interface. This is an optimistic assumption that is currently unrealistic. We feel that an honest accounting of costs needs to quantify the overhead of format conversion. Would the performance benefits of quad-tree data structures be lost in the cost of building them in the first place?
There are several variants of recursive orderings other than the U-ordering that Frens and Wise used. Some of these orderings, such as Gray-Morton [31] and Hilbert [20] , are supposedly better for load balancing, albeit at the expense of greater addressing overhead. How would these variants compare in terms of complexity vs. performance improvement for finegrained parallel computations?
Frens and Wise speculated about the "attractive hybrid composed of Strassen's recurrence and this one" [13, p. 215] . This is an interesting variation on the problem for two reasons.
First, Strassen's algorithm [41] achieves a lower operation count at the expense of more data accesses in patterns that have worse algorithmic locality of reference. Second, the control structure of Strassen's algorithm is not as simple as that of the standard recursive algorithm, making it trickier to work with quad-tree layouts. Could this combination be made to work, and how would it perform?
Our major contributions are as follows. First, we provide improved performance over that reported by Frens and Wise by stopping their recursive splitting well before the level of single elements. Second, we integrate recursive data layouts into Strassen's algorithm and provide some surprising performance results. Third, we test five different recursive layouts and characterize their relative performance. We provide efficient addressing routines for these layout functions that would be useful to implementors wishing to incorporate such layout functions into fine-grained parallel computations. Finally, as a side effect, we provide an evaluation of the strengths and weaknesses of the Cilk system [4] , which we used to parallelize our code.
The remainder of this paper is organized as follows. Section 2 introduces the algorithms for fast matrix multiplication that we discuss in this paper. Section 3 introduces recursive data layouts for multi-dimensional arrays. Section 4 describes the implementation issues that arose in weaving together the recursive data layouts with the recursive control structures of the algorithms.
Section 5 offers measurement results that support the claim that these layouts improve the overall performance. Section 6 compares our approach with previous related work. Section 7 presents conclusions and future work.
Algorithms for fast matrix multiplication
The computation we discuss in this paper is that of matrix multiplication. Let and be two Ò ¢ Ò matrices, where we initially assume that Ò ¾ . Let ¯ , where the symbolr epresents matrix multiplication.
Rather than formulate the matrix product in terms of row or column operations, we will proceed by quadrant or sub-matrix operations. Partition the two input matrices A and B and the result matrix C into quadrants as follows. 
The standard algorithm that performs Ç´Ò ¿ µ operations proceeds as shown in Figure 1 (a), performing eight recursive matrix multiplication calls and four matrix additions.
Strassen's original algorithm [41] uses algebraic identities to reduce the number of recursive matrix multiplication calls from eight to seven at the cost of 18 matrix additions/subtractions. This change reduces the operation count to Ç´Ò Ð µ. It proceeds as shown in Figure 1( Compared to Strassen's original algorithm, the noteworthy feature of Winograd's variant is its identification and reuse of common subexpressions. These shared computations are responsible for reducing the number of additions, but also contribute to worse locality of reference unless special attention is given to this aspect of the computation. We do not discuss in this paper numerical issues concerning the fast algorithms, as they are covered elsewhere [19] .
We use Cilk [4] to implement parallel versions of these algorithms. The parallelism is exposed in the recursive matrix multiplication calls. Each of the seven or eight calls are spawned in parallel, and these in turn invoke other recursive calls in parallel. Cilk supports this nested parallelism, providing a very clean implementation.
Interface
In order to stay consistent with previous work and to permit meaningful comparisons, all our implementations follow the same calling conventions as the dgemm subroutine in the Level 3 BLAS library [10] . Thus, the implementation computes « £ op´ µ¯op´ µ · ¬ £ , where « and ¬ are scalars, op´ µ is an Ñ ¢ double precision real matrix, op´ µ is a ¢ Ò double precision real matrix, is an Ñ ¢ Ò double precision real matrix, and op´ µ is either or
The matrices , , and are stored in column-major order, with leading dimensions ldA, ldB, and ldC respectively.
Recursive array layouts
Programming languages that support multi-dimensional arrays must also provide a function (the layout function Ä) to map the array index space into the linear memory address space. For ease of exposition, we assume a two-dimensional array with Ñ rows and Ò columns, which we index using a zero-based scheme. The results we discuss generalize to higher-dimensional arrays and other indexing schemes. We define Ä such that Ä´ µ is the memory location of the array element in row and column relative to the starting memory location of the array. We list near the end of the argument list of Ä, following a semicolon, any "structural" parameters (such as Ñ and Ò) of Ä, thus: Ä´ Ñ Òµ. In the shared-memory parallel environments in which we experimented, the elements of a quadrant of a matrix are spread out in shared memory, and a single shared memory block can contain elements from two quadrants, and thus be written by the two processors computing those quadrants. This leads to false sharing [9] .
In a message-passing parallel environment such as those used in implementations of High Performance Fortran [28] , typical array distributions would again spread a matrix quadrant over many processors, thereby increasing communication costs.
The dilation effect can compromise single-node memory system performance in the following ways: by reducing or even nullifying the effectiveness of multi-word cache lines; by reducing the effectiveness of translation lookaside buffers (TLBs) for large matrix sizes; and by causing cache misses due to self-interference even when a tiled loop repeatedly accesses a small array tile.
The above considerations have led various authors to investigate a class of array layout functions variously described as being based either on quadtrees [13] or on space-filling curves [20, 36, 38] . Instances of this family of layout functions are familiar in parallel computing under the names Morton ordering and Hilbert ordering, and have been used for load balancing purposes [1, 22, 23, 37, 39, 44] . They have also been applied for bandwidth reduction in information theory [2] , for graphics applications [16, 30] , and for database applications [25] .
Despite the dilation effect described above, canonical layout functions have one major advantage that is heavily exploited for efficiency in address computation. Simple algebraic manipulation of the defining formulas, as shown below in equations (2)- (3), reveals that these layouts allow incremental computation of memory locations of elements that are adjacent in array index space. (c) Z-Morton layout function Ä 000000000000000 000000000000000 111111111111111 111111111111111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000000000000000 111111111111111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 
This idiom is understood by compilers and is one of the keys to high performance in libraries such as native BLAS. Unlike Frens and Wise, therefore, we will not carry the recursive layout down to the level of individual elements, but will terminate the recursion when we reach a Ø Ê ¢Ø submatrix that fits in cache. We will also describe these layouts in terms of space-filling curves rather than in terms of quad-trees. (The quad-tree is simply an aid to conceptualizing the layout, and none of its internal nodes need to be physically represented in memory. The space-filling curve makes this clear.)
Assume for the moment that we know how to choose Ø Ê and Ø , and that Ø Ê and Ø simultane- 
where Ë´ µ gives the position along the space-filling curve (i.e., the pre-image) of the element at rectangular co-ordinates´ µ.
We are, of course, using the term "space-filling curve" somewhat loosely. Technically, a spacefilling curve is a curve (i.e., a continuous map from the unit interval into the Ò-dimensional Euclidean space) that passes through every point of an Ò-dimensional region with positive Jordan content [38] . Following Peano's proof of the existence of such a curve [36] , Hilbert [20] gave a geometric generating procedure to construct a class of space-filling curves as the limit of a sequence of nested discrete approximations to it. It is precisely this geometric interpretation that we use in our recursive layout functions. In the terminology of Sagan [38, p. 21] , the ¾ ¢ ¾ elements of the matrix correspond to the nodal points of the Ø approximating polygon for the space-filling curve.
Equation (5) defines a family of layout functions parameterized by the function Ë characterizing the space-filling curve. All of these recursive layout functions have the following operational interpretation following from Hilbert's construction [20] . Divide the original matrix into four quadrants, and lay out these submatrices in memory in an order specified by Ä Ì . A Ê ¢ submatrix with Ê Ø Ê and Ø is laid out recursively using Ä Ì ; a Ø Ê ¢ Ø tile is laid out using Ä .
Space-filling curves are based on the idea of threading a region with self-similar line segments at multiple scales. The two fundamental operations involved are scaling and orienting (rotating) the line segments. We classify the five recursive layouts we consider in this paper into three classes based on the number of orientations needed. Three layouts (U-Morton, X-Morton, and Z-Morton) require a single orientation; one layout (Gray-Morton) requires two orientations; and one layout (Hilbert) requires four orientations. We now discuss the structure of these layouts and the computations involved in calculating their Ë functions.
We need the following notation to discuss the computational aspects of the recursive layouts.
For any non-negative integer , let ´ µ be the bit string corresponding to its standard binary encoding, and let ´ µ be the bit string corresponding to its Gray code [34] encoding. Correspondingly, for any bit string ×, let ½´× µ be the non-negative integer such that ´ µ ×, and let ½´× µ be the non-negative integer such that 
Recursive layouts with a single orientation
The three layouts U-Morton (Ä Í ), X-Morton (Ä ), and Z-Morton (Ä ), illustrated in Figure 3 The Ë functions for these layouts are easily computed with bit operations, as follows.
For Ä : Ë´ µ ½´ ´ µ º» ´ µµ
Recursive layouts with two orientations
The Gray-Morton layout [31] (Ä ) is based on a C-shaped line segment and its counterpart that is rotated by 180 degrees. 
Recursive layouts with four orientations
The Hilbert layout [20] (Ä À ) is based on a C-shaped line segment and its three counterparts rotated by 90, 180, and 270 degrees. Figure 3 (g) illustrates this layout. The Ë function for this layout is computationally more complex than any of the others. The fastest method we know of is based on an informal description by Bially [2] , which works by driving a finite state machine with pairs of bits from and , delivering two bits of Ë´ µ at each step.
Summary
We have described five recursive layout functions in terms of space-filling curves. These layouts grow in complexity from the ones based on Lebesgue's space-filling curve to the one based on
Hilbert's space-filling curve. We now state several facts of interest regarding the mathematical and computational properties of these layout functions.
¯It follows from the pigeonhole principle that only two of the four cardinal neighbors of´ µ can be adjacent to Ë´ µ. Thus, recursive layouts are prone to a dilation effect similar to that the canonical layouts experience. The important difference is that the dilation occurs at multiple scales. This dilation effect is seen in the abrupt jumps in the curves of Figure 3 . We note that these jumps get less pronounced as the number of orientations increases.
We do not know of a recursive layout with three orientations. There are, however, spacefilling curves appropriate for triangular or trapezoidal regions. We do not know whether such curves would be useful for laying out triangular matrices.
The Ä layout function has a useful symmetry that is easiest to appreciate visually. Refer to In terms of computational complexity of the Ë functions of the different layout functions, we observe that bits ¾Ù · ½ and ¾Ù of Ë´ µ depend only on bit Ù of and for the layouts with a single orientation, while they depend on bits through ½ for the Ä and Ä À layouts.
Implementation issues
Section 2 described the parallel recursive control structure of the matrix multiplication algorithms, and Section 3 described the recursive array layouts we wanted to combine with the algorithms.
This section links these two aspects together, addressing a number of implementation issues in the process.
A naive strategy
A naive but correct implementation strategy is to follow equation (5) and to replace every reference to array element ´ µ in the code with a call to the address computation routine for the appropriate recursive layout. However, this requires integer division and remainder operations to computé Ø Ø µ from´ µ, which imposes an unreasonably large overhead. To gain efficiency, we need to exploit the decoupling of the layout function Ä shown in equation (5): once we have located a tile and are accessing elements within it, we do not recompute the starting offset of the tile and instead use the incremental addressing techniques supported by the canonical layout Ä .
Integration of address computation into control structure
For the matrix multiplication algorithms of Section 2, we can further integrate the computation of the Ë function into the control structure in an incremental manner, as follows. Observe that the actual work of matrix multiplication happens on Ø Ê ¢Ø tiles when the recursion terminates. At each recursive step, we need to locate the quadrants of the current trio of matrices, perform pre-additions on quadrants, spawn the parallel recursive calls, and perform the post-additions on quadrants. The additions have no temporal locality and are ideally suited to streaming through the memory hi-erarchy, which is aided by the fact that recursive layouts keep quadrants contiguous in memory.
Therefore, all we need is the ability to quickly locate the starting points of the four quadrants of a (sub)matrix. This produces the correct Ë number of the tiles when the recursion terminates, which are then converted to memory addresses and passed to the leaf matrix multiplication routine.
For the Ä and Ä À layout functions, which have multiple orientations, we need to retain both the location and the orientation of quadrants. We encode orientation in one or two most significant bits of the integers.
Relaxing the constraint of equation (4)
The definitions of the recursive matrix layouts in Section 3 assumed that Ø Ê and Ø were constrained as described in equation (4). This assumption does not hold in general, and the conceptual way of fixing this problem is to pad the matrix to an Ñ ¼ ¢Ò ¼ matrix which satisfies equation (4) . There are two concrete ways to implement this padding process.
Frens and Wise keep a flag at internal nodes of their quad-tree representation to indicate empty or nearly full subtrees, which "directs the algebra around zeroes (as additive identities and multiplicative annihilators)" [13, p. 208 ].
Maintaining such flags makes their solution insensitive to the amount of padding, but requires maintaining the internal nodes of the quad-tree. This scheme is particularly useful for sparse matrices, where patches of zeros can occur in arbitrary portions of the matrices. Note that if one carries the quad-tree decomposition down to individual elements, then Ñ ¼ ¾Ñ and Ò ¼ ¾Ò in the worst case.
We choose the strategy of picking Ø Ê and Ø from an architecture-dependent range, explicitly inserting the zero padding, and blindly performing all the arithmetic on the zeros. We choose the range of acceptable tile sizes so that the tiles are neither too small (which would increase the overhead of recursive control) nor overflow the cache (which would result in capacity misses). Since tiles are contiguous, there are no self-interference misses. This makes the performance of the leaf-level matrix multiplications almost insensitive to the tile size [42] .
Our scheme is very sensitive to the amount of padding, since it performs redundant computations on the padded portions of the matrices. However, if we choose tile sizes from the range Ì Ñ Ò Ì Ñ Ü , the maximum ratio of pad to matrix size is ½ Ì Ñ Ò .
Our imposition of the range Ì Ñ Ò Ì Ñ Ü of tile sizes is guided by cache considerations, but causes a problem for rectangular matrices whose aspect ratio Ñ Ò is either too large or too small.
Let « Ì Ñ Ü Ì Ñ Ò , and call a matrix wide, squat, or lean depending on whether « Ñ Ò, ½ « Ñ Ò «, or ½ « Ñ Ò. For wide and lean matrices, it is not possible to find tile sizes that simultaneously satisfy the constraints of equation (4) and lie in the prescribed range of tile sizes. 2 This problem can easily be understood by considering the following case: Ñ ½¼¾ , Ò ¾ , Ì Ñ Ò ½ , and Ì Ñ Ü ¿ ¾ .
We overcome this limitation quite simply, by dividing the wide or lean matrix into squat submatrices, and reconstructing the matrix product in terms of the submatrix products. 
Conversion and transposition issues
In order to stay compatible with dgemm, we assume that all matrices are presented in columnmajor layout. Our implementation internally allocates additional storage and converts the matrices from column-major to the recursive layout. The remapping of the individual tiles is again amenable to parallel execution. We incorporate any matrix transposition operations required by the operation into this remapping step. This is handy, because it requires only a single routine for the core matrix multiplication algorithm. The alternative solution requires multiple code versions or indirection using pointers to handle these cases correctly.
Issues with pre-and post-additions
There is one final implementation issue arising from the interaction of the pre-and post-additions with the recursive layouts with more than one orientation. Consider, for example, the addition For Ä À , the situation is more complicated, because there is no simple pattern to the ordering of tiles. Instead, we simply keep global mapping arrays of these orders for the various orientations, and use these arrays to identify corresponding tiles in pre-and post-additions. The added cost in loop control appears to be insignificant.
Experimental results
Our experimental platform was a Sun Enterprise 3000 SMP with four 170 MHz UltraSPARC processors, and 384 MB of main memory, running SunOS 5.5.1. We used version 5.2.1 of the Cilk system [4] compiled with critical path tracking turned off. The Cilk system requires the use of gcc to compile the C files that its cilk2c compiler generates from Cilk source. We used the gcc-2.7.2 compiler with optimization level -O3. The experimental machine was otherwise idle. We also took multiple measurements of every data point to further reduce measurement uncertainty.
We timed the full cross-product of the three algorithms (standard, Strassen, Winograd) and the six layout functions (Ä , Ä Í , Ä , Ä , Ä , Ä À ) running on 1, 2, and 4 processors on square matrices with Ò ranging from 500 through 1500. We verified correctness of our codes by comparing their outputs with the output of vendor-supplied native version of dgemm, However, we could not perform the leaf-level multiplications in our codes by calling the vendor-supplied native version of dgemm, since we could not get Cilk to support such linkage of external libraries. Instead, we coded up a C version of a 6-loop tiled matrix multiplication routine with the innermost accumulation loop unrolled four-way. We report all our results in terms of execution time, rather than megaflop/s (which would not correctly account for the padding we introduce) or speedup (since the values are sensitive to the baseline).
In another set of experiments, we evaluated a sequential implementation in which we see a trade-off between memory and execution time. We also compare this sequential implementation with two state-of-the-art fast matrix multiplication implementations.
General comments
As predicted by theory, we observed the two fast algorithms consistently outperforming the standard algorithm. This is apparent from the different y-axis extents in the subgraphs of Figure 7 .
From the same figure, we observe virtually no difference between the execution times of the two fast algorithms. This suggests to us that the worse algorithmic locality of reference of Winograd's algorithm compared to Strassen's (see Figure 2 ) offsets its advantage of lower operation count.
We observed near-perfect scalability for all the codes, as evident from Figures 6 and 7. By running under a different version of Cilk in which critical path tracing was enabled, we determined that, for Ò ½¼¼¼, there is sufficient parallelism in the standard algorithm to keep about 40 processors busy; the corresponding number for the two fast algorithms is around 23. This is as expected, since the total work of the algorithms is Ç´Ò ¾·AE µ, while the critical path is Ç´Ð ¾ Òµ.
Choice of tile size
To back our claim that, for best performance, the recursive layouts should be stopped before the matrix element level, we timed a version of the standard algorithm with the Ä layout in which we explicitly controlled the tile size at which the recursive layout stopped. Figure 5 shows the single- Table 1 ]. The numbers for Ò ½ ¿ are 61.555 seconds for native dgemm, 96.1996 seconds for our best time, and a slowdown factor of 1.56.
Robustness of performance
To study the robustness of the performance of the various algorithms, we timed the standard and Strassen algorithms using the Ä and Ä layouts for Ò ¾ ½¼¼¼ ½¼ on 1-4 processors. Figure 6 shows the results, which are unlike what we had originally expected. The standard algorithm with Ä layout exhibits large performance swings which are totally reproducible. The Ä layout greatly Figure 7 shows the relative performance of the various layout functions at two problem sizes: Ò ½¼¼¼ and Ò ½¾¼¼. The figures reveal two major points. First, compared to the Ä layout, the effect of recursive layouts on the standard algorithm is dramatic, while their effect on the two fast algorithms is marginal. We offer our explanation of this effect in Section 5.6. Second, at least for these problem sizes, the performance of all the recursive layouts is approximately the same. We interpret this to mean that our implementation of the layouts is sufficiently efficient to control the addressing overheads even of Ä À . An alternate explanation is that the purported benefits of, say, Ä À over Ä , do not manifest themselves until we reach larger problem sizes. 
Relative performance of different layouts

A critique of Cilk
Overall, we were favorably impressed with the capabilities of the Cilk system [4] that we used to parallelize our code. For a research system, it was quite robust. The simplicity of the language extensions made it possible for us to parallelize our codes in an afternoon. The restriction of not being able to call Cilk functions from C functions, while sound in its motivation, was the one feature that we found annoying, for a simple reason: it required to annotate several intervening C functions in a call chain into Cilk functions, which appeared to us to be spurious. This problem is avoidable by using the library version of Cilk.
The intimate connections between Cilk and gcc, and the limitations on linking non-Cilk libraries, are serious barriers to getting performance out of the system. In order to quantify these losses, we compiled our serial C codes with three different sets of compile and link options: costs us a factor of 1.2-1.4, while the switch to gcc costs us a factor of 1.5-1.9. We find it interesting that the incremental loss in performance due to switching compilers is comparable to the loss in performance due to the non-availability of native BLAS. The single-processor Cilk running times are indistinguishable from the running times of version (iii) above, indicating an extremely 
Why the fast algorithms behave differently than the standard algorithm
Our explanation for the qualitative difference in the behavior of the fast algorithms compared to the standard algorithm is an algorithmic one. Observe that both the Strassen and Winograd algorithms perform pre-additions, which require the allocation of quadrant-sized temporary storage, while this is not the case with the standard algorithm. Therefore, when performing the leaf-level matrix products, the standard algorithm works with tiles of the original input matrices, which have leading dimension equal to Ò. In contrast, every level of recursion in the fast algorithms reduces the leading dimension by a factor of approximately two, even if we do not re-structure the matrix at the top level. This intrinsic feature of the fast algorithms makes them less sensitive to idiosyncrasies of the memory system. One implementation, referred to as DGEFMM in the rest of the paper, uses dynamic peeling.
This approach peels off the extra row or column at each level, and separately adds their contributions to the overall solution in a later fix-up computation [24] . This eliminates the need for extra padding, but reduces the portion of the matrix to which Strassen's algorithm applies, thus reducing the potential benefits of the recursive strategy. The fix-up computations are matrix-vector operations rather than matrix-matrix operations, which limits the amount of reuse and reduces performance.
The other implementation [11] , referred to as DGEMMW in the rest of the paper, uses dynamic overlap. This approach finesses the problem by subdividing the matrix into submatrices that (conceptually) overlap by one row or column, computing the results for the shared row or column in both subproblems, and ignoring one of the copies. This is an interesting solution, but it complicates the control structure and performs some extra computations.
We measure the execution time of the various implementations on a 500 MHz DEC Alpha Miata and a 300 MHz Sun Ultra 60. The Alpha machine has a 21164 processor with an 8KB direct-mapped level 1 cache, a 96KB 3-way associative level 2 cache, a 2MB direct-mapped level 3 cache, and 512MB of main memory. The Ultra has two UltraSPARC II processors, each with a 16 KB level 1 cache, a 2MB level 2 cache, and 512MB of main memory. We use only one processor on the Ultra 60.
We timed the execution of each implementation using the UNIX system call getrusage The results are quite different on the Ultra (see Figure 10) . The most striking difference is the performance of DGEMMW (see Figure 10b) , which outperforms both MODGEMM and DGEFMM for most matrix sizes on the Ultra. Another significant difference is that MODGEMM is generally faster than DGEFMM for large matrices (500 and larger), while DGEFMM is generally faster for small matrices.
Related work
We categorize related work into two categories: previous application of recursive array layout functions in scientific libraries and applications, and work in the parallel systems community related to language design and iteration space tiling for parallelism.
Scientific libraries and applications
Several projects emphasize the generation of self-tuning libraries for specific problems. We are aware of three such efforts: PHiPAC [3] , ATLAS [45] , and FFTW [14] . The PHiPAC project aims at producing highly tuned code for specific BLAS 3 [10] kernels such as matrix multiplication that are tiled for multiple levels of the memory hierarchy. Their approach to generating an efficient code is to explicitly search the space of possible programs, to test the performance of each candidate code by running it on the target machine, and selecting the code with highest performance. It appears that the code they generate is specialized not only for a specific memory architecture but also for a specific matrix size. The ATLAS project generates code for BLAS 3 routines based on the result that all of these routines can be implemented efficiently given a fast matrix multiplication routine. The FFTW project explores fast routines for one-and multi-dimensional fast Fourier transforms. None of these projects explicitly use data restructuring, although the FFTW project recognizes their importance.
Gustavson [18] and Toledo [43] investigate the issue of locality of reference in dense linear algebra, and recommend the use of recursive control structures as a means of improving locality. Stals and Rüde [40] investigate algorithmic restructuring techniques for improving the cache behavior of iterative methods. None of these authors investigate nonlinear data reorganization.
Frens and Wise [13] provide an implementation of recursive matrix multiplication. We adopted their idea of computation restructuring by recursion unfolding. They appear to carry the recursion down to the level of single array elements, which causes a dramatic loss of performance.
The goal of out-of-core algorithms [15, 32] is related to ours. However, the problems differ in two fundamental ways: the limited associativity of caches and their fixed replacement policies are not relevant for virtual memory systems; and the access latencies of disks are far greater than that of caches.
The application of space-filling curves is not new to parallel processing, although most of the applications of the techniques have been tailored to specific application domains [1, 22, 23, 37, 39, 44] . They have also been applied for bandwidth reduction in information theory [2] , for graphics applications [16, 30] , and for database applications [25] . Most of these applications have far coarser granularity than our test codes. We have shown that the overheads of these layouts can be reduced enough to make them useful for fine-grained computations.
Parallel languages and compilers
The parallel compiler literature contains much work on iteration space tiling for gaining parallelism [47] and improving cache performance [5, 46] . Carter et al. [6] discuss hierarchical tiling schemes for a hierarchical shared memory model. Lam, Rothberg, and Wolf [29] discuss the importance of cache optimizations for blocked algorithms. A major conclusion of their paper was that "it is beneficial to copy non-contiguous reused data into consecutive locations". Our recursive data layouts can be viewed as an early binding version of this recommendation, where the copying is done possibly as early as compile time.
The class of data-parallel languages exemplified by High Performance Fortran [28] recognizes the fact that co-location of data with processors is important for parallel performance, and provides user directives such as align and distribute to re-structure array storage into forms suitable for parallel computing. The recursive layout functions described in this paper can be fitted into this memory model using the mapped distribution supported in HPF 2.0. Hu et al.'s implementation [22] of a tree-structured AE-body simulation algorithm manually incorporated Ä within HPF in a similar manner. Support for the recursive layouts could be formally added to HPF without much trouble. The more critical question is how well the corresponding control structures (which are most naturally described using recursion and nested dynamic spawning of computations) would fit within the HPF framework.
A substantial body of work in the parallel computing literature deals with layout optimization of arrays. Representative work includes that of Mace [33] for vector machines; of various authors investigating automatic array alignment and distribution for distributed memory machines [7, 17, 26, 27] ; and of Cierniak and Li [8] for DSM environments. The last paper also recognizes the importance of joint control and data optimization.
Conclusions
We have examined the combination of five recursive layout functions with three parallel matrix multiplication algorithms. We have demonstrated that addressing using these layout functions can be accomplished cheaply, and that these address computations can be performed implicitly by embedding them in the control structure of the algorithms. We have shown that, to realize maximum performance, such recursive layouts need to co-exist with canonical layouts, and that this interfacing can be performed efficiently. We observed no significant performance variations among the different layout functions. Finally, we observed a fundamental qualitative difference between the standard algorithm and the fast ones in terms of the benefits of recursive layouts; we attribute this difference to the algorithmic feature of pre-additions.
