Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Traversing a row-major array in column-major order, or vice-versa, leads to poor spatial locality. With large arrays the performance loss can be a factor of 10 or more. This paper explores the Morton storage layout, which has substantial spatial locality whether traversed in row-major or column-major order.
Introduction
Two-dimensional arrays are generally arranged in memory in row-major order (for C, Pascal etc) or column-major order (for Fortran). Modern processors rel y heavily on caches and prefetching, which work well when the access pattern matches the storage layout. Sophisticated programmers, or occasionally sophisticated compilers, match the loop structure to the language's storage layout in order to maximise spatial locality. Unsophisticated programmers do not, and the performance loss is often dramatic -a factor of 10 or more. In this paper we study the Morton storage layout (for background and history see [3, 20] ). Morton layout is a compromise between row-major and columnmajor, with some spatial locality whether traversed in row-major or column-major order -but in neither case is spatial locality as high as the best case for row-major or column-major. Further, the way that array elements are stored requires fairly complicated address calculation. So, should language implementors still consider providing support for Morton layout for multidimensional arrays? In this paper, we explore and analyse this question and provide some qualified answers.
Perhaps controversially, we confine our attention to "naively" written codes, where a mismatch between access order and layout is reasonably likely. It is of course possible that the compiler might help by adjusting storage layouts or by interchanging loops; however, in the examples which we studied in this paper, we have not seen evidence of the compiler performing either of these transformations. Providing a clear performance programming model is an important aspect of research into compilation; the severe performance differences that can be observed depending on the traversal order of large two-dimensional arrays represent a failure to provide such a model. Even if production compilers did eliminate some loops accessing large arrays with poor stride by suitable transformations, the behaviour of the compiler with respect to this optimisation would have to become part of the performance model presented to the application programmer. In this paper, we have carried out an extensive study of whether Morton order, when used as the default layout for large two-dimensional arrays, can deliver predictable performance, and we quantify the performance penalty that such a choice would incur when compared to an optimal choice of lexicographic layout.
Contributions of this paper
• We show that using lookup tables to calculate Morton layout addresses is remarkably effective. It compares well with the dilated arithmetic scheme proposed by Wise et al. [20] , and offers useful flexibility (Section 4.2).
• We evaluate the hypothesis that Morton layout, implemented using lookup tables, is a useful compromise between row-major and column-major layout. We present extensive experimental results using five simple numerical kernels, running on five different processors (Section 6).
• For each processor and each kernel, we calculate the slowdown, over a range of problem sizes, of using Morton layout compared with the better of the two lexicographic layouts (Section 6.2).
• We show that the effectiveness of Morton layout can often be significantly improved if the base address of the array is page-aligned (Section 5).
• We use the best available compilers for each of the five processors, using the compiler flags chosen by the vendors for their SPEC CFP2000 (base) reports [17] (see Table III in Section 6).
• We present an extensive and systematic study using random problem sizes. This shows a number of interesting effects, and Morton layout appears less attractive. However, as we discuss at the end of the paper, further improvements to the performance of Morton layout are possible.
• We analyse the importance of aligning the base address of a Morton array to a cache line or page boundary.
Compiler techniques. Locality can be enhanced by restructuring loops to traverse the data in an appropriate order [13, 21] . Tiling can suffer disappointing performance due to associativity conflicts, which, in turn, can be avoided by copying the data accessed by the tile into contiguous memory [12] . Copying can be avoided by building the array in this layout. More generally, storage layout can be selected to match execution order [11] . While loop restructuring is limited by what the compiler can infer about the dependence structure of the loops, adjusting the storage layout is always valid. However, each array is generally traversed by more than one loop, which may impose layout constraint conflicts which can be resolved only with foreknowledge of program behaviour. Anderson, Amarasinghe and Lam [2] describe a compilation system for automatically parallelising sequential code for shared memory multiprocessors which makes use of data layout transformations in order to reduce the adverse effects of false sharing of cache lines between processors. Cierniak and Li [4] present a unified algorithm for applying both iteration space and data layout transformations, which is shown to perform better than applying the two transformations separately. For a compiler to change the storage layout of an array in order to improve locality, it has to determine the impact of that transformation through the entire program. Cierniak and Li [5] discuss compiler algorithms for deciding whether changing the layout of an array is valid. Procedure cloning [6] can be used to dis-ambiguate array accesses, making it easier to change the layout of individual arrays. Locally reshaped arrays are arrays that are accessed inside a procedure in a different layout from the calling context. O'Boyle and Knijnenburg [14] describe a static approach to calculating the effect of global data transformations such as partitioning for parallelisation on locally reshaped arrays, as well as loop transformations that can be used to undo locally adverse effects, such as poor stride, of such global data transformations.
Blocked and recursively-blocked array layout. Wise et al. [20] advocate Morton layout for multidimensional arrays, and present a prototype compiler that implements the dilated arithmetic address calculation scheme which we evaluate in Section 4. They found it hard to overcome the overheads of Morton address calculation, and achieve convincing results only with recursive formulations of the loop nests. Chatterjee et al. [3] study Morton layout and a blocked "4D" layout (explained in Section 3.3). They focus on tiled implementations, for which they find that the 4D layout achieves higher performance than Morton layout because the address calculation problem is easier, while much or all the spatial locality is still exploited. Their work has similar goals to ours, but all their benchmark applications are tiled (or "shackled") for temporal locality; they show impressive performance, with the further advantage that performance is less sensitive to small changes in tile size and problem size, which can result in cache associativity conflicts with conventional layouts. In contrast, the goal of our work is to evaluate whether Morton layout can simplify the performance programming model for unsophisticated programmers, without relying on very powerful compiler technology.
Background
In this section, we briefly review various array mappings and their resulting spatial locality.
Lexicographic array storage
For an M × N two dimensional array A, a mapping S(i, j) is needed, which gives the memory offset at which array element A i,j will be stored. Conventional solutions are the row-major (for example in C and Pascal) and column-major (as used by Fortran) mappings expressed by
respectively. We refer to row-major and column-major as lexicographic, i.e. elements are arranged by the sort order of the two indices (another term is "canonical").
Opaque array storage: array descriptors
In more modern languages, such as Fortran 90 (and notable earlier designs -Algol 68 and APL), arrays are represented by a descriptor which provides run-time information on how the address calculation should be done [9] . This is needed to support multidimensional array slicing -where the array descriptor hides the actual array representation, and allows the implementor freedom to select storage layout at will. Using a descriptor allows a single fragment of source code to operate on arrays whose layout varies from call to call -a form of "shape" polymorphism [10] . This raises performance problems since the storage layout is not known at compile-time -the stride of successive memory accesses depends on how a function is called. For optimal performance, different variants of each function need to be generated for each combination of array operand layouts. There may be many distinct combinations requiring distinct code variants. The variants can be selected by run-time dispatch. More aggressively, the appropriate procedure "clone" can be called according to call site context [6] .
Blocked array storage
How can we reduce the number of code variants needed to achieve high performance? An attractive strategy is to choose a storage layout which offers a compromise between row-major and columnmajor. For example, we could break the M × N array into small, P × Q row-major subarrays, arranged as a M/P × N/Q row-major array. We define the blocked row-major mapping function (this is the 4D layout discussed in [3] ) as:
For example, consider 16-word cache blocks and P = Q = 4, as illustrated in Figure 1 . Each block holds a P × Q = 16-word subarray. In row-major traversal, the four iteration s (0, 0), (0, 1), (0, 2) and (0, 3) access locations on the same block. The remaining 12 locations on this block are not accessed until later iterations of the outer loop. Thus, for a large array, the expected cache hit rate is 75%, since each block has to be loaded four times to satisfy 16 Column−major traversal: one in four accesses Figure 1 . Blocked row-major ("4D") layout S (8, 8) brm (i, j) with block-size parameters P = Q = 4. The diagram illustrates that with 16-word cache lines, illustrated by different shadings, the cache hit rate is 75% whether the array is traversed in row-major or column-major order.
results with column-major traversal, i.e. when the loop structure is "do j...do i" rather than the "do i...do j" loop of row-major traversal.
Recursive blocking
The impact of virtual memory pages on blocked array storage. Modern computer systems rely on a TLB to cache address translations: a typical 64-entry data TLB with 8 KB pages has an effective span of 64 × 8 = 512 KB. Unfortunately, as illustrated in Figure 2 , if a blocked row-major array is traversed in column-major order, only one subarray per page is usable. Thus, we find that the blocked row-major layout is still biased towards row-major traversal. We can overcome this by applying the blocking again Figure 2 . Blocked row-major layout for large array. If a large blocked row-major array is traversed in columnmajor order, only one subarray per page is usable. The diagram shows an array with rows of 2048 doubles, using the blocked row-major layout with 4×4 blocks. Each 8 KB page holds 1024 doubles, in 64 blocks. When traversed in row-major order, one fresh page is accessed every 256 accesses (a hit rate of 1 − 1/256 = 99.6%), but when traversed in column-major order, a fresh page is accessed every 4 accesses (a hit rate of 1 − 1/4 = 75%).
Bit-interleaving
Assume that for an M × N array, M = 2 m , N = 2 n . Write the array indices i and j as
respectively. From this point onwards, we restrict our analysis to square arrays (where M = N ; we address non-square arrays in Section 4.2). Now the lexicographic mappings can be expressed as bitconcatenation (written " "):
If P = 2 p and Q = 2 q , the blocked row-major mapping is
Now, choose P = Q = 2, and apply blocking recursively:
This mapping is called the Morton Z-order, and is illustrated in Figure 3 . 
Morton-order layout is an unbiased compromise between row-major and column-major
The key property which motivates our study of Morton layout is the following:
• Given a cache with any even power-of-two block size, with an array mapped according to the Morton order mapping S mz , the cache hit rate of a row-major traversal is the same as the cachehit rate of a column-major traversal.
• This applies given any cache hierarchy with even power-of-two block size at each level. This is illustrated in Figure 3 .
• The cache hit rate for a cache with block size 2
For cache blocks of 32 bytes (4 double words, k = 1) this gives a hit rate of 50%. For cache blocks of 128 bytes (16 double words, k = 2) the hit rate is 75% as illustrated earlier. For 8 KB pages (1024 words, k = 5), the hit rate is 96.875%. In Table I , we contrast these hit rates with the corresponding theoretical hit rates that would result from row-major and column-major layout. Notice that traversing the same array in column-major order would result in a swap of the row-major and column-major columns, but leave the hit rates for Morton layout unchanged. In Section 5, we show that this desirable property of Morton layout is conditional on choosing a suitable alignment for the base address of the array. In our conclusion in Section 7, we also point out that non-square cache blocks and pages lead to a more complicated picture.
Morton-order address calculation

Dilated arithmetic
Bit-interleaving is too complex to execute at every loop iteration. Wise et al. [20] explore an intriguing alternative: represent each loop control variable i as a "dilated" integer, where the i's bits are interleaved with zeroes. Define D 0 and D 1 such that
Now we can express the Morton address mapping as S
, where "|" denotes bitwise-or. At each loop iteration we increment the loop control variable; this is fairly straightforward. Let "&" denote bitwise-and. Then: Figure 4 , which shows the ikj variant of matrix multiply. Note that the strengthreduction in equations 8 and 9 can only be used when an array is accessed in unit stride. Most current processors do not offer instruction-level support for dilated arithmetic; if Morton layout can be seen to be competitive in terms of spatial locality for application programs, this would be an interesting architectural feature to investigate.
This is illustrated in
Morton-order address calculation using a lookup table
The dilated arithmetic approach works when the array is accessed using an induction variable which can be incremented using dilated addition. We found that a much simpler scheme often works nearly as well: we simply pre-compute a table for the two mappings D 0 (i) and D 1 (i). We illustrate this for the Figure 5 . Note that for programs with regular stride, the table accesses are very likely cache hits, as their range is small and the tables themselves are accessed in unit stride. One small but important detail: we use addition instead of logical "or". This may improve instruction selection. It also allows the same loop to work on lexicographic layout using suitable tables. If the array is non-square, 2 n × 2 m , n < m, we construct the table so that the j index is dilated only up to bit n. Figure 6 shows the performance of these two variants on a variety of computer systems. This shows that the dilated arithmetic implementation is almost always faster; however, the difference is usually less than 20%. In the remainder of the paper, we use the table lookup scheme exclusively. We comment on this decision further in Section 7. With compiler support, many applications could benefit from the dilated arithmetic approach, leading in many cases to more positive conclusions.
Effect of Memory Alignment in Morton Layouts
With lexicographic layout, it is often important to pad the row (respectively column) length to avoid associativity conflicts [15] . With Morton layout, it turns out to be important to consider padding the base address of the array.
In our previous discussion of the cache hit rate resulting from Morton order arrays, we have implicitly assumed that the base address of the array will be mapped to the start of a cache line. For a 32 byte, i.e. 2 × 2 double word cache line, this means that the base address of the Morton array needs to be 32-byte aligned. As we have illustrated previously in Section 3.6, such an allocation is unbiased towards any particular order of traversal. However, in Figure 7 we show that if the allocated array is offset from this "perfect" alignment, Morton layout may no longer be an unbiased compromise storage layout. Furthermore, the actual average hit rates over the entire array can be significantly worse compared with perfect alignment of the base address. In Figure 8 , we consider the case where the size of a cache line does not match a square tile of array elements. This is the case, for example with 64 byte cache lines and arrays of double word floating point numbers. As shown in the figure, this means that the symmetry property of Morton order is lost. It still appears, however, that perfect alignment of the base address of the Morton array, 64-byte alignment in this case, leads to the best hit rates in both traversal orders. A similar effect is replicated on each level of the memory hierarchy.
In our experimental evaluation, we have studied the impact on actual performance of the alignment of the base address of Morton arrays. For each architecture and each benchmark, we have measured the performance of Morton layout both when using the system's default alignment (i.e. addresses as returned by malloc()) and when aligning arrays to each significant size of memory hierarchy. The results, which are included in Figures 9-18 and discussed in more detail in the next section, broadly confirm our theoretical conclusion.
Experimental setup and experimental results
Benchmark kernels and architectures. To test our hypothesis that Morton layout, implemented using lookup tables, is a useful compromise between row-major and column-major layout experimentally, we have collected a suite of simple implementations of standard numerical kernels operating on two- (Table) Morton Default Alignment (DA) Figure 6 . Matrix multiply (ikj) performance in MFLOPs of dilated arithmetic Morton address calculation (see Figure 4 ) compared against the table-based Morton address calculation (see Figure 5) . The graphs show performance in MFLOPs achieved by the implementations at each problem size on each system. Details of the systems are given in Table III . On nearly all systems , the dilated-arithmetic implementation performs relatively better than the table implementation. However the difference is usually less than 20%. Underneath each diagram, we show the average theoretical hit rate for the entire Morton array for both row-major (RM) and colum-major (CM) traversal. When an array is imperfectly aligned, in addition to losing the symmetry of the Morton layout, we also get worse spatial locality.
MMijk
Matrix multiply, ijk loop nest order (usually poor due to large stride) MMikj Matrix multiply, ikj loop nest order (usually best due to unit stride) Jacobi2D Two-dimensional four-point stencil smoother ADI Alternating-direction implicit kernel, ij,ij order Cholesky K-variant (usually poor due to large stride) Table II and the platforms in Table III .
Problem sizes. As mentioned in Section 2, our previous paper [19] reported performance results for power-of-two problem sizes. For this paper, we decided to carry out an exhaustive study, collecting performance data, where possible, for all problem sizes between 100 × 100 and 2048 × 2048. In some cases, the running-time of the benchmarks was such that we were not able yet to collect data up to 2048 × 2048. In those cases, we report data up to 1024 × 1024; however, we are continuing to collect measurements. In all cases, we used square arrays.
Experimental methodology. Most of the architectures we used for experiments were multi-user platforms. In the case of the x86 architectures (Pentium III, Pentium 4 and Athlon), we used clusters of identical teaching machines. The absence of a fully-controlled environment, and our desire to collect data for a full range of problem sizes (which implies running experiments for a very long time in total), led us to design carefully an experimental methodology aimed at minimising the impact of external interferences in our results.
• During off-peak hours, we ran a script for collecting measurements on each available platform.
In order to minimise the impact of any transient effects on particular ranges of experiments, the scripts are programmed to repeatedly make a random choice of benchmark kernel (from the list in Table II) , array layout (i.e. row-major, column-major or Morton), alignment of the base address of the array (from a list of all significant sizes in the memory hierarchy, i.e. cache line lengths and page size) and problem size.
• Once a kernel, layout, alignment and problem size are chosen, the kernel is run once and the time recorded in a shared file structure using suitable locking.
• Over time, we accumulated a total of more than 23 million measurements. In our evaluation, we proceeded as follows: For each tuple of platform, experiment (kernel), layout, alignment and problem size, we gather all timing results obtained. Notice that due to the random choice of parameters, the number of samples for each point varies. We first use the Dixon Test [7] to eliminate up to one outlier. Following that, we calculate various statistical parameters, such as mean, standard deviation, median and 90% confidence intervals.
• The performance numbers we report in the following figures and tables are all based on the median of measurements taken. The reason for this is that the median is less liable to interference from outliers than the mean [16] . Although we do not show these in the paper, we have calculated and plotted 90% confidence intervals for all data we report. Table IV shows the baseline performance achieved by each machine using standard row-major layout. Figures 9-18 show our results in detail. We make some comments on each graph directly in the figures. For each experiment / architecture pair, we give a broad characterisation of whether Morton layout is a useful compromise between row-major and column-major in this setting by annotating the figures with win, lose, etc. As an overview, we record wins for
Performance results
• Adi: Alpha, P3 and Sparc over column-major but not over row-major. Table IV . Baseline performance of various kernels on different systems. For each kernel, for each machine, we show the performance range in MFLOPs for row-major array layout over all problem sizes covered in our experiments (as shown in Figures 9-18 ).
• Jacobi2D: Alpha, Sparc over column-major but not over row-major.
• MMikj: Alpha, Sparc over column-major but not over row-major.
• MMijk: Alpha, Sparc over both row-major and column-major.
Notice that for kernels with high spatial locality, such as MMikj and Jacobi2D, which run close to the machine's peak performance, bandwidth to L1 cache for table access may be a major factor. Our results also confirm that for row-major layout, padding the length of the rows of an array can significantly improve performance. The amount of padding required is small, but needs to be chosen very carefully. In Section 6.2, which follows the detailed performance graphs, we evaluate the competitiveness of Morton layout in overview by calculating the slowdown of Morton over the best lexicographic layout.
Competitive efficiency of Morton layout
If we know how an array is going to be accessed, and if there are no conflicting access patterns to that array, we could choose optimally between the two lexicographic layouts for that array. If we have only partial or no information about the access patterns, we would like to choose a layout which is competitive with the optimum layout. If we do not know how arrays in a program might be aliased, we probably need to choose one default layout for all arrays in that program, and we want this choice to compare well to an optimal decision. One way to evaluate the use of the Morton layout as such a default-choice is by analogy with competitive online algorithms. Suppose we have an optimal array layout scheme OPT. Following [8] , a memory layout scheme ALG is c-competitive (for a constant "efficiency" factor c) if there exists a constant α such that for all utilisation scenarios (or access sequences) σ,
It is not generally possible to find an optimum layout efficiently -specifically if there are conflicting access patterns for arrays. Therefore, in order to quantify our hypothesis that Morton layout • The fall-off in RM performance occurs at 725 × 725 when the total datasize exceeds L2 cache size (4MB, direct mapped). This assumes a working set of 725 × 725 doubles.
• RM below about 725 × 725 has a bimodal distribution.
• Notice the sharp drop in CM performance at around 360 × 360.
• Alignment does not change Morton performance, the three lines coincide. 
Adi on Sparc: Win over Column-Major
• The fall-off in RM performance occurs at 1024 × 1024 when the total datasize exceeds the L2 cache size (8MB, direct mapped). This assumes a working set of 1024 × 1024 doubles.
• Notice the drop in CM performance which occurs after 720 × 720.
• L1 (32-byte) and L2 (64-byte) aligned
Morton are faster than default-and page-aligned Morton. Figure 9 . ADI performance in MFLOPs on Alpha and Sparc. We compare row-major (RM), column-major (CM) and Morton implemented using lookup tables. For Morton, performance is shown for default alignment and significant sizes in the memory hierarchy (cache line lengths and page size).
implemented using table lookup is a competitive default choice, we compare against the better of the two lexicographic layouts. The constant α is designed so that a constant overhead cost could be taken into account (such as the cost of filling lookup tables). However, in our experiments, we have simply measured the time of the numerical kernels, and we therefore set α = 0 in the above equation.
In Figures 19-23 , we show the range of slowdown factors c over the best lexicographic layout. For each problem size, we select from our performance data the better of the two lexicographic layouts and then calculate the slowdown of Morton over this "optimal" time. We show the data as box-and-whisker plots, indicating maximum, minimum, median values and the range in which 50% of all data points lie. From these plots, we offer the following tentative conclusions about the competitiveness of Morton layout. 
Adi on Athlon: Conditional Win over
Column-Major
• There is a cross-over between default Morton and page-aligned Morton at around 900 × 900.
• For large datasizes, page-aligned is the best Morton version.
• Notice some very bad performance drops on CM for individual problem sizes. 
Adi on Pentium 4: Lose
• Morton generally performs no better than CM.
• Notice, however, some really bad drops in CM performance for some datasizes, which Morton does not suffer.
• For large problem sizes, L2 aligned is slightly faster than page-aligned Morton. 
Jacobi2D on Pentium III: No Win
• L2-and page-aligned Morton is slightly better than default-aligned. 
MMikj on Athlon: Lose
• Notice there is a cross-over between the default and L2-/page-aligned Morton versions at around 500 × 500.
• For large problem sizes, page-aligned Morton is very slightly better than default-aligned. Column-Major
• Notice upper limit is 1024 × 1024.
• Notice the sharp drop in RM and CM performance around 360 × 360.
• Page-aligned Morton is faster than default, but L2-aligned is faster than page-aligned.
• For Morton on problem sizes 832(= 26 * 32) -864(= 27 * 32) and 992(= 31 * 32) -1024(= 32 * 32) we see a noticeable drop in performance, presumably due to some self-or inter-array interference effect in the direct-mapped L2 cache. 
MMijk on Sparc: Potential Win over both
Row-Major and Column-Major
• Notice the sharp drop in RM and CM performance around 720 × 720.
• Notice for large problem sizes Morton is faster than either lexicographic layout. Cholk on Alpha: Win
• For data sizes larger than around 550 × 550, Morton is faster than either column-major or row-major. Figure 19 . Range of slowdown factors c over the best lexicographic layout for Adi. For each machine / layout pair, we show the range of all c factors as a box-and-whisker plot: The gray box indicates the area where 50% of the datapoints lie while the "whiskers" indicate the maximum and minimum points. The line in the middle of the grey box indicates the median. Most median slowdown factors for Adi are around or below 2. The very high maximum factors encountered on Athlon and P3 occur for small datasizes. Figure 20 . Range of slowdown factors c over the best lexicographic layout for Jacobi2D. For each machine / layout pair, we show the range of all c factors as a box-and-whisker plot: The gray box indicates the area where 50% of the datapoints lie while the "whiskers" indicate the maximum and minimum points. The line in the middle of the grey box indicates the median. Using page-aligned arrays, the median slowdown factor on Alpha, P3 and Sparc is close to 2. The high maximum on Athlon for default-alignment occurs for very small datasizes; note however, that alignment improves this considerably. Figure 21 . Range of slowdown factors c over the best lexicographic layout for MMikj. For each machine / layout pair, we show the range of all c factors as a box-and-whisker plot: The gray box indicates the area where 50% of the datapoints lie while the "whiskers" indicate the maximum and minimum points. The line in the middle of the grey box indicates the median. On i386 architectures, Morton suffers a disappointingly large slowdown. In particular on the Pentium 4, the stride-1 traversal of arrays in the MMikj algorithm is handled very well by the memory system. In contrast, Morton almost certainly does not benefit from features such as the P4's hardware prefetching mechanism, which is based on detecting regular access strides.
Competitive Efficiency for Jacobi2D
Competitive Efficiency for MMikj
Impact of Alignment of Morton Arrays.
For all experiments, except for MMijk and Cholesky on Athlon, our theoretical conclusions from Section 5 are supported by our experimental data: Padding the base address of Morton order arrays to a significant size in the memory hierarchy, such as cache line size or page size can significantly improve performance. Considering spatial locality alone, we would expect alignment to the largest significant size, i.e. page size, to have the greatest benefit. This is supported in most, but not all cases by our experimental data, and we assume that where this is not the case (such as MMijk on Alpha), this is due to interference effects. In some cases, specifically Adi on Athlon, Jacobi on Sparc and MMikj on P4, the improvement with page-alignment is quite significant.
Morton layout can outperform both lexicographic layouts. Notice that there are cases where the minimum, and occasionally (MMijk on Alpha) also the mean slowdown is less than 1, meaning that Morton layout performs better than the best lexicographic layout.
Worst slowdown. Although we have shown that proper alignment can reduce the maximum slowdown factors found, and therefore increase the competitiveness of the basic Morton scheme, the maximum slowdown factors are still very high -the worst probably around 13 for MMikj on P4. In our conclusion, we discuss some further ways in which we hope to improve these figures. Figure 22 . Range of slowdown factors c over the best lexicographic layout for MMijk. For each machine / layout pair, we show the range of all c factors as a box-and-whisker plot: The gray box indicates the area where 50% of the datapoints lie while the "whiskers" indicate the maximum and minimum points. The line in the middle of the grey box indicates the median. The MMijk loop suffers from severe performance drops for specific datasizes on most architectures for both row-major and column-major layout. Morton layout does not incur this problem, and the minimum slowdown on all architectures here is less than 1 -i.e. on all architectures there are problem sizes where Morton out-performs the better of the two lexicographic layouts. On Alpha, Morton out-performs both lexicographic layouts for all datasizes larger than 360 × 360, hence the median slowdown of less than 1.
Conclusions and directions for further research
Using a small suite of dense kernels working on two-dimensional arrays, we have studied the impact of poor array layout. On some machines, we found that Morton array layout, even implemented with a lookup table with no compiler support, is remarkably competitive to both row-major and column-major layouts. We also found that using a lookup-table for address calculation allows flexible selection of finegrain non-linear array layout, while offering attractive performance on some architectures compared with lexicographic layouts on untiled loops. A number of interesting issues remain:
• Non-square cache blocks and pages In our brief analysis of spatial locality using Morton layout (Section 3.6, Figure 3 ), we assumed that cache blocks and virtual memory pages are a square (even) power of two. This depends on the array's element size, and is often not the case. As we showed in Figure 8 , row-major and column-major traversal of Morton layout then lead to differing spatial locality. A more subtle non-linear layout might address this.
• Unrolling
The results presented here are based on code which uses the lookup table for every address calculation. By strip-mining the innermost loop (which is always valid) by a small square powerof-two factor such as 4, it is possible to replace some lookup from the base of a 2 × 2 block. This should give higher performance for the Morton layout, at the loss of some of the addressing flexibility which the lookup table scheme allows. Since this paper was submitted, we have measured the effect of this optimisation [18] .
• Associativity conflicts within and between Morton arrays
Associativity conflicts have been studied extensively for lexicographic layouts (e.g. [15] ). Our results show evidence that associativity conflicts also impact performance with Morton layout.
• Cache contention between arrays and lookup tables
The lookup table scheme relies for its performance on the tables, which are accessed with unit stride, occupying first-level cache. However, array accesses can displace lookup table entries. We believe this effect may explain some features of our performance graphs and plan to investigate.
• Prefetching
Most modern processors have both autonomous prefetching of uniform address streams, and explicit prefetching instructions. With lexicographic layout, fixed-stride accesses are common and autonomous prefetching works well. With Morton, the access pattern is known but not uniform. To sustain memory access bandwidth we need to issue prefetch instructions carefully.
• Performance analysis using performance counters
We plan to use performance counter instrumentation in order to better understand and analyse why we see the performance patterns that we have reported in this paper.
It seems unlikely that Morton layout can offer a competitive compromise for three-dimensional arrays, since a given lexicographic traversal would use only 2 k words of each 2 3k -word cache block. 
