Abstract-The increasing disparity between memory latency and processor speed is a critical bottleneck in achieving high performance. Recently, several studies have been conducted on blocked data layouts, in conjunction with loop tiling to improve locality of references. In this paper, we further reduce cache misses, restructuring the memory layout of multi-dimensional arrays, so that array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. A straightforward way is presented to easily translate multi-dimensional indexing of arrays into their blocked memory layout using quick and simple binary-mask operations. Actual experimental results on three hardware platforms, using 5 different benchmarks with various array sizes, illustrate that execution time is greatly improved when combining tiled code with blocked array layouts and binary, mask-based translation functions for fast indexing. Finally, simulations verify that our enhanced performance is due to the considerable reduction of cache misses in all levels of memory hierarchy, and especially due to their concurrent minimization, for the same tile size.
I. INTRODUCTION
As microprocessors become increasingly fast, memory system performance begins to determine overall performance [1] , since a large percentage of execution time is spent on memory stalls, even with large on-chip caches. Computer architects have been battling against this memory wall problem [2] by designing even larger but slightly faster caches.
In order to overcome the disparity between cpu and cache memory speeds, software techniques that exploit locality of references for iterative codes are widely used. These techniques can be categorized into control (code) transformations, data transformations, combination of control and data transformations, and, lately, transformations of the array layouts in memory (non-linear, hybrid linear and non-linear).
As far as control transformations are considered, they change loop iteration order and, thereby, data access order. Loop permutation, loop reversal and loop skewing attempt to restructure the control flow of the program to improve data locality. Loop unrolling and software pipelining are exploiting multiple registers and pipelined datapaths to improve temporal locality [3] . Such transformations, when used in combination with software-controlled prefetching [4] , [5] , [6] , [7] , help tolerating memory latency, as long as prefetching happens soon enough, but not too early so that prefetched data are evicted prior to their use. Loop fusion and loop distribution can indirectly improve reuse by enabling control transformations that were previously not legal [8] . Loop tiling, widely used in codes that include dense multidimensional arrays, operates on submatrices (tiles), instead of entire rows or columns of arrays, so that data loaded into the faster levels of memory hierarchy are reused. In this context, tile size selection [9] , [10] plays a critical role, since an efficient execution requires elimination of capacity and interference (both self and cross interference) cache misses, along with avoiding underutilization of the cache.
Locality analysis [11] allows compilers to identify and quantify reuse and locality, and is used to guide the search for the best sequence of transformations. Unimodular control transformations, described in the most cited work of Wolf and Lam [12] , and compound transformations of McKinley et al in [8] , attempt to find the best combination of control transformations which, when used with tiling, ensure the correct computation order, while increasing locality of accesses to cache memory.
In a data-centric approach [13] , array unification maps multiple arrays into a single array data space, in an interleaved way, grouping together arrays accessed in successive iterations, to reduce the number of conflict misses. Copying [14] is a method for reducing the intra-tile and inter-tile interference in the cache. However, it can not be used for free, as it incurs instruction overhead and additional cache misses, while data is copied from arrays to buffers. Padding [15] , [16] is a data alignment technique that involves the insertion of dummy elements in a data structure for improving cache performance by eliminating severe conflict misses.
Nevertheless, increasing the locality of references for a group of arrays may affect the number of cache hits for other referenced arrays. Combined loop and data transformations were proposed in [17] and [18] . Cierniak and Li in [17] present a cache locality optimization algorithm which combines both loop (control) and linear array layout transformations. Another unified, systematic method, presented by Kandemir et al in [19] , [20] , [21] , [18] , aims at utilizing spatial reuse in order to obtain good locality.
The previous approaches assumed linear array layouts. Programming languages provide with multidimensional arrays which are finally stored in a linear memory layout, either column-wise or row-wise (canonical order). However, such linear array memory layouts can produce unfavorable memory access patterns, that cause interference misses and increase memory system overhead. Wise et al in [22] and [23] used quad-tree layouts in combination with level-order, either Ahnentafel or Morton, indexing. Quad-tree layouts seem to work well with recursive algorithms due to their efficient element indexing. Nevertheless, no locality gain can be obtained when quad-tree layouts are applied to non-recursive codes. According to Lam et al in [24] , quad-tree (recursive) layouts need not to reach at the finest (element) level, since, if a tile is small enough to totally fit in cache, it can be organized in a canonical order, without any additional depth of recursion for its layout. In this way, the quad-tree decomposition can be pruned, well before reaching the element level and coexist with tiles organized in a canonical manner, thus creating hybrid layouts. Although quad-tree layouts are not efficient, their indexing is fast enough to be adopted for non-recursive layouts.
Hybrid quad-tree and canonical layouts were explored by Chatterjee et al in [25] and [26] . In these papers, the implementation cost was quantified in terms of execution time. Although they claimed for increasing execution-time performance, using four-dimensional arrays, any gain obtained by data locality due to blocked layouts seems to be counterbalanced by the slowdown caused when referring to four-dimensional array elements (in comparison to access time using two-or onedimensional arrays). Lin et al in [27] proposed that these fourdimensional arrays should be converted to two-dimensional ones, mapping array elements through Karnaugh representation scheme. However, indexing the right array elements still requires for tedious time calculations.
In order to quantify the benefits of adopting nonlinear layouts to reduce cache misses, there exist several different approaches. In [28] , Tseng considers all levels of memory hierarchy to reduce L2 cache misses as well, rather than reducing only L1 ones. He presents even fewer overall misses, however performance improvements are rarely significant. This is because L2 misses are not as many as L1 ones, although much more cycles are lost per L2 miss. Targeting only the L1 cache, nearly all locality benefits can be achieved. In another approach, TLB misses should also be considered, along with L1 & L2 misses. As problem sizes become larger, TLB thrashing may occur ( [29] ), so that the overall performance can be seriously degraded. In this context, TLB and cache misses should be considered in concert. Park et al in [30] analyze the TLB and cache performance for standard matrix access patterns, when tiling is used together with block data layouts. Such layouts with block size equal to the page size, seem to minimize the number of TLB misses. Considering both all levels of cache (L1 and L2) and TLB, a block size selection algorithm calculates a range of optimal block sizes.
As related work has shown, the automatic application of nonlinear layouts in real compilers is a really time tedious task. It does not suffice to identify the optimal layout either blocked or canonical one for each specific array. For blocked layouts, we also need an automatic and quick way to generate the mapping from the multidimensional iteration indices to the correct location of the respective data element in the linear memory. Blocked layouts are very promising, subject to an efficient address computation method. Any method of fast indexing for non-linear layouts will allow compilers to introduce such layouts along with row or column-wise ones, therefore further reducing memory misses.
In this paper, in order to facilitate the automatic generation of tiled code that accesses blocked array layouts, we propose a very quick and simple address calculation method of the array indices. We can adopt any out of four different proposed types of blocked layouts, and apply a dilated integer indexing, similar to Morton-order arrays. Thus, we combine additional data locality due to blocked layouts, with fast access per any array element, since simple boolean operations are used to find its right location in linear physical memory. Since array data are now stored block-wise, we provide the instruction stream with a straightforward indexing to access the correct elements. Our method is very effective at reducing cache misses, since the deployment of the array data in memory follows the exact order of accesses by the tiled instruction code, achieved at no extra runtime cost. Experimental results were conducted using the Matrix Multiplication, LU-decomposition, SSYR2K (Symmetric Rank 2k Update), SSYMM (Symmetric MatrixMatrix Operation) and STRMM (Product of Triangular and Square Matrix) codes from BLAS3 routines. We ran two types of experiments, actual execution of real codes and simulation of memory and cache usage using SimpleScalar. We compare our method with the ones proposed by Kandemir in [18] and by Cierniak in [17] , that propose control tiling and arbitrary but linear layouts, and show how overall misses are further reduced and thus final code performs faster. Comparing with Chatterjee's implementation, which uses non-linear fourdimensional array layouts in combination with control transformations, we prove that limiting cache misses is not enough, if address computation for these layouts is not efficient.
The remainder of the paper is organized as follows: Section 2 briefly discusses the problem of data locality using as example the typical matrix multiplication algorithm. Section 3 reviews definitions related to Morton ordering. Section 4 presents blocked data layouts and our efficient array indexing. Section 5 illustrates execution and simulation comparisons with so far presented methods, with results showing that our algorithm reduces cache misses and improves overall performance. Finally, concluding remarks are presented in Section 6.
II. THE PROBLEM: IMPROVING CACHE LOCALITY FOR ARRAY COMPUTATIONS
In this section, we elaborate on the necessity for both control (loop) and data transformations, to fully exploit data locality. We present, stepwise, all optimization phases to improve locality of references with the aid of the typical matrix multiplication kernel.
Loop Transformations: Figure 1a shows the typical, unoptimized version of the matrix multiplication code. Tiling (figure 1b) restructures the execution order, such that the number of intermediate iterations and, thus, data fetched between reuses, are reduced. Thus, useful data are not evicted from the register file or the cache before being reused. The tile size (step) should be selected accordingly, to allow reused data to fit in the specific memory hierarchy level (i.e. L1, L2, etc), that was selected to be optimized.
Unified Loop and linear Data Transformations: Since, loop transformations alone can not result in the best possible data locality, a unified approach that utilizes both control and data transformations becomes necessary. In figure 1b, loop k scans different rows of B. Given a row-order array layout, spatial reuse can not be exploited for B along the innermost loop k. Focusing on self-spatial reuse [18] (since self-temporal reuse can be considered as a subcase of selfspatial, while group spatial reuses are rare), the transformed code takes the form of figure 1c, which has proved to give the best performance, so far. Firstly, we fixed the layout of the LHS (Left Hand Side) array, namely array C, because, in every iteration the elements of this array are both read and written, while the elements of arrays A and B are only read. Choosing j to be the innermost loop, the fastest changing dimension of array C [i, j] should be controlled by this index, namely C should be stored by rows (Cr). Similarly, array B [k, j] should also be stored by rows (Br). Finally, placing loops in ikj order is preferable, because we exploit self-temporal reuse in the second innermost loop for array C. Thus, A [i, k] should also be stored by rows (Ar).
Unified Loop and non-linear Data Transformations: In order to evaluate the merit of non-linear data transformations, let us consider the code shown in figure 1d . We assume that the elements of all three arrays are stored exactly as they are swept by the program (we call this layout ZZ-order, as extensively presented in the following section). The loop ordering remains the same as in figure 1c , except that, tiling is also applied in loop i, so as to have homomorphic shaped tiles in all three arrays. This simplifies the computations needed to find the location for each array element.
III. MORTON ORDER (RECURSIVE) ARRAY LAYOUTS
In the sequel, in order to capture the notion of Morton order [31] , i.e. a recursive storage order of arrays, the basic elements of the dilated integer algebra are presented. Morton defined a way to index two-dimensional arrays and pointed out the conversion to and from cartesian indexing available through bit interleaving (figure 2).
The The hexadecimal evenBits = 0x55555555 has all even bits set and all odd bits cleared. Likewise, oddBits = evenBits << 1 has the value 0xaaaaaaaa.
Definition 3.2: The even-dilated representation of j = When using a dilated integer representation for the indexing of an array, if an index is only used as a column index, then its dilation is odd. Otherwise it is even-dilated. If used in both roles, then, doubling its even-dilated value, produces the odddilation, as needed.
So, after i and j are translated to their images, ← − i and − → j , the code of matrix multiplication will be as follows:
where rowsEven, colsOdd and rowsAEven are the bounds of arrays when transformed in dilated integers. Notice that k is used both as column and as row index. Therefore, it is translated to − → k and for the column indexing of array B, 2 * k is used, which represents ← − k .
IV. BLOCKED ARRAY LAYOUTS
In this section we introduce the non-linear, blocked array layout transformation. Since the performance of the proposed methods in [17] , [18] , [12] is better when tiling is applied, perhaps we could achieve even better locality, if also, array data are stored neither column-wise nor row-wise, but follow a blocked layout. Such layouts were used by Chatterjee et al in [25] , [26] , in order to obtain better interaction with cache 
of the tile offsets is a canonical one, namely column-major or row-major, according to the access order of array elements by the program code. The transformation function for the space L T : (t i , t j ) of the tile co-ordinates can be either canonical or follow the Morton ordering. However, referring to 4-dimensional arrays comes up with long assembly codes, thus, repetitive load and add instructions, which, as seen in experimental results, are too time consuming and, thus, degrade total performance.
A. Different Proposed Types of Blocked Layouts
We expand the notion of blocked layouts, storing array elements exactly in the same order as they are accessed when loop iterations are executed. This storage layout is presented in figure 3 for an 8×8 array which is split into tiles of size 4×4. The grey line shows the order by which the successive program iterations sweep the array data, while numbering illustrates the order, data are actually stored. We split arrays in tiles of the same size as those used by the program that scans the elements of the array. We denote the transformation of this example as "ZZ", because the inner of the tiles is scanned row-wise (in a Z-like order) and the shift from one tile to another is Zlike as well. The first letter (Z) of transformation denotes the shift from one tile to another, while the second indicates the sweeping within a tile. The sweeping of array data in ZZ-order is done using the first portion of code shown below:
In a similar way, the other three types of transformations are shown in figures 4, 5 and 6. For example, according to the "NN" transformation, both the sweeping of the array data within a tile and the shift from one tile to another are done columnwise (N-like order). The respective codes are found below.
"NZ sweeping" for (jj=0; jj < N; jj+=step) 
Since compilers support only linear layouts (either columnorder or row-order) and not blocked array layouts, sweeping the array data when stored in one of the four aforementioned ways, can be done using one-dimensional arrays and indexing them through dilated integers. Morton indexing [23] cannot be applied, since it is applicable only when we have a fully recursive tiling scheme, whereas, in our case, tiling is applied only once (1-level tiling). Thus, instead of oddBits and evenBits, we introduce binary masks. 
B. Mask Theory for Fast Address Computation
The form of the introduced masks, used to calculate the address of an element A [i, j] , for an array of size N c × N r and tile size step c × step r , is illustrated in table I.
The number of subsequent 0 and 1's that consist every part of each mask is defined by the functions m x = log(step x ) and t x = log 
C. Implementation
Let us store elements of array A [i, j] in an one-dimensional memory space, in the same order as they are swept by the instruction stream of a tiled code, similar to ZZ-sweeping of section IV-A. Such storage order is illustrated in figures 7 and 8, where array A is of size 32 × 32, tiled in blocks of size 8 × 8.
To access any element of an array stored according to a blocked layout, loop indices that control the element locations should take non-successive values. In our example (ZZ layout), the right values for indices that control columns and rows of the first tile (as shown in figure 8) These are the storage locations of the first column and the first row elements (figure 8). To find the storage location of an arbitrary element, e.g. A [4, 6] (5-th row and 7-th column), we can just add 6+32=38 (we selected the 5-th column index (=32) and the 7-th row index (=6)). Location #38 is the requested one. For an element that belongs in the 3-rd row of tiles and in the 4-th column of tiles, which is tile #11(=3+8, as shown in figure 7 ), the element position will have 1011 <2> (= 11 <10> ) as prefix.
The appropriate binary masks for the automatic calculation of such non-linear indices are given by the forms of section IV-B. For our example ( 32×32 array with 8×8 tiles), the masks are:
for the row index : 0011000111 for the column index : 1100111000 If values 0-31 of linear indices (which are the values given to row and column indices according to standard compiler array handling), are filtered, using these masks, the desired values will arise (figure 9).
As shown in figure 9 , element A [20, 30] which belongs to tile #11 <10> , is stored in position:
"20"+"30"=10 00 100 000 + 00 11 000 110 = 1011 100 110 = 742 <10> , which has indeed 1011 <2> as prefix.
D. Example: Matrix Multiplication
According to Kandemir et al in [18] , the best data locality is achieved when the code has the following form:
for (kk=0; kk < 32; kk+=8) for (jj=0; jj < 32; jj+=8)
We use a modified version of [18] (as explained in section II), with a nested loop of depth 6, instead of depth 5, since the implementation of our mask theory is simpler in this case. All three arrays A, B, C are scanned according to ZZ-sweeping.
In the nested code of our example, the three inner loops (i, j, k) control the sweeping within the tiles. The 6 least significant digits of the masks are adequate to sweep all iterations within these loops, as shown in figures 8 and 9. The three outer loops (ii, kk, jj) control the shifting from one tile to another, and the 4 most significant digits of the masks can define the movement from a tile to its neighboring one. Thus, i takes values in the range of xxx000 where x = (0 or 1) and ii takes values in the range of xx00000000, so i|ii = xx00xxx000 gives the desired values for the columns of matrix A and C. Similarly, j takes values in the range of 000xxx and jj in the range of 00xx000000, so j|jj = 00xx000xxx are the desired values for rows of B and C. Index k, does not control the same dimension in the arrays in which it is involved, as shown in table II. So, for array A, the proper mask is a row one, namely k A ∈ 000xxx and kk A ∈ 00xx000000. On the other hand, the mask for 
E. The Algorithm to select the Optimal Layout per Array
In order to succeed in finding the best possible transformation which maximizes performance, we present a method, based on the algorithm presented in [18] , adjusted to allow for blocked array layouts. Notice that our algorithm finds the best data layout of each array, taking into account all its instances throughout the whole program (instances of the same array in different loop nests are also included). This way, no replicates of an array are needed. Figure 10 gives a graphic representation of the following algorithm. As easily derived from the three loops of the flow chart, the worst case complexity is O(A · r · D), where A is the number of different array references, r is the number of dimensions each array reference contains and D is the loop depth (before tiling is being applied).
• Create a matrix R of size r × l, where r = number of different array references and l = nested loop depth (before tiling is applied). Columns of R are ordered from left to right, according to the loop nest order (from the outermost to the innermost loop). If there are two identical references to an array, there is no need for an extra line. We can just add an indication to the row representing this array, thus putting double priority to its optimization. For the example of matrix multiplication
Notice that the Left Hand Side (LHS) array of an expression (for our example this is array C) is more important, because in every such access, the referred element is both read and written. Each column of R represents one of the loop indices. So, in each row, array elements are set when the corresponding loop controls one of its dimensions. Otherwise array elements are reset.
• Optimize first the layout of the array with the greatest number of references in the loop body. (In our example this is array C). The applied loop transformations should be such that one of the indices that control an array dimension of it, is brought in the innermost position of the nested loop. Thus, the C row of R should come to the form (x, . . . , x, 1), where x = 0 or 1. To achieve this, swapping of columns can be involved. The chosen index has to be the only element of the stated dimension and should not appear in any other dimension of C. 
A. Execution Environment
In this section we present experimental results using 5 benchmarks: Matrix Multiplication, LU-decomposition, SSYR2K (Symmetric Rank 2k Udate), SSYMM (Symmetric Matrix-Matrix Operation) and STRMM (Product of Triangular and square Matrix). There are two types of experiments: actual execution times of optimized codes using non-linear layouts and simulations using the SimpleScalar toolkit [32] . Firstly, the experiments were performed on three platforms with different architectural characteristics: a Sun Enterprise 450 machine, an SGI/Cray Origin2000 multiprocessor, and an Athlon XP 2600+ PC. The Sun Enterprise has UltraSPARC II CPUs at 400MHz, each with a 16 KB 2-way set associative on-chip instruction L1 cache, a 16 KB direct-mapped on-chip data L1 cache, with 32 bytes cache line size and 8 cycles L1 miss latency, a direct-mapped L2 external cache of 4 MB, with 64 bytes cache line size and 84 cycles L2 miss latency, and a 64-entry data TLB with 8 KB page size and 51 cycles miss penalty. The SGI Origin has MIPS R10000 processors, each of which operating at 250MHz and has a 32 KB 2-way set associative on-chip instruction L1 cache, with 32 bytes line size. It also has a 32 KB 2-way set associative on-chip data L1 cache, with 64 bytes line size, a 4 MB 2-way set associative unified external cache, with 128 bytes cache line size and a 64-entry TLB, with 16 KB page size. The latency of the L1 data cache, L2 cache and TLB is 6, 100 and 57 cycles respectively. We used the Workshop cc compiler 5.0, first without any optimization flags (cc -xO0), in order to study the clear effect of different data layouts, avoiding any possible interference in the results due to any compiler optimizations. We then used the highest optimization level (-fast -xtarget=native), which includes memory alignment, loop unrolling, software pipelining and other floating point optimizations. The experiments were executed for various array dimensions (N ) ranging from 16 to 2048 elements, and tile sizes (step) ranging from 16 to N elements, to show the merit of our cache miss reduction method both on data sets that fit and do not fit in cache.
The Athlon XP CPU operates at 2,083GHz and has an 128KB, 2-way set associative on-chip L1 cache (64KB+64KB for data+instructions) with 64bytes line size and 3 cycles miss penalty. The unified L2 cache is also on-chip, with 64bytes line size, it is 16-way set associative, 256KB in capacity and has 20 cycles miss penalty. There are also two levels of TLBs: Instruction TLB has 24 entries in L1 and 256 entries in L2, while data TLB has 40 entries in L1 (with 3 cycles miss penalty and 4KB page size) and 256 entries in L2. In Athlon XP the gcc (version 3.3.4-13) compiler has been used, firstly without any optimization flags (gcc -O0) and then, at the highest optimization level (-O3).
We implemented 5 different versions per benchmark: our method, that is Blocked array Layouts using our Mask theory for fast address computation (MBaLt), L MO (Lmo) and L 4D (L4d) both from Chatterjee [25] , and the method [18] by Kandemir using both 2-dimensional arrays (tiled2D) and 1-dimensional arrays (tiled1D). Measured execution times do 
B. Execution Time Measurements
As far as the Matrix Multiplication benchmark is concerned, we compared our method (MBaLt) having selected the best tile size that achieves maximum performance, with the ones proposed in the literature [18] , where the respective optimal tile size has been selected as well. Optimal tile sizes are chosen experimentally, by making execution time measurements for various tile sizes. It is observed that tiles should fit in L1 cache and especially for blocked layouts step = √ L1 cache capacity. In figure 11 , the execution time of our method is almost 25% less than the one achieved using the optimal code from Kandemir et al for 2-dimensional arrays (tiled2D). Since current compilers do not support non-linear (blocked) array layouts, we used one-dimensional arrays throughout the implementation of our method (MBaLt). Storing elements in one-dimensional arrays, while two-dimensional ones are needed, required extra computations to find the desired array elements. Thus, a fair comparison should be done with one-dimensional arrays arrays of the tiled version (tiled1D). In this case, MBaLt over-exceeds in performance by 60%. Both Chatterjee's implementations (L MO (Lmo) and L 4D (L4d)) perform worse than tiled2D. Notice that the L 4D array layout is in practice the same as MBaLt, except for data are indexed through four-dimensional arrays instead of one-dimensional ones used in our approach. However, the performance degradation is due to the much more complicated memory location calculation scheme when dealing with four-dimensional arrays. Using the -fast optimization level ( figure 12 ), MBaLt gives a gentle slope curve as execution times increase smoothly when array sizes increase, due to independence on conflict misses. The tiled version of the benchmark, (especially in LUdecomposition and SSYR2K), is prone to such unpredictable types of misses, since steep fluctuations occur in the performance graph for specific matrix sizes. Furthermore, comparing figures 11 and 12, we conclude that the simpler the initial code is, the better optimization the -fast level can bring. Thus, standard compiler options are not enough when optimizing complex codes. Applying even more optimizations by hand (e.g. loop unrolling and data prefetching) in conjunction with -fast optimization level, proves to be the best technique for achieving top performance (see hand-optimized performance of MBaLt: MBaLt-hand opt in figure 17) . Extreme performance degradation for specific array sizes is possible due to self-conflict misses that take place when successive tile lines map to exactly identical cache locations.
Although the MBalt code is larger in terms of instruction lines (1.5 more instruction code lines, as seen in the example code of section IV-D), it takes less to execute, since boolean operations are used for address computations. Furthermore, array padding, which is necessary for rounding up array sizes in order to become a power of 2, does not affect performance, since execution time increases regularly, proportionally to the actual array sizes. The reader can verify that no sharp peaks occur in the execution time graphs. Delving into the assembly code, one-dimensional arrays are more efficiently implemented than greater dimension ones, since they require for fewer memory references to find the array elements. On the other hand, finding the storage location of an array element in nonlinear layouts usually needs a lot of computations, to find the right location. Consequently, what we achieved with the proposed method is handling one-dimensional arrays without introducing any additional delays due to complicated address computations.
The above results are also verified with the LUdecomposition, SSYR2K, SSYMM and STRMM benchmarks (figures 14 -25) . Additionally, we noticed that in LUdecomposition, the use of blocked array layouts in combination with our fast indexing (MBaLt version), not only provides with an average of 15% reduction in execution times (when no optimization is applied: figure 14) compared to the tiled version (tiled2D), but also smoothes any steep peaks that come up due to conflict misses in the simple tiled code. This reduction is even bigger (can reach even 30%) with -fast flag. After observing that tiled1D (in the matrix multiplication benchmark and LU-decomposition) outperforms MBaLt when -fast optimization (in SGI Origin platform) is applied for arrays smaller than 2048 × 2048, we conducted experiments with larger ones (sizes bigger than 2048 × 2048). Figure 17 illustrates that MBaLt in this case, for large matrices, is much more efficient also using -fast optimization level.
In the SSYR2K benchmark (figures 18 and 19), we searched for the best tile size among a wide range of values (not only for values of power of 2), when linear layouts are used (tiled2D-fine). We notice that, in all previous tiled codes experiments were using tile sizes equal to a power of 2, in order to be in accordance with the MBaLt implementation. The fact that graphs for tiled and tiled-fine are practically identical, proves that no delay is introduced when tile sizes are a power of 2.
Finally, as far as different tile sizes are considered, MBaLt versions perform better for tile sizes, that keep data mainly in L1 cache. Successive tile lines are mapped in sequential cache lines. As a result, conflict misses are minimized and optimal tile sizes are fixed for any array size. On the other hand, tile sizes that give minimum execution times in the tiled versions (tiled1D and tiled2D) are difficult to predict. Conflict misses can, in this case, severely degrade performance. To avoid conflict misses, complex cache alignment analysis is required or else an exhaustive search of all possible tile sizes.
C. Simulation results
In order to verify the results of measured execution times, we applied the binary codes of matrix-multiplication (MBaLt, Kand2D), LU-decomposition (MBaLt, tiled) and SSYR2K (MBaLt, tiled) to the SimpleScalar 2.0 toolkit, simulating both the cache characteristics of the Sun Enterprise 450 and the SGI Origin machines. We measured the data L1 (dl1), unified L2 (ul2) cache and data TLB (dtlb) misses, for various values (ULTRASPARC) for N , ranging from 16 to 1024, and various values for step, ranging from 8 to N , at the standard optimization level.
The · 100%. The results show that dl1 misses are reduced in matrix multiplication for tiles of size 64 × 64, because at this tile size, data of a whole tile for each one of the three arrays fits in the L1 cache. For the same reason, for step = 2048, ul2 misses increase sharply, compared to ul2 misses when step = 1024. The minimum value fluctuates with the problem size, because it depends on conflict misses which can not be easily foreseen, and thus avoided. Figure 26 shows the ratio of MBaLt misses vs. Kand2D misses for all memory hierarchy levels of the UltraSPARC architecture, for different array sizes. In all cases, when total height is less than 3, MBaLt has better cache performance, in all levels of memory hierarchy. The analytic formula used to calculate the height of bars in figures 26 to 29 is behaviour, as it can bring an extremely large number of misses. Thus, improvement in real time execution is achieved when the white bar of figures 26 to 29 is lower than 1, even if L2 and TLB misses increase. In any case, no L2 cache or TLB thrashing occurs (no major increase in the number of misses occurs), because we have to restrain the tile size so that L1 cache misses do not increase excessively. The total % improvement of memory hierarchy performance in MBaLt vs. tiled versions is tables III, IV (last column), is given by the formula:
(m1t−m1M )·p1+(m2t−m2M )·p2+(mTt−mTM )·pT m1t·p1+m2t·p2+mTt·pT
, where p 1 = D-L1 miss penalty, p 2 = L2 miss penalty and p T = TLB miss penalty in clock cycles. The reduction of misses in %, achieved in the MBaLt code, is shown in table III. Negative percentage means that misses increase in MBaLt. Such values are met mainly in L2 cache. However, data L1 cache misses are 2 orders of magnitude more than that of L2 cache. So, although a L1 miss has much less performance cost (latency) than a L2 one, the total percentage (calculated by taking into consideration the latency of each memory level), gives an average of 27% miss improvement, and an even better result as array dimension grow bigger. As a matter of fact, small array sizes have no need of data layout or any other optimization, because the working set, in this case, can totally fit even in the higher (faster) level of cache hierarchy. Complicated transformations can be efficient when cache and TLB misses really degrade performance, which happens for large array sizes. For such large array sizes (N > 512) MBaLt achieves to decrease misses in all memory levels significantly. The average miss improvement is over 70%. Furthermore, we observed that among the various tile sizes used in the matrix multiplication benchmark, in MBaLt code, the TLB misses remain within the same order of magnitude for all tile sizes. On the contrary, in Kand2D code, the best tile size (step) for TLB misses does not match to the one that minimizes the data L1 misses, while for other values of step, TLB misses increase rapidly.
The results are quite similar for the rest of the benchmarks. In the LU-decomposition benchmark, for example, the TLB misses of the masked version are almost zero compared to those that occur in the tiled version of the code, so, in figures 27 and 28, the TLB misses are, for the most array sizes, not worth mentioning. Figure 29 displays significantly reduced data L1 cache misses in MBaLt version. As already mentioned, L1 cache misses are the dominant factor on performance, because they are some orders of magnitude larger than misses in other memory levels.
Finally, notice that for the problematic array sizes (sizes equal to some power of 2 and N 2 > cache capacity) MBaLt achieves a major reduction of misses in all memory levels. Blocked data layouts manage an efficient placement of array elements, mapping elements accessed by nearby iterations to adjacent, but not identical cache lines. Otherwise, linear layouts, when array dimensions are equal to some power of 2, align successive rows to exactly the same cache line, thus, bringing many conflicts.
VI. CONCLUSION
Low locality of references, thus poor performance in algorithms which contain multidimensional arrays, is due to incompatibility of canonical array layouts with the pattern of memory accesses from tiled codes. In this paper, we have examined the effectiveness of blocked array layouts using 5 benchmarks and have provided with a fast addressing scheme that uses simple binary masks, accomplished at low cost. Experimental results illustrate the efficiency of the proposed fast address computation method. Simulations prove that performance improvement is due to excessive reduction, mainly of data L1 cache misses and additionally of TLB ones.
