Abstract-Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques (copying, padding, etc.). The total miss cost is reduced considerably. Experiments on several platforms (UltraSparc II and III, Alpha, and Pentium III) show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.
INTRODUCTION
T HE increasing gap between memory latency and processor speed is a critical bottleneck in achieving high performance. The gap is typically bridged by a multilevel memory hierarchy that can hide memory latency. This memory hierarchy consists of multilevel caches, which are typically on and off-chip caches. To improve the effective memory hierarchy performance, various hardware solutions have been proposed [3] , [7] , [9] , [10] , [20] . Recent processors such as Intel Merced [25] provide increased programmer control over data placement and movement in a cache-based memory hierarchy, in addition to providing some memory streaming hardware support for media applications. To exploit these features, it is important to understand the effectiveness of control and data transformations.
Along with hardware solutions, compiler optimization techniques have received considerable attention [14] , [15] , [23] . As the memory hierarchy gets deeper, it is critical to efficiently manage the data. To improve data access performance, one of the well-known optimization techniques is tiling. Tiling transforms loop nests so that temporal locality can be better exploited for a given cache size. However, tiling focuses only on the reduction of capacity cache misses by decreasing the working set size. Cache in most state-of-the-art machines is either direct-mapped or small set-associative. Thus, it suffers from considerable conflict misses, thereby degrading the overall performance [6] , [12] . To reduce conflict misses, copying [12] , [26] and padding [16] , [21] techniques with tiling have been proposed. However, most of these approaches target mainly the cache performance, paying less attention to the Translation Look-Aside Buffer (TLB) performance. As problem sizes become larger, TLB performance becomes more significant. If TLB thrashing occurs, the overall performance will be drastically degraded [24] . Hence, both TLB and cache must be considered in optimizing application performance.
Most previous optimizations, including tiling, concentrate on single-level cache [6] , [12] , [16] , [21] , [26] . Multilevel memory hierarchy has been considered by a few researchers. For improving multilevel memory hierarchy performance, a new compiler technique is proposed in [28] that transforms loop nests into recursive form. However, only multilevel caches were considered [22] , [28] with no emphasis on TLB. It was proposed in [13] that cache and TLB performance be considered in concert to select the tile size. In this analysis, TLB and cache were assumed to be fully-set associative. However, the cache is direct or small set-associative in most of the state-of-the-art platforms.
Some recent work [4] , [11] , [12] , [18] , [19] , [26] proposed changing the data layout to match the data access pattern, to reduce cache misses. It was proposed in [11] that both data and loop transformation be applied to loop nests for optimizing cache locality. In [4] , conventional (row or column-major) layout is changed to a recursive data layout, referred to as Morton layout, which matches the access pattern of recursive algorithms. This data layout was shown to improve the memory hierarchy performance. This was confirmed through experiments; we are not aware of any formal analysis.
The ATLAS project [27] automatically tunes several linear algebra implementations. It uses block data layout with tiling to exploit temporal and spacial locality. Input data, originally in column major layout, is remapped into block data layout before the computation begins. The combination of block data layout and tiling has shown high performance on various platforms. However, the selection of the optimal block size is done empirically at compile time by running several tests with different block sizes.
In this paper, we study block data layout as a data transformation to improve memory hierarchy performance. In block data layout, a matrix is partitioned into submatrices called blocks. Data elements within one such block are mapped onto contiguous memory. These blocks are arranged in row-major order. First, we analyze the intrinsic TLB performance of block data layout. We then analyze the TLB and cache performance using tiling and block data layout. Based on the analysis, we propose a block size selection algorithm. Morton data layout is also discussed as a variant of block data layout. The contributions of this paper are as follows:
. We present a lower bound analysis of TLB performance. Further, we show that block data layout intrinsically has better TLB performance than rowmajor layout (Section 2). As an abstraction of matrix operations, the cost of accessing all rows and all columns is analyzed. Compared with row major layout, we show that the number of TLB misses is improved by Oð ffiffiffiffiffi P v p Þ, where P v is the page size. . We present TLB and cache performance analysis when tiling is used with block data (Sections 3.1 and 3.2). In tiled matrix multiplication, block data layout improves the number of TLB misses by a factor of B, where B is the block size. Cache performance analysis is also presented. We validate our analysis through simulations using SimpleScalar [2] . . On the basis of our cache and TLB analysis, we propose a block size selection algorithm (Section 3.3). The best block sizes found by ATLAS fall in the range given by our algorithm. . We validate our analysis through simulations and measurements using matrix multiply, LU decomposition, and Cholesky factorization (Section 4). . We compare the performance of block data layout and Morton data layout. Block size selection for Morton data layout is limited. This limitation causes the performance of Morton data layout to be worse than that of block data layout. Experimental results on UltraSparc II and Pentium III show that matrix multiplication and LU decomposition executions using block data layout were up to 15.8 percent faster than that obtained using Morton data layout. The rest of this paper is organized as follows: Section 2 describes block data layout and gives analysis of its TLB performance. Section 3 discusses the TLB and cache performance when tiling and block data layout are used in concert. A block size selection algorithm is described based on this analysis. Section 4 shows simulation and experimental results. Concluding remarks are presented in Section 5.
BLOCK DATA LAYOUT AND TLB PERFORMANCE
In this paper, we assume the architecture parameters to be fixed (e.g., cache size, cache line size, page size, TLB entry capacity, etc.). The following notations are used in this paper. S tlb denotes the number of TLB entries. P v denotes virtual page size. It is assumed that the TLB is fully setassociative with Least-Recently-Used (LRU) replacement policy. Block size is B Â B, where it is assumed B 2 ¼ kP v . S ci is the size of the ith level cache. Its line size is denoted as L ci . Cache is assumed to be direct-mapped.
In Section 2, we analyze the TLB performance of block data layout. We show that block data layout has better intrinsic TLB performance than conventional data layouts.
Block Data Layout
To support multidimensional array representations, most programming languages provide a mapping function, which converts an array index to a linear memory address. In current programming languages, the default data layout is row-major or column-major, denoted as canonical layouts [4] . Both row-major and column-major layouts have similar drawbacks. For example, consider a large matrix stored in row-major layout. Due to large stride, column accesses can cause cache conflicts. Further, if every row in a matrix is larger than the size of a page, column accesses can cause TLB trashing, resulting in drastic performance degradation.
In block data layout, a large matrix is partitioned into submatrices. Each submatrix is a B Â B matrix and all elements in the submatrix are mapped onto contiguous memory locations. The blocks are arranged in row-major order. Another data layout of recent interest is Morton data layout [3] . Morton data layout divides the original matrix into four quadrants and lays out these submatrices contiguously in the memory. Each of these submatrices is further recursively divided and laid out in the same way. At the end of recursion, elements of the submatrix are stored contiguously. This is similar to the arrangement of elements of a block in block data layout. Morton data layout can thus be considered as a variant of the block data layout. They only differ in the order of blocks. Fig. 1 shows block data layout and Morton data layout with block size 2 Â 2. Due to the similarity, the following TLB analysis holds true for Morton data layout also.
TLB Performance of Block Data Layout
In this section, we present a lower bound on the TLB misses for any data layout. We discuss the intrinsic TLB performance of block data layout using a generic access pattern. We give an analysis on the TLB performance of block data layout and show improved performance compared with conventional layouts. Throughout this section, we consider an N Â N array.
A Lower Bound on TLB Misses
In general, most matrix operations consist of row and column accesses, or permutations of row and column accesses. In this section, we consider an access pattern, where an array is accessed first along all rows and then along all columns. The lower bound analysis of TLB misses incurred in accessing the data array along all the rows and all the columns is as follows. Theorem 1. For accessing an array along all the rows and then along all the columns, the asymptotic minimum number of TLB misses is given by 2
Proof. Consider an arbitrary mapping of array elements to pages. Let A k ¼ fij at least one element of row i is in page kg. Similarly, let B k ¼ fjj at least one element of column j is in page kg.
Using the mathematical identity that the arithmetic mean is greater than or equal to the geometric mean (
Let x i (y j ) denote the number of pages where elements in row i (column j) are scattered. The number of TLB misses in accessing all rows consecutively and then all columns consecutively is given by
OðS tlb Þ is the number of page entries required for accessing row i (column j) that are already present in the TLB. Page k is accessed a k times by row accesses,
Pv k¼1 b k . Therefore, the total number of TLB misses is given by
As the problem size (N) increases, the number of pages accessed along one row (column) becomes larger than the size of TLB (S tlb ). Thus, the number of TLB entries that are reused is reduced between two consecutive row (column) accesses. Therefore, the asymptotic minimum number of TLB misses is given by 2
We obtained a lower bound on TLB misses for any layout when data are accessed along all rows and then along all columns. This lower bound of TLB misses also holds when data is accessed along an arbitrary permutation of all rows and columns. Corollary 1. For accessing an array along an arbitrary permutation of row and column accesses, the asymptotic minimum number of TLB misses is given by 2
TLB Performance
In this section, we consider the same access pattern as discussed in Section 2.2.1. Consider a given N Â N array stored in a canonical layout. Without loss of generality, canonical layout is assumed to be row-major layout. During the first pass (row accesses), the memory pages are accessed consecutively. Therefore, TLB misses caused by row accesses is equal to
. During the second pass (column accesses), elements along the column are assigned to N different pages. Hence, a column access causes N TLB misses. Since N ) S tlb , all N column accesses result in N 2 TLB misses. The total number of TLB misses caused by all row accesses and all column accesses is thus
2 . Therefore, in canonical layout, TLB misses drastically increase due to column accesses.
Compared with canonical layout, block data layout has better TLB performance. The following theorem shows that block data layout minimizes the number of TLB misses. Theorem 2. For accessing an array along all the rows and then along all the columns, block data layout with block size ffiffiffiffiffi
minimizes the number of TLB misses.
To prove this theorem, we consider two cases:
and B 2 =P v 1. For each case, we estimate the number of TLB misses by comparing the TLB size, the number of page entries in a row, and the number of pages in a block. The optimal block size is then derived from these estimations. Detailed proof of Theorem 2 is presented in [17] . In general, the number of TLB misses for a B Â B block data layout is k
It is reduced by a factor of
Þ when compared with canonical layout. When B ¼ ffiffiffiffiffi
, this number approaches the lower bound shown in Theorem 1.
Theorem 2 holds true even when data in block data layout is accessed along an arbitrary permutation of all rows and columns. Corollary 2. For accessing an array along an arbitrary permutation of rows and columns, block data layout with block size ffiffiffiffiffi
Similar to Theorem 2 and Corollary 2, the number of TLB misses is minimized when blocks are stored in Morton data layout and elements are accessed along rows and columns. To verify our analysis, simulations were performed using the SimpleScalar simulator [2] . It is assumed that the page size is 8KByte and the data TLB is fully setassociative with 64 entries (similar to the data TLB in UltraSparc 2.) Double precision data points are assumed. The block size is set to 32. Table 1 shows the comparison of TLB misses using block data layout with using canonical layout. Table 1a shows the TLB misses for the "first all rows and then all columns" access. For small problem sizes, TLB misses with block data layout are considerably less than those with canonical layout. This is due to the fact that TLB entries used in a column (row) access are almost fully reused in the next column (row) access. For a problem size of 1; 024 Â 1; 024, a 504.37 times improvement in the number of TLB misses is obtained with block data layout. This number is much less than the lower bound obtained from Theorem 1. This is because the TLB entries are reused for this problem size. For larger problem sizes the TLB entries cannot be reused. The total number of TLB misses approaches the lower bound. For these large problem sizes, TLB misses with block data layout are upto 16 times less compared with canonical layout.
To verify Corollaries 1 and 2, two sets of access patterns were simulated: An arbitrary permutation of all rows and columns, and an arbitrary permutation of all rows followed by an arbitrary permutation of all columns. With these access patterns, TLB entries referenced during one row (column) access are not reused when accessing the next row (column). The number of TLB misses with block data layout approaches the lower bound on TLB misses. The results are shown in Tables 1b and 1c . Morton data layout shows a performance similar to block data layout.
Even though block data layout has better TLB performance compared with canonical layouts with generic access pattern, it alone does not reduce cache misses. The data access pattern of tiling matches well with block data layout. In the following section, we discuss the performance improvement of TLB and caches when block data layout is used in conjunction with tiling.
TILING AND BLOCK DATA LAYOUT
Tiling is a well-known optimization technique that improves cache performance. Tiling transforms the loop nest so that temporal locality can be better exploited for a given cache size. Consider an N Â N matrix multiplication represented as Z ¼ XY. The working set size for the usual 3-loop computation is N 2 þ 2N. For large problems, the working set size is larger than the cache size, resulting in severe cache thrashing. To reduce cache capacity misses, tiling transforms the matrix multiplication to a 5-loop nest tiled matrix multiplication (TMM) as shown in Fig. 2a . The working set size for this tiled computation is B 2 þ 2B. To efficiently utilize block data layout, we consider a 6-loop TMM as shown in Fig. 2b instead of a 5-loop TMM.
TLB Performance
In this section, we show the TLB performance improvement of block data layout with tiling. To illustrate the effect of block data layout on tiling, we consider a generic access pattern abstracted from tiled matrix operations. The access pattern is shown in Fig. 3 , where the tile size is equal to B.
With canonical layout, TLB misses will not occur when accessing consecutive tiles in the same row, if B S tlb . Hence, the tiled accesses along the rows generate
Pv TLB misses. This is the minimum number of TLB misses incurred in accessing all the elements in a matrix. However, tiled accesses along columns cause considerable TLB misses. B page table entries are necessary for accessing each tile. For all tiled column accesses, the total number of TLB misses is
It is reduced by a factor of B compared with the number of TLB misses for all column accesses without tiling (see Section 2.2).
The total number of TLB misses are further reduced when block data layout is used in concert with tiling, as shown in Theorem 3. Throughout this paper, the block size of block data layout is assumed to be the same as the tile size so that the tiled access pattern matches block data layout. In block data layout, a two-dimensional block is mapped onto one-dimensional contiguous memory locations. A block extends over several pages, as shown in Fig. 4 , for an example of block size
To analyze TLB misses for column accesses using block data layout, the average number of pages in a block is required. Proof. For block size kP v , assume that k ¼ n þ f, where n is a nonnegative integer and 0 f < 1. The probability that a block extends over n þ 1 contiguous pages is 1 À f. The probability that a block extends over n þ 2 contiguous pages is f. Therefore, the average number of pages per block in block data layout is given by:
Assume that an N Â N array is stored using block data layout. For tiled row and column accesses, the total number of TLB misses is ð2 þ
Proof. Blocks in block data layout are arranged in row-major order. So, a page overlaps between two consecutive blocks that are in the same row. The page is continuously accessed. The number of TLB misses caused by all tiled row accesses is thus
Pv , which is the minimum number of TLB misses. However, no page overlaps between two consecutive blocks in the same column. Therefore, each block along the same column goes through ðk þ 1Þ different pages according to Lemma 1. The number of TLB misses caused by all tiled column accesses is thus,
. Therefore, the total TLB misses caused by all row and all column accesses is
For tiled access, the number of TLB misses using canonical layout is
Using Theorem 3, compared with canonical layout, block data layout reduces the number of TLB misses by ffiffiffiffiffi ffi
. To verify our analysis, simulations for tiled row and column accesses were performed using the SimpleScalar simulator. The simulation parameters are equal to those in Section 2. A 32 Â 32 block size was considered. The block size is the same as the page size. Table 2 (2, 048) . This is a special case in which each block starts on a new page.
A similar analytical result can be derived for real applications. Consider the 5-loop TMM with canonical layout in Fig. 2a . Array Y is accessed in a tiled row pattern. On the other hand, arrays X and Z are accessed in a tiled column pattern. A tile of each array is used in the inner loops ði; k; jÞ. The number of TLB misses for each array is equal to the average number of pages per tile, multiplied by the number of tiles accessed in the outer loops ðkk; jjÞ. The average number of pages per tile is B þ . Therefore, the total number of TLB misses is given by:
Þ. Consider the 6-loop TMM on block data layout as shown in Fig. 2b . A B Â B tile of each array is accessed in the inner loops ði; k; jÞ with block layout. The number of TLB misses for each array is equal to the average number of pages per block multiplied by the number of blocks accessed in the outer loops ðii; kk; jjÞ. According to Lemma 1, the average number of pages per block is
Compared with the 5-loop TMM with canonical layout, TLB misses decrease by a factor of OðBÞ using the 6-loop TMM. Note that the 6-loop TMM uses block data layout. To verify our TLB miss estimation, simulations on the 6-loop TMM were performed. The problem size was fixed at 1; 024 Â 1; 024. Simulation parameters were the same as those in Section 2. Fig. 5 compares our estimations (given by (1)) with the simulation results. Fig. 6 shows that block data layout reduced TLB misses considerably compared with tiling.
Cache Performance
For a given cache size, tiling transforms the loop nest so that the temporal locality can be better exploited. This reduces the capacity misses. However, since most of the state-of-theart architectures have direct-mapped or small set-associative caches, tiling can suffer from considerable conflict misses that degrade the overall performance. Fig. 7a shows cache conflict misses. These conflict misses are determined by cache parameters such as cache size, cache line size and set-associativity, and runtime parameters such as array size and block size. Performance of tiled computations is thus sensitive to these runtime parameters.
If the data layout is reorganized from a canonical layout to a block layout (assuming tile size is same as block size) before tiled computations start, the entire data that is accessed during a tiled computation will be localized in a block. As shown in Fig. 7b , a self interference miss does not occur if the block is smaller than the cache since all elements in a block can be stored in contiguous memory locations.
In general, cache miss analysis for direct mapped cache with canonical layout is complicated because the self interference misses cannot be quantified easily. Cache performance analysis of tiled algorithm was discussed in [12] . The cache performance of tiling with copying optimization was also presented. We observe that the behavior of cache misses for tiled access patterns on block layout is similar to that of tiling with copying optimization on canonical layout. We have derived the total number of cache misses for 6-loop TMM (which uses block data layout). Detailed proof can be found in [17] . For ith level cache with line size L ci and cache size S ci , the total number of cache misses (CM i ) is:
To verify the cache miss estimations, we conducted simulations using SimpleScalar for 6-loop TMM with block data layout. The problem size was fixed at 1; 024 Â 1; 024. A 16KByte direct mapped cache was assumed (similar to L1 data cache in UltraSparc II). Fig. 8 compares our estimated values (given by (2)) with the simulation results.
Block Size Selection
To test the effect of block size, experiments were performed on several platforms. Fig. 9 shows the execution time of a 6-loop TMM with size 1; 024 Â 1; 024 on UltraSparc II (400 MHz) as a function of block size. It can be observed that block size selection is significant for achieving high performance. With canonical data layout, tiling technique is sensitive to problem and tile sizes. Several GCD-based tile size selection algorithms [6] , [8] , [12] were proposed to optimize tiled computation. However, their performance is still sensitive to the problem size. In [12] , TLB and cache performance were considered in concert. This approach showed better performance than algorithms that separately consider cache or TLB. However, all these approaches are based on canonical data layout. On the other hand, ATLAS [27] utilizes block data layout. However, the best block size is determined empirically at compile time by evaluating the actual performance of the code with a wide range of block sizes.
In a multilevel memory hierarchy system, it is difficult to predict the execution time (T exe ) of a program. But, T exe is proportional to the total miss cost of TLB and cache. In order to minimize T exe , we will evaluate and minimize the total miss cost for both TLB and l-level caches. We have:
where MC denotes the total miss cost, CM i is the number of misses in the ith level cache, T M is the number of TLB misses, H i is the cost of a hit in the ith level cache, and M tlb is the penalty of a TLB miss. The ðl þ 1Þth level cache is the main memory. It is assumed that all data reside in the main memory (CM lþ1 ¼ 0). Using the derivative of MC with respect to the block size, we can find the optimal block size that minimizes the overall miss cost. For a simple 2-level memory hierarchy that consists of only one level cache and TLB, the total miss cost (denoted as MC tc1 ) in (3) reduces to:
where H 2 is the access cost of main memory. In the above estimation, M tlb and CM are substituted with (1) and (2), respectively. Using the derivative of MC, the optimal block size (B tc1 ) that minimizes the total miss cost caused by L1 cache and TLB misses is given as
We now extend this analysis to determine a range for optimal block size in a multilevel memory hierarchy that consists of TLB and two levels of cache. The miss cost is classified into two groups: miss cost caused by TLB and L1 cache misses and miss cost caused by L2 misses. Figs. 10a  and 10b show the miss cost estimated through (1) and (2) . Fig. 10a is the separated TLB, L1, and L2 miss cost, using UltraSparc II parameters. Fig. 10b shows the variance of the estimated total miss costs as the ratio between L1 cache miss penalty (H 2 ) and L2 cache miss penalty (H 3 ) varies. Using (4), we discuss the total miss cost for three ranges of block size: Lemma 2. For B < B tc1 , MCðBÞ > MCðB tc1 Þ.
Proof. Using the derivatives of TLB and cache miss ( (1) and (2)), it can be easily verified that dMC tc1 dB < 0 and dCM 2 dB < 0 for B < B tc1 . This is shown in Fig. 10a. For B < B tc1 , TLB, L1, and L2 miss cost increase as block size decreases, thereby increasing the total miss cost. Therefore, the optimal block size cannot be in the range B < B tc1 . t u
Proof. In the range B > ffiffiffiffiffiffi ffi S c1 p , TLB miss cost is optimized by tiling and block data layout. However, the change in TLB miss cost is negligible as the block size increases. Since block size is larger than L1 cache size, self-interferences occur in this range. The number of L1 cache misses drastically increases as shown in Fig. 10a . For ffiffiffiffiffiffi ffi
, the ratio of derivatives of (2) for L1 and L2 misses is as follows:
In a general memory hierarchy system,
H2 . Therefore,
Thus, although the number of L2 cache misses decreases ( dCM2 dB < 0), the total miss cost increases for
because the increase in L1 cache miss cost is larger than the decrease in L2 cache miss cost. For B ! ffiffiffiffiffiffiffiffiffi 2S c1 p , there is no reuse in L1 cache. Thus, the L1 cache miss cost saturates. Fig. 10b shows the change of the total miss cost as the ratio of H3 H2 varies. Even though L2 miss penalty is 40 times that of L1 miss penalty, T MðBÞ > TMð ffiffiffiffiffiffi ffi
. Therefore, the optimal block size cannot be in the range B > ffiffiffiffiffiffi ffi
Proof. This follows from Lemmas 2 and 3. Therefore, an optimal block size that minimizes the total miss cost is located in
We select a block size that is a multiple of L c1 (L1 cache line size) in this range. t u
To verify our approach, we conducted simulations using UltraSparc II parameters (Table 3 ). Fig. 11 shows the simulation results of 6-loop TMM using block data layout. As discussed, the number of TLB and L2 misses decreased as block size increases. Also, the minimum number of L1 misses was obtained for B ¼ 36 and then drastically increased for B > 45. Fig. 12 shows the total We also tested ATLAS on UltraSparc II. Through a wide search ranging from 16 to 44, ATLAS found 36 and 40 as the optimal block sizes. These blocks lie in the range given by (5) . We further tested 6-loop TMM with respect to different problem and block sizes. For each problem size, we performed experiments by testing block sizes ranging from 8-80. In these tests, we found that the optimal block size for each problem size was in the range given by (5) as shown in Fig. 13 . These experiments confirm that our approach proposes a reasonably good range for block size selection.
EXPERIMENTAL RESULTS
To verify our analysis, we performed simulations and experiments on the following applications: tiled matrix multiplication (TMM), LU decomposition, and Cholesky factorization (CF). The performance of tiling with block data layout (tiling + BDL) is compared with other optimization techniques: tiling with copying (tiling + copying), and tiling with padding (tiling + padding). For tiling + BDL, the tile size (of the tiling technique) is chosen to be the same as the block size of the block data layout. Input and output is in canonical layout. All the cost in performing data layout transformations (from canonical layout to block data layout and vice versa) is included in the reported results. As stated in [12] , we observed that the copying technique cannot be applied efficiently to LU and CF applications, since copying Fig. 12 . Total miss cost for 6-loop TMM. Fig. 13 . Optimal block sizes for 6-loop TMM. overhead offsets the performance improvement. Hence, we do not consider tiling + copying for these applications. In all our simulations and experiments, the data elements are double-precision.
Simulations
To show the performance improvement of TLB and caches using tiling + BDL, simulations were performed using the SimpleScalar simulator [2] . The problem size was 1; 024 Â 1; 024. Two sets of architecture parameters were used: UltraSparc II and Pentium III. The parameters are listed in Table 3 .
Fig. 14 compares the TMM simulations of different techniques, based on UltraSparc II parameters. Tiling + BDL reduces L1 and L2 cache misses as well as TLB misses. Block size 32 leads to increased L1 and L2 cache misses for block data layout because of the cache conflicts between different blocks. Tiling + BDL reduced 91-96 percent of TLB misses as shown in Fig. 14a . This confirms our analysis presented in Section 3.1. Fig. 15 shows the total miss cost (calculated from (3)) for TMM using block size 40 Â 40. L1, L2, and TLB miss penalties were assumed to be 6, 24, and 30 cycles, respectively. This figure shows that tiling + BDL results in the smallest total miss cost and that the TLB miss cost with tiling + BDL is negligible compared with L1 and L2 miss costs. Though tiling + BDL has more L2 cache misses than tiling + padding, its total miss cost is smaller. Fig. 12 shows the effect of block size on the total miss cost for TMM using tiling + BDL. As discussed in Section 3.3, the best block size (44) is in the range 36-44 suggested by our approach. Fig. 16 presents simulation results for LU using Intel Pentium III parameters. Similar to TMM, the number of TLB misses for tiling + BDL was almost negligible compared with that for tiling + padding as shown in Fig. 16a . For both techniques, L1 and L2 cache misses were reduced considerably because of 4-way set-associativity. For tiling + padding, when the block size was larger than L1 cache size, the padding algorithm in [16] suggested a pad size of 0. There is essentially no padding effect, thereby drastically increasing L1 and L2 cache misses. Fig. 17 shows the block size effect on total miss cost using tiling + padding and tiling + BDL. Tiling + padding reduced L1 and L2 cache miss costs considerably. However, TLB miss costs were still significantly high, affecting the overall performance. As discussed in Section 3.3, the suggested range for optimal block size is 32-44. Simulations validate that the optimal block size achieving the smallest miss cost locates in the range selected using our approach.
Execution on Real Platforms
To verify our block size selection and the performance improvements using block data layout, we performed experiments on several platforms as tabulated in Table 3 . gcc compiler was used in these experiments. The compiler optimization flags were set to "-fomit-frame-pointer -O3 -funroll-loops." Execution time was the user processor time measured by sys-call clock(). All the data reported here is the average of 10 executions. The problem sizes ranged from 1; 000 Â 1; 000 to 1; 600 Â 1; 600. Fig. 18 shows the comparison of execution time of tiling + BDL with other techniques. The performance of tiling + TSS (tile size selection algorithm [6] ) shown in this figure selects block size based on GCD computation. Tiling solves the cache capacity miss problem but it cannot avoid conflict misses. Conflict misses are strongly related to the problem size and block size. This makes tiling sensitive to problem size. As discussed in Section 3.2, block data layout greatly reduces conflict misses, resulting in smoother performance compared with others. 19, 20 , and 21, the optimal block sizes for Pentium III, UltraSparc II, Sun UltraSparc III, and Alpha 21264 are 40, 44, 76, and 76, respectively. All these numbers are in the range given by our block size selection algorithm. For example, the range for best block size on Alpha 21264 is 64-78. This confirmed that our block size selection algorithm proposes a reasonable range. As discussed earlier, block sizes 32 and 64 should be avoided (for use with block data layout) because the performance degrades due to conflict misses between blocks.
Figs. 22, 23, and 24 show the execution time comparison of tiling + BDL with tiling + copying and tiling + padding. In these figures, block size for tiling + BDL was given by our algorithm discussed in Section 3.3. The tile size for the copying technique was given by the approach in [12] . The pad size was selected by the algorithm discussed in [16] . Tiling + BDL technique is faster than using other optimization techniques, for almost all problem sizes and on all the platforms.
Block Data Layout and Morton Data Layout
Recently, nonlinear data layouts have been considered to improve memory hierarchy performance. One such layout is the Morton data layout (MDL) as defined in Section 2.1. Similar to block data layout, elements within each block are mapped onto contiguous memory locations. However, Morton data layout uses a different order to map blocks as shown in Fig. 1 . This order matches the access pattern of recursive algorithms. In this section, we compare the performance of recursive algorithms using MDL (recursive + MDL) with iterative tiled algorithms using BDL (iterative + BDL), for matrix multiplication and LU decomposition. We show that the performance of recursive + MDL is comparable with that of iterative + BDL if the block size of outside the optimal range given by our approach. Our experiment results show that this degrades the overall performance.
Experiments using TMM and LU were performed on UltraSparc II and Pentium III. Table 4 shows the execution time comparison of MM using iterative + BDL with recursive + MDL. For iterative + BDL, we selected the block size according to the algorithm discussed in Section 3.3. For recursive + MDL, we tested various recursion depths (resulting in various basic block sizes) and used the best for comparison. For problem size 1; 280 Â 1; 280 and 1; 408 Â 1; 408, optimal block sizes for recursive + MDL were 40 and 44, respectively, which were in the range given by our algorithm, 36-44. Both the layouts showed competitive performance for these cases. For problem size 1; 600 Â 1; 600, recursive + MDL was up to 15.8 percent slower than iterative + BDL. Among possible choices of 25, 50, and 100, the performance of recursive + MDL was optimized at block size 25, where 25 ¼ 1600 2 5 . This is because it is outside the optimal range specified by our algorithm. Table 5 shows the execution time comparison of tiled LU decomposition using BDL and recursive LU decomposition [28] using MDL. These results confirm our analysis.
CONCLUDING REMARKS
This paper studied a critical problem in understanding the performance of algorithms on state-of-the-art machines that employ multilevel memory hierarchy. We showed that using block data layout, TLB misses as well as cache misses are reduced considerably. Further, we proposed a tight range for block size using our performance analysis. Our analysis matches closely with simulation based as well as experimental results. This work is part of the Algorithms for Data IntensiVe Applications on Intelligent and Smart MemORies (ADVI-SOR) Project at USC [1] . In this project, we focus on developing algorithmic design techniques for mapping applications to architectures. Through this, we understand and create a framework for application developers to exploit features of advanced architectures to achieve high performance. . For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
