Tile-size selection is known to be a complex problem. Thjs paper develops a new selecbion algorithm. Unlike previous algorithms, this new algorithm considers the effect of loop skewing on cache miss-. It also estimates loop overhead and incorporates them into the execution cost model, which turns out to be critical to the decision between tiling a single loop level vs. tiling two loop levels. Our preliminary experimental results sliow a significant impact of these pre\lously ignored issues on the execution time of tiled loops. In our experiments, we measured the cache miss rate and the execution time of five benchmark programs on a single processor and we compared o w algorithm with previous algorithms. Our algorithm achieves an average speedup of 1.27 to 1.63 over all the other algorithms.
Introduction
Memory access latency has become the key pedormance bottleneck on modern microprocessors. In order to reduce the average IDemory reference latency, it is important to exploit data locality such that most memory references can be served by the fast memory, e.g. the cache, in the memory hierarchy. Tiling is a well-known compiler technique to enhance data locality such that more data can be reused before they are replaced from the cache [23] . Tiling transforms a loop nest by combining strip-mining and loop interchange. Loop skewing and loop reversal are often used to enable tiling [20] . Figure 1 shows SOR relaxation as an example. Figure 1 Much of previous work on tiling applies to perfectly-nested loops only [8, 20, 21, 23J . Recently, we proposed a new technique to tile a class of imperfectly-nested loops [17, 18J. Performance of a tiled loop nest can vary dramatically with different tile sizes [9] . How to select proper tile sizes is hence an important issue. In this paper, if loop skewing is applied before tiling, such a tiling is called skewed tilin9. Non-skewed tiling results if loop skewing is not necessary for tiling. All previous work tacitly assumes non-skewed tiling [4, 6, 9, 12, 16, 22J . However, such an assumption may not be true, especially for loops which perform iterative relaxation computations [17, 18J. Another important factor ignored in previous work is the loop overhead in terms of the increased instruction counts due to the increased loop levels. Further, tiling a software-pipelined loop will also "This work is sponsored in part by National Science Foundation through grants CCR-9975309 and MIP-9610379, by Indiana 21st Century FUnd, by Purdue Research Foundation, and by a donation from Sun Microsystems, Inc. increase the dynamic count of load instructions. In this paper, we shall show that these previously ignored factors can have a significant effect on tile-size selection.
In our recent work [17] , we present a memory cost model to estimate cache misses, assuming that only one loop level is tiled. In this paper, we present a more general scheme by considering two loop levels which may both be tiled. We present an algorithm to compute tile sizes such that during each tile traversal, capacity misses and self-interference misses are eliminated. Further, cross-interference misses are eliminated through array padding [15] . Given a tile size, we model the tiling cost based on both the number of cache misses and the loop overhead. To choose between tiling one loop level vs. tiling two loop levels, our algorithm cornputes their lowest costs and thc respective tile sizes. We then choose the tiling level, and the corresponding best tile size, which yields the lowest cost. One can easily extend our discussion to higher loop levels, but such an extension does not seem useful for applications known to us.
In this paper, we consider data locality and performance enhancement on a single processor whose memory hierarchy includes cache memories a t one or more levels. We have applied our tile-size selection algorithm to fivehumerical kernels, SOR, Jacobi, Livermore Loop No. 15 (LL18), tomcatv and svim, using a range OF matrix sizes. We evaluate our algorithm on one processor of a n SGI multiprocessor and on a SUN uniprocessor workstation. We compare our algorithm with TLI [3], TSS [4] , LRW [9] and DAT [13] . Experiments show that our algorithm achieves a average speedup of 1.27 to 1.63 over all these previous algorithms.
In the rest of the paper, we first present a background in Section 2. We then present our memory cost model in Section 3. We model the execution time and present our tile-size selection algorithm in Section 4. We discuss related work in Section 5. In Section 6, we report experimental results and compare our algorithm with previous algorithms. Finally, we conclude in Section 7.
Background
In this section, we first define our program model and a few key parameters. We then discuss the issues of the memory hierarchy.
Tiling
Most of previous research on tiling addresses perfectly-nested loops only [8, 20, 21, 231 . After tiling, the loops remain perfectly-nested. In our recent work [17, 181, we perform tiling on a class of imperfectly-nested loops. Figure 2 increase the dynamic count of load instructions. In this paper, we shall show that these previously ignored factors can have a significant effect on tile-size selection.
In our recent work [17] , we present a memory cost model to estimate cache misses, assuming that only one loop level is tiled. In this paper, we present a more general scheme by considering two loop levels which may both be tiled. We present an algorithm to compute tile sizes such that during each tile traversal, capacity misses and self-interference misses are eliminated. FUrther, cross-interference misses are eliminated through array padding [15] . Given a tile size, we model the tiling cost based on both the number of cache misses and the loop overhead. To choose between tiling one loop level vs. tiling two loop levels, our algorithm computes their lowest costs and the respective tile sizes. We then choose the tiling level, and the corresponding best tile size, which yields the lowest cost. One can easily extend OUI discussion to higher loop levels, but such an extension does not seem useful for applications known to us.
In this paper, we consider data locality and performance enhancement on a single processor whose memory hierarchy includes cache memories at one or more levels. We have applied oUI tile-size selection algorithm to five "numerical kernels, SOR, Jacobi, Livermore Loop No. 18 (LL18), tomcatv and s'Jim, using a range of matrix sizes. We evaluate our algorithm on one processor of an SGI multiprocessor and on a SUN uniprocessor workstation. We compare our algorithm with TLI [31, TSS [4] , LRW [9] and DAT [13] . Experiments show that our algorithm achieves a average speedup of 1.27 to 1.63 over all these previous algorithms.
In the rest of the paper, we first present a background in Section 2. We then present our memory cost model in Section 3. We model the execution time and present our tile-size selection algorithm in Section 4. We discuss related work in Section 5. In Section 6, we report experimental results· and compare our algorithm with previous algorithms. Finally, we conclude in Section 7.
Background
Tiling
Most of previous research on tiling addresses perfectly-nested loops only [8, 20, 21, 23J . After tiling, the loops remain perfectly-nested. In our recent work [17, 18] , we perform tiling on a class of imperfectly-nested loops. Figure 2{a Loop T is called the tile-sweeping loop, and loops JJ and 1 1 are called the tile-conlrolling Ioops [20] . Each combination of JJ and II defines a tale traversal. Two tiles are said to be consecutiue within a tile traversal if the daerence of the corresponding T values equals 1. In this paper, we assume the data dependences permit both I-D and 2-D tiling. Choosing between 1-D vs. 2-D tiling will depend on the estimate of cache misses and loop overhead. As far as estimating cache misses is concerned,, 1-I) tiling can be viewed as a special case of 2-D tiling with the maximum tile height. However, 2-D tiling incurs higher loop overhead, which we want to take into account.
Let 71 = min{Lil)l L i 5 m ) , 72 = mn{Uilll < i I m ) , 71 = min{Li211 I i _< m) and 72 = mmaz{Uiz 1 1 5 i 5 m). We call Sl and S2 the skewing factors corresponding to Ji and li loops respectively. (The skewing factors are also called the slope in our previous work [17, 181.) If Sl = 0, then loop skewing is not applied before tiling at the Ji level. In this paper, we are interested only in skewed tiling at least at the Ji level, thus Sl > 0. B1 is called the file width and Bp is calIed the tile height. B1 and B2 are called the tile size collectively. These parameters are used to define the bounds of the tile-controlling loops. For reference, Table 1 lists all the symbols used in this paper and their brief descriptions.
For simplicity, rve assume all arrays are of bvo dimensions with the same column sizes. (We assume column-major storage.) Lower dimension variables can be ignored due to their lesser impact on cache misses in relaxation programs which we are interested in. Let n, be the number of two dimensional arrays for the given tiled loop nest. Within the innermost Ioop Ii, 1 5 i _< m, of the untiled program in Figure 2 (a), we assume array subscript patterns of Ak(li + a , Ji + b), I 
where a and b are known integer constants.
Memory Hierarchy
The memory hierarchy includes registers, cache memories at one or more levels, the main memory and the secondary storage, as well as the TLB [7] .
The TLB translates a virtual address into a physical address. Tl~e TLB has two key parameters, Loop T is called the tile-sweeping loop, and loops JJ and II are called the tile-controlling loops [20] .
Each combination of JJ and II defines a tile traversal. Two tiles are said to be consecutive within a tile traversal if the difference of t.he corresponding T values equals L In this paper, we assume the data dependences permit both I-D and 2-D tiling. Choosing between l·D vs. 2-D tiling will depend on the estimate of cache misses and loop overhead. As far as estimating cache misses is conc~rned.. I-D tiling can be viewed as a special case of 2-D tiling with t.he maximum tile height. However, 2-D tiling incurs higher loop overhead, which we want to take into account.
Let 'Yl = min{Lilll~i~m}, "/2 = max{Uill1~i~m}, "11 = min{Li2 11~i .$ m} and "12 = ma.z{Ui211 :5 i~m}. We call 81 and 82 the skewing factors corresponding to Ji and I j loops respectively. (The skewing factors are also called the slope in our previous work [17, 18] .) If 51 = 0, then loop skewing is not applied before tiling at the Ji level. In this paper) we are interested only in skewed tiling at least at the J i level, thus 8 1 > O. B 1 is called the tile width and B2 is called the tile height. B l and B2 are called the tile size collectively. These parameters are used to define the bounds of the tile-controlling loops. For reference, Table 1 lists all the symbols used in this paper and their brief descriptions.
For simplicity, "Ie assume all arrays are of t\vo dimensions with the same column sizes. (We assume column-major storage.) Lower dimension variables can be ignored due to their lesser impact on cache misses in relaxation programs which we are interested in. Let no be the number of two dimensional arrays for the given tiled loop nest. Within the innermost loop h 1 ::S i :5 m, of the untiled program in Figme 2(a), we assume array subscript patterns of Ak(Ii+aJi +b), 1 :5 k :::; na, where a and b are known integer constants.
The memory hierarchy includes registers, cache memories at one or more levels, the main memory and t.he secondary storage, as well as the TLB [7] .
The TLB translates a virtual address into a physical address. The TLB has two key parameters, Wc assume a fully-associative TLB with an LRU replacement policy.
For simplicity of presentation, we consider two levels of caches in this paper, namely the L1 and L2 caches, which are common in current practice. The L1 cache has several parameters, namely the cache size Csl, the cache block size Cbl and the set associativity Cal. Csl and C,,I are measured in the number of data eIements. Similarly for L2 cache, the cache size, cache block size and set associativity are Cs2, Cbz and CnP respectively. The cache misses can be divided into three classes [7] : compulsory misses, capacity misses and conflict misses. Conflict misses can be attributed to self-interference misses of the same array and to cross-interference misses between different arrays.
A Memory Cost Model
In this section, we want to estimate the number of cache misses incurred by executing the loop nest in our program model after tiling.
Let So represent the iteration space defined by yl I Ji 5 yz and 91 I 1; 5 7 2 in Figure 2 The cache misses incurred by one tile traversal can be partitioned into those within the base tile and those within the advanced tiles. Note that only those base tiles and advanced tiles overlapping with So will be executed, thus only they can contribute to the cache misses. In Figure 3 (a), the base tile in the tile traversal tt1 resides outside So, while the base tile in it2 resides within So.
We make the following two assumption in our estimation of the number of cache misses: namely the alock count T c and the block size Tb. We call T s == Ten the TLB size. In this paper, n is the size of the virtual memory represented by each TLB entry in the number of data elements. We assume a fully-associative TLB with an LRU replacement policy_ For simplicity of presentation, we consider two levels of caches in this paper, namely the L1 and L2 caches, which are common in current practice. The L1 cache has several parameters, namely the cache size Csl> the cache block size ObI and the set associativity G al -C sI and Gill are measured in the number of data elements. Similarly for L2 cache, the cache size, cache block size and set associativity are 0 3 2, C b2 and C a2 respectively. The cache misses can be divided into three classes [7] ; compulsoTlJ misses, capacity misses and conflict misses. Conflict misses can be attributed to self-interference misses of the same array and to cross-interference misses between different arrays.
In this section, we want to estimate the number of cache misses incurred by executing the loop nest in OUT program model after tiling.
Let So represent the iteration space defined by il~Ji~[2 and 111~Ii~112 in Figure 2 (a). (For simplicity, we also regard So as the original iteration space defined by J i and Ii loops in Figure 2 (aL as if all Ji loops have the same loop bounds and all Ii loops have the same loop bounds.) So is illustrated in Figure 3 (a) by the rectangle enclosed by the solid lines with the height rJ and the width 'Y. Within each tile traversal, we define the base tile to be a tile with T = 1 and an advanced tile to be a tile with T > 1. The dashed-lines in Figure 3 The cache misses incurred by one tile traversal can be partitioned into those within the base tile and those within the advanced tiles. Note that only those base tiles and advanced tiles overlapping with So will be executed, thus only they can contribute to the cache misses. In Figure 3 (80), the base tile in the tile traversal ttl resides outside So, while the base tile in tt2 resides within So-
We make the following two assumption in our estimation of the number of cache misses:
• Assumption 1: There exist nO cache reuse between different tile traversals. Assumption I is reasonable if ITMAX is large, since it will be very likely for a tile traversal to overwrite cache lines whose old data could have been reused in the next tile traversal. Assumption 2 is reasonable because a large B1 can easily cause an overflow in the TLB. As explained later in Section 4, our algorithm poses a constraint on 3 1 such that TLB should not overflow. If the tile size (131, Bz) is chosen properly, there should be exactly one cacbe miss for each cache line accessed within a tile traversa.. To be more specific, the following two properties should hold;
Property 1: No capacity and self-interference misses are generated within a tile traversal.
Property 2: No cross-interference misses are generated within a tile traversal.
In Section 4.2, we shall discuss how to preserve the above properties. For now, we assume they hold.
We first show how to compute the number of L1 cache misses caused by an advanced tile. Let W represent the size of the data set accessed by the original loop nest in terms of the number of data elements. The average size of the data accessed by one tile is estimated to be D = a + &Bz. Figure 3(b) shows hvo consecutive tiles, it3 and tt4, within a tile traversal, assuming t l a t both tiles reside within So. The iteration subspace of ttd is produced by shifting the iteration subspace of tt3 upwards by S2 iterations and to the left by S1 iterations. The L1 cache misses in ttQ either occur in Region ABCD or in Region DEFG. The totaI estimated L1 cache misses equal to
(This estimate may not be exact because data accessed at the lower border of Region DEFG may or may not be in the cache already.) We then show how to accumulate the number of L1 cacbe misses for all the tile traversals with the same JJvalue. Figure 3 (c) illustrates the idea. For a particular JJvalue, let tl, t a , ts and t4 be the base tiles of four tile traversals, and let t:, th, tb and ti be the corresponding advanced tiles when T increases by 1. In this particular illustration, the number of L1 cache misses caused collectively by t; ( Similarly, the number of L1 cache misses caused by the advanced tiles t: (1 L: i < 4) equal to the sum of the number of L i cache misses caused by individual 6, that is, bg + Z(B1-S1)S2 i +. 70 b l In general, the number of L1 cache misses caused by the advanced t~les with the same JJ value equal to + r ( B I -Sl)S2 * &, where r is the number of base tiles in So for a particular JJ 5 u4 II: Assumption 1 is rea.sonable if ITMAX is large, since it will be very likely for a tile traversal to overwrite cache lines whose old data could have been reused in the next tile traversal. Assumption 2 is reasonable because a large B 1 can easily cause an overflow in the TLB. As explained later in Section 4, our algorithm poses a constraint on B l such that TLB should not overflow. If the tile size (Bl, B 2 ) is chosen properly, there should be exactly one cache miss for each cache line accessed within a tile traversal. To be more specific, the following two properties should hold:
• Property 1: No capacity and self-interference misses are generated within a tile traversal .
• Property 2: No cross-interference misses are generated within a tile traversal.
In Section 4.2, we shall discuss how to preserve the above properties. For now l we assume they hold.
We first show how to compute the number of L1 cache misses caused by an advanced tile. Let W represent the size of the data set accessed by the original loop nest in terms of the number of data elements. The average size of the data accessed by one tile is estimated to be D = w * B I B 2 • Figure 3 (b) shows two consecutive tiles, itS and tt4, within a tile traversal. assuming "I'that both tiles reside within So' The iteration subspace of tt4 is produced by shifting the iteration subspace of ttS upwards by 82 iterations and to the left by Sl iterations. The L1 cache misses in tt4 either occur in Region ABeD or in Region DEFG. The total estimated L1 cache misses equal to (SIB2 +8 z B l -8 1 S 2 ) *1J]~~l' (This estimate may not be exact because data. accessed at the lower border of Region DEFG mayor may not be in the cache already.)
We then show how to accumulate the number of L1 cacbe misses for all the tile traversals with the same JJ value. Figure 3 (c) illustrates the idea. For a particular JJ value, let tl, t2) t3 and t4. be the base tiles of four tile traversals, and let t~) t~, t 3 and t~1 be the corresponding advanced tiles when T increases by 1. In this particular illustration, the number of L1 cache misses caused collectively by ti (I $ i~4) equals to the sum of the number of L1 cache misses caused by each individual ti, that is, WeBl. Note that only the tiles overlapping with So can contribute to L1 cache misses. 'Y bl Similarly, the number of L1 cache misses caused by the advanced tiles ti (1 $ i $ 4) equal to the sum of the number of L1 cache misses caused by individual tL that is,~t:t + 2(Bl -Sd S2 * 1'l~bl .
In general, the number of L1 cache misses caused by the advanced tjles with the same JJ value equal to SCI W + r(Bl -SdS2 * 4--, where 'T is the number of base tiles in So for a particular JJ With Assumptions 1 and 2, we can then accumulate L l cache misses corresponding to different JJ values by considering three different cases:
This case is illustrated by Figure 4(a) . In this case, the tile traversals defined by JJ < 71 + y -NSTEP will not execute to the ITMAXth T-iteration. The tile traversal defined by 71 + y -NSTEP < JJ 5 71 + 7 is the first to reach the ITMAXth T-iteration. The tile traversals defined by JJ > 71 +-y will start executing at T > 1. During the execution, the tile traversals defined by JJ = yl will incur L1 cache misses of $$-. The tile traversals defined by JJ = 7 1 + B1 will incur L1 cache misses of * 2 + +(BI -S1)Sz * * k. Hence, we have the following: (1 + 2 + . . . + 1-1) + ~( 8 1 -SI)SZ * 9 * -The L1 cache misses in all the tile traversals defined by 7 2 -B1 < JJ 5 7 2 amount to * I51 + dB1s l ) s 2 * 2 * * * IYl.
-The L1 cache misses in all the tile traversals defined by 7 2 
Adding up the three numbers of the above, the total L1 cache misses in the tiled loop nest approximate 2 r + 2 ;:r E.
This case is illustrated by Figure 4 (b). Similar to the computation in Case 1, we have the following:
-The L1 cache misses in all the tile traversals defined by JJ 5 7 2 amount to
-The L1 cache misses in all the tile traversals defined by 7 2 < JJ 5 (ITMAX-I) * B1 4- 
The value 1] + 8 2 * (ITMAX-l) is the maximum height of the iteration space after tiling. Any B z value greater than or equal to 'Jl + 82 * (ITMAX-l) results in no tiling at the Ii loop leveL With Assumptions 1 and 2, we can then accumulate Ll cache misses corresponding to different JJ values by considering three different cases:
This case is illustrated by Figure 4 
Adding up the three numbers of the above, the total Ll cache misses in the tiled loop nest approximate~* B l +~*~.
This case is illustrated by Figure 4 Adding up the three numbers of the above, the total L1 cache misses in the tiled loop nest approximate ws, (ITMAX-1) + rvs2 (ITMA X-I),
. . W S , ITMAX-I) Silnilar to Case 2, the total L1 cache misses jn the tiled loop nest approximate (qlal -t
WS, ( I T M A X -1 )~ ~C b l
Combining the above three cases and plugging in the estimate of 7, the total number of L1 cache misses is approximately
Similarly, with Properties 1 and 2 standing, the number of L2 cache misses for 2-D tiling is approximately W S1 (ITMAX-1) W S2 (ITMAX-I) Figure 2 (c)), the L1 cache temporal locality is not exploited across the T-loop iterations. The number of L1 cache misses is approximately W ITMAX* -.
Cbl
The total number of cache misses for the L2 cache is approximately
Tile-Size Selection
In this section, we first present an execution cost model for tiling with a given tile size, based on both the number of cache misses and the loop overhead. We then present our tile-size selection algorithm, followed by a running example to go through our algorithm.
An Execution Cost Model, for Tiling
Loop tiling introduces loop overhead. To decide between 1-D tiling and 2-D tiling, the overhead of the tiled Ii loops in Figure 2 Adding up the three numbers of the above, the total L1 cache misses in the tiled loop nest
Similar to Case 2, the total L1 cache misses in the tiled loop nest approximate
(1)
Similarly, with Properties 1 and 2 standing, the number of L2 cache misses for 2-D tiling is approximately Figure 2 (c», the L1 cache temporal locality is not exploited across the T-Joop iterations. The number of L1 cache misses is approximately
An Execution Cost Model for Tiling
Loop tiling introduces loop overhead. To decide between 1-D tiling and 2-D tiling, the overhead of the tiled Ii loops in Figure 2 
A-om (5) and (6), if nl and nz are approximately equal, then a small B2 will introduce large loop overhead. Let ng be sum of the static number of instructions for the computation of all the . Ti loop bounds (1 5 i < m). The loop overhead due to tiled Ji loops can be measured by
Enabled by scaIar replacement [2], in a software-pipelined loop [I], loaded data can be reused in different iterations. The dynamic count of load instructions can hence be reduced. Let n d be the sum of the dynamic count of load instructions in the prologues and the epiIogues of all the softwar+pipelined loops. Let ns be the sum of the number of load instructions divided by the unroll factor in the software-pipelined loop bodies. The unroll factor is one if the loop is not unrolled. The dynamic count of load instruction with 1-D tiling is approximately
Clearly, if nd is significantly greater than ng and B2 is small, then the dynamic count of load instructions with 2-D tiling can be much greater than that with 1-D tiling. Let pl be the penalty for an L1 cache miss and p2 be the penalty for an L2 cache miss. By adding the penalty due to L1 cache misses in Formula (3), the penalty due to L2 cache misses in Formula (41, the loop overhead due to tiled Ji loops in Formula (7) , and the dynamic count of load irlstructions for sohare-pipelined innermost loops in Formula (a), we can model the execution cost for 1-D tiling by . W W S1 [ITMAX-1) ITMAX * 7 pl * (ITMAX* -) + p z * ( ) + n 3 (10) 
+ ( n 4 + n s 7 ) v * I T M A X .
c b l Cb2B1
B1
In the above formula, we aSsume the latency of one unit of time for each instruction, including a load instruction. From (lo), with 1-D tiling, we want to maximize B1 (subject to Properties X and 2 aforementioned) such that the number of L2 cache misses is minimized. By adding the penalty due to L1 cache n~isses in Formula (I), the penalty due to L2 cache misses in Forlnula (2), the dynamic count of load instructions for software-pipelined innermost loops in Formula [9) , the loop overhead due to tiled Ji loops in Formula (7), and the Ioop overhead due to the tiled innermost loop in Formula (5), the execution cost for 2-D tiling can be modeled by
Tile-Size Selection Algorithm
In this section, we first discuss how t o preserve Properties X and 2. We then present our tile-size selection algorithm.
8
From (5) and (6), jf ni and n2 arc approximately equal, then a small B 2 will introduce large loop overhead. Let n3 be sum of the static number of instructions for the computation of all the Ji loop bounds (1 ::; i ::; m). The loop overhead due to tiled J; loops can be measured by
Enabled by scalar replacement [2] , in a software-pipetined loop [1] , loaded data can be reused in different iterations. The dynamic count of load instructions can hence be reduced. Let n4 be the sum of the dynamic count of load instructions in the prologues and the epilogues of all the software-pipelined loops. Let n5 be the sum of the number of load instructions divided by the unroll factor in the software-pipelined loop bodies. The unroll factor is one if the loop is not unrolled. The dynamic count of load instruction with I-D tiling is approximately
With 2-D tiling, the dynamic count of load instructions is approximately
Clearly, if n.j is significantly greater than ns and B2 is small, then the dynamic count of load instructiDns with 2-D tiling can be much greater than that with I-D tiling.
Let Pl be the penalty for an Ll cache miss and pz be the penalty for an L2 cache miss. By adding the penalty due to Ll cache misses in Formula (3), the penalty due to L2 cache misses in Formula (4), the loop overhead due to tiled Ji loops in Formula (7) , and the dynamic count of load instructions for software-pipelined innermost loops in Formula (8), we can model the execution cost for I-D tiling by .
bl b2 1 1 In the above formula, we assume the latency of one unit of time for each instruction, including a load instruction. From (10), with I-D tiling, we want to maximize B i (subject to Properties 1 and 2 aforementioned) such that the number of L2 cache misses is minimized. By adding the penalty due to L1 cache misses in Formula (I), the penalty due to L2 cache misses in Formula (2) , the dynamic count of load instructions for software-pipelined innermost loops in Formula (9), the loop overhead due to tiled Ji loops in Formula (7) , and the loop overhead due to the tiled innermost loop in Formula (5), the execution cost for 2-D tiling can be modeled by
(11)
In this section, we first discuss how to preserve Properties 1 and 2. We then present our tile-size selection algorithm.
Procedure EnumFPSize(C,, Cb, N ) First, we discuss how to eliminate self-interEerence misses within a single tile. For any array Ail let R be the minimum rectangular array region which contains all the A; elements referenced within a tile t. We say that Ai's footprint size withn tile t is (Fl,F2), where Fl and F2 are the numbers of columns and rows in R respectively. We call Fl ( F -) the away foolprint width (height) for Ai within tile t. Reversely, given a footprint size of Ai, the tile size can also be computed. Given the subscript patterns and the loop bounds, such a computation is straightforward and we omit tbe details. For the example oE SOR (Figure l(c) ), assuming the array footprint size for A to be ( K~, ti2), t h e loop tile size sbould be (tcl -2, r;z -2). For array A,, if the footprint height F2 is greater than the distance between the locations of two columns in the cache, then the columns accessed within the tile will conflict in the cache, creating self-interference misses [3] . More precisely, we have the following lemma:
Lemma 1 Given array footprint size (Fl, F2) for any Ai (1 5 i 5 n,), a cache of size C , and cache line size Cb, if there exist no self-interference misses, then the distance between the starting cache locations of any two columns of Ai within Fl consecutive columns is either no smaller than Fz, or no greater than C, -F2. Conversely, there exist no self-interference misses if the distance between the starting cache locations of any two columns of Ai ~i t h i n Fl consecutive columns is either no smaller than Fz + Cb -1, or no greater than C, -Fz -Cb + 1.
Given a directly-mapped cache of size C, and cache line size Cb, and given an array column size N , pxocedure EnumFPSize in Figure 5 (a) enumerates all the footprint sizes (PI, F2) which incur no self-interference misses, according to Lemma 1. We say that a footprint size (Fl,F2) of Ai is mazimal if increasing either PI or Fz will introduce self-interference misses for A;. In general, the maximal footprint size for array Ai is not unique. According to EnumFPSize, the maximal footprint sizes for all arrays are the same if they have the same array column sizes. Our tile-size selection scheme will enumerate all array footprint sizes which are free of self-interference misses until the sizes become maximal. The scheme estimates and compares the execution cost for different (Fl, F2) in order to get the optimal tile size.
Next, suppose the cache is not directly-mapped, and assume an LRU replacement policy.
9
Procedure EnumFPSize(C3 , C b , N) First, we discuss how to eliminate self-interference misses within a single tile. For any array Ai, let R be the minimum rectangular array region which contains all the Ai elements referenced wi thin a tile t. We say that Ai'S footprint size within tile tis (FI, F2) , where FI and F 2 are the numbers of columns and rows in R respectively. We call F 1 (F2) the array footprint width (height) for A within tile t. Reversely, given a footprint size of Ai, the tile size can also be computed. Given the subscript patterns and the loop bounds, such a computation is straightforward and we omit the details. For the example of SOR (Figure l(c) ), assuming the array footprint size for A to be (1\:1,1\:2), the loop tile size should be (It] -2 1 1\:2 -2). For array Ai, if the footprint height F 2 is greater than the distance between the locations of two columns in the cache, then the columns accessed within the tile will conflict in the cache, creating self-interference misses [3] . More precisely, we have the following lemma: Lemma 1 Given array footprint size (FI, F2) for any Ai (1~i~nil), a cache of size C s and cache line size G b , if there exist no self-interference misses, then the distance between the starting cache locations of any two columns of Ai witWn F I consecutive columns is either no smaller than F 2 , or no greater than C s -F2· ConverselYI there exist no self-interference misses if the distance between the starting cache locations of any two columns of Ai within FI consecutive columns is either no
Proof Obvious. 0
Given a directly-mapped cache of size C s and cache line size Cb, and given an array column size N, procedure EnumFPSize in Figure 5 (a) enumerates all the footprint sizes (F I , F 2 ) which incur no self-interference misses, according to Lemma 1. We say that a footprint size (Fl, F2) of Ai is maximal if increasing either F I or F 2 will introduce self-interference misses for Ai. In general, the maximal footprint size for array Ai is not unique. According to EnumFPSize, the maximal footprint sizes for all arrays are the same if they have the same array column sizes_ Our tile-size selection scheme will enumerate all array footprint sizes which are free of self-interference misses until the sizes become maximal. The scheme estimates and compares the execution cost for different (F I , F2) in order to get the optimal tile size.
Next, suppose the cache is not directly-mapped, and assume an LRU replacement policy. We show that the parameter C, in procedure EnumFPSize should not be the whole cache size.
Otherwise, self-interference misses w i l l occur when the execution proceeds horn one tile to the next. For clarity, instead of arguing formally for the general cases, we illustrate the cases of Pway and fully-associative caches. Figure 5(b) shows two cousecutive tiles tl and t2. Suppose C, equals thc whole cache size in procedure EnumFPSize and suppose the footprint size of 21 is maximal.
Tile t i accesses the cache from the least-recently referenced data segment to the most-recently referenced data segment in the memory, in the order of D l , D$ 0 3 and 04 which are separated by solid lines. If the cache associativity is Cal = 2, then 0 2 and Dd will map to the same cache sets. The data accased in the blank rectangle A will replace segment DZ. If the cache is fully associative, D l will be replaced. Ko~vever, part of the old data in segment 0 2 (or D l ) could have been reused by tile t2 One solution to avoid the replacement of useful data is to reduce the footprint size within t l such that only a portion of the cache is used to compute the maxima1 footprint size in EnurnFPSize. Figure 5 (c) shows the case for twe~vay set-associative cache. In this way, the data accessed in Regions A and C will replace the cache segment 0 2 and part of segment Dl, whose old data are not reused by t2-The reusable data in 0 3 will be kept in the cache. Using the above idea, we let C, = c2$ c ,~ in procedure EnumFPSize, for 2-way and fully-associative caclles. The cases of other associativities are more complex, and they will not be discussed in this paper.
To eliminate capacity misses, the footprint size of each array A, can only be ([2j, Fz), a fraction of (PI, F2). Here, we choose to partition columns instead of rows, iu order to preserve spatial locality. Assume that (I3l(i1,l3$)), 1 $ i < n,, is the tile size such that the footprint size for array Ai within a single tile is ([el, F~) . For 2-way and fully-associative caches, we choose the tile size for the tiled loop as ( B l , Bz) =: (rnin.j~,('), rnini~?)). For directly-mapped caches, we choose (BI, Bz) = (rnini~;') -Sly rni%B$) -&). One can prove that for directly-mapped, 2-way and fully-associative caches, Property 1 holds under the above treatment. For other set-associative caches, procedure EnumFPSize needs to be revised.
Preserving Property 2
We apply inter-array padding to eliminate cross-interference misses within a tile traversal. For simplicity of presentation, we assume that the array subscript patterns of one particular array Ak cover all the array subscript patterns for all the other arrays Ai, i # k . The discussion in this section can be easily extended if such an assulnption does not hold. Using inter-array padding, we let the starting addresses for array Ai(l 5 i 5 n,) map to the same location in the cache as the starting address of the ( [ E l r (i -1))th column of array Al. With such padding, cross-interference misses are eliminated within a single tile between Ai and Aj (1 5 i, j 5 n,,i # j ) .
When the execution goes hom one tile to the next, if the cache is directly-mapped, the newly accessed data for A, will map to cache locations previously unused in the tile traversal. If the cache is not directly-mapped, the newly accessed data for Ai will map to cache locations which are 10 array footprint Figure 6 : An illustra~ion of padding to eliminate cross-interferences
We show that the parameter C s in procedure EnumFPSize should no~be the whole cache size. Otherwise, self-interference misses will occur when the execution proceeds from one~ile to the next. For clarity, instead of arguing formally for the general cases, we illustrate the cases of 2-way and fully-associative caches. Figure 5 (b) shows two consecutive tiles t1 and t2. Suppose G s equals the whole cache size in procedure EnumFPSize and suppose the footprint size of t1 is ma.ximal. Tile tl accesses the cache from the least-recently referenced data segment to the most-recently referenced data segment in the memory, in the order of Dl, D2, DS and D4 which are separated by solid lines. If the cache associativity is Cal == 2, then D2 and D4 will map to the same cache sets-
The data accessed in the blank rectangle A will replace segment D2. If the cache is fully associative, Dl wiIJ be replaced. However, part of the old data in segment D2 (or Dl) could have been reused by tile t2. One solution to avoid the replacement of useful data is to reduce the footprint size within tl such that only a portion of the cache is used to compute the maximal footprint size In EnumFPSize. Figure 5 (c) shows the case for tw()-way set-associative cache. In this way, the data accessed in Regions A and C will replace the cache segment D2 and part of segment Dl, whose old data are not reused by t2. The reusable data in D3 wlll be kept in the cache_ Using the above idea, we let C s = c~-1 G sl in procedure EnumFPSize, for 2-way and fully-associative caclles. The cases 01 of other associativities are more complex, and they will not be discussed in this paper.
To eliminate capacity misses, the footprint size of each array Ai can only be (l.EJ.. j, F 2 ), a no fraction of (FI, F 2 ). Here, we choose to partition colUIIlIlS instead of rows, in order to preserve spatial locality. Assume that (Bi i ), B~i»), 1~i~n a , is the tile size such that the footprint size for array Ai within a single tile is (L.5..J,F 2 )-For 2-way and fully-associative caches, we choose no the tile size for the tiled loop as (B b B2) = (mi7l.iBii) , mi7l.iB~i». For directly.mapped caches, we choose (BI, B2) = (mi7l.iBii) -8 t , mi1li.B~i} -82). One can prove that for directly-mapped, 2·way
and fully-associative caches, Property 1 holds under the above treatment. For other set-associative caches, procedure EnumFPSize needs to be revised.
We apply inter-array padding to eliminate cross-interference misses within a tile traversal. For simplicity of presentation, we assume that the array subscript patterns of one particular array Ak cover all the array subscript patterns for ail the other arrays Ail i #: k. The discussion in this sectIon can be easily extended jf such an assumption does not hold. Using inter~array padding, we let the starting addresses for array Ai(1 SiS n a ) map to the same location in the cache as the starting address of the (l~J* (i -l»th column of array AI. With such padding, cross-interference misses are eliminated within a single tile between Ai and Aj (1~i,j~n a , i Ij).
When the execution goes from one tile to the next, if the cache is directly-mapped, the newly accessed data for Ai will map to cache locations previously unused in the tile traversaL If the cache is not directly-mapped, the newly accessed data for Ai will map to cache locations which are Input: SI, S2, C31, G I , CM, C32, Coz. C b 2 . n l r n3. n4, R S , n o r N , 0 (~C C Figure 7 ; Tile-size selection algorithm -STS either previously unused or will not be referenced again within the current traversal. Therefore, cross-interference misses are also eliminated within a tile traversal. Figure 6 illustrates an example for Fl = 4 and n, = 2, where the cache is directly mapped. Here, assuming the starting address for array Al to be 0, the padded number of data items, x , between arrays A1 and Az can be determined from (size(A1) + z) = (2 * N), m o d CS1. (12) We are ready to present our tile-size selection algorithm in the next section.
Algorithm STS
Algorithm STS in Figure 7 selects the tile size by interleaving the operations in procedure EnumFP-Size with the applications of Formulas (10) and (11) which compute the execution cost. We require Bz to be no smaller than the cache line size Cbl. However, we do not require B2 to be a multiple of Cbl, since such a requirement does not have much benefit when execution praceeds horn one tile to the next. In addition to the conditions stated in procedure EnumFPSize, the array footprint width F2 should be no greater than u, which is the total number of array columns representable by the TLB minus the number of newly accessed array columns when the execution proceeds &om one tile to the next.
11
Input: Sl, 52, C.I, Cal, C bb C,2, Ca 2, Cb2, nt. D3, n4, ns, DD, N, u (see Table I Figure 7 : Tile-size selection algorithm -STS either previously unused or will not be referenced again within the current traversal. Therefore, cross-interference misses are also eliminated within a tile traversal. Figure 6 illustrates an example for PI = 4 and na, = 2, where the cache is directly mapped. Here, assuming the starting address for array Al to be 0, the padded number of data items, X, between arrays A l and A 2 can be determined from
We are ready to present OUI tile-size selection algorithm in the next section.
Algorithm STS
Algorithm STS in Figure 7 selects the tile size by interleaving the operations in procedure EnumFP-Size with the applications of Formulas (10) and (11) which compute the execution cost. We require B 2 to be no smaller than the cache line size ChI· However, we do not require B 2 to be a multiple of Cbl, since such a requirement does not have much benefit when execution proceeds from one tile to the next. In addition to the conditions stated in procedure EnumFPSize, the array footprint width F2 should be no greater than a, which is the total number of array columns representable by the TLB minus the number of newly accessed array columns when the execution proceeds from one tile to the next.
STS makes the decision between 1-D and 2-D tiling based on their execution cost. For 1-D tiling,
ComputeTileSize-ID tries to find tile width B1 such that Properties 1 and 2 are preserved on the L2 cache and that Formula (10) is minimized. For 2-D tiling, ComputcTileSize-2D enumerates all tile sizes which are free of self-interference misses. The tiIe size with the lowest execution cost is selected. Between 1-D and 2-D tiling, the scheme with the lower execution cost is chosen. STS needs a conversion fiom array footprint size (PI, F2) to loop tile size (El, B2) , a s stated in The complexity of STS is O (N t min(C,,, u ) ) = O(Nu). (In practice, o is much smaller than the L1 cache size C,I .)
A Running Example
We now take SOR (Figure 1 In the following, we show the steps of STS.
Since C, = No inter-array padding is applied since n, = 1.
5 Related Work
Competing Tile-Size Selection Schemes
Chame and Moon present a tile size selection algorithm, called TLI, to simultaneously eliminate selfinterference misses and minimize the summation of capacity misses and cross-interference misses UnLike the work in this paper, these tile-size selection algorithms do not consider the effect of loop skewing, nor do they take loop overhead into account.
Other Related Work
Ghosh et al. estimate cache misses, given a tile size, for a perfect loop nest [6] . They also informally discuss a tile-size selection scheme using matrix multiplication as the example. No formal algorithm is presented, l~owever. They do not discuss the estimation of cache misses for imperfectly-nested loops. Therefore, we are not able to compare with their method in our experiments.
12
STS makes the decision between I-D and 2-D tiling based on their execution cost. For 1-D tiling, GomputeTileSize·1D tries to find tile width B l such that Properties 1 and 2 are preserved on the L2 cache and that Formula (10) is minimized. For 2-D tiling, Compute TileSize-2D enumerates all tile sizes which are free of self-interference misses. The tile size with the lowest execution cost is selected. Between I-D and 2-D tiling, the scheme with the lower execution cost is chosen.
STS needs a conversion from array footprint size (F I , F2) to loop tile size (Bl' B2), as stated in Section 4.2.1. If the resulting tile width or tile height is nonpositive, 1-D tiling is chosen.
The complexity of STS is O(N * min(C sl , a» = O(Na). (In practice, a is much smaller than the L1 cache size C sI .)
A Running Example
• Since Go. = 2, ComputeTileSize-2D(~) is called, and we have BI = 38, B2 = 43. The execution cost for 2-D tiling is M = 4171464893 units based on Formula (11). • No inter-array padding is applied since n a = 1.
Related Work

Competing Tile-Size Selection Schemes
Chame and Moon present a tile size selection algorithm, called TLI, to simultaneously eliminate selfinterference misses and minimize the summation of capacity misses and cross-interference misses [3] . Coleman and McKinley provide a tile size selection algorithm, TSS, based on the cache organization and the data layout [4] . TSS utilizes a gcd algorithm to exploit maximum cache utilization while eliminating all self-interference misses. Rivera and Tseng present a variation of TSS algorithm [16] . Lam et al. provide a tile size selection scheme, LRW, which tries to select a square tile size to eliminate the capacity and self-interference misses for a dominant array [9] . Panda et al present DAT, which always chooses square tile sizes and tries to minimize the interferences by padding [13] . Unlike the work in this paper, these tile-size selection algorithms do not consider the effect of loop skewing, nor do they take loop overhead into account.
Other Related Work
Ghosh et al. estimate cache misses, given a tile size, for a perfect loop nest [6] . They also informally discuss a tile-size selection scheme using matrix multiplication as the example. No formal algorithm is presented, however. They do not discuss the estimation of cache misses for imperfectly-nested loops. Therefore, we are not able to compare with their method in our experiments. [15, 16 ). Manjikian and Abdelralunan use cache partitioning to scatter arrays evenly in the cache, such that cross-interference misses are minimized [lo] . We use a diflerent padding scheme which seems more suitable for our algorithm.
Experiment a1 Evaluation
We apply our tile-size selection algorithm STS to three numerical kernels, SOR, Jacohi and Livermore Loop No. 18 (LLla), and two SPEC benchmarks, tomcatv and swim. We use reference inputs for torncatv and s w i m . For SOR, Jacobi and LL18, we declare N x N double precision arrays, with randomly chosen N based on a random number generator [14] with the following formula We use zl = 9 in all our experiments. Note that it would be too time-consuming to exhaustly test all array sizes within the range in our experiments.
We run the test programs on a SUN Ultra 11 uniprocessor workstation and on one MIPS RlOK processor of an SGI Origin 2000 multiprocessor, with the tile sizes selected by five different algorithms, namely, STS, TLI [3], TSS [4] , LRW [9] and DAT [13] . In order to handle several equally-important arrays, we make an obviously necessary mod5cation on the original TSS aod LRW algorithms such that the value of the initial tile size will meet the working set constraint. We also modify the TLI algorithm such that only the cache size divided by the number of equally-important arrays is used to compute the tile sizes which are free of self-interference misses. If any algorithm d'ecides to choose the whole array column as the tile height, then we let B2 = q + S2 * (ITMAX-1) and tile the Ji loops only (Figure 2(b) ). Table 2 lists the machine parameters for the Ultra I1 and the RlOK, assuming the size of an array element of 8 bytes. To accommodate the competition between instructions and data in the L2 cache both on the Ultra I1 and on the RlOK, we only tries to utilize 95% of the total L2 cache capacity. We use the machine counters on the RlOK and the Ultra I1 to measure the cache miss rate. Currently, we obtain the values of nl, ns, 724 and ns by examining the assembly code of the original program. A backend compiler can easily obtain such numbers.
On the RlOK, the untiled codes are compiled using the native compiler with the "-03" optimization switch set. On the RlOK, we found that compiling the tiled code with the "-02" switch can sometimes run faster than that with the "-03" switch, regardless of the tile-size selection [5] . Temam et al. derive an analytical method to estimate the number of self-interference m.isses [19] . Mckinley et af. present a simple cost model to estimate the number of cache misses [11] .
These methods do not consider the effect of loop skewing.
Rivera and Tseng present several padding algorithms to eliminate cache conflict m.isses [15, 16] . Manjikian and AbdelraJunan use cache partitioning to scatter arrays evenly in the cache, such~hat cross-interference misses are minimized [10] . We use a different padding scheme which seems more suitable for our algorithm.
Experimental Evaluation
We apply our t.ile-size selection algorithm STS to three numerical kernels, SOR, Jacobi and Livermore Loop No. 18 (LLI8), and two SPEC benchmarks, tomcatv and swim. We use reference inputs for tomcatv and swim. For SOR, Jacobi and LL18, we declare N x N double precision arrays, with randomly chosen N based on a random number generator [14] with the following formula Zn+l = (16807z n ) mod 2147483647. (13) Assuming that the array sizes under consideration range from TO to Tl, we select 200 array sizes, an, such that an = TO + (zn mod (Tlro», 1~n ::; 20D. (14) We use zl = 9 in all our experiments. Note that it would be too time-consuming to exhaustly test all array sizes within the range in our experiments.
We run the test programs on a SUN Ultra II uniprocessor workstation and on one MIPS RlOK processor of an SGI Origin 2000 multiprocessor, with the tile sizes selected by five different algorithms, namely, STS, TLI [3] , TSS [4] , LRW [9] and DAT [13] . In order to handle several equally-important arrays, we make an obviously necessary modification on the original TSS and LRW algorithms such that the value of the initial tile size will meet the working set constraint. We also modify the TLI algorithm such that only the cache size divided by the number of equally-important arrays is used to compute the tile sizes which are free of self-interference misses. If any algorithm decides to choose the whole array column as the tile height, then we let B2 = "f1 + 82 * (ITMAX-l) and tile the Ji loops only (Figure 2(b) . Table 2 lists the machine parameters for the Ultra II and the R10K, assuming the size of an array element of 8 bytes. To accommodate the competition between instructions and data in the L2 cache both on the Ultra II and on the RI0K, we only tries to utilize 95% of the total L2 cache capacity. We use the machine counters on the RlOK and the Ultra II to measure the cache miss rat.e. Currently, we obtain the values of nb n3, n4 and ns by examining the assembly code of the original program. A backend compiler can easily obtain such numbers.
On the RIOK, the untiled codes are compiled using the native compiler with the "_03 " optimization switch set. On the RIOK, we found that compiling the tiled code with the "-02" switch can sometimes run faster than that with the "-03" switch, regardless of the tile-size selection scllemes. Thereforc, we compile the tiled code with 'L-02" or "-03" depending on which produces shorter execution time. For all the tile-size selection schemes, we switch off loop tiling for the native compiler on the RlOK when we compile the tiled source programs (with for both 1-D and 2-D tiling). We switch off prcfetching on the RlOI< when we compile 2-D tiled source codes since prefetching may increase cross-interference misses for smalIer tile height 3 2 . We also switch off common block reorganization since the tile size selection algorithms already take care of memory layout. On the Ultra 11, both the untiled and the tiTed codes are compiled using the native compiler with the "-fast -xchip=ultra2 -xarch=v8plusa -fsimple=2" optimization switch, which is recommended by the vendor. 200, 2000) in Equation (14) . The skewing factors are S1 = S2 = 1. We have nl = na = 11, n4 = 9 and 7~5 = 3 on the RlOK aud nl = ng = 22, n4 = 34, ns = 4 on the Ultra 11. Table 3 summarizes the average speedup by STS over other schemes, average L1 and L2 cache miss rates for SOR on both the Ultra I1 and the RlOK. The execution time is averaged by geometric mean, and the cache miss rates are averaged by arithmetic mean of cache miss rates for individual array size, Specifically, Figures 8 and 11 show the execution time for various schemes on the Ultra 11 and on the RlOK respectively. Figures 9 and 10 show the Ll cache and L2 cache miss rates respectively on the Ultra 11. Figures 12 and 13 show the L1 cache and L2 cache miss rates respectively on the R101<. Figure 9 : L1 cache miss rate of SOR for various schemes on the Ultra II schemes. Therefore, we compile the tiled code with "·02" or "-03" depending on which produces shorter execution time. For all the tile-size selection schemes, we switch off loop tiling for the native compiler on the RlOK when we compile the tiled source programs (with for both I-D and 2-D tiling). We switch off prefetching on the RlOK when we compile 2-D tiled source codes since prefetching may increase cross-interference misses for smaller tile height B 2 • We also switch off common block reorganization since the tile size selection algorithms already take care of memory layout. On the Ultra II, both the untiled and the tiled codes are compiled using the native compiler with the "-fast -xchip=ultra2 -xarch=v8plusa -fsimple=2" optimization switch, which is recommended by the vendor.
The SOR kernel
We fix ITMAX to 1050 and randomly choose 200 array sizes ranging from 200 to 2000, i.e_, (TO, Tl) = (200, 2000) in Equation (14) . The skewing factors are 8 1 = 82 = 1. We have nl = nJ = 11, n4 = 9 and n5 = 3 on the RI0K and nl = n3 = 22, n4 = 34, ns = 4 on the Ultra II. Table 3 summarizes the average speedup by STS over other schemes, average Ll and L2 cache miss rates for SOR on both the Ultra II and the RI0K. The execution time is averaged by geometric mean, and the cache miss rates are averaged by arithmetic mean of cache miss rates for individual array size. Specifically, Figures 8 and 11 show the execution time for various schemes on the Ultra II and on the RI0K respectively. Figures 9 and 10 show the Ll cache and L2 cache miss rates respectively on the Ultra II. Figures 12 and 13 show the L1 cache and L2 cache miss rates respectively on the RlOK. .". . . ...Jo-'-,........-..,.....~~..,-_ ...... .. We fix ITMAX to 500 and randomly choose 200 array sizes ranging fiom 200 to 2000. The skewing factors are S1 = S;! = 1. W e have nl = ng = 17, nr = 28 and ns = 10 on the RlOK and nl = n~ = 28, n.1 = 24, ns = 3 on the Ultra 11. Table 4 shows the average speedup by STS, average L1 and L2 cache miss rates for Jacobi on both the Ultra I1 and the RIOK. Specifically, Figures 14 and 17 show the execution time of Jacobi for various schemes on the Ultra I1 and on the RlOK respectively. Figures 15 and 16 show the L1 cache and L2 cache miss rates respectively on the Ultra 11. Figures 18 and 19 show the L1 cache and L2 cache miss rates respectively on the R101C. factors are Sl = S2 = 1. We ha.ve nl = n3 = 17, n4 = 28 and n5 = 10 on the RlOK and nl = n3 = 28, n.1 = 24, ns = 3 on the Ultra II. .. LL18 has 9 arrays, and the tiled version has 11 arrays after duplicating ZR and ZZ. Due to the relatively large number of arrays, the array sizes we used in SOR will produce extremely small tile sizes for all the tiie-size selection schemes. Therefore, we reduce the array sizes and randomly choose 200 array sizes ranging from 200 to 500. We fix ITMAX to 300. The skewing factors are S1 = S2 = 2. We have nl = n 3 = 75, n4 = 100 and ns = 35 on the RlOK and nl = n3 = 87, n4 = 14, n5 = 8 on the Ultra 11. LLI8 has 9 arrays, and the tiled version has 11 arrays after duplicating ZR and ZZ. Due to the I"elatively large number of arrays, the array sizes we used in SOR will produce extremely small tile sizes for all the tile-size selection schemes. Therefore, we reduce the array sizes and randomly choose 200 array sizes ranging from 200 to 500. We fix ITMAX to 300. The skewing factors are 8 1 = 52 = 2. We have nl = n3 = 75, n4 = 100 and ns = 35 on the RIOK and n1 = n3 = 87, n4 = 14, ns = 8 on the Ultra II. tomcatv tomcatv can only be tiled with one dimension [18] , hence only STS can be applied for tile-size selection. We use two different reference inputs from SPEC92 and SPEC95 respectively. To verify whether STS produces nearly the best results, we run through a range of tile sizes, from 2 to twice of the size selected by STS, for each version of tomcatv . Figures 26(a) tomcatv tomcatv can only be tiled with one dimension [18] , hence only STS can be applied for tile-size selection. We use two different reference inputs from SPEC92 and SPEC9S respectively. To verify whether STS produces nearly the best results, we run through a range of tile sizes, from 2 to twice of the size selected by STS, for each version of tomcatv. Figures 26(a) syim Similar to tomcatv, slJim is tiled only with one dimension. We use three different reference inputs from SPEC92, SPEC95 and SPEC2000 respectively. Similar to tomcatv, we choose the tile sizes .. .. • _ . . . . . . 11 . . . . . . . . . . . . . . . ... the results on the RlOK. Similar to tomcatv, padded version runs faster than unpadded version in most cases for SPEC92 and SPEC95. Note that on the Ultra 11, the TLB size is smaller than the L2 cache size, hence STS will result in an underutilization of L2 cache. For SPEC2000, however, such an underutilization seems a negative eEect on performance.
2-
Discussion
In summary, Table 6 shows the speedup by STS over all the other schemes for all 600 cases for SOR, Jacobi and LL18, where "Both" stands for both the Ultra I1 and the RZOK.
One interesting point is related with LRW. Considering the combination of each benchmark (SOR, Jacobi and LL18) and cach machine (Ultra I1 and RlOK), LRW produces equal or smaller average L1 cache misses in 5 out of 6 combinations compared with STS. However, this does not translate into large performance saving. (The worst average speed ratio of STS over L R W is 0.98.) We found that in general LRW produces smaller tile sizes than STS, which potentially introduces more loop overhead. For LLlS, LRW has greater average L2 cache miss rates than STS since STS cxploits locality for L2 cache in most of cases due to large number of arrays.
Conclusion
In this paper, we present a memory cost model to predict the cache misses after skewed tiling. Further, me model the execution cost by considering both the cache misses and the loop overhead, based on which we make a decision between tiling one loop level vs. two Ioop levels. We present Algorithm STS, which selects the tile size such that the capacity misses and self-interference misses within a tile traversal are eliminated. STS uses inter-array padding to eliminate cross-interference misses.
We also compare STS with four previous algorithms, TLI, TSS, LRW and DAT. Experimerlts show that STS achieves an average speedup of 1.27 to 1.63 over all the other four algoritlims. We have previously implemented a cost model along with a number of tiling algorithms [18] . However, we are yet to implement the cost model presented in this paper. Ideally, our cost model should be incorporated in a backend compiler, which will Be our future work. show the results on the RI0K. Similar to tomcatv, padded version runs faster than unpadded version in most cases for SPEC92 and SPEC95. Note that on the Ultra II, the TLB size is smaller than the L2 cache size, hence STS will result in an underutilization of L2 cache. For SPEC2000, however, such an underutilization seems a negative effect on performance.
Discussion
In summary, Table 6 shows the speedup by STS over all the other schemes for all 600 cases for SOR, Jacobi and LL18, where "Both" stands for both the Ultra II and the RIOK. One interesting point is related with LRW. Considering the combination of each benchmark (SOR, Jacobi and LLI8) and each machine (Ultra II and RIOK), LRW produces equal or smaller average Ll cache misses in 5 out of 6 combinations compared with STS. However, this does not translate into large performance saving. (The worst average speed ratio of STS over LRW is 0.98.) We found that in general LRW produces smaller tile sizes than STS, which potentially introduces more loop overhead. For LLI8, LRW has greater average L2 cache miss rates than STS since STS exploits locality for L2 cache in most of cases due to large number of arrays.
