Impact of Tile-Size Selection for Skewed Tiling by Song, Yonghong & Li, Zhiyuan
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
2000 
Impact of Tile-Size Selection for Skewed Tiling 
Yonghong Song 
Zhiyuan Li 
Purdue University, li@cs.purdue.edu 
Report Number: 
00-018 
Song, Yonghong and Li, Zhiyuan, "Impact of Tile-Size Selection for Skewed Tiling" (2000). Department of 
Computer Science Technical Reports. Paper 1496. 
https://docs.lib.purdue.edu/cstech/1496 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
IMPACT OF TILE-SIZE SELECTION 
FOR SKEWED TILING 
Yonghong Song 
Zhiyuan Li 
Department of Computer Sciences 
Purdue Unversity 
West Lafayette, IN 47907 
CSD TR #00-018 
December 2000 





Impact of Tile-Size Selection for Skewed Tiling * 
Yonghong Song Zhiyuan LI 
Department of Computer Sciences 
Purdue University 
West Lafayette, IN 47907 
{songyh,li)@cs.purdue.edu 
Abstract 
Tile-size selection is known to be a complex problem. Thjs paper develops a new selecbion 
algorithm. Unlike previous algorithms, this new algorithm considers the effect of loop skewing 
on cache miss-. It also estimates loop overhead and incorporates them into the execution 
cost model, which turns out to be critical to the decision between tiling a single loop level vs. 
tiling two loop levels. Our preliminary experimental results sliow a significant impact of these 
pre\lously ignored issues on the execution time of tiled loops. In our experiments, we measured 
the cache miss rate and the execution time of five benchmark programs on a single processor 
and we compared ow algorithm with previous algorithms. Our algorithm achieves an average 
speedup of 1.27 to 1.63 over all the other algorithms. 
1 Introduction 
Memory access latency has become the key pedormance bottleneck on modern ~nicroprocessors. In 
order to reduce the average memory reference latency, it is important to exploit data locality such 
that most memory references can be served by the fast memory, e.g. the cache, in the memory 
hierarchy. Tiling is a well-known compiler technique to enhance data locality such tbat more data 
can be  reused before they are repIaced from the cache (231. Tiling transforms a loop nest by 
combining strip-mining and loop interchange. Loop skewing and ioop ~eversa i  are often used to 
enable tiling 1201. Figure 1 shows SOR relaxation as an example. Figure l(a)  shows the original 
loop nest in SOR, and Figure l(b) shows the tiled SOR in which loop J is skewed with respect to 
loop T ,  and Figure l(c)  shows the tiled SOR in which loops J and I are skewed with respect to 
loop T. 
Much of previous work on tiling applies to perfectly-nested loops only (8, 20, 21, 231. Recently, 
we proposed a new technique to tile a class of imperfectly-nested loops [17, 181. Performance of 
a tiled loop nest can vary dramatically with different tile sizes [9]. Wow to select proper tile sizes 
is hence an important issue. In this paper, if loop skewing is applied before tiling, such a tiling 
is called skewed tiling. Non-skewed tiling results if loop skewing is not iiecessary for tiling. All 
previous work tacitIy assumes non-skewed tiling [4, 6, 9, 12, 16, 221. However, such an assumption 
may not be true, especially for loops which perform iterative relaxation computations [17, 181. 
Another importallt factor ignored in previous work is the loop overhead in terms of the increased 
illstruction counts due to the increased loop levels. Further, tiling a software-pipelined loop will also 
'This work is sponsored in part by National Science Foundation through grants CCR-9975309 and MIP-9610379, 
by Indiana 2lst Century Fund, by Purdue Rcscarch Foundation, and by a donation from Sun Microsystems, Inc. 









l . h t





















" t - - ,
1 U e e ti , i s ste s,
I
DO T = 1 ,  ITATAX 
DO J =  2 , N  - I 
DO I  = 2,  rY - I  
A ( 1 ,  J )  = 
(A(;, J )  + A(I  + 1. J )  + A(I  - 1, J )  
+ A ( ] ,  J + 1)  





r l ( l , J )  = A ( I , J )  + A ( !  + 1 . J )  + A(S - 1. J )  
+ A ( I . J  + 1)  + A ( I . J  - 1 ) ) / 5  
END DO 
EN]> DO 
E D N  DO 
END DO 
DO 11 = 2, n - 1 + I T S T A , ~ ,  Ej, 
DO T = 1,ITAfA.Y 
DO J = ma=(JJ-  T , 2 ) ,  
m;n(JJ - T + B r  - 1 , N  - 1)  
DO I  = m a ~ ( l l -  T, ?), 
min( l1 -  T + B1 - I ,  N - I )  
A ( I ,  3 )  = A ( 1 .  J ) +  A ( ] +  I .  J ) + A ( I  - 





E N D  DO 
(a) Dcforc rranslorn~alioa (b) A l ~ c r  ~ h c w i n g  and ' I -D" riling ( c )  Allcr  .kcwing and '2-D" liling 
Figure 1: An example of tiling: SOR relaxation. 
increase the dynamic count of load instructions. In this paper, we shall show that these previously 
ignored factors can have a significant effect on tile-size selection. 
In our recent work [17], we present a memory cost model to estimate cache misses, assuming 
that only one loop level is tiled. In this paper, we present a more general scheme by considering 
two loop levels which may both be tiled. We present an algorithm to compute tile sizes such that 
during each tile traversal, capacity misses and self-interference misses are eliminated. Further, 
cross-interference misses are eliminated through array padding [15]. Given a tile size, we model the 
tiling cost based on both the number of cache misses and the loop overhead. To choose between 
tiling one loop level vs. tiling two loop levels, our algorithm cornputes their lowest costs and thc 
respective tile sizes. We then choose the tiling level, and the corresponding best tile size, which 
yields the lowest cost. One can easily extend our discussion to  higher loop levels, but such an 
extension does not seem useful for applications known to us. 
In this paper, we consider data locality and performance enhancement on a single processor 
whose memory hierarchy includes cache memories a t  one or more levels. We have applied our 
tile-size selection algorithm to fivehumerical kernels, SOR, Jacobi, Livermore Loop No. 15 (LL18), 
tomcatv and svim, using a range OF matrix sizes. We evaluate our algorithm on one processor of 
an  SGI multiprocessor and on a SUN uniprocessor workstation. We compare our algorithm with 
TLI [3], TSS [4], LRW [9] and DAT [13]. Experiments show that our algorithm achieves a average 
speedup of 1.27 to 1.63 over all these previous algorithms. 
In the rest of the paper, we first present a background in Section 2. We then present our memory 
cost model in Section 3. We model the execution time and present our tile-size selection algorithm 
in Section 4. We discuss related work in Section 5. In Section 6, we report experimental results 
and compare our algorithm with previous algorithms. Finally, we conclude in Section 7. 
2 Background 
In this section, we first define our program model and a few key parameters. We then discuss the 
issues of the memory hierarchy. 
2.1 Tiling 
Most of previous research on tiling addresses perfectly-nested loops only [8, 20, 21, 231. After 
tiling, the loops remain perfectly-nested. In our recent work [17, 181, we perform tiling on a class 
of imperfectly-nested loops. Figure 2(a) shows a representative loop nest before tiling, where the 
T-loop body consists of m perfectly-nested loops. The depth of each perfectly-nested inner loop is 
at least two. The loop bounds Lij and Ujj, 1 < i < m, j = 1,2, are T-invariant. We assume that the 
, I MA
,11'











DO 11 = 2, N - I + lrUAX, B,
DO T = I.ITMA.V.
DO J = mD~(IJ - T, l},
min{JI- T+ BI -I,N-1)
DO l = 2. N - I
,1(1,1) = (1,I) AlE 1,/) A(I - 1,/)
(J, J I) A{1. J - 1))/5
!)
O
DO JJ", 2, l'l - 1 + ITAlAX, BI
II , N MAX, 112
'" , ITMAX
J D~ ll , ),
", " 11- J I,ll' I}
I D~(II- , 2),
;n(//-T+ ~ - 1,11' I
U,J) '" { , I} J , } J I, J)
(/,I 1) + (I.1 - »/5
£














l -size ive "nu er , 8
o t 'J m of . r r f
I r






ti , t i
. i
t i tili tl t l l , , , J. ft r
, , ]
i tl t l . i { t ti l t tili , t
-I i t tl t l . t tl t i r l i
t l t t . l ij ij, ::; i ::; , j , , r -i ri t. t t t
2 




DO Jm = Lm1,~ml 





DO J J  = 7 1 , ~ ~  + SI - ( I T M A X - I ) ,  8, 
D o  T = h (JJ).gl ( J 4  
DO J1 = L',,, U:, 
DO I1 = L11rU12 
. . .  
END DO 
END DO 
. . .  




END D O  
DO JJ = 7 1 , ~ ~  + SI - ( I T M A X - I ) ,  91 
DO Ir= ~ 1 ,  r ]2+52*(ITMAX- l ) ,  B 2  
D o  T = I?(JJ, I I ) , g z (JJ ,  IT) 
DO J1 = L:', , u;', 
DO 1,  = L;',,U;l 




DO Jm = Lk,,Uc, 







Figure 2: The program model before and after tiling 
iteration space determined by J and I remains unchanged over different T-loop index values. For 
simplicity of presentation, we also assume that cache-line spatial locality is already fully exploited 
in the innermost loops except on the loop boundaries- Figure 2(b) shows the code after tiling the 
J; loops only (I-D tiling), and Figure 2(c) shows the code after tiling both Ji and Ii loops (2-0 
tiling). In Figures 2(b) and 2(c), the iteration subspace defined by all Ji and Ii loops is called a tile. 
Loop T is called the tile-sweeping loop, and loops JJ and 11 are called the tile-conlrolling Ioops [20]. 
Each combination of JJ and II defines a tale traversal. Two tiles are said to be consecutiue within 
a tile traversal if the daerence of the corresponding T values equals 1. In this paper, we assume 
the data dependences permit both I-D and 2-D tiling. Choosing between 1-D vs. 2-D tiling will 
depend on the estimate of cache misses and loop overhead. As far as estimating cache misses is 
concerned,, 1-I) tiling can be viewed as a special case of 2-D tiling with the maximum tile height. 
However, 2-D tiling incurs higher loop overhead, which we want to take into account. 
Let 71 = min{Lil)l L i 5 m), 72 = mn{Uilll < i I m),  71 = min{Li211 I i _< m) and 
72 = mmaz{Uiz 11 5 i 5 m). We call Sl and S2 the skewing factors corresponding to Ji and li loops 
respectively. (The skewing factors are also called the slope in our previous work [17, 181.) If Sl = 0, 
then loop skewing is not applied before tiling at the Ji level. In this paper, we are interested only 
in skewed tiling at least at the Ji level, thus Sl > 0. B1 is called the file width and Bp is calIed the 
tile height. B1 and B2 are called the tile size collectively. These parameters are used to define the 
bounds of the tile-controlling loops. For reference, Table 1 lists all the symbols used in this paper 
and their brief descriptions. 
For simplicity, rve assume all arrays are of bvo dimensions with the same column sizes. (We 
assume column-major storage.) Lower dimension variables can be ignored due to their lesser impact 
on cache misses in relaxation programs which we are interested in. Let n, be the number of two 
dimensional arrays for the given tiled loop nest. Within the innermost Ioop Ii ,  1 5 i _< m, of the 
untiled program in Figure 2(a), we assume array subscript patterns of Ak(li +a ,  Ji + b ) ,  I 5 k < n,, 
where a and b are known integer constants. 
2.2 Memory Hierarchy 
The memory hierarchy includes registers, cache memories at one or more levels, the main memory 
and the secondary storage, as well as the TLB [7]. 
The TLB translates a virtual address into a physical address. Tl~e TLB has two key parameters, 
DO T= l,lTMAX
DO JI = LII,UII









DO 1m =L:" .. U:" I
DO 1m = Lm2.Um2
ND
(b)
"11,1'2 I • l), HI
TJ 1}1, F"J2 +S. *(lTM H.





[ t I .
llocalit. ea
aries_ ( )
i 1 , i 2-D
lin ) { ), i .





~ .. I-D .
, , .
'Yl i { illl ~ ~ }, "/ ax{ l1 ~ ~ }, "1 i i2 1 ~ .$ }
"12 a.z{Ui21 :5 ~ }. 81 8 i I j
ti l . 1 , ].) f 51
l. )
i 81 O. 1 t l i t 2 l
l ti l .
,
or si plicity, "Ie assu e all arrays are of t\vo di ensions ith the sa e colu n sizes. ( e
. o
l h ::S :5
m (a), I a ), 1 :5 :::; a,
.
r he ,
Table 1: Description of symbols 
namely the block count T, and the block size Tb. We call T' EE TcTb the TLB size. In this paper, Tb 
is the size of the virtual memory represented by each TLB entry in the number of data elements. 
Wc assume a fully-associative TLB with an LRU replacement policy. 
For simplicity of presentation, we consider two levels of caches in this paper, namely the L1 
and L2 caches, which are common in current practice. The L1 cache has several parameters, 
namely the cache size Csl, the cache block size Cbl and the set associativity Cal. Csl and C,,I 
are measured in the number of data eIements. Similarly for L2 cache, the cache size, cache block 
size and set associativity are Cs2, Cbz and CnP respectively. The cache misses can be divided into 
three classes [7]: compulsory misses, capacity misses and conflict misses. Conflict misses can be 
attributed to self-interference misses of the same array and to cross-interference misses between 
different arrays. 
3 A Memory Cost Model 
In this section, we want to estimate the number of cache misses incurred by executing the loop nest 
in our program model after tiling. 
Let So represent the iteration space defined by yl I Ji 5 yz and 91 I 1; 5 72  in Figure 2(a). (For 
simplicity, we also regard So as the original iteration space defined by Ji and I; loops in Figure 2(a), 
as if all J; loops have the same loop bounds and all li loops have the same loop bounds.) So is 
illustrated in Figure 3(a) by the rectangle enclosed by the solid lines with the height Q and the 
width 7. Within each tile traversal, we define the base tile to be a tile with T = 1 and an advatzced 
tile to be a tile with T > 1. The dashed-lines in Figure 3(a) separate the base tiles of different 
tile traversals. The bvo shaded areas illustrate two different tiIe traversals, ttl and tt2, where each 
shaded rectangle with solid-line boundaries represents an advanced tile. When the tile-sweeping 
loop T increases the index by 1, the tiles can only overlap partially. 
The cache misses incurred by one tile traversal can be partitioned into those within the base tile 
and those within the advanced tiles. Note that only those base tiles and advanced tiles overlapping 
with So will be executed, thus only they can contribute to the cache misses. In Figure 3(a), the 
base tile in the tile traversal tt1 resides outside So, while the base tile in it2 resides within So. 
We make the following two assumption in our estimation of the number of cache misses: 
Assumption 1: There exist no cache reuse between different tile traversals. 
4
S)'mbo] De"e-riptlon Symbol Dl)lcript.icm
"1, ~rho minimum lowu bound 0 aU J. JooDa .." The maxjmum uppal' bound of nH .1. loops
711 The- minimum lo.....~]' bound or taU J, loops "2 The mll.ll:;imum upDcr bound 0 all J; OOp9
S, The ltk-ewJOK Geter or J; oop, s. Tna skowJng o.ctor Or I, loop.
B, l'J)e t. ~ \MU.n ~ Jj~ The L, 0 hCIRM... 'l' lC n10Jmber () .....rray~ in the KLvcn loop nCBt. N Th~ 3r-TI1Y Il;olumn a'i2.o.., 7~ - 7, -I- 1 'I 'I. - 'II + 1
Tc The numb~r of 'l'Ltj .."l\r C5 Tb The number 0 dBta elemenl!l ~Ilch 'TLD enL.ry l::an repre.!r:nt
.1 l·.I)~ Ll eM La .1'%0 In the number o' dD.la elemcnL. C.: 7he 1.1 c3'l:he line 8n:e In the numbllr 0 . dau!., 0 emenls
C.' ' 'he Ll cache !let. a,,~Dc:jn.livir.y :.. The L2 co.cll.l:l .. i:l.o in t.ho Jlumbor 0 ' dElta ol!!lncnls
.2 Tho L2 cache BEt n.".soclo.th·Hy .~ The L2 C'21cho liDD lli2.o in Lho number of dl'h, elements
T. Thr: TL8 ~io:~ in the: numbn 0 dlloh" ~l~mc.nu r De lncd in S(!ction 3
1', 'I'he Ll cacha rniu p~nl\LLy p~ Tho J.,2 cl\che min p~nah.y
a U;l n.rTlly oolpnnt Width l:onsLrQlOcd by tho L< 1500 ~ectl()n 4.2.3
"I Thot Bum 0 the: sLDtic number 0' ms\rue:L1onl5 Dr the com?uto.Lion or DJI Uu, Ii oop bounds". Tht: .:!u.rn D tht:: :lLntlc nUlllb(!T" 0 mSlrueLJOn~ III the J. loop bodiCln., Thot IIUIn 0 t.h~ sLlule numb~r of in:ltrucLtoD5 com utlnlt the J; loop bound.8n. Thtillum g the dynamic numbQr of Joad in/iltruC'L\on1! in the proloKUe, Dond \h-D apiloltuos 0 ELJI lIoh",-a.,Q-pipolil1ed I.; loopsn. The ",urn 0 the numbu 0 Joad 1D~LrucLJDn:ll divided b}' ~hc unroll fe.ctor in the '0 lWB.l'e-piDeJined loop bodi~,:,
s. the iLC'Tl'~ion apace tlehtloCd by '1] < J. < '7-::1 and r:J1 < I~ < r:I:l 1n F',gurc 7 a}
!'MAX Th.o mdXllllunl Indo. "'t1 'OQ Or the 'h e-aw<cepmg oop
W ThQ wOTkinK-lu.t l:J)i:C. or tho JoCIp nQ.lt. lo'iguro 2 a)) in the numbl:lr 0 dllolD. ch:ml:lnL.




l t i sl> t l i O I t t ciati it G l - sI Gill
l , ,
0 3 2, b2 a2
; pulsoTlJ .
OUT
o i ~ ~ [2 11 ~ Ii ~ 11 .
i i L








l , t -
f
• : O
Figure 3: Illustration of tile traversal 
Assumption 2: 31 << 7 and B1 << (ITMAX-1)  * SI. 
Assumption I is reasonable if ITMAX is large, since it will be very likely for a tile traversal to 
overwrite cache lines whose old data could have been reused in the next tile traversal. Assumption 2 
is reasonable because a large B1 can easily cause an overflow in the TLB. As explained later in 
Section 4, our algorithm poses a constraint on 3 1  such that TLB should not overflow. If the tile 
size (131, Bz) is chosen properly, there should be exactly one cacbe miss for each cache line accessed 
within a tile traversa.. To be more specific, the following two properties should hold; 
Property 1: No capacity and self-interference misses are generated within a tile traversal. 
Property 2: No cross-interference misses are generated within a tile traversal. 
In Section 4.2, we shall discuss how to preserve the above properties. For now, we assume they 
hold. 
We first show how to compute the number of L1 cache misses caused by an advanced tile. Let 
W represent the size of the data set accessed by the original loop nest in terms of the number of 
data elements. The average size of the data accessed by one tile is estimated to be D = a + &Bz. 
Figure 3(b) shows hvo consecutive tiles, it3 and tt4, within a tile traversal, assuming t l a t  both 
tiles reside within So. The iteration subspace of ttd is produced by shifting the iteration subspace 
of tt3 upwards by S2 iterations and to the left by S1 iterations. The L1 cache misses in ttQ 
either occur in Region ABCD or in Region DEFG. The totaI estimated L1 cache misses equal to 
(&Bz + S2Bl - SlS2) * 7&. (This estimate may not be exact because data accessed at the lower 
border of Region DEFG may or may not be in the cache already.) 
We then show how to accumulate the number of L1 cacbe misses for all the tile traversals with 
the same JJvalue. Figure 3(c) illustrates the idea. For a particular JJvalue, let tl, t a ,  ts and t4 be 
the base tiles of four tile traversals, and let t:, th, tb and ti be the corresponding advanced tiles when 
T increases by 1. In this particular illustration, the number of L1 cache misses caused collectively 
by t; (1 <_ i < 4) equals to the sum of the number of L1 cache misses caused by each individual 
ti, that is, 3. Note that only the tiles overlapping with So can contribute to L1 cache misses. 
Similarly, the number of L1 cache misses caused by the advanced tiles t: (1 L: i < 4) equal to the 
sum of the number of L i  cache misses caused by individual 6, that is, bg + Z(B1- S1)S2 i+. 
70 b l  
In general, the number of L1 cache misses caused by the advanced t~les with the same JJ value 






















f w * B I B 2 •
tw S . "I'tha
o' 4
S 82 l 4
e l
SI 2 8zBl 8 1 2) 1J]~~l' i ta.
t
val val 2) 3 4.
~) ~, 3 ~1
. f




1 tL ~t:t 2 l dS2 * 1'l~bl .
jles
SCIW ( l dS2 4--, 'T o
'7 bl 'Yl}vbI
Figure 4: Calculating cache misses under different scenarios 
estimated as 
1 if 1 5 BP < 7 + S2 * (ITMAX-1) 
0 i f  I?, = 7 + S2 * (ITMAX-1) 
The value 71 + S2 + ( W A X - I )  is the maximum height of the iteration space after tiling. Any B2 
value greater than or equal to 77 + S2 * {ITMAX-1) results in no tiling at the Ii loop level. 
With Assumptions 1 and 2, we can then accumulate L l  cache misses corresponding to different 
JJ values by considering three different cases: 
Case 1: 7 = ( ITMAX-I)*  Sl. 
This case is illustrated by Figure 4(a). In this case, the tile traversals defined by JJ < 
71 + y - NSTEP will not execute to the ITMAXth T-iteration. The tile traversal defined 
by 71 + y - NSTEP < JJ 5 71 + 7 is the first to reach the ITMAXth T-iteration. The tile 
traversals defined by JJ > 71 +-y will start executing at T > 1. During the execution, the tile 
traversals defined by JJ = yl will incur L1 cache misses of $$-. The tile traversals defined 
by JJ = 71 + B1 will incur L1 cache misses of * 2 + +(BI - S1)Sz * * k. Hence, 
we have the following: 
- The L1 cache misses in all the tile traversals defined by JJ 5 7 2  - B1 amouot to r 
* (  1 + 2 +  . . . + [ V l ) .  (1 + 2 + . . . + 1-1) + ~ ( 8 1  - SI)SZ * 9 * 
- The L1 cache misses in all the tile traversals defined by 72 - B1 < JJ 5 7 2  amount to 
* I51 + dB1 - s l ) s2  * 2 *  * IYl. 
- The L1 cache misses in all the tile traversals defined by 7 2  < JJ amount to 3 r (1 + 
2 + . . . + [*I)+ r(Bl - S1)S2 * 2 * & * (1 + 2 + . . . + r7-gB'l). 
Adding up the three numbers of the above, the total L1 cache misses in the tiled loop nest 
approximate 2 r + 2 ;:r E. 
Case 2: 7 < (ITMAX-I) * SI. 
This case is illustrated by Figure 4(b). Similar to the computation in Case 1, we have the 
following: 
- The L1 cache misses in all the tile traversals defined by JJ 5 7 2  amount to * (1 + 
2 + ... + [&I) + r(B1 - Sl)S2 * 2 * & * (1 + 2 + . . . + [ W l ) .  
- The L1 cache misses in all the tile traversals defined by 7 2  < JJ 5 (ITMAX-I) * B1 4- 













T = { r11;1 ~ z 11 82 (I -l)
o B 2 'Jl 82 • l)
1] 82 * ITMAX-l z
'Jl 82 ( l) i L
• : '1 l 81·
(a). , :S
'1 / t
/1 / ~ 1' '1 t
'Yl 1' i l L ,
/1 l ~J:;. W
1' l i l l WeB! 7(B i 8 82 r!b.s 1 ~ ., 61 1 ,1/"'b1
i :
l ll :S 1' l n ~cf1L *
. r'Y~~) 1) T(B I 81 82 ~ '1/~Ol :fo ( . + r1-;:B1 1).
l '1 l ~ 'Y
WB, r.l T(BI - 81)8 fu ~ r1- 81 1.,e61 81 SI '11/'-'61 BI
l ll l 1' WeBJ *
, 01
. r1B~11)+ T(El 81 82 ~ -Y1/~61 . ~:Bl
, l
~ * Bl + ~ *~.
Cbl 1 Gbl SIr)
• a 'Y ITM l) 81.
. ,
:
l ~ '1 ~d:: 1
. , rix1) T( I d z lk 71/'hbl . rJB~ll .
l '1 ~ t l +
'1 WC~l r.lB 1 r I ;-1;*SI-11 T(B I - t} S2 !lJ..s * 'V }c" r..:LB 1"( Mil I r r) bl 1
r(IT A;;1}*SI-'11.
- The L1 cache misses in all the tile traversals defined by (ITMAX-I) * B1 + yl < JJ 
amount to %*(1-1-2 -I-...+la]) + T ( B ~ - s I ) s ~ * ~ * ~ * ( ~ + ~ - I -  ...+ [PI). 
Adding up the three numbers of the above, the total L1 cache misses in the tiled loop nest 
approximate ws, (ITMAX-1) + rvs2 (ITMA X-I) ,  
c b l  Bl ~ c b l  . Case 3: 7 > (ITMAX-1) * SI. . . 
WS, ITMAX-I)  Silnilar to Case 2, the total L1 cache misses jn the tiled loop nest approximate (qlal 
-t 
WS, ( ITMAX-1)~  
~ C b l  
Combining the above three cases and plugging in the estimate of 7, the total number of L1 cache 
misses is approximately 
W Sl (ITMA X - I )  + W S2 (ITMAX-1) 
Cbl B1 CblB2 
Similarly, with Properties 1 and 2 standing, the number of L2 cache misses for 2-D tiling is 
approximately 
W S1 (ITMAX-1) W S2 (ITMAX-I) + 
Cb2B1 (762 BZ 
With l-D tiling (in Figure 2(c)), the L1 cache temporal locality is not exploited across the T-loop 




The total number of cache misses for the L2 cache is approximately 
4 Tile-Size Selection 
In this section, we first present an execution cost model for tiling with a given tile size, based on 
both the number of cache misses and the loop overhead. We then present our tile-size selection 
algorithm, followed by a running example to go through our algorithm. 
4.1 An Execution Cost Model, for Tiling 
Loop tiling introduces loop overhead. To decide between 1-D tiling and 2-D tiling, the overhead of 
the tiled Ii loops in Figure 2(c) needs to be measured. Let nl be the sum of the static number of 
instructiom for the computation of all the li loop bounds (1 5 i <_ m). The li loop overhead due 
to 2-D tiling jn terms of the dynamic count of instructions, is measured approximately by 
ITMAX* 7 * 77 
n1* 
B2 
Let n2 be the sum of the static number of instructions in the Ii (1 5 i 5 m) loop bodies. The 
dynamic instructiorl count for the Ii loop bodies is 
n2 * ITMAX * y + 7) .  ( 6 )  
7
l I /1
t t ';C£; * ( + 2+ ... + ri 1) (B1 - 81)82 *~ * 'Y,;bbl (1 +2+ . r1B~11).
. t Sl{l l} wS .l)T
apprOXl a e C
b
BI 'lGbl'
• : "I l) 8 1,




dIT l} ·dITMAX-l). (1)
E I /J\
, , -D
dIT .l) + 82 (I -l). (2)




















'* I "f * 7]. (6)
A-om (5) and (6), if nl and nz are approximately equal, then a small B2 will introduce large 
loop overhead. Let ng be sum of the static number of instructions for the computation of all the 
.Ti loop bounds (1 5 i < m). The loop overhead due to tiled Ji loops can be measured by 
ITMAX + y 
n3 * 
B1 
Enabled by scaIar replacement [2], in a software-pipelined loop [I], loaded data can be reused 
in different iterations. The dynamic count of load instructions can hence be reduced. Let nd be 
the sum of the dynamic count of load instructions in the prologues and the epiIogues of all the 
softwar+pipelined loops. Let ns be the sum of the number of load instructions divided by the 
unroll factor in the software-pipelined loop bodies. The unroll factor is one if the loop is not 
unrolled. The dynamic count of load instruction with 1-D tiling is approximately 
With 2-D tiling, the dynamic count of load instructions is approximately 
Y Y (n4 + n5B2) * - * q * ITMAX = (n4 - + n5y )q  + ITMAX. 
B2 Bz (9) 
Clearly, if nd is significantly greater than ng and B2 is small, then the dynamic count of load 
instructions with 2-D tiling can be much greater than that with 1-D tiling. 
Let pl be the penalty for an L1 cache miss and p2 be the penalty for an L2 cache miss. By 
adding the penalty due to L1 cache misses in Formula (3), the penalty due to L2 cache misses in 
Formula (41, the loop overhead due to tiled Ji loops in Formula (7), and the dynamic count of load 
irlstructions for sohare-pipelined innermost loops in Formula (a), we can model the execution cost 
for 1-D tiling by . 
W W S1 [ITMAX- 1) ITMAX * 7 
pl * ( ITMAX* -) + p z  * ( ) + n 3  + ( n 4 + n s 7 ) v * I T M A X .  (10) 
c b l  Cb2B1 B1 
In the above formula, we aSsume the latency of one unit of time for each instruction, including a 
load instruction. From (lo), with 1-D tiling, we want to maximize B1 (subject to Properties X and 
2 aforementioned) such that the number of L2 cache misses is minimized. By adding the penalty 
due to L1 cache n~isses in Formula ( I ) ,  the penalty due to L2 cache misses in Forlnula (2), the 
dynamic count of load instructions for software-pipelined innermost loops in Formula [9), the loop 
overhead due to tiled Ji loops in Formula (7), and the Ioop overhead due to the tiled innermost 
loop in Formula (5), the execution cost for 2-D tiling can be modeled by 
M / S ~  [IXMAX-I) + W$ (ITMAX-1) WS, (ITMAX-I) + WS, (ITMAX-1) 
P 1  * ( c b ~ 8 1  C b l  B2 ) + p2 * ( cbT& c b 2 B 2  1 
4.2 Tile-Size Selection Algorithm 




Fr , j i 2 c 2
3
Ji ::; ::; ). ;
*7nJ * ---C-
1









, .j s ,
tiD I-





diT A -J) I
Pi *" c) P2 C B n3 B (n n5f)7J * ITMAX )
b 1 1
, s
10 I- i 1
ti )
mis es ), m ),
li e ( ),
i , l
, D
* (wsd T l wsd .l}) + * (wsd l) S2(IT l)
I C6J B! q,! 2 P C 2Jh C
ITMAX * 'Y * "I ITMAX* 'Y f
+nI * B
2
+ n3 HI + (n4 B2 + n5'Y)11 * ITMAX. (11)
r t
, 1
Procedure EnumFPSize(C,, Cb, N) 
Cor F2 t 1 to N do 
Fl + 1 
t t (Fl * A') mod C3 
while ((F2 + Cb - 1) < t 5 (C, - F2 - Cb + 1)) 
Rccord ( P I ,  Fz) 
F1 t- fi -I- 1 
t c- (F ,  r N) mod C, 
end while 
end for 
Figure 5: Procedure EnurnFPSize and an illustration of utilizing portions of the cache by a single 
tile 
4.2.1 Preserving Proper ty  1 
First, we discuss how to eliminate self-interEerence misses within a single tile. For any array Ail let 
R be the minimum rectangular array region which contains all the A; elements referenced within 
a tile t .  We say that Ai's footprint size withn tile t is (Fl,F2), where Fl and F2 are the numbers 
of columns and rows in R respectively. We call Fl ( F - )  the away foolprint width (height) for Ai 
within tile t. Reversely, given a footprint size of Ai, the tile size can also be computed. Given 
the subscript patterns and the loop bounds, such a computation is straightforward and we omit 
tbe details. For the example oE SOR (Figure l(c)), assuming the array footprint size for A to be 
( K ~ ,  ti2), t he  loop tile size sbould be (tcl -2, r;z -2). For array A,, if the footprint height F2 is greater 
than the distance between the locations of two columns in the cache, then the columns accessed 
within the tile will conflict in the cache, creating self-interference misses [3]. More precisely, we 
have the following lemma: 
Lemma 1 Given array footprint size (Fl, F2) for any Ai (1 5 i 5 n,), a cache of size C, and cache 
line size Cb, if there exist no self-interference misses, then the distance between the starting cache 
locations of any two columns of Ai within Fl consecutive columns is either no smaller than Fz, or 
no greater than C, - F2. Conversely, there exist no self-interference misses if the distance between 
the starting cache locations of any two columns of Ai ~ i t h i n  Fl consecutive columns is either no 
smaller than Fz + Cb - 1, or no greater than C, - Fz - Cb + 1. 
Proof Obvious. o 
Given a directly-mapped cache of size C, and cache line size Cb, and given an array column size 
N ,  pxocedure EnumFPSize in Figure 5(a) enumerates all the footprint sizes (PI, F2) which incur 
no self-interference misses, according to Lemma 1. We say that a footprint size (Fl,F2) of Ai is 
mazimal if increasing either PI or Fz will introduce self-interference misses for A;. In general, the 
maximal footprint size for array Ai is not unique. According to EnumFPSize, the maximal footprint 
sizes for all arrays are the same if they have the same array column sizes. Our tile-size selection 
scheme will enumerate all array footprint sizes which are free of self-interference misses until the 
sizes become maximal. The scheme estimates and compares the execution cost for different (Fl, F2) 
in order to get the optimal tile size. 
Next, suppose the cache is not directly-mapped, and assume an LRU replacement policy. 
9
3 , b ,
r r +-]
~
+- l. N .
(F2 ) ~ ~ 3 b I})
e F1, z)
I l FI +






. i S ithi i I I 2
ti l . 1 F2) rr t rint
. r l , ,
h f )),
1\:1,1\: ), h It] 21 1\:2 2). i 2
r ce ].
i :




w thin I ti
2 b , s 2 Gb l.
i s. 0
s b,
, r ll FI , 2 )
, . , )









Figure 6: An illustration of padding to eliminate cross-interferences 
We show that the parameter C, in procedure EnumFPSize should not be the whole cache size. 
Otherwise, self-interference misses will occur when the execution proceeds horn one tile to the 
next. For clarity, instead of arguing formally for the general cases, we illustrate the cases of Pway 
and fully-associative caches. Figure 5(b) shows two cousecutive tiles t l  and t2. Suppose C, equals 
thc whole cache size in procedure EnumFPSize and suppose the footprint size of 21 is maximal. 
Tile t i  accesses the cache from the least-recently referenced data segment to the most-recently 
referenced data segment in the memory, in the order of D l ,  D$ 0 3  and 04 which are separated by 
solid lines. If the cache associativity is Cal = 2, then 0 2  and Dd will map to the same cache sets. 
The data accased in the blank rectangle A will replace segment DZ. If the cache is fully associative, 
D l  will be replaced. Ko~vever, part of the old data in segment 0 2  (or D l )  could have been reused 
by tile t2 One solution to avoid the replacement of useful data is to reduce the footprint size 
within t l  such that only a portion of the cache is used to compute the maxima1 footprint size in 
EnurnFPSize. Figure 5(c) shows the case for twe~vay set-associative cache. In this way, the data 
accessed in Regions A and C will replace the cache segment 0 2  and part of segment Dl, whose old 
data are not reused by t2- The reusable data in 0 3  will be kept in the cache. Using the above idea, 
we let C, = c2$ c , ~  in  procedure EnumFPSize, for 2-way and fully-associative caclles. The cases 
of other associativities are more complex, and they will not be discussed in this paper. 
To eliminate capacity misses, the footprint size of each array A, can only be ([2j, Fz), a 
fraction of (PI, F2). Here, we choose to partition columns instead of rows, iu order to preserve 
spatial locality. Assume that (I3l(i1,l3$)), 1 $ i < n,, is the tile size such that the footprint size 
for array Ai within a single tile is ([el, F ~ ) .  For 2-way and fully-associative caches, we choose 
the tile size for the tiled loop as (Bl ,  Bz) =: (rnin.j~,('), rnini~?)) .  For directly-mapped caches, we 
choose (BI, Bz) = (rnini~;') - Sly rni%B$) - &). One can prove that for directly-mapped, 2-way 
and fully-associative caches, Property 1 holds under the above treatment. For other set-associative 
caches, procedure EnumFPSize needs to be revised. 
4.2.2 Preserving Property 2 
We apply inter-array padding to eliminate cross-interference misses within a tile traversal. For 
simplicity of presentation, we assume that the array subscript patterns of one particular array Ak 
cover all the array subscript patterns for all the other arrays Ai, i # k .  The discussion in this 
section can be easily extended if such an assulnption does not hold. Using inter-array padding, we 
let the starting addresses for array Ai(l 5 i 5 n,) map to the same location in the cache as the 
starting address of the ( [ E l  r (i - 1))th column of array Al.  With such padding, cross-interference 
misses are eliminated within a single tile between Ai and Aj (1 5 i, j 5 n,,i # j ) .  
When the execution goes hom one tile to the next, if the cache is directly-mapped, the newly 
accessed data for A, will map to cache locations previously unused in the tile traversal. If the 





, i l fr m ~ e
. , 2-
. ) n 1 . Gs
e FPSiz t .xi
l
, 2, DS D
l = D 4 -
cess 2
iIJ H wever, D )
.
l I
m i ()-way ,
D l,
. D l che_
s ~ -1 Gsl PSi ,01
i l.EJ.. j, 2 ,no
FI, 2 ). lUIIlIlS n
Bii ), B~i» , ~ ~ a ,
L.5..J, 2 )- w ,no
b 2) mi7l iBii) , mi7l.iB~i». . ,
I, 2) mi7l.iBii) 8 t , mi1li. ~i} 82) ·
.
,
i l : .
I jf m e ~array
(1 S S a )
l~J* i l»th I.




Input: SI, S2, C31, G I ,  CM, C32, Coz. C b 2 .  n l r  n3. n4, RS, n o r  N ,  0 ( ~ C C  Table 1)- 
Outpu t :  Tile size (BI, Bz) and the transformed array declaraiion. 
Procedure: 




Cornpu teT i l eS ize -2D(w C,I) -... 
CornputeTilcSize- I R ( ~ c , * )  
e n d  if 
Apply inter-array padding (see Section 1.2.2). 
Rcturn (B1,Bz). 
Procedure C~rn~ulcTileSizc-ID(C,) 
/* (TB1,  TS*) is a temporary tile size. */ 
Select the maximum tile width K such that thc fooLprinL of one tile can fit in both the TLB and the L2 cache. 
T B I  t- K - Sj, TB2 t r )  + S 2  * (ITMAX-1) 
Cornputc the cxecu~ion cost, TM, based on (10). 
if (TM < hb) then Bj i- TB,, B2 t TBz, M t Thf a n d  if 
Procedure CompulcTi!cSite-2D(C,) 
/* (TBI , TBs) is a temporary tile size. ./ 
A4 4- w 
for Fz t CbI t o  N d o  
f i t - 1  
L t (FL t N )  m o d  C. 
while ( f i  5 u or (Fa f - 1) _< L _< (C, - I 3  - Cbl + I ) )  d o  
Convcrt array footprint size (FIB F2) to loop tile size ( T B I , T B 2 )  (scc Srtc~ion 4.2.1). 
if (Cnl = 1) then T B I  t T B I  - SI ,TBa t TB2 - S? e n d  if  
if (TB1 > 0 and T B 2  > 0) t h e n  
Compute the execution cost, TM, bascd on (11). 
if ( T M  < M) then B1 t TBI, B2 t- TB2, M + T M  end if 
end if 
f i t . F l i - 1  
t t (PI * iV) mod C. 
end while 
end for 
Figure 7; Tile-size selection algorithm - STS 
either previously unused or will not be referenced again within the current traversal. Therefore, 
cross-interference misses are also eliminated within a tile traversal. Figure 6 illustrates an example 
for Fl = 4 and n, = 2, where the cache is directly mapped. Here, assuming the starting address for 
array Al to be 0, the padded number of data items, x ,  between arrays A1 and Az can be determined 
from 
(size(A1) + z) = (2 * N), mod CS1. (12) 
We are ready to present our tile-size selection algorithm in the next section. 
4.2.3 Algorithm STS 
Algorithm STS in Figure 7 selects the tile size by interleaving the operations in procedure EnumFP- 
Size with the applications of Formulas (10) and (11) which compute the execution cost. We require 
Bz to be no smaller than the cache line size Cbl. However, we do not require B2 to be a multiple 
of Cbl, since such a requirement does not have much benefit when execution praceeds horn one tile 
to the next. In addition to the conditions stated in procedure EnumFPSize, the array footprint 
width F2 should be no greater than u, which is the total number of array columns representable 
by the TLB minus the number of newly accessed array columns when the execution proceeds &om 
one tile to the next. 
11
l, 5 .I, Cal, bb ,2, a 2, , t. D3, ns, DD, , u see l I .
El, E2 nn J ralL .
al = I}
rn lc il c I
l .
mputeTileSize-2D( ey -1 .l
0)




II B2) 7. >Of
idth" e t int c
El -\- l'< 81. +-1/ ,. -l
m te e cution t, t .
M) EI +- I +- ,. +- M o
il ize.2D(C.)




t +- Fi ,.
FI ~ (J O , + Cbl ~ ~ , P, l 1»
e , ,) i7. I' TB2) e ection
( al ) El +- I }, B'.l +- 5,
( I O} .
JCei:uti , e ll)
( I +- I, + . {--
FI +- FI + 1




f X l 2





2 hI· , 2




STS makes the decision between 1-D and 2-D tiling based on their execution cost. For 1-D tiling, 
ComputeTileSize-ID tries to find tile width B1 such that Properties 1 and 2 are preserved on the 
L2 cache and that Formula (10) is minimized. For 2-D tiling, ComputcTileSize-2D enumerates all 
tile sizes which are free of self-interference misses. The tiIe size with the lowest execution cost is 
selected. Between 1-D and 2-D tiling, the scheme with the lower execution cost is chosen. 
STS needs a conversion fiom array footprint size (PI, F2) to loop tile size (El, B2), a s  stated in 
Section 4.2.1. If the resulting tile width or tile height is nonpositive, 1-D tiling is chosen. 
The complexity of STS is O(N t min(C,,, u) )  = O(Nu).  (In practice, o is much smaller than 
the L1 cache size C,I .) 
4.3 A Running Example 
We now take SOR (Figure 1) as an example to show how STS works, assuming the following 
parameters: N = 1000, ITMAX= 1050, Csl = 4096, Cbl = 4, Cnl = 2, Csz = 128* 1024, Cb2 = 16, 
Ca2 = 2, Tb = 4096 and Tc = 48, nl = 15, ng = 15, nq = 20, ng = 3, pl = 6, and pz = 30. Based on 
the array subscripts and the loop bounds, we have S1 = S2 = 1, y = 7 = 999, W = N*N = 1000000 
and o = 195. 
In the following, we show the steps of STS. 
Since C, = 2, ~ o m ~ u t e ~ i l e ~ i z e - 2 ~ ( ~ )  is called, and we have Br = 38, Bz = 43. The 
execution cost for 2-D tiling is M = 4171464893 units based on Formula (11). 
~ o r n ~ u t e ~ i l e ~ i z e - l ~ ( ~ )  computes TB1 = 63, TB2 = 2048. The execution cost for 1-D 
tiling is TM = 4764840588 units based on k r m u l a  (10). In this case, STS favors 2-D tiling 
over 1-D tiling with the tile size (38,43). 
No inter-array padding is applied since n, = 1. 
5 Related Work 
5.1 Competing Tile-Size Selection Schemes 
Chame and Moon present a tile size selection algorithm, called TLI, to simultaneously eliminate self- 
interference misses and minimize the summation of capacity misses and cross-interference misses [3]. 
Colcmsn and McKinley provide a tile size selection algorithm, TSS, based on the cache organization 
and the data Iayout [4]. TSS utilizes a gcd algorithm to exploit maximum cache utilization while 
eliminating all self-interference misses. 'Rivera and Tseng present a variation of TSS algorithm [lG]. 
Lam e1! al. provide a tiIe size selection scheme, LRW, which tries to select a square tile size to 
eliminate the capacity and self-interference misses for a dominant array [9]. Panda et a1 present 
DAT, which always chooses square tile sizes and tries to minimize the interferences by padding [13]. 
UnLike the work in this paper, these tile-size selection algorithms do not consider the  effect of loop 
skewing, nor do they take loop overhead into account. 
5.2 Other Related Work 
Ghosh et al. estimate cache misses, given a tile size, for a perfect loop nest [6]. They also informally 
discuss a tile-size selection scheme using matrix multiplication as the example. No formal algorithm 
is presented, l~owever. They do not discuss the estimation of cache misses for imperfectly-nested 
loops. Therefore, we are not able to compare with their method in our experiments. 
12
I- -D
G l ize·1 l
) i . te
l
I- -D .
r FI , 2) Bl' ),




, , sl bl Gal , s2 8 , Gb2 ,
o.2 " c , , 3 , <j s PI , P2 .
8 1 2 "( 11
a .
i g,
• Go. ComputeTileSize-2D(~) I 2 .
-D















TabIe 2: Machine parameters 
Ferrante e t  al. present an algorithm to estimate the number of distinct cache lines over a perfect 
loop nest [5]. Temam et a!. derive an analytical method to estimate the number of self-interference 
misses [19]. Mckinley et  ol. present a simple cost model to estimate the number of cache misses [XI]. 
These methods do not consider the effect of loop skewing. 
Rivera and Tseng present several padding algorithms to eliminate cache conflict misses [15, 16). 
Manjikian and Abdelralunan use cache partitioning to scatter arrays evenly in the cache, such that 
cross-interference misses are minimized [lo]. We use a diflerent padding scheme which seems more 
suitable for our algorithm. 
6 Experiment a1 Evaluation 
We apply our tile-size selection algorithm STS to three numerical kernels, SOR, Jacohi and Liver- 
more Loop No. 18 (LLla), and two SPEC benchmarks, tomcatv and swim. We use reference inputs 
for torncatv and s w i m .  For SOR, Jacobi and LL18, we declare N x N double precision arrays, with 




z,+l = (16807~~)  mod 2147483647. (13) 
G z  I Cbz 
256K 8 




Assuming that the array sizes under consideration range from TO to T I ,  we select 200 array sizes, 
a,, such that 










' y b  
1K 
4K 
We use zl = 9 in all our experiments. Note that it would be too time-consuming to exhaustly test 
all array sizes within the range in our experiments. 
We run the test programs on a SUN Ultra 11 uniprocessor workstation and on one MIPS 
RlOK processor of an SGI Origin 2000 multiprocessor, with the tile sizes selected by five dif- 
ferent algorithms, namely, STS, TLI [3], TSS [4], LRW [9] and DAT [13]. In order to handle 
several equally-important arrays, we make an obviously necessary mod5cation on the original 
TSS aod LRW algorithms such that the value of the initial tile size will meet the working set 
constraint. We also modify the TLI algorithm such that only the cache size divided by the number 
of equally-important arrays is used to compute the tile sizes which are free of self-interference 
misses. If any algorithm d'ecides to choose the whole array column as the tile height, then we let 
B2 = q + S2 * (ITMAX-1) and tile the Ji loops only (Figure 2(b)). 
Table 2 lists the machine parameters for the Ultra I1 and the RlOK, assuming the size of an 
array element of 8 bytes. To accommodate the competition between instructions and data in the 
L2 cache both on the Ultra I1 and on the RlOK, we only tries to utilize 95% of the total L2 cache 
capacity. We use the machine counters on the RlOK and the Ultra I1 to measure the cache miss 
rate. Currently, we obtain the values of nl, ns, 724 and ns by examining the assembly code of the 
original program. A backend compiler can easily obtain such numbers. 
On the RlOK, the untiled codes are compiled using the native compiler with the "-03" opti- 
mization switch set. On the RlOK, we found that compiling the tiled code with the "-02" switch 





















.ile- , , b
I8),
m . , ,
14]
Zn+l 6807zn ) (13)
l, ,
n




























Mallix Size {Ultra IbSOR) Met& Slze (UIUD IISOR) 
Figure 9: L1 cache miss rate of SOR for various schemes on the Ultra II  
200 400 630 000 tMK) 1200 1400 1600 I000 2000 200 d W  60 OW 1WO 1200 14W 1600 1BW 2WO 
htzlrlr Sire (Unm II-SOB) Me~r lx  Size ( ~ n m  11-SOR) 
ORG : 700 nl . 
scllemes. Thereforc, we compile the tiled code with 'L-02" or "-03" depending on which produces 
shorter execution time. For all the tile-size selection schemes, we switch off loop tiling for the native 
compiler on the RlOK when we compile the tiled source programs (with for both 1-D and 2-D tiling). 
We switch off prcfetching on the RlOI< when we compile 2-D tiled source codes since prefetching 
may increase cross-interference misses for smalIer tile height 3 2 .  We also switch off common block 
reorganization since the tile size selection algorithms already take care of memory layout. On the 
Ultra 11, both the untiled and the tiTed codes are compiled using the native compiler with the 
"-fast -xchip=ultra2 -xarch=v8plusa -fsimple=2" optimization switch, which is recommended by 
the vendor. 
- LRW - ~ S S  - %a' .. 
./; - ..; 
'- - ..'.' 
The SOR kernel 
B W '  
STS 0 
OAT * . 
U1 
E 400 - 
E 
We fix ITMAX to 1050 and randomly choose 200 array sizes ranging from 200 to 2000, i.e., (ro,rl) = 
(200,2000) in Equation (14). The skewing factors are S1 = S2 = 1. We have nl = na = 11, n4 = 9 
and 7 ~ 5  = 3 on the RlOK aud nl = ng = 22, n4 = 34, ns = 4 on the Ultra 11. Table 3 summarizes the 
average speedup by STS over other schemes, average L1 and L2 cache miss  rates for SOR on both 
the Ultra I1 and the RlOK. The execution time is averaged by geometric mean, and the cache miss 
rates are averaged by arithmetic mean of cache miss rates for individual array size, Specifically, 
Figures 8 and 11 show the execution time for various schemes on the Ultra 11 and on the RlOK 
respectively. Figures 9 and 10 show the Ll cache and L2 cache miss rates respectively on the Ultra 
11. Figures 12 and 13 show the L1 cache and L2 cache miss rates respectively on the R101<. 
. .,:- - : 300 - 
- 




















.. ... ,,~::;. ~ ... ...




0 BOO 1000 lBOO




t ..I.... r ...~."..
:...~~...~~
ol.-.....~~=--~~~-~_---1
400 00 BOO 000 00 800 000
aiM Unra II-S0A)





0.1 .oJ- : 4' ~. ....: ) ~.. ~.:::
... .. - oO .. .60 1.
{l,05 lS. :,.~: •....,.~ • to·!' ? . ."..•.•';:...~ ;,;:.
o
200 400 600 BOO 1000 1200 1400 1600 1800 2000






















200 400 600 BOO 1000 1200 1400 1600 1800 2000
r a /l-








, .e_, TO, Tl)
. 8 1 82 . l J ,







Figure 10: L2 cache miss rate of SOR for various schemes on the Ultra 11 
, , , , .  4 F 200 3 150 .d:" 
:: loo ,-.::..e- 
w 
50  #d- 
200 4 0 0  m 80D 100012M)14001800IB00ZWO 
0 
2W 400 600 800 1000 1200 14W 1WO 1BM) 20W 
Mallix Size (R1OK-SOR) 
Figure 11: Execution time of SOR for various schemes on the RlOK 
200 400 M)O BDO 1 0 ~ 3  1200 1400 1603 IBW zwo zoo wo wo m rm i z ~ o  imimimm 
M e l d ~  S$O (RlOK-SOR) Mslrlx Sks (RIOK-SOR) 










'" '";j 0.08 .". .....Jo-'-,........-..,..... ~~..,-_ ........ 2- 0.08
13 i- ]
::< 0.06 ::;; lJ.06




0.02 .....~ ..... i .....~\.-~ . ,~'. .1 0.02do.,
0 0





450 LRW ~ 450 STS... TSS 1: OAT-g 400§350
tI) U/
§. :lOll gJOO
'" 2~ ~ 2~E .~F F 200
l5 . d·;':' a
150ij , ,,' '.J-Y" ,; ~ , ~...~...CJ 100
~~
100
" ~.. 'r"''''· ... 50
--~--.-'0 {}
600 0 00 00 00 6 0 18 2000 00 6 00 600 600 000







0.04 ... :. o"~··- .... '\:. -_...
..... ~.:....;..;:.: =........~; .. ,.~ .. 4 ....~1...':. .....,."11.
0.02 tf:...,' '0'4.,. ~ 0 • ';y,-v.;.' .l~_'i..~ ..o . -.~; #. ..... ...
200 400 600 800 1000 1200 1400 1600 1800 2000














0.02 .. ,." Q ........
{} M~"~~~'~.~~'~!!1!!i:!1~~~~,i,!l!!f!l
600 800 0 0 0 1 OO 2000
atrlx ize 0 - )
:::i 0,06
: l R I
Malrlr She (RlDK~SOR) mrlr stze (RlOK-SOR) 
Figure 13: L2 cache miss rate of SOR for various schemes on the RlOK 
4 5 0  
ORG . - 4 5 0  
, 4 0 0  . LRW 3 - 
o TSS '- 
E !so 
g 300 . 
C = 250 . = 250 
200 . $ 200 
6 1 5 ~ .  
W 0 - ,'@. - 
59 - 
0 0 
200 400 600 800 1000 1XK) 1400 I600 IBW 20M) 200 4W 600 BOD 1000 1200 1400 1600 1BW 2000 
Mnlll* Size (Unra IIJambFJ Matrix Slza (V im IIJombl) 




Mot~$x Slzo (Uhra I IJamti)  MaMx SLze (Unra IIJembl) 





















" ."". B oa , ~., "c. :,p .
D " ..0; 0.15 D Da: D ~ III D . ,
~ . '8 • . " . D111 III D o:J






0.05 . .'. '",. .
~
ol4l:ldloIii.o:E~~~~~~~~~






















'" ~~- .... ICI D gO
a ... ",.,-c'. • • • • • ~
50 --d -a :'.J;.: ,~"'..,'jl f$
o 1.A._..·........lIIi~iIlIII~.!!-!!!i~..,,;c=:,:::::.:..l'!l>:..:..~~_~~.J
200 16 1 OO 00



















.M 0,-4 D ~ • ".. .... ..
~ 0.3 '. • ,) .: .: ~..:{/.; ..I~ : ••••: : ~.)
0..2 t"" : .. • ,oo ...... ,.!,a, 4..:.. ..: .'".. r ~I· t I .. '.- ,. w~2e -i't .. too a_... 1•...,.. " .. I ~
0.1 ~: ~ - 4''\:1 d."L' ~:. ..
o~:~~db'c~~~_i"
200 400 600 BOD 1000 1200 1400 1600 IBOD 2000
tr1 i o hmll- aco l)








0 2  - DAT 
Malrlr Slza (Unra IIJawbl) Malrlr Slze (Ullra IIJawMJ 
Figure 16: L2 cache miss rate of Jacobi for various schemes on the Ultra I1 
Figure 17: Execution time of Jacobi for various schemes on the RlOK 
Mauix Size (RlOX-Jawbl) Malrlx Slze (RIOK-Jambl) 
TLI - 






P = ' .  
8 m .  
g 250 
C 
g 2w g 1 % .  
1 I - -  
Figure 18: Ll cache miss rate of Jacobi for various schemes on the RlOIC 
ORG 
400 
LRW . . 
I": TSS *, .- ... 
. :. ! 3M ; 250 .:. r-: - - . i ; 200 
1 '- F 
, -  ..+- . *. -; ..r' z lso- 
.---a . :: 1 1- 
M -  
200 400 €HI 800 lW0 1200 1400 1600 180020M) 2 M  4W 6011 8-90 1000 001200 1400 1600 I800 2000 
Malrlx Sue (RIOKJambo Marrlr Slze (RlOKJaCObl) 
., ,::.P"- 
-.L&--~ -b- O-'-- ' ' . 








~"."-:" .. TSS . O
. ;r-~, #', •... . ..










2.00 400 600 800 1000 1200 1400 1600 1800 2000 200 400 600 SO(} 1000 1200 1400 1600 1800 2000




~ ~ 350.. . i 300 O." ,.c: 3008..
II) 250 .... l/) 2.50c:
~= '
.:.. s-".. 200 ~ .E
F I" J=
c: 50 ........ - :s 150..
1 "
.;. .... .t g100 '.. ...,r 100




6DO BOO 100 lBOO 00 00 00 0 SO(} .10 120 lBOO






600 800 1000 1200 1400 1600 1BOO 2000
t i i { 10 co lj
.... "1:1
. ,;









~ O~5 :\-'''' --......_.-~,.. •• .....
.c. 02 ~r \ ~·z:»..'·\· . ~~-... IIJ"
.... : ...






Table 3: Speedup by STS and average cache miss rates for different schemes for SOR 
Figure 19: L2 cache miss rate of Jacobi for various schemes on the RlOK 
The Jacobi Kernel 
Ultra I1 
Avcragc'Speedup by STS 
L1 Mi= Rate 
L2 Miss Rate 
RlOK 
Average Speedup by STS 
L1 Miss Rate 
L2 Miss Rate 
We fix ITMAX to 500 and randomly choose 200 array sizes ranging fiom 200 to 2000. The skewing 
factors are S1 = S;! = 1. W e  have nl = ng = 17, nr = 28 and ns = 10 on the RlOK and 
nl = n~ = 28, n.1 = 24, ns = 3 on the Ultra 11. Table 4 shows the average speedup by STS, average 
L1 and L2 cache miss rates for Jacobi on both the Ultra I1 and the RIOK. Specifically, Figures 14 
and 17 show the execution time of Jacobi for various schemes on the Ultra I1 and on the RlOK 
respectively. Figures 15 and 16 show the L1 cache and L2 cache miss rates respectively on the 



















F? 3W . LRW TSS . 3 3W 
3 
X 254 - 
V) fn 
6 200 d 200 
E- 
l= 1 5 0  . . . E F 150 
C 
a - ID0 - . A  D 
w 5 0 .  
n n " 
2W 250 300 35.3 4 0 0  450 %€I 2W 250 3M1 350 4 0 0  450 5M) 









Figure 20: Execution time of LL18 for various schemes on the Ultra I1 
TLI STS DAT ' 























e e' LID . 34 . 3 . 0 1
l iss 02 07 . 3 . 2 06
066 0 005 006
I I S S
. 0 . .











ct • -g bad' 6 Dq,,'B1UW':
~ ..."",~\:B'19 g&;' ..
... .. ~ it.;J-g "'d ~ ... D
.......DIC~ rJ.'" II .. flTD. ... Q IQ 'C!" l» ..
... • ...~,pal:l.. [fl.D~ 11 11 ~ ,lJg
.. ""lIIJ G a"l. c
~"'''l~~ g ..
o ~...!l>i,;1 " ~ ?". ~ ~~. - .;~. ' .. N














l 2 . l 3 4 5
l 3 II
l I . i i ll ,
I
ectiv l . l e i
























ORG ' - 
LRW*  
TSS - - 





IImr$,Ir,, e s .. as . - 
. -- . *.- ..,, ..:, . . 1.. ,?- 
Figure 21: L1 cache miss rate of LL18 for various schemes on the Ultra I1 
Figure 22: L2 cache m i s s  rate of LL18 for various schemes on the Ultra I1 
0.25 








F m  
E r n  " 250 
E 
E 200 ; 150 
J lM 
D 
3M) 350 400 450 5w 200 250 300 350 4 0 0  450 500 
~slrrw slze (AlOKLLlB) hletdx Size (RlOK-LU 8) 
Figure 23: Execution time of LL18 for various schemes on the RlOK 
200 250 300 350 400 450 500 200 250 300 3-50 41X) 4 5 0  500 
MardxSue (Ultra II-LL10) Maldx Size (URra lCLLl0) 
ORG . 
LRW - - 755 < 
- 
. . '-.... . -.. *:,*--. .%>$*;%4:.- 
A - . . . --* .*-->--: u.. 
c. . - .  -- , ' 
I 0  _ . _  . - a -  e- 
ats-.*> **? %@*:9.9 
025 
0 2  
0 
$ 0.95 









500250 300 350 400 450
Matrtx SI2.& (Ultra II-LL18)
OL-_~-~--~--,--~-----'
200
0.5~. • ... " .. "" '-," ...........uf "_:""~\,oo,...' Jl,





MalJbc $lUi (Ullra 1I-LL18)
250
~.
; --;: J j -:
• ~ -.. 1"-
• • • '.' '.' g '.,f~ .\,.;;. ... ~.JI'
0.2 ";. •• 1>.••" .}~ :"' \ 3P'";-'~ .~ O.

















0.05 ~ ••: ... D • "I~
... D gO... ff.I. ~~~.~~
~ ..~~ ~:,~l} ••ii......"'•o •
J











.. .....\. ~'L" .... .i' ..... ... "'" .. .& _":"":"1:1..9.°
...~~ ~.' ,.;;:.r~~·
50 00 SOD









8 300 ~:lOO.. .,
§. 250 en.8.




B 100 ~ 100.,
.' 0 . ; ~~ •.lr.~.. "w SO _l.f~~~ 0- l1J 50
.~
0 0
200 250 00 00
Mattfx Si R1Q1l;· l.18) M lrl 1 · L1B
10
Table 4: Speedup by STS and average cache miss rates for different schemes for Jacobi 
0.3 
' ORG ' - 0.3 ni . 
LRW - STS = 
0.25 - TSS - - D g  - 
3 0.15 - . . _ ,  I . -  . . - . .-- ;L:,;~;-.> -,\ .. . . .;;.>L*~';..;:#-,: !. 
3 0.1 . . A' ".< - .. - a . - , : ' a * - -  - - t ' . ,  :..-.:=., * 
0.05 . ,4- ,. ., .. .:- rheA - .  .' -. 
0 0 
2M) 250 200 350 4DO 450 500 200 256 3M) 350 460 450 50C) 




L2 Miss Rate 
Rl OK 
SpeedupbySTS 
L1 Miss Rate --
L2 Miss Rate 
Figure 24: L1 cache miss rate of LL18 for various schemes 011 the RlOK 









LL18 has 9 arrays, and the tiled version has 11 arrays after duplicating ZR and ZZ. Due to the 
relatively large number of arrays, the array sizes we used in SOR will produce extremely small 
tile sizes for all the tiie-size selection schemes. Therefore, we reduce the array sizes and randomly 
choose 200 array sizes ranging from 200 to 500. We fix ITMAX to 300. The skewing factors are 
S1 = S2 = 2. We have nl = n 3  = 75, n4 = 100 and ns = 35 on the RlOK and nl = n3 = 87, 
n4 = 14, n5 = 8 on the Ultra 11. Table 5 shows the average speedup by STS, average L1 and 
L2 cache miss rates for LL18 on both the Ultra I1 and the RlOK. Specifically, Figures 20 and 23 
show the execution time of LL16 for various schemes on the Ultra I1 and on the RlOK respectively. 
Figures 21 and 22 show the L1 cache and L2 cache miss rates respectively on the Ultra 11. Figures 24 
and 25 show the L1 cache and L2 cache miss rates respectively on the RlOIC. O u t  of 200 cases, 
- 
200 254 3W 350 4 0 0  4 5 0  m 200 253 300 3% 4 0 0  450 500 











































t 'I' 'I' l 51'
edup y 1' AO . ' . 28 l.0 1
1 iS..'! te . 0 . 2"- . 5 1
at . 01 . 2 01
I 'I' 'I' l 1'
edup y 1' 0 .21 . 9'(







In. 5128 A 101<·11 IS)
O'--------~-~-~--~----l







8 1 52 l I 1
s II. , I









"" ...'\:.. .. \. • ..~: • ...... ,/:... , v-." •
I:~M.""'':'" "'Jl"~ ..~•• :'~ ~ ... ~(~~~:~ ~~
o
0 sa




















Table 5: Speedup by STS and average cache miss rates for diEerent schemes for LL18 
3 
0 5 I0 15 XI 25 30 35 4 0  45 54 55 0 5 10 15 20 25 
Tile Size (Ultra-[SPEC92. tomcahr)) Xle Size [Ulrra-(SPEC9.5. lomcatv)) 
(a) SPEC92 (b) SPEC95 
Ultra I1 
Speedup by STS 
L1 Miss Rate 
L2 Miss Rate 
RlOK 
S p d u p  by SI'S 
LlMissRate 
L2 Miss Rate 
Figure 26: Performance of tomcatv with different tile sizes on the Ultra I1 
STS chooses 1-D tiling on 186 cases on the Ultra I1 and on all 200 cases on the RlOK. All the 
other tiling schemes either choose 2-D tiling or no tiling if they fail to generate the legal tile sizes. 











tomcatv can only be tiled with one dimension [18], hence only STS can be applied for tile-size 
selection. We use two different reference inputs from SPEC92 and SPEC95 respectively. To verify 
whether STS produces nearly the best results, we run through a range of tile sizes, from 2 to twice 
of the size selected by STS, for each version of tomcatv. Figures 26(a) and (b) show the results 
on the Ultra 11, where the vertical bar indicates the tile size selected by the STS. The original 
programs kom SPEC92 and SPEC95 run 5 and 174 seconds respectively on the Ultra 11, and 4.0 
and 115.0 seconds respectively on the RlOK. Figures 28(a) and (b) show the results on the RlOK. 
STS chooses the near optimal tile sizes for both versions of the codes on both machines. To examine 
how padding will affect the STS, we also run both versions of tomcatv on both machines without 
padding applied- Figures 27(a) and (b) show the results on the Ultra 11, and Figures 29(a) and 
(b) show the results on the RlOK. Except few cases, padded version runs signficantly faster than 









Similar to tomcatv, svim is tiled only with one dimension. We use three different reference inputs 


































t I W r
LB 9 . 9 LOO .
l iss l 4 2 284 469
iss l 0 . 7 018 021
I
peedu T 1
l Mis at 173 . 2



































..... T , ..
Ti I t · 5. »
)
I















o cat lJ .
fr m ,













50 100 150 200 
Tle Size [RloK-(SPEC92, tomcaw)) 
(a)  SPEC92 
," 
0 20 40 60 00 1W 120 
Tile Size (RIOK.(SPEC95, t o m b ) )  
(b) SPEC95 
0 5 10 15 20 25 30 35 40 45 50 55 0 5 I 0  15 20 25 
Tile Size (Ullra.{SPEC92. lomcalv)) Tile Size (Ullm-(SPEC95, fomcahr)) 
(a) SPEC92 (b) SPEC95 






















Figure 29: Performance of tomcatv  with different tile sizes and without padding on the RlOIC 
. .".......*.*............ 















0 53 100 150 200 250 0 20 40 60 80 100 120 
Tile Size (RIOK-(SPEC92. lorncatv)) Tlle Slze (R1OK-(SPEC95, tomcatv)) 











E 110 : 100 
S 
2 90 














g 4 ... .. ... .. .. <>
" ~w w.
:l













OJ 4 OJ 120lJ) (/)
c e;;:;.
3.8 . 11(1OJ .,
E E
F 3,6 F 100
" c0 0"5 3.4 ~ 900 0
OJ '")( Jjw
3.2 50 ""~"" .. . ... " w·
3 70
-........". ....,....~.",.'-'N-"".--'
0 250 BO 00




-'i3 , ~ ..,;j'--





!= 6 l- F 100c: c
.2 .g :
S :>
:£ ...'"x anw .
:3
0
10 · t mealv) 11 i Al0K·( , l caIV))
) )
c K
I . .  , . I 
1 0 a Y ) M Y ) m m m  
Th 5- R ) w . i S P m  *I) 
(a) SPEC92 
I I .  I 
0 5 1 0 1 3 a x J O s  
n. Ea. ,'Ul".lSPEOOL. m)) 
(b) SPEC95 
LSD 
2 3 4 5 6 7  
TbSk. rdLniSPEQW).&J) 
(c) SPEC2000 
Figure 30: Performance of s w i m  with different tile sizes on the Ultra I1 
. .  I .  I 
0 1 0  P I I ) Y ) M r n W  2% N .,sP- *, 




0 5 1 0 l 6 m Z 5 3 0 3 5  
T I B U .  iVW6PEtPS .rtnl) 
(b) SPEC95 
450l ' I . I 
2 5 A 5 9 7 1  
n. sm. ,uhiSP-. r.4")) 
(c) SPEC2000 
Figure 31: Performance of suim with different tile sizes and without padding on the Ultra I1 
25 
I 5  
0 5 Q 1 m r u h m 2 5 a m  
T h  sllr ( R l M 4 3 P E C w . h ) )  
{a) SPEC92 
0 ?D 40 M w Im la 1 0  IhO 
W 81. (Rlm-(SACS% b)) 
(b) SPEC95 
0 3 1 0 1 5 ~ % ~ 3 5 M 4 5 3  
T& Slr . (R1LUi5PECcrOCcrO~I}  
(c)  SPEC2000 
Figure 32: Performance of suim with different tile sizes on the RlOK 






0 50 IM 1- 2w EU 
rn s!I.(Rloct*F%C4nm)J 0 rn w w l m 1 ~ ~ 1 1 s c  0 I 1 0 1 1 8 Z I ~ Z I 1 0 4 3 U )  !%I. IRID(CjPECB, -1) lb Silo (RrMiSPECxra.  &I) 
(a) SPEC92 (b) SPEC95 [c) SPEC2000 








0 15 20 ::-s :IiO M













f ~ I f &SO'6030 ~'" "" ..... ~ ...... eooe ". ~ ~
~ .. ~ ",. ! s.so














''''~ 35 e ~ """
~ 3D ! '00 ~ "'" 0.
j S 1 i .:;0",I~....L .o> ••~~~ ... ADD ... ..............."]' ..........•,;..;~~.,;.,...,:...~~ ....I to 350S
0 00 .!!Cl 200 250 """ • 20 AD 60 eo 10:1 1~ '40 t • 5
,. IS 20 2S ::KJ 3S. AD olS .!iD














~ .;...--' e 5.'iO , . .. u . ... ... ..n •••• ~ •••••-
~ ! ! ~ 5003D '00
i i .. j .~" 10 000
"" I
.:l
='5• ,CO '50 00 2!iO 300 • ~ .. 00 00 'DO '20 ,." ,.. , ,. 15~2S30.)5-40""eo110 SIlo tR 'OK-{S P£C112. --II Too 5ilo R'oo -{SI' aIS......n TIlo 511. 'oK EQQoD, mnJ)
a) ) (c)
f w t l
Table 6: Summary of speedup of STS over other schemes 
horn 2 to twice of the size selected by STS for each version of s w i m .  The original program horn 
SPEC92, SPEC95 and SPEC2000 run 36, 157 and 930 seconds respectively on the Ultra 11, and 
21 -2, 91.9 and 619.5 seconds respectively on the RlOK. Figures 30(a), (b) and (c) show the results 
on the Ultra 11, and Figures 32(a), (b) and (c) show the results on the RlOK. STS chooses the 
near optimal tile sixes for all versions of the codes on both machines. Figures 31(a), (b) and (c) 
show the results on the Ultra I1 for unpadded versions of s w i m ,  and Figures 33(3,), (b) and (c) show 
the results on the RlOK. Similar to tomcatv, padded version runs faster than unpadded version in 
most cases for SPEC92 and SPEC95. Note that on the Ultra 11, the TLB size is smaller than the 
L2 cache size, hence STS will result in an underutilization of L2 cache. For SPEC2000, however, 
such an underutilization seems a negative eEect on performance. 
6.1 Discussion 
In summary, Table 6 shows the speedup by STS over all the other schemes for all 600 cases for 
SOR, Jacobi and LL18, where "Both" stands for both the Ultra I1 and the RZOK. 
One interesting point is related with LRW. Considering the combination of each benchmark 
(SOR, Jacobi and LL18) and cach machine (Ultra I1 and RlOK), LRW produces equal or smaller 
average L1 cache misses in 5 out of 6 combinations compared with STS. However, this does not 
translate into large performance saving. (The worst average speed ratio of STS over LRW is 0.98.) 
We found that in general LRW produces smaller tile sizes than STS, which potentially introduces 
more loop overhead. For LLlS, LRW has greater average L2 cache miss rates than STS since STS 
cxploits locality for L2 cache in most of cases due to large number of arrays. 
7 Conclusion 
In this paper, we present a memory cost model to predict the cache misses after skewed tiling. Fur- 
ther, me model the execution cost by considering both the cache misses and the loop overhead, based 
on which we make a decision between tiling one loop level vs. two Ioop levels. We present Algorithm 
STS, which selects the tile size such that the capacity misses and self-interference misses within a 
tile traversal are eliminated. STS uses inter-array padding to eliminate cross-interference misses. 
We also compare STS with four previous algorithms, TLI, TSS, LRW and DAT. Experimerlts show 
that STS achieves an average speedup of 1.27 to 1.63 over all the other four algoritlims. We have 
previously implemented a cost model along with a number of tiling algorithms [18]. However, we 
are yet to implement the cost model presented in this paper. Ideally, our cost model should be 
incorporated in a backend compiler, which will Be our future work. 
References 
[l] Vicki Allan, Reese Jones, Randall Lee, and Stephen Allan. Software pipelining. ACM 
Computing Surueys, 27(3):367-432, September 1993. 
24
ORG LRW TSS TLI DAT
Ultra II 2.24 1.63 1.95 1.37 1.37
R10K 2.28 1.24 1.36 1.22 1.17
Both 2.26 1.42 1.63 1.29 1.27
fr m . grams fr m
l II,
.2, I0 . , ) )
II, ( }, ) I
7. } )
























[2] David Callahan, Steve Carr, and I<en Kennedy. Improving register allocation for subscripted 
variables. In Proceedings of A CM Sf GPLAN I990 Conference on Programming Language 
Design and Implementation, pages 53-65, White Plains, New York, June 1990. 
[3] Jacqueline Chame and Sungdo Moou. A tile selection algorithm for data locality and 
cache interference. In Proceedings of the thirteenth A CM International Conference on 
Supercomputing, pages 492-499, Rhodes, Greece, June 1999. 
[4] Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and 
data layout. In Proceedings of A CM SIGPLAN conference on Programming Langvage Design 
and Implementation, pages 279-290, La JoUa, CA, June 1995. 
[5] J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. 
In Proceedings of 4th International Workshop on Languages and Cotnpzlers for Pamllel 
Computing, August 1991. Also in Lecture Notes in Computer Science, pp. 328-341, 
Springer-Verlag, Aug. 1991. 
[6] Somnath Ghosh, Ma~garet Martonosi, and Sharad Malik. Precise miss analysis for program 
transformations with caches of arbitrary associativity. In Proceedings of the 6th ACM 
Conference on Architectural Support for Programming Languages and Operating Sgstems, 
pages 228-239, Sau Jose, California, October 1998. 
[7] John Hennessy and David Patterson. Computer Archdteclurc: a Quantilalive Approach. 
Morgan K a u h a n n  Publishers, 1996. 
[8] Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. Data-centric multi-level 
blocking. In Proceedings of A CM S W A N  Conference on Programming Language Design 
and Intplementation, pages 346-357, Las Vegas, NV, June 1997. 
[9] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf. The cache performance and 
optimizations of blocked algorithms. In Proceedings of the 4th International Conference on 
Arc hitectuml Support for Programming Languages and Operating Systems, pages 63-74, Santa 
Clara, CA, April 1991. 
[lo] Naraig Manjikian and Tarek Abdelrahman. Fusion of loops for parallelism and locality. IEEE 
'It-unsactions on Parallel and Distributed Systenas, 8(2):193-209, February 1997. 
[ll] Karhryn McKinley, Steve Carr, and Chau-Wen Tseng. Improving data locality with loop 
transformations. ACM I)-ansactions on Programming Languages rrnd Systems, 18(4):424453, 
July 1996. 
[I21 Nicholas Mitchell, Karin Hiigstedt, Larry Carter, and Jeanne Ferrante. Quantifying the multi- 
level nature of tiling interactions. International JournnI of Parallel Progrummit~g, 26(6):641- 
670, December 1998. 
[13] Preeti Panda, Hiroshi Nakalnura, Nikil Dutt, and Alexandru Nicolau. Augmenting loop tiling 
with data  alignment for improved cache performance. IEEE Tkansaciions on Computers, 
48(2):142-149, February 1999. 
[14] Stephen Park and Keith Miller. Random number generators: Good ones are hard to find. 
Communiccations of the A CM, 31 (10):1192-1201, October 1988. 
25
[2] i ll , t rr, Ke . I r i r i t r ll ti [ r ri t
iables. i s f I 1 f a i
i l e t ti , , it l i s, , UD .
[3] li n. til l ti l it l lit
i t . s j J
er puting, - , es, r , .
[4] . l .
t. a s j a i u i
l ent ti , 0, ll , , .
[5] . r, .
i s f l r m iler f rall
puti , . ,
. .
] , rgaret i,
sfor at. f 8
fer ce r it t i y
, n .
] itect e t ti
aufma .
]
i . s f IGPLA a esi
m l entati , , .
[ ] . . olf
i s 0/
r it t ra r r . -74,
l r , , i .
10]
Transactions ms, -209,
l1J i l , ,
ti s. e 'lr a : 24-
.
1 ] l o stedt)






ations f , , .
[15] Gabriel Rivera and Chau-Wen Tseng. Eliminating conflict misses for high performance 
architectures. In Proceedings oJ the 1998 ACM International Conference on Supercomputing, 
pages 353-360, MeIboune, Australia, July 1998. 
[16] Gabriel Rivera and Chau-Wen Tseng. A comparison of compiler tiling algorithms. In 
Proceedings oj  the 8th Internalional Conference on Compiler Conslruclion, Amsterdam, The 
Netherlands, Marc11 1999. 
[17] Yonghong Song and Zhiyuan Li. A compiler framework for tiling imperfectly-nested loops. 
In the 12th International Workshop on Languages and Compilers for Pamllel Computing, San 
Diego, CA, August 1999. 
[18] Yonghong Song and Zhiyuan Li. New tiling techniques to improve cache temporal locality. 
In Proceedings oJ A CM SIGPLAN Conference on P~ogramming Language Design and 
Implcmenlaiion, pages 215-228, Atlanta, GA, May 1999. 
[19] 0. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In  Proceedings of 
SIGMETRICS'94, pages 261-271, Santa Clara, CA, 1994. 
[20] Michacl Wolf. Inzproving Locality and Parallelism in Nested Loops. PhD thesis, Department 
of Computer Science, Stanford University, August 1992. 
[21] Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In Proceedings of 
ACM SIGPLAN Conference on Programming Languages Design and Implementation, pages 
30-44, Toronto, Ontario, Canada, June 1991. 
1221 Michael E. Wolf, Dror E. Maydan, and Ding-Kai Chen. Combining loop transformations 
considering caches and scheduling. In Proceedings of the 29th Annual IEEE/ACM International 
Syrnposiz~m on Microarchitecture, pages 274-286, Paris, fiance, December 1996. 
[23] Micllacl Wolfe. High Performance Compilers for Para1 lei Computing. Addison-Wesley 




, lbour , 8.
] .
t il t ti , ,
rch .
J ram
r s il ral ti ,
, , .
81
1'Oceedi j rogramming esi
e entati , , .
] O. . f
, .





[ ] , .
. f
mposium 6, Fra
31 h e '17nanc il Ja l l t
, .
