Effective Use of the Level-Two Cache for Two Cache for Skewed Tiling (Extended Version) by Song, Yonghong & Li, Zhiyuan
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
2001 
Effective Use of the Level-Two Cache for Two Cache for Skewed 
Tiling (Extended Version) 
Yonghong Song 
Zhiyuan Li 
Purdue University, li@cs.purdue.edu 
Report Number: 
01-006 
Song, Yonghong and Li, Zhiyuan, "Effective Use of the Level-Two Cache for Two Cache for Skewed Tiling 
(Extended Version)" (2001). Department of Computer Science Technical Reports. Paper 1504. 
https://docs.lib.purdue.edu/cstech/1504 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
EFFECTIVE USE OF THE LEVEL-TWO




Department of Computer Sciences
Purdue University
West Lafayette, IN 47907
CSD TR #01-006
April 2001
Effective Use of The Level-Two Cache
for Skewed Tiling
(Extended Version) *
Yonghong Song Zhiyuan Li
Department of Computer Sciences
Purdue University
West Lafayette, IN 47907
{songyh,li}@cs.purdue.edu
Abstract
Tiling is a well-known loop transformation technique to enhance temporal data locality. In
our previous work, we have developed a skewed tiling technique for relaxation codes, which
requires to apply loop skewing before loop tiling. In this paper, we study how to effectively usc
the level-two cache for skewed tiling through a tile-size selection algorithm, STS. Particularly,
we address two questions: (1) when to foclls on enhancing locality for the L2 cache instead of
the Ll cache, and (2) how to improve the L2 cache locality such that the overall performance
nm be improved. \Ve address the first question by developing an execution cost model which
incorporates both the Ll and the L2 cach(~ misses. \Ve address the second question by applying
inter-array padding to minimize cross-interference misses.
We compare STS with several previonsly known algorithms. For certain test cases, STS is
significantly better than those previolls algorithms because it effectively exploits the L2 cache
locality. For other cases, STS achieves comparable results because it also effectively exploits
the Ll cache locality. For two well-known SPEC benchmarks with different inputs on two dif-
ferent machines, we also compare our inter-array padding algorithm with a previously-proposed
padding algorithm. Our padding algorithm is significantly bel.ter.
1 Introduction
Memory access latency has become a key performance bottleneck on modern microprocessors.
An important approach to reducing the average latency is to exploit data locality on the cache
memories. Tiling is a well-known compiler technique to enhance data locality such that more
data can be reused before they are replaced from the cache [23]. Tiling transforms a loop nest
by combining strip-mining and loop interchange. Loop skewing and loop reversal are often used to
enable tiling [20J. Figure 1 shows SOR relaxation as an example. Figure l(a) shows the original
loop nest in SOR, Figure l(b) shows the tiled SOR in which loop J is skewed with respect to loop
T (l-D tiling), and Figure l(c) shows the tiled SOR in which both loops J and I are skewed with
respect to loop T (2-D tiling). In this paper, we call tiling enabled by loop skewing skewed tiling.
Much of previous work on tiling applies to perfectly-nested loops only [6, 20, 21, 23J. Since very
-This work is sponsored in part b)' National Scicncc Foundation through Grants CCR~9975309, ITRI ACI-008283.1
and i....IlP-9610379, by Indiana 21st Century Fund, by Purdue Research Foundation, and by a donation from Sun
i~l'1icrosystems, Inc.
I
2few programs arc known to have perfectly-nested loops, we recently proposed a new skewed-tiling
technique to tile a class of imperfectly-nested loops [15, 16J, including typical loops that perform
iterative relaxation computations [15, 16].
Performance of a tiled loop nest can vary dramatically with different tile sizes [7J. How to select
a proper tile size is hence an important issue. All previous publications on tile-size selection tacitly
assume 1l00Hkcwed tiling [2, 4, 7, 10, 14, 22]. Particularly, they assume that each tile repeatedly
accesses the same set of data. This is certainly not true for skewed tiling. Furthermore, they
consider only the Ll, not the L2, cache misses.
Like the SOR code in Figure 1, many relaxation codes can be tiled with either I-D or 2-D tiling.
If the Ll cache is the only target for locality enhancement, then 2-D tiling will invariably be chosen
over 1-D tiling, as is dDne in previDus wDrks. Our main claim in this paper is that the overall
perfDrmance can be improved by including the L2 cache in the cost model. MoreDver, under such
a mDre cDmprehensive cost model, 1-D tiling may be chosen over 2-D tiling because it results in
[ewer L2 cache misses. We support Dur claim by making the fDIIDwing analytical and experimental
contributions:
• We develDp an execution cost model and utilize it in a tile-size selectiDn algorithm called
STS [18J. Unlike previous algDrithms, DlIT mDdel incDrporates two performance factors,
namely cache misses (Dn bDth the Ll and L2 caches) and loop overhead, into a single
performance-estimatiDn model. Moreover, when estimating the number Df cache misses, the
effect Df IDop skewing is taken into account. The lDDp overhead is incDrpDrated tD aVDid small
tile heights (B2 in Figure l(c)), fDr which IDDp overhead cDuld be significant. The choice
between 1-D and 2-D tiling and the choice of tile size is determined by a comparison between
the different execution costs.
• We adopt an inter-array padding [I1J technique tD reduce L2 cache misses due tD interferences,
thus enhancing the L2 cache IDcality.
• We report experimental results of three relaxation kernels which can be tiled at two loop
levels. We compare our tile-size selection algorithm, STS, with previDus algDrithms. We
measure the execution time, as well as the L1 and L2 cache misses, on tWD machines. Our
results show that, for tWD programs, where the Ll cache IDcality can be exploited in most
cases, STS achieves cDmparable results. For one program, where the L2 cache locality is
explDited in mDst cases, STS is significantly better.
• We also report experimental results of tWD SPEC benchmarks which can be transformed at
Dne IDDp level only. For these two programs, some competing tile-size selection algorithms
nD longer apply because they always generate square tile sizes. CDmparing to the remaining
applicable algorithms, STS is superiDr because it utilizes inter-array padding to improve the
performance Dll the L2 cache. We CDmpare our inter-array padding algorithm with anDther
padding algorithm, GroupPad [13], which targets the L1 cache. We CDnclude that Dur inter-
array padding is more apprDpriate for the STS. We alsD show that the results obtained by
STS are within 5% Df the Dptimal (amDng a range of tile sizes) except in one test case where
STS undel'pcrforms the optimal by 13%.
This paper is a heavily-revised versiDn Df Dur prevIOUS technical report [17J m the fDllowing
aspects:
• Unlike [17], we drop the TLB consideratiDn in STS. We justify our decisiDn in Sections 6.4
and 6.6.













DO JJ = 2,N -1 + l'I'MJlX, HI
DO l' = 1, 1TMAX
DO J = max(JJ -1',2),
min(JJ - T + B I - 1, N - 1)
D01=2,N-l
it(1, J) = A(1,.1) + A(1 + 1,.1)






(b) After skewing and "1-D" tiling
DO JJ=2,N -1 + lTMAX,B1
DO II= 2, N -1 + lTMAX,B2




min(ll- l' + B2 - 1, N - 1)
A(1,.1) = A(1,J) +A(1 +1, J)+A(1 -1, J)






(e) Afler skewing and "2-D" tiling
3
Figure 1: An example of tiling: SOR relaxation.
• Unlike [17], we drop the consideration of software pipelining effects in STS because (1) the
assumption we made in [17] that each load instruction takes one cycle to execute in a software-
pipelined loop is not valid in most cases, and (2) the software pipelining only plays a marginal
role in tile-size selection, i.e., without considering it, the performancc does not change much.
We justify our decision in Section 6.4.
• Because of the above algorithm·level change, our experimental results are heavily changed. In
this paper, we further cvaluate the impact of loop overhead and give some deeper discussions.
In the rest of the paper, we first compare with previous work in Section 2. We present a
background in Section 3. We thcn present our memory cost model in Section 4. We model
the execution time and present our tile-size selcction algorithm in Section 5. In Section 6, we
expcrimentally compare our algorithm with previous algorithms. Finally, we conclude in Section 7.
2 Previous Work
2.1 Competing Tile-Size Selection Schemes
Several tilc-sizc selection algorithms have becn proposed before STS, which are listed below.
• TLI by Chame and Moon [1] uscs an execution cost model which includes the capacity misscs
and the cross-interfcrcncc misses. TLI enumerates a range of tile sizes which are free of
self-interfcrence misses and selects the one with the smallest execution cost as the optimal
tile size.
• TSS by Coleman and McKinley [2] uses a GCD algorithm to select a number of candidate
tile sizes. Among such candidates, the one resulting in the largest array footprint is chosen
as the optimal. Rivera and Tseng [14] present a variation of TSS.
• LRW by Lam et al. [7] chooses the largest square tile size which is free of sclf-interference
mIsses.
• DAT by Panda et al. [11] chooses the maximum square tile size which are free of capacity
misses. It then applics intcl'~array padding to minimize cross-interference misses.
When computing the tile size, these algorithms consider the LI, but not the L2, cache misses.
They also ignore the effect of loop skewing cvcn if skewed tiling is applied. By assuming non-skewed
4Table 1: Comparison between various tile-size selection algorithms
LRW TSS TLI STS DAT
Loop Skewing No No No y~ No
LI data. cache y~ y~ y~ y~ y~
L2 ea.cha No No No y~ No
Dominant Array y~ y~ y~ No No
Padding No No No y~ y~
Loop overhead No No No y~ No
Tile dimensions 2 1,2 1,2 1,2 2
Tile shape squ. reel. red. reel. sqll.
tiling, previous algorithms can be applied to !nore general loop structures and array reference
patterns than STS. On the other hand, STS considers the effect of skewed tiling when considering
the Ll cache misses. Moreover, STS counts L2 cache misses, as well as the loop overhead, such
that a decision on loop levels for tiling can be correctly made.
Of all arrays in a tiled loop, LRW, TSS and TLI identify a dominant array and choose the
tile size to eliminate certain kinds of cache misses due to the dominant array. All other arrays
are ignored. In contrast, both DAT and STS consider all arrays. Among all tile-si7.e selection
teChniques, only DAT and STS utilize padding, a data transformation technique to eliminate certain
interference misses [13]. All tile-size selection algorithms discussed in this paper can be applied
to loop ne!'its which can be tiled at two loop levels. In other words, they can be applied to two-
dimensional tiles. LRW and DAT require a .')(]/LaTe tile shape. In contrast, STS, TSS and TLI
permit rectangular shapes. For loop nests which can be tiled at one loop level only, LRW and
DAT no longer apply. However, STS, TSS and TLI still apply in such cases because as far as
tile-size selection is concerned, an one-dimensional tile can be viewed as a degenerate case of a
two-dimensional rectangular tile, where the tile height equals to the trip count of the inner loop.
Table 1 snmmarizes the characteristics of various algorithms discussed here, where "squ." !'itands
for "square" and "rect." for "rectangular".
2.2 Other Previous Work
Rivera and Tseng present several intra-array padding algorithms, which increase the array column
sizes, to eliminate cache conflict misses [13, 14]. Their algorithms, however, cannot be applied
after LRW, TLI and TSS are applied because the array column size is used in these three tile-size
seledion algorithms. Rivera and Tseng do not give an algorithm to determine the intra-array
padding si7.e before tiling [14]. In [13], Rivera and Tseng also present an inter-array padding
algorithm, GroupPad, to exploit group reuses across the outer loops. Theil' padding algorithm
targets the Ll cache. We will comparc oUl' padding algorithm with theirs for the tiled codes in
Section 6.5.
Ghosh el at. estimate cache misses, given a tile size, for a perfect loop nest [4]. They also
informally discuss a tile-Hize selection scheme using matrix multiplication as the example. No formal
algorithm is presented, however. They do not discuss the estimation of cache misses for imperfectly-
nested loops. Therefore, we arc not able to compare with their method in our cxpcriments.
Ferrante et at. present an algorithm to estimate the number of distinct cache lines over a perfect
loop nest [3). Temam cl al. derive an analytical method to estimate the number of self-interference
misses [19]. Mckinley et at. present a simple cost model to estimate the number of cache misses [9].
These methods do not consider the effect of loop skewing.
Manjikian and Abdelrahman use cache pa7"titioning to scatter arrays evenly in the cache, such
5that cross-interference misses arc minimized [8]' We use a different padding scheme which seems
more suitable for our algorithm.
3 Background
III this section, we first define OUI program model and a few key parameters. We then discuss the
issues of the memory hierarchy.
3.1 Skewed Tiling
Most of previous research on tiling addresses perfectly-nested loops only [6, 20, 21, 23]. After
tiling, the loops remain perfectly~nested. In our recent work [15, 16}, we perform tiling on a class
of imperfcctly~nestedloops. Figure 2(a) shows a representative loop nest before tiling, where the
T-Ioop body consists of m perfectly-nested loops. The depth of each perfectly-nested inner loop is
at least two. The loop bounds Lij and Uij, 1 ::; i ~ m, j = 1,2, are T-invariant. We assume that the
iteration space determined by J and I remains unchanged over different T-Ioop index values. For
simplicity of presentation, we also assume that cache-line spatial locality is already fully exploited
in the innermost loops except on the loop boundaries. Figure 2{b) shows the code after tiling the
Ji loop~ only (1-D tiling), and Figure 2{c) shows the code after tiling both Ji and Ii loops (2-D
tiling). In Figures 2(b) and 2{c), the iteration subspace defined by all Ji and Ii loops is called a tile.
Loop T is called the tile-sweeping loop, and loops JJ and II arc called the tile-contmlling loops [20J.
Each combination of JJ and II defines a tile traversal. Two tiles are said to be consecutive within
a tile traversal if the differencc of the corresponding T values equals 1.
Let 'YI = min{Lilll ::; i ::; m}, 'Y2 = max{Ua]1 ::; i ::; m}, 1Jl = min{Li211 ::; i ::; m} and
1J2 = max{Ui2I1 ::; i':::; m}. We call 8 1 and 8 2 the skewing faeto'ffl corresponding to Jj and Ii loops
respectively. (The skewing factors are also called the slope in our previous work [15, 16).) If 8 1 = 0,
then loop skewing is not applied before tiling at the Ji. level. In this paper, we are interested only
in skewed tiling at least at the J i level, thus 8 t > O. B 1 is called the tile width and B 2 is called the
tile height. B l and B 2 are called the tile size collectively. These parameters are used to define the
bounds of the tile-controlling loops. For reference, Table 2 lists all the symbols used in this paper
and their brief descriptions.
In this paper, we assume the data dependences permit both 1-0 and 2-D tiling. Choosing
between 1-0 vs. 2-D tiling will dcpend on the estimate of cache misses and loop overhead. For
simplicity, we assume all arrays are of two dimensions with the same column sizes. (We assume
column-major storage.) Lower dimension variables can be ignored due to their lesser impact on
cache misses in relaxation programs which we are interested in. Let n(1. be the number of two
dimensional arrays in the given tiled loop nest. Within the innermost loop hi::; i ::; m, of the
untiled program in Figure 2{a), we assume array subscript patterns of Ak(Ii +a, Ji +b), 1 ::; k .:::; n(1.,
where a and b are known integer constants.
Although we restrict our program model to be either 1-0 or 2-0 tiling and restrict all arrays to
be of two dimensions, such restrictions can be relaxed. We can have noD tiling by extending our
tiling technique in [16). We can also allow high-dimensional arrays by extending oUl' mcmory cost
model and execution cost model. However, such extcnsion to high-dimensional tiling and arrays
seems unnecessary for the applications we currently have met.
DO T = I,ITMAX
DO Jt =LIl,Uu
DO It = £12> Vl2
END DO
END DO
DO J m := Lml,Uml





DO JJ:= 1'1,12 + 51 * (lTMAX-J), H[
DO '1'= heJJ),9J(JJ)
DO J):= L'u,U: 1
DO h := £12, U12
END DO
END DO
DO J m =L~I,U:"l






DO JJ:= "II, 1"2 + S1 • (ITMAX-l), HI
DO 11:= '7), 7/2+S2*(1TMAX·1), H2
DO 1':= h{J.J, Il),Y2(JJ, II)
DO J1:= L~\,U;'1
DO It := L~2' U;'2
END DO
END DO
DO J m =L~l'U:':1
















Figure 2: The program model before and after tiling
Table 2: Description of symbols
Do~cdp'ion Symbol Dc,cr;p';on
The ."i"in,,,,,, 101'10< bound" 1>11 J, loop. "I, The m""imum UppH bound of nil J; lcnp.
The minimum low.. bound" 0111; loop" ~, The mo"imum uppec bound 0 011 t; loop'
The .kowin foe'"c for J, loop, 5, The ,kowin [actor for 1; loop.
T,."e"',," ~ 'elle ,e; ,.
The numboe of IUny. in 'he Kivon loop no" N The arc",,' column ,ize
")~ ")1+1 Q Q~ QI+1
The L1 cache .;zo in 'he number 0 dR'''' elemon," U. The 1-1 cRche line .ize in the numbcc 0 da,a elome"," '
,e I ell<: ,e .0' ""oc'M'Vlty C.~ The L2 e.ehe 51ze m .he number 0 d"" olomen..
The L2 eRche 'et ,,",oci8<i"i')' .~ The [,'2 cache line si •.e in the numbee 0 <.181" olo"'oll'"
The nip COUll.' foe .he 'ile ,"""opin loop T Definod in SOClion ·1
The 1-1 eRehe mi.. ponal' z Thr [i:1 culto In;'. pellolt
The ,urn of 'he "Mic numboe of ins'ruction. foe 'ho computa'ion 0 ...11 'ho Ii loop bound,
'j' ,e sum 0 • '0 ",atle num or 0 m,'rnc',on, compu'm , 0 J; 001' oun,
The ,urn of 'he "Mic numboe 0 in.'rue<ion. in the '; loop bodi••
• '0 ,'..nhon .p.CO elinod by ")1 < J; <"), "nd 'IJ < 1; ~, in F, uro 2 a
The workinK_'o' ,i,e of 'he loop ne", Wi ,"e 2(")) ;n Lhr ",,",be, of tiM" ,·I.m""I.
Tho numboe of roup reu.e, wi'h" reu'e di""nco Kre8<er 'han C,
The numbe, 0 "011' 'Oil''''' which 0"0<"'0 LI c.,el,o ",1"'0.'
3.2 Memory Hierarchy
The memory hierarchy includes registers, cache memories at one or more levels, the main memory
and the secondary storage [5]-
For simplicity of presentation, we consider two levels of caches in this paper, namely the Ll
and L2 caches, which are common in current pradice. The Ll cache has several parameters,
namely the cache size C sl , the cache block size Cbl and the set associativity Cal- C sl and CbI
are measured in the number of data elements. Similarly for L2 cache, the cache size, cache block
size and set associativity are Cs2 , Cb2 and C(12 respectively. The cache misses can be divided into
three classes [5]: compulsory misses, capacity misses and conflict misses. Conflict misses can be
attributed to self-interference misses of the same array and to cross-interference misses between
different arrays.
4 A Memory Cost Model
In this section, we want to estimate the number of cache misses incurred by executing the loop nest



















.- . ~i2..-~ ...~
t~I~.~~~~~;r~·~~fr'~~Ert
...:..._~-_..~_ .. "':----:"'; .
-,+--;-r~- .
(al
Figure 3: Illustration of tile traversal
Let So represent the iteration space defined by II :::; Ji ~ 12 and 111 ::; Ii ::; 'TJ2 in Figure 2(a). (For
simplicity, we also regard So as the original iteration space defined by Jj and Ii loops in Figure 2(a),
as if all Ji loops have the same loop bounds and all Ii loops have the same loop bounds.) So is
illustrated in Figure 3(a) by the rectangle enclosed by the solid lines with the height 'TJ and the
width f. Within each tile traversal, we define the ba,<;e tile to be a tile with T = 1 and an advanced
tile to be a tile with T > 1. The dashed-lines in Figure 3(a) separate the base tiles of different
tile.traversals. The two shaded areas illustrate two different tile traversals, ttl and tt2, where each
shaded rectangle with solid~line boundaries represents an advanced tile. When the tile-sweeping
loop T increases the index by 1, the tiles can only overlap partially.
The cache misses incurred by one tile traversal can be partitioned into those within the base tile
and those within the advanced tiles. Note that only those base tiles and advanced tiles overlapping
with'So will be executed, thus only they can contribute to the cache misses. In Figure 3(a), the
bascJilc in the tile traversal ttl resides outside So, while the base tile in tt2 resides within So'
We make the following two assumptions in our estimation of the number of cache misses:
• Assumption 1: There exist no cache reuse between different tile traversalli.
• Assumption 2: B l «'Y and Bl « (ITMAX-l) * SI·
Assumption 1 is reasonable if ITMAX is large, since it will be very likely for a tile traversal to
overwrite cache lines whose old data could have been reused in the ncxt tile traversal. Assumption 2
is reasonable because a largc B 1 can easily cause an overflow in the target cache, or result in a small
B2 which Illay gcnerate a large loop overhead. As explained later in Section 5.1, STS incorporates
loop overhead in execution cost estimation to prevent a very small B2 • If the tile size (BI , B 2 ) is
chosen properly, there should be exactly one eache miss for each cache line accessed within a tilc
traversal. To be more specific, the following two properties should hold:
• Property 1: No capacity and self-interference misses arc generated within a tile traversal.
• Property 2: No cross-interference misses arc generated within a tile traversal.
In Section 5.2, we shall discuss how to preserve the above properties. For now, we assume they
hold.
We estimate cache misses with the following four options:
• Option 1: 2-D tiling with the Ll cache targeted. The tile shape is illustrated in Figure 2(c).
The footprint of each tile fits in both the Ll cache and the L2 cache. Inter-array padding is
if 1 ~ B, < ry + 8, • (ITMAX-l)
if B, = ry + 8, • (ITMAX-l)
8
applied to satisfy Property 2 on the Ll cache. For this option, both Properties 1 and 2 are
satisfied on both the Ll cache and the L2 cache.
• Option 2: I-D tiling with the Ll cache targeted. This is similar to Option 1, except for the
different tile shape (see Figure 2(b)). Both Properties 1 and 2 are satisfied on both the Ll
and the L2 caches.
• Option 3: 2-D tiling with the L2 cache targeted. The tile shape is illustrated in Figure 2(c).
The footprint of each tile fits in the L2 cache, but not necessarily in the Ll cache. Inter-array
padding is applied to satisfy Property 2 on the L2 cache. For this option, both Properties 1
and 2 arc satisfied on the L2 cache, but not nece.'lsarily on the Ll cache.
• Option 4: I-D tiling with the L2 cache targeted. This is similar to Option 1, except for the
differcnt tile shape (see Figure 2(b)). Both Properties 1 and 2 are satisfied on the L2 cache,
but not necessarily on the L1 cache.
4.1 Estimation of Cache Misses for Option 1
We first show how to compute the number of L1 cache misses caused by an advanced tile. Let W
represent the size of the data set accessed by the original loop nest in terms of the number of data
elements. The average size of the data accessed by one tile is estimated to be D = ~~ * B I B2.
Figure 3(h) shows two consecutive tiles, tt3 and tt4, within a tile travcrsal, assuming that both
tiles residc within 80 , The iteration subspace of tt4 is produced by shifting the iteration subspace
of tt,'] upwards by 82 iterations and to the left by 81 iterations. The L1 cache misses in ti4
either occur in Region ABeD or in Region DEFG. The total estimated L1 cache misses equal to
(81B2 + 82 B l - 8 182 ) * ~el: . (This estimate may not be exact because data accessed at the lower
'FI bl
border of Region DEFG mayor may not be in the cache already.)
We now show how to accumulate the number of L1 cache misses for all the tile traversals with the
same 11 valuc. Figut'e 3(c) illustrates the idea. For a particular ././ value, let t" t2, t3 and t<j be the
base tiles of four tHe traversals, and let tit, t~, t~ and t~ be the corresponding advanced tiles when
T increases by 1. In this particular illustration, the number of L1 cache misses caused collectively
by ti (1 S i ::; 4) equals to the sum of the number of L1 cache misses caused by cach individual
ii, that is, ~~~;. Note that only the tiles overlapping with 8 0 can contribute to L1 cache misses.
Similarly, the number of L1 cachc misses caused by the advanced Liles l~ (1 SiS 4) equal to the
sum of the number of L1 cache misses caused by individual ti, that is, 5CI~'1' +2(B l - 8d82 *.., ~ev .
"1 ~l ,'I ~l
In general, the number of L1 cache misses caused by the advanced tiles with the same JJ value
equal to SCiW + r(B1 - 8d82 * He: , where T is the nnmber of base tiles in 80 for a particular J./
"1 bl "1" bl
estimated as
The value 77 + 82 * (ITMAX-l) is the maximum height of the iteration space after tiling. Any B 2
value grcater than or equal to 77 + 82 * (ITMAX-l) results in no tiling at the Ii loop level.
With Assumptions 1 and 2, we can then accumulate L1 cache misses corresponding to different
././ values by considering three differcnt cases:
• Case 1: ,= (ITMAX-l) * 8 1 .
This case is illustrated by Figure 4(a). In this case, thc tHe travcrsals dcfined by JJ S
It + j - NSTEP will not execute to the ITMAXth T-iteration. The tile traversal defined
by ,1 + j - NSTEP < JJ:::; 11 + j is the first to reach the ITMAXth T-iteration. The tile
Figure 4: Calculating cache misses under different scenarios
traversals defined by JJ> 71 +7 wHl start executing at T > 1. During the execution, the tile
traversals defined by J.J = /1 will incur L1 cache misses of ne' Hr. The tile traversals defined-, ..
by JJ = /1 + B I will incur Ll cache misses of ~tPJ * 2 + T(B1 - 8d82 * r!lJ..s 1* lev • Hence,"fu I "frjbl
we have the following:
- The Ll cache misses in all the tile traversals defined by JJ:C; ')'2 - B1amount to J.;cflJ *
(1 + 2+ ... + P8711) + T(B1 - 8d82 * t * "f:~bl * (1 + 2+ ... + rJ ~:HI1).
- The L1 cache misses in all the tile traversals defined by /2 - B I < .JJ:C; ')'2 amount to
~~: * rik1 + T(B I - 8d82 * t * "';~bl * rlB~ll.
- The Ll cache misses in all the tile traversals defined by /2 < JJ amount to WC~I * (1 +
, "
2+ ... + r1R7J1)+ T(B I - 8d82 * t * "f:~bl * (1 + 2 + ... + p-~:HJ 1)·
Adding up the three numbers of the above, the total L1 cache misses in the tiled loop nest
, t.!±::r 1 .!±::r 2:l..!
approXlma e Cbl * 81 + C bl * SI'l'
• Case 2: / < (ITMAX-1) * 8 t .
This case is illustrated by Figure 4(b). Similar to the computation in Case 1, we have the
following;
- The L1 cache misses in all the tile traversals defined by JJ:C; ')'2 amount to J.;d:: * (1 +
2 + ... + riJJ1) + T(B1 - S!l82 *~ * i:~.1 * (1 + 2 + ... + rlB~'l)·
- The Ll cache misses in all the tile traversals defined by ')'2 < JJ::; (ITMAX-l) * B l +
"J amount to w~J * rll * r(ITMAX-l)"Sl-il + T(fl - S)8 * ~ *~ * rll *
11 iCbl 81 HilI 2 51 "frJCbl B I
r(ITMA~,-lj'S'-'l
- The L1 cache misses in all the tile traversals defined by (ITMAX-l) * B l + /1 < JJ
amount to nifJ*(1+2+ ... + rJ......B 1) + T(B1-81)82 * lb.s * "c~ *(1+2+ ... + r1 -BBI 1).ibl J I "frjbl I
Adding up the three numbers of the above, the total L1 cache misses in the tiled loop nest
, t W5, (ITMAX-lj + ws,(ITMAX-lj,
apprOXlma e CblBI rjCbl '
• Case 3: ')' > (ITMAX-1) * SI.




Combining the above three cases and plugging in the estimate of T, the total number of L1 cache
misses is approximately
WS,(ITMAX-l) + WS,(ITMAX-l) (1)
CblBl C"t B2
Similarly, with Properties 1 and 2 standing, the number of L2 cache misses for 2-D tiling is
approximately
(2)
4.2 Estimation of Cache Misses for Option 2
Similar to the derivation in Section 4.1, for 1-D tiling with the L1 cache targeted, the total number
of L1 cache misses is approximately
WS,(ITMAX-I)
GblEt
The total number of cache misses for the L2 cache is approximately
WS,(ITMAX-l)
0"2 B l
4.3 Estimation of Cache Misses for Option 3
(3)
(4)
Similar to the derivation in Section 4.1, for 2-D tiling with the L2 cache targeted, the total number
of cache misses for the L2 cache is approximately
_W_S-'-'",(I,-T-"M,-A_X_-1..c) W S2 (ITMAX-l)+ .Cb2B t Cb2B 2
(5)
To estimate the number of L1 cache misses, we need to consider both the temporal reuses across
different tiles and the group reuses [21] within the same tile. With the L2 cache targeted, the L1
cache temporal locality may not be exploited between different tiles. The locality due to certain
group reuses, however, may be exploited on the L1 cache. For example in Figure l(c), there exist
group reuses between array references A(I + 1,.1) and A(I,.1), between A(I,.1 - 1) and A(I, J),
etc. All the group resues with a reuse distance [21]1e.<;s than the L1 cache block size are very likely
to be exploited due to spatial locality. (The reuse between A(I, J) and A(I + 1, J) is an example.)
Therefore, we assume that all such group reuses will result in Ll cache hits.
For group reuses with a distance greater than the L1 cache block size, we need to separate two
cases. If n a = 1, these group reuses may still result in cache hits since no cross-interference misses
occur. For example, the group reuse between A(1,J -1) and A(I,J) in Figure l(c) is very likely
to generate L1 cache hit for array reference A(I, .1). If n" > 1, some of these group reuses may not
generate cache hits because of cross-interference misses. Note that inter-array padding is applied
to exploit temporal locality for the L2 cache across different tiles. The padding sizes computed by
inter-array padding may not guarantee the L1 cache locality due to those 6'TOUP reuses.
Suppose n a > 1. Let 9 be the number of group reuses with a distance greater than the L1 cache
block size. If we assume all 9 group reuses result in cache hits, the total number of cache misses






(7)9 W(I + -) * ITMAX* -c.
n a 1>1
If we assume all g group reuses result in cache misses, the total Dumber of cache misses for the Ll
cache will be
In this paper, if n a > 1, we choose to assume that half the 9 group reuses generate L1 cache
misses. Let a be the number of group reuses which will result in cache misses. Based on the above
discussion, we have the following:
{ o ifn"~1a = 0.5 *9 otherwise
The number of Ll cache misses can hence be estimated a..<;
" W(I + -) * ITMAX * -c.
n a 1>1
(8)
4.4 Estimation of Cache Misses for Option 4
Due to the different options presented above, for I-D tiling with the L2 cache targeted, the total
number of cache misses for the Ll cache is approximately
" W(I + -) * ITMAX * -c. (9)
n a b1





In this section, we first present an execution cost model for tiling with a given tile size, based on both
the number of cache misses and the loop overhead. We then present our tile-size selection algorithm,
STS. Furthermore, we develop an execution cost model for GroupPad [13] and theoretically compare
it with inter-array padding. Finally, we give a running example to go through our algorithm.
5.1 An Execution Cost Model for Tiling
Loop tiling introduces loop overhead. To decide between 1-D tiling and 2-D tiling, the overhead of
the tiled Ii loops in Figure 2(c) needs to be measured. Let nl be the sum of the static number of
instructions for the computation of all the Ii loop bounds (1 ::; i ::; m). The Ii loop overhead due
to 2-D tiling in terms of the dynamic count of instl'Uctions, is measured approximately by
nj * ITMAX*, * if (11)
B2
Let n2 be the sum of the static number of instructions for the computation of all the Ji loop
bounds (1 ::; i :::; m). The loop overhead due to tiled J i loops can be measured by
n2 * ITMAX*' (12)
B,
12
Let n3 be the sum of the static number of instructions in the Ii (1 ::; i :::; m) loop bodies. The
dynamic instruction count for the Ii loop bodies is
n3 * ITMAX *7 * 1/. (13)
From (11) and (13), if nl and n3 are approxima.tely equal, then a small B 2 will introduce large
loop overhead.
Similar to the classification for estimating cache misses, we have four different cases for execution
cost estimation. By summing up the cache misses and the loop overhead in Section 4 and in the
above respectively, we can model the execution cost as follows:






• Execution Cost Estimation for Option 2:
WS'(ITMAX-l} WSdITMAX-l} ITMAX"
PI* CB +P2* CB +n2 B .
bl 1 b2 1 1
• Execution Cost Estimation for Option 3:
* (1 +.!!..) * (ITMAX * ..!L) + * (lVS, (ITMAX-l) + lVS,(ITMAX-l))
PI II" Cbl P2 Cb2Bj C~21h
ITMAX * f * 71 ITMAX * I
+nj * +n2 BB 2 I




c< W WS,(ITMAX-l} ITMAXq
P1 *(I+-)*(ITMAX*~C )+p,* C +n2 n . (17)
n a bI b2Bl I
Note that B2 and 82 do not appear in formulas (15) and (17). For I-D tiling, targeting the LI
cache will produces a smaller B1 than targeting the L2 cache. From the above formulas, I-D tiling
is preferable to 2-D tiling under two circumstances: 1) if the skewing factor 82 is so large that the
tile height B 2 in 2-D tiling must be maximized in order to reduce the cache misses; and 2) nt'T} is so
large that B 2 must be maximized in order to reduce the loop overhead. In either case, it is simply
preferable that 2-D tiling degenerates to I-D tiling.
Other than the above observation, the optimal tile size is not immediately clear from the
formulas, due to the constraints of Properties 1 and 2. In the next, we develop our STS algorithm.
5.2 Tile-Size Selection Algorithm
In this section, we first discuss how to preserve Properties 1 and 2. We then present our tile-size
selection algorithm.
Procedure EnumFPSizc(C" C~, N)
forFz (-] toNdo
FI (-1
! (- (1-'] .. N) mod C.
while «F~ + Cb - 1) :S t S (C. - Fz - Cb + 1»
Record (F1,Fz )
F 1 +-Fl+l












"'".- ,.....: ", .'.' "..
0'(e)
13
Figure 5: Procedure EnumFPSize and an illustration of utilizing portions of the cache by a single
tile
5.2.1 Preserving Property 1
First, we discuss how to eliminate self-interference misses within a single tile. For any array Ai, let
R be the minimum rectangular array region which contains all the Ai elements referenced within
a tile t. We say that Ai'S footprint size within tile t is (PI, F2), where F t and F2 are the numbers
of columns and rows in R respectively. We call FI (F2 ) the anuy footprint width (height) for Ai
within tile t. Reversely, given a footprint size of Ai, the tile size can also be computed. Given
the subscript patterns and the loop bounds, such a computation is straightforward and we omit
the details. For the example of SOR (Figure l(c)), assuming the array footprint size for A to be
("'I, "'2» the loop tile size should be ("'1 - 2, n,2 -2). For array Ai, if the footprint height F2 is greater
than the distance between the locations of two columns in the cache> then the columns accessed
within the tile will conflict in the cache> creating self-interference misses [1]. More precisely, we
have the following lemma:
Lemma 1 Given array footprint size (FI , F2 ) for any A; (1 ~ i ~ nul. a cache of size C.• and cache
line size cb, If there exist no self-interference misses, then the distance between the starting 'cache
locations of any two columns of Ai within F I consecutive columns is either no smaller than F2 , or
no greater than Cs - F2 • Conversely, there exist no self-interference misses if the distance between
the starting cache locations of any two columns of Ai within F I consecutive columns is either no
smaller than F2+ cb - 1, or no greater than c$ - F2 - Cb + 1.
Proof Obvious. 0
Given a directly-mapped cache of size Cs and cache line size cb, and given an array column size
N, procedure EnumFPSizc in Figure 5(a) enumerates all the footprint sizes (F1 ,F2 ) which incur
no self-interference misses, according to Lemma 1. We say that a footprint si7.e (Fl , F2 ) of Ai is
maximal if increasing either F I or F2 will introduce self-interference misses for Ai. In general, the
maximal footprint size for array A; is not unique. According to EnumFPSize, the maximal footprint
sizes for all arrays are the same if they have the same array column sizes. Our tile-size selection
scheme will enumerate all array footprint sizes which are free of self-interference misses until the
si7.es become maximal. The scheme estimates and compares the execution cost for different (Fl , F2 )
in order to get the optimal tile size.




Figure G: An illustration of padding to eliminate cl'oss·intcrfcrcnccs
We .show that the parameter Cs in procedure EnumFPSize should not be the whole cache size.
Otherwise, self· interference misses will occur when the execution proceeds from one tile to the
next. For clarity, instead of arguing formally for the general cases, we illustrate the cases of 2-way
and fully-associative caches. Figure 5(b) shows two consecutive tiles t1 and t2. Suppose Cs equals
the whole cache size in procedure EnumFPSize and supp01ie the footprint si7.e of t1 is maximal.
Tile tl accesses the cache from the least-recently referenced data segment to the most-recently
referenced nata segment in the memory, in the order of Dl, D2, D3 and D4 which arc separated by
solid lines. If the cache associativity is Cal = 2, then D2 and D4 will map to the same cache sets.
The data accessed in the blank rectangle A will replace segment D2. If the cache is fully associative,
Dl will be replaced. However, part of the old data in segment D2 (or Dl) could have been rensed
by tile t2. One solution to avoid the replacement of useful data is to reduce the footprint size
within tl such that only a portion of the cache is used to compute the maximal footprint size in
EnumFPSize. Figure 5{c) shows the case for two-way set-associative cache. In this way, the data
accessed in Regions A and C will replace the cache segment D2 and part of segment Dl, whose old
data are not reused by t2. The reusable data in D:J will be kept in the cache. Using the above idea,
we let Cs = ce"i-leSl in procedure EnumFPSize, for 2-way and fully-associative caches. The cases
.,
of other associativities are more complex, and they will not be discussed in this paper.
To eliminate capacity misses, the footprint size of each array Ai can only be (lElJ, F2 ), an.
fraction of (Fl ,F2 ). Here, we choose to partition columns instead of rows, in order to preserve
spatial locality. Assume that (B~i), B~i»), ! :::; i :::; n a, is the tile size such that the footprint size
for array A within a single tile is (LE'lJ,F2 ). For 2·way and fully-associative caches, we choose
'.
the tile size for the tiled loop as (B l , B 2 ) = (mi7liB~i), miTliB~i)). For directly-mapped caches, we
choose (B t , B 2 ) = (mi1IiBfi) - 8 1 , miTliB~i) - 82 ), One can prove that for directly-mapped, 2-way
and fully-associative caches, Property 1 holds under the above treatment. For other set-associative
caches, procedure EnumFPSize needs to be revised.
5.2.2 Preserving Property 2
We apply inter-array padding to eliminate cross-interference misses within a tile traversal. For
simplicity of presentation, we assume that the al'l'ay subscript patterns of one particular array Ak
cover all the array subscript patterns for all the other arrays Ai, i i k. The discussion in this
section can be ea.'iily extended if such an assumption does not hold. Using inter-array padding, we
let the starting addresses for array Ai(! :::; i :::; n a ) map to the same location in the cache as the
starting address of the (l..!:!::. J* (i -l))th column of array AI. With such padding, cross-interference
'.
misses are eliminated within a single tile between Ai and A j (1 ~ i, j ~ n a, i '# j).
When the execution goes from one tile to the next, if the cache is directly-mapped, the newly
accessed data [01' Ai will map to cache locations previously unused in the tile traversal. If the
cache is not directly-mapped, the newly accessed data for Ai will map to cache locations which are
15
Input: 51,52, CAl, Col, C~l, C A2 , C"2, Cb2 , 111, 712. n, N (see Table 2).
Output: THe size (H l , 8 2) and the transrormed array declaration.
Procedure:
/* /1;[ is the minimum execution cost .. /
M~=
if(Cal = 1) then C.d = Cl',G l C,01 end if
if (Cn2 = 1) then Co2 = C&,-I C.2 end if
.,
GomputeTileSize-2D(CA1,Cbl,Cal) r Option 1 */
ComputeTi/eSize-JD(C.'I,Cbd /* Option 2·/
ComputcTilcSize-2D(C.2, Cb2 ,C"I) r Option 3 .. /
CompuleTileSize-1D(G.2,Cb2) r Option 4 *j
Apply inter-array padding [17J.
Return (BI,B2)'
Procedure GompulcTiIcSize-JD(G., Gol
/* Targeting the cache with the si7.e C.- (TB1,TB2) is a temporary tile si?:e. "';
Select the maximum tile width Ii. such Lhat the rootprint of one tile can fit in the cache with size C•.
TBI t-K.-Sl
Tn2 +- 'I + S2 .. (lTMAX-I)
Compute the execution cosL, TM, based all (15) for Option 2 or (17) for Optioll 4.
if (TM < M) then BI +-- TBI , B z f- TEz , M +-- 'J'M end if
Procedure Compute'I'lleSize-2D(C~,Cb, Co)
r Targeting the cache with lhe size CA. (TB], THz) is a temporary tile size. */
for Fz +-- Cb to N do
PI f- 1
t f- (Ft .. N) mod CA
while {(Fz + Cb -1) ::; t ::; (CA - Fz - Cb + 1» do
Convert array footprint size (Ft, Fz) to loop lile size (T81, T Hz) (I7].
if (Co = I) then T8] f- TBI - Sl,TBz f-TBz - Sz end if
if(TB] > 0 and TBz > 0) then
Compute lhe execulion cost, TM, based on (14) for Option 1 or (16) for Option 3.
if (TM < M) then H] f- TB], Hz f- TBz , M f- TM end if
end if
F'lf-F]+1
t +-- (F] * N) mod C~
end while
end for
Figure 7: Tile-size selection algori~hm - STS
ei~her previously unused or will not be referenced again within ~he current traversal. Therefore,
cross-interference misses are also eliminated within a tile traversal. Figure 6 lllustrates an example
for F1 = 4 and n a = 2, where the cache is directly mapped. Here, assuming the starting address for
array Al to be 0, ~he padded number of data items, x, between arrays Al and A2 can be determined
from
(.,;ze(At) + x) = (2 * N), mod C'"
We are ready to present our tile-siJ>;e selection algorithm in the next section.
5.2.3 Algorithm STS
(18)
Algorithm STS in Figure 7 selec~s the tile size by interleaving the operations in procedure EnumFP-
Size with the applications of Formulas (17) and (14) which compute the execution cost. We require
B 2 to be no smaller than the cache line size Cbl. However, we do not require B 2 to be a multiple
of CItI, since such a requirement does not have much benefit when execution proceeds from one tile
to ~he next.
STS makes the decision between 1-0 and 2·D tiling based on their execution cost. For I-D tiling,
Compute.Tile.Size.-1D tries to find tile width B 1 such that Propcr~ies 1 and 2 are preserved on the
16
L2 cache and that Formula (17) is minimized. For 2~D tiling, ComputeTileSize-2D enumerates all
tile sizes which arc free of self-interference misses. The tile size with the lowest execution cost is
selected. Between I-D and 2-D tiling, the scheme with the lower p.xecution cost is chosen.
Here, we want to specially mention the conversion from array footprint size (F1 , F2 ) to loop We
size (TEl, TB2) in Procedure ComputeTileSize-2D in Figure 7. We have
F,




where n a is the number of arrays within the T-loop body and c is a small nonnegative constant to
adjust loop tile sir.c in order to avoid capacity misses. If nil is large, it is very likely that TB I is no
greater than St. As the result, the (Ft,Fz) pair will bear no effect on the final tile-size selection.
5.3 Inter-Array Padding and GroupPad
In the STS algorithm in Figure 7, after all four options are evaluated, one optimal tile size with the
associated targeted cache is determined. Inter-array padding is then applied to satisfy Property 2
on the targeted cache.
For 1-D tiling, if the array column size is large, it is often difficult to exploit the L1 temporal
locality. For example in Figure l(b), if N is greater than the L1 cache size, the L1 temporal locality
across the loop T cannot be exploited because no legal tile size can be found satisfying Properties 1
and 2 on the Ll cache. For I-D tiling, if Option 4 is chmien, then the tile size should clearly be
chosen to make the footprint fit the L2 cache. We need to justify, however, why we choose to
perform padding for the L2, but not the Ll, cache. Inter-array padding can preserve Property 2
on the L2 cache. However, due to the large footprint for the L1 cache, inter-array padding will not
be able to help preserve the group reuses for the L1 cache. Alternatively, we can apply another
padding algorithm, GTOllpPad, proposed by Rivera and Tseng [13], to exploit the L1 cache locality.
The padding sizes generated by the two algorithms are often not the same, unfortunately. Suppose
that, after applying GmupPad, the locality due to group reuses can be fully exploited on both
the Ll and the L2 cache misses. On the other hand, no temporal locality across different tiles is
exploited. For I-D tiling with the L2 cache targeted, we will have an execution estimation, after
GroupPad, as follows:
W W ITMAX.,
1" * (ITMAX * -0 ) +p, * (ITMAX * -) +n, .
bl Cb2 B l
In order to exploit the L2 cache locality, B l is to be chosen as large as possible. On the other
hand, both a and 8 1 are small. Comparing (17) with (20), it is thus clear that inter-array padding
is preferred in the case of Option 4.
5.4 A Running Example
We now take SOR (Figure 1) as an example to show how STS works, assuming the following
parameters: N = 1000, ITMAX = 1050, C sl = 4096, Cb1 = 4, Cal = 2, Cs2 = 128 *1024, C b2 = 16,
C a2 = 2, nl = n2 = 15, PI = 6, P2 = 30, and a = O. Based on the array subscripts and the loop
bounds, we have 8 1 = 82 = 1, 'Y = 1] = 999 and W = N * N = 1000000.
In the following, we show the steps of STS.
• Since Cbl = 2, Csl =~ = 2048.
• Since Cb2 = 2, Cs2 = ~ = 64 *1024.
17
Table 3: Machine parameters
Processors C" C" c., C,' C., Ca 2 p,
"Ultra II 21( 2 1 256K 8 1 G 45
RlOK 1K , 2 512I{ 16 2 9 68
• ComputeTileSize-2D(Cst , CbI , Cad is called. After this call, we will have: B 1 = 38, B 2 = 43
and M = 541048960.
• ComputeTileSize-1D(Gs},Cbd is called. During this call, no legal tile sizes can be generated
because of large N. Therefore, we still have the same optimal tile size and execution cost as
in the previous step.
• ComputcTileSize-2D(Cs2 , Cb2, Co2) is called. Although this call is able to generate legal tile
sizes, aftcr this call, th,e tile size and execution cost in the previous step remain optimal.
• ComputeTileSize-1D(Cs2 ,Cb2 ) is called. This call generates TB I = 63, Tfl2 = 2048 and TM
= 1608043520. Since TM > M, the tile size and execution cost in the previous step remain
optimal.
• Until this point, STS has chosen 2-D tiling with the L1 cache targeted. The optimal tile size
is (38,43).
• No inter-array padding is applied since n a = 1.
6 Experimental Evaluation
In this section, we use experimentalrcsults from several benchmark programs to compare different
tile-size selection algorithms. In Section 6.1, we describe how we set up the experiments. In
Section 6.2, we examine the results from benchmarks whose loops can be tiled at two levels. All
tile-size selection algorithms are compared for these benchmarks. We evaluate the impact of loop
overhead in Section 6.3. In Section 6.4, we justify our decision which drops the consideration of
software pipelining and the TLB, compared with [17]. In Section 6.5, we examine the results from
two programs, tomcatv and swim, from the well-known SPEC benchmark suites. The loops in
these two programs can be tiled at one level only, allowing a comparison among STS, TSS and
TLIonly. We also compare our inter-array padding algorithm with the GroupPad [13] for the tiled
codes. Further, we exhaustively test different tile si7,es to see how close STS is from the optimal
tile si7,e. In Section 6.6, we explain several subtle points concerning the results.
6.1 Experimental Setup
We have implemented a stand-alone version of STS based on Figure 7. We have previously
implemented our skewed tiling algorithm in Panorama [16]' which is a Fortran source-to-source
compiler. However, some special parameters to STS, for example, nl and n2, can be easily obtained
only within a compiler backend. Therefore, we have not integrated the STS into Panorama.
Currently, we obtain these special parameters by examining the assembly code of the original
program.
18
We apply STS to three numerical kernels, SOR, .Jacobi and Livermore Loop No. 18 (LLIS), and
two SPEC benchmarks, tomcatv and syim. These benchmarks are chosen because they require
skewed tiling. We use reference inputs for tomcatv and swim. For SOR, Jacobi and LL18, we declare
N x N double precision arrays, with randomly chosen N based on a random number generator [12]
with the following formula
Zn+l = (16807zn ) mod 2147483647. (21)
Assuming that the array sizes under consideration range from TO to T], we select 200 array sizes,
a'I! such that
an = TO + (zn mod (rl - TO)), 1 :s: n :::;: 200. (22)
We use Zl = 9 in all our experiments. Note that it would be too time-consuming to exhaustly test
all array sizes within the range in our experiments.
We run the test programs on a SUN Ultra II uniprocessor workstation and on one MIPS
RIOK processor of an SGI Origin 2000 multiprocessor, with the tile sizes selected by five different
algorithms, namely, S1'S, 1'LI [1], 1'SS [2], LRW [7] and DAT [11]. In order to handle several
equally-important arrays, we make an obviously necessary modification on the original TSS and
LRW algorithms such that the value of the initial tile size will meet the working set constraint.
We also modify the TLI algorithm such that only the cache size divided by the number of equally-
important arrays is used to compute the tile sizes which are free of self-interference misses. For
I-D tiling, we also extend TSS and TLI to con..'iider the L2 cache if the LI cache size is too small
for locality enhancement in tllat casco If any alg9rithm decides to choose the whole array column
as the tile height, then we let B2 = 1] + 82 * (I1'MAX-1) and tile the .Ii loops only (Figure 2(b)).
Table 3 lists the machine parameters for the Ultra II and the RlOK, assuming the size of an
array element of 8 bytes. The main memory size [or the Ultra II is 128M bytes, and it is I6G
bytes for the RlOK, in which IG bytes are local to the processor. To accommodate the competition
between instructions and data in the L2 cache, we only tries to utilize 95% of the total L2 cache
capacity. We use the machine counters on both machines to measure the cache miss rate.
On the RlOK, the untiled codes are compiled using the native compiler with the "-03" opti-
mization switch set. On the RlOK, we found that compiling the tiled code with the "-02" switch
can sometimes run faster than that with the "_031l switch, regardless of the tile-size selection
schemes. Therefore, we compile the tiled code with "-02" or "-03" depending on which produces
shorter execution time. For all the tile-size selection schemes, we switch off loop tiling for the native
compiler all the RlOK when we compile the tiled source programs (with for both I-D and 2-D tiling).
We switch off prcfetching on the RIOK when we compile 2-D tiled source codes since prefetching
may increase cross-interference misses for smaller tile height B2 . We also switch off common block
reorganization since the tHe size selection algorithms already take care of memory layout. On the
Ultra II, both the untiled and the tiled codes are compiled using the native compiler with the
"·fast -xchip=ultra2 -xarch=v8plusa -fsimple=2" optimization switch, which is recommended by
the vendor.
6.2 Loops Which Can Be Tiled at Two Levels
In this subsection, we compare different tile-size selection schemes through three numeric kernels,
SOR, Jacobi and LL18.
The SOR kernel
The SOR kernel is a perfectly-nested loop as shown in Figure l(a). We fix ITMAX to 1050





























0.1 Jo•••••• : •• :. ' • .i.".:,.. ......~:
005' .h ••~ • ••••••• l.t-•• ••• - ••'
. 0 it:!::·~1:"····l· J. •• ..:.:•••~ ~••















, 0;: _: '.?~~~:::~ ..:__ "":;:.~::;.~
0.05 ~ • •• ,j' ,.... ' ••:~••_
0.''; "",,!... nO,' .¥_
200 400 600 600 1000 1200 1400 1600 1000 2000
Malrbl Size (Ullra II.$OA)






~ 0,08 ~'.__ ~""'w"'''''''''''''' ~ 000










800 1000 1200 1400 1600 1800 2000 ;00
MalIi>: Size {UJlra II-SORJ
TU
'"OAT
400 600 800 1000 1200 1400 1600 1800 2000
MalrilcSlze {Ullra II-SORJ





















· ""~ 200 .~
.§ 150 " ~ ,<;~'.<t:~W~ 100 • (':~'--A.:. ....,,;.-.~:c:.."",--
50 ..."'OOL,.-'-OO""·""~"""OO~~"':::OO=,C,OOO::-,C,COO::-,C""=C,O""=C,,J,,,.
Mallix SU:I! (A10K-SOR)
Figure 11: Execution time of SOR for various schemes on the RIOK
'.'
'"'"0'"
, . .•.. ,
0.04 • • ".. ••;. ,"." ~.. "-•
.. :.: ..... ~... " ~ ..
0.02 .Jo:" .":...oz.o~' '". '.:.~ ~I_.~" ,'" :, '~."'1. ,t
.!:em , ,,' ... _~T:....~: ..~ • .,;.


























'. ':'0 '. ,',9. '~
" 0.., .. 'p ," 0
" ~~~'....~'"~':5'~4:.~:.~~~i.:.•~.~~'~-;~~~~''''~'~J~,,~~~;~~~.t~"!!l"




















Figure 13: L2 cache miss rate of SOR for various schemes on the RlOK
21
Table 4: Speedup by STS and average cache miss rates over different schemes for BOR
Ultra 1I aRC LRW 1'55 TM STS OAT
Speedup by STS 1.05 1.01 1.27 0.99 1.00 1.05
Ll Miss Rate 0.14 0.02 0.07 0.03 0.02 0.06
L2 Mis.s Rate 0.066 0.006 0.009 0.005 0.006 0.008
RIOK aRC LRW TSS TLl 51'S OAT
Speedup by STS 1.30 1.02 1.09 1.01 1.00 1.00
LI Miss Rate 0.113 0.006 0.024 0.012 0.006 0.031







°200 400 600 BOO 1000 1200 1400 1600 1600 2000
M~lIix Size (Ultra II·Jacobi)
Figure 14: Execution time of Jacobi for various schemes on the Ultra II
Equation (22). We have 81 = 82 = 1 and a: = O. We have nl = n2 = 11 on the RIOK and
nl = n2 = 22 on the Ultra II. On the Ultra II, STS chooses 1-D tiling for 7 test cases and chooses
2-D tiling for 193 test cases. On the RI0K, STS chooses I-D tiling for 9 test cases and 2·D tiling
for 191 test cases. All test cases of I-D tiling on both machines target the L2 cache.
Table 4 summarizes the average speedup by STS over other schemes, average Ll and L2 cache
miss rates for SOR on both the Ultra II and the RlOK. The execution time is averaged by geometric
mean, and the cache miss rates are averaged by arithmetic mean of cache miss rates for individual
array size. On the Ultra II, STS performs slightly worse than TLI by 1%. On the RlOK, it performs
equally or better than all other schemes. On the Ultra II, TSS often chooses a rectangular tile size
with a small B2 . Snch a tile size introduces a high loop overhead, which slows down the program
execution.
From Table 4, STS has the lowest average Ll cache miss rate, but it exhibits the worst L2
cache miss rate on the RIOK. When combined with the low Ll cache miss rate, however, STS still
outperforms the others on the RI0K, with a largest speedup of 1.09 over TSS.
Figures 8 and 11 show the execution time for various schemes on the Ultra II and on the RI0K
respectively. Figures 9 and 10 show the Ll cache and L2 cache miss rates respectively on the Ultra
II. Figures 12 and 13 show the Ll cache and L2 cache miss rates respectively on the RIOK.
The Jacobi Kernel
The Jacobi kernel is imperfectly nested such that the loop T contains two perfect nests (m = 2 in
Figure 2(a)). The T-loop body contains two arrays after tiling [16]. We fix ITMAX to 500 and
randomly choose 200 array sizes ranging from 200 to 2000. We have 8 1 = 82 = 1 and a = 1.0. We
have nl = n2 = 17 on the RlOK and nl = n2 = 28 on the Ultra II. On the Ultra II, STS chooses







" '"0 0 DA<,.,
,.,
'A o 0 0 • • 0 • ...
. . ' c·· ..·• ' .•,.' "'.
.. . . " . ...".
. '. '." ' .. '.~ . ".,:. ",. ..'
02. ," :", ·.r;.. ·.... j .. '.:..,:"" .'. r ...... , •..I:..... ·~..?"-<'t··· •••~...... ~ "'~"'".;-01 ~. " ......#. ~ • .,...... ," •
~.:.~ ;;"';"~~~d~""""",
200 400 600 800 1000 1200 1400 1600 1800 2000
Matrix $1:00 (Ullm II·Jacobij
Figut'c 15: Ll cache miss rate of Jacobi for various schemes on the Ultra II
,.,
400 600 BOO 1000 1200 1400 1600 18002000













400 600 800 1000 1200 1400 1600 1600 2000






















. '. ,,,.--,,, ...
o' ' .... ,.. • ,,'. , •••
!:--:::-_-:::··~~·~·~--:~~:'."·'·..c,..~··_~.o!~~"~~··~-~
°200 400 600 600 1000 1200 1400 1600 1800 :WOO
Malrix Sizo (Rl0K-JllCObij














01· • •••• •
. .'. U·'·" •• ". ". • ••' ..
0.05 ",$,...... •.re. ~. :.}2 . ,."" ,"
O~~ os'itb h







0.2 , 5 • . .'.; .~ ,-;\, .. ~,.".: ;., 1••
0.15 ., .•••:~ . .: ••;E':.\ , ,
.~: .. ;.~ ... '). ,- '-;..--.'~: ;":
0.1 ' ' _ • '" t
0.05 ;f~~' "."5~:"' .~: ..!.-~ ..:.:~ -';'.': ," '::
o ~1IiIl. '.
200 400 600 000 1000 1200 1400 1600 1800 2000
Malli~ Size (Rl0K-Jacobl)
23
Figure 18: Ll cache miss rate of Jacobi ror various schemes on the RIOK
. -
. '." . .' .....,." ' ....'
. '-. . .::. ~ . ·4~'t~.'~~
02> r-~~--ccr---=,,-,
'. • • ORG
." .,-" LRW·~., 15"'-'
..~" _._........ ~,.._..~ ":.'-'
.' '. . . .
•a: 0.15
• •
:::;; D ' ~ [It',l
'" 0.1 ~ ..1....'lI\~~.8' • p
... •.... "'.~.,po."' .." 0.,
0.05. :• .f~;:"~"':i" ".. ~". lo ••
.~.' 'I. ...:J,~" '. •. .. ; .
o .' ~ A~ -:.' M
200 400 GOO 800 1000 1200 1400 1600 1800 2000
Malrix SIze iR10K-Ja<:ob~
02> ii:~. ··.' ·,0.'
." ., .•
• 8. o' '/,0'
·~ 0'''<>0 • • ... " ·· ,0.15 : .'.. 0.: ,
·








j " .. •,
0." ..
o
200 400 600 800 1000 1200 1400 1600 1800 2000
Malrl:c Size (Rl0K·Jacoblj
Figure 19: L2 cache miss rate of Jacobi for various schemes on the RI0K
I~D tiling for 19 test cases and 2-D tiling for 181 test cases. All test cases of 1-D tiling, except one
on each machine, target the L2 cache.
Table 5 shows the average speedup by STS, average Ll and L2 cache miss rates for Jacobi on
both the Ultra II and the RI0K. On both machines, STS performs equally or better than all other
schemes.
From Table 5, STS has the lowest average Ll cache miss rate cxcept when compared with LRW
on the RI0K. However, STS achicves a good speedup over other schemes on both machines in most
cases.
Figures 14 and 17 show the execution time of Jacobi for various schemes on the Ultra II and on
the RlOK respectively. Figures 15 and 16 show the Ll cache and L2 cache miss rates rcspectively
on the Ultra II. Figurcs 18 and 19 show the L1 cache and L2 cache miss rates respectively on the
RIOK.
The LL18 Kernel
Similar to Jacobi, LL18 is also imperfectly nested and thc loop T contains three perfect nests
(m = 3 in Figure 2(a)). LL18 has 9 arrays, and the tiled version has 11 arrays after duplicating
arrays ZR and ZZ. Due to the relatively large number of arrays, the array sizes we used in SOR
will produce extremely small tile sizes for all the tile-sizc selection schemes. Thcrcfore, we reduce
the array sizes and randomly choose 200 array sizes ranging from 200 to 500. We fix ITMAX to












n' 0 • •
• • no n _,.__~n::"."";'!'
• ~••I~~__--
, b!~",_""":":":M~~':."'::==::'__~__J

















300 350 400 450 500
MalrixSlze (Ultra1RL18j
Figure 20: Execution time of LL18 for various schemes on the Ultra II
", "
.... .. '~'.., :.'::"~
~~. ~ '.
. ...... ~ '.
l·:··~. .. ~
.~
... ", ".I.J:······ ..::·:·,··"· >.::-......
"0 ."....
.~......~:::••:. ••# .:~......:
'••••" \. • #
, L~_~_~ ..J
200 250 300 3SO 400 450 500












.\ oil!, 0 • .,...";p.~"'~ oiP OD
,
... ..,..






",";; 0 ~~ 1M OJ'!. .. ::'''
" ",-" ,ol';>od' '0.
"-,."'.~--~--~--~-~~'---'0.1 ~
200 250 300 350 400 450 500


























0.05 • " .'~ 0 :, ........ ~:t.~";'O~l,;..F..offi"li o'J-.i~ ""iJf I' 0, • ...'10'.o • .., •• ,
200 250 300 350 400 450 500




Figure 22: L2 cache miSS rate of LL18 for variou~ schemes on the Ultra II
25
L~~~~;;;::~,~-",.~~~'~!,~..~-~....,.:;!
°2;;' 250 300 350 400 450 500
M~lrilc Size (R1OX·l118j





.... I : •• ",
:..•~~ ";.":ol,/i'.'l.~·;";-:":::·~~ l.I.:.•y~ .." •
. . .:~' ~ ..
. . /',:- ,.... "'.. ' .'
-. .. ..











", 'I ••••. .
'.r .."..~. ",+'~:':.. '. .
. . .
.. " ;.
..,. :"';.. .J~ '''.: i
'0 ~ • ~!, ;"",,,,,,,





























.'.... .. ") ...:."....~,
• ,,~ I I
", '.
.,. _••B H
'.'8"0";° 0 :I: ..
.;j" -. ,"'i.. ,.
_ "r'!! ~ ~ ... ,.
• ...", °0 ~ ...,











Figure 25: L2 cache miss rate of LL18 for various schemes on the RIOK
26
Table 5: Speedup by STS and average cache miss rates over different schemes for Jacobi
Ultra II ORC LRW TSS TLl STS DAT
Speedup by STS 5.14 1.33 2.07 1.22 1.00 1.05
LI Miss Rate 0.60 0.12 0.24 0.24 0.06 0.1!J
L2 Mis.s Rate 0.15 0.02 0.02 0.01 0.02 0.01
fllOK ORC f,RW TSS nr STS DAT
Speedup by 31'S 5.66 1.01 1.24 1.19 1.00 1.00
Ll Miss Rate 0.234 0.022 0.062 0.144 0.038 0.082
L2 Miss Rale 0.169 0.066 O.O~3 0.006 0.104 0.010
Table 6: Speedup by STS and average cache miss rates over different schemes for LL18
Ultra 11 ORC LRW 'r55 1'1.1 5TS VAT
Speedup by STS 2.10 2.25 2.83 2.18 1.00 2.35
Ll MIss Rate 0.'135 0.217 0.28-1 0.326 0.'114 0.208
L2 Mis:; Rate D.ll2 0.037 0,056 0.019 0.020 0.021
RiOK ORC f,RW TSS nr STS DAT
Speedup by STS 1.83 2.11 2.ll 1.72 1.00 1.80
Ll Miss Rate 0.173 0.072 0.096 0.122 0.21'1 0.066
L2 Miss Rate 0.128 0.049 0.075 0.010 0.004 0.026
Table 6 shows the average speedup by STS, average L1 and L2 cache miss rates for LL18 on
both the Ultra II and the R10K. Note that 82 = 2 suggests that we must have a large tile height
B 2 to reduce the number of L1 cache misses for 2~D tiling. After tiling, the T-Ioop body contains
11 arrays. Such large number of arrays often make STS unable to eliminate the cache misses for
2-D tiling through padding (see Section 5.2.2).
STS correctly estimate the number of cache misses between I-D and 2-D tiling, which is crucial
to the determination ofthe final tile sizes. Out of200 cases, STS chooses 1-D tiling in 186 cases on
the Ultra II and in all 200 ca.'ies on the RlOK. All test cases of 1-D tiling on both machines target
the L2 cache. All the other tiling schemes either choose 2-D tiling or no tiling if they fail to generate
the legal tile si7.es. From Table 6, STS achieves a significant speedup over other schemes, ranging
from 1.72 to 2.86 on both machines. Such a speedup is largely due to the more accurate modeling
of the effects of loop skewing and due to the consideration of both L1 and L2 cache misses.
From Tdble 6, STS has the worst average L1 cache miss rates on both machines. This is expected
because 1-D tiling tries to minimize the number of L2 cache misses. Fl.·om Table 6, STS indeed has
much smaller number of L2 cache miss rates than other schemes in most cases.
Figures 20 and 23 show the execution time of LL18 for various schemes on the Ultra II and on
the RlOK respectively. Figures 21 and 22 show the Ll cache and L2 cache miss rates respectively
on the Ultra II. Figures 24 and 25 show the L1 cache and L2 cache miss rates respectively on the
RlOK.
Overall, for SOR and Jacobi, where the L1 cache locality can be exploited in most cases, STS
achieves comparable results. For LL18 where the L2 cache locality is exploited in most cases, STS
is significantly better.
6.3 Impact of Loop Overhead
In this subsection, we evaluate the impact of loop overhead through the benchmarks SOR, Jacobi











" "B 2SO 8
"0
• •~ ~
:§. 200 .:l " 020 .,;;
"
=
• . " • 100 ..E ",' ~ ..'.
"
ISO
, ,,;.,., SO ,.'
"











200 400 600 800 1000 1200 1400 1600 1BOO 2000 200 400 600 aoo 1000 1200 1400 1600 1800 2000
Matrix Size (Ullra 1I-S0R) Matrix Size (A1QK-SORj
(a) On the Ultra II (b) On the RIOK









200 400 600 800 1000 1200 1400 1600 1BOO 2000
Matrix Size (Rl0K-Jacobi)





















200 400 600 BOO 1000 1200 lol.00 1600 1600 2000
Matrix Size (Ullra II-Jacobi)

























Malnx Size (Ullra II·LL181








Figure 28: Comparison with and without loop overhead for LL18
28
Table 7: Speedup by STS over the STS without consideration of loop overhead
Ultra II STS STS-Ioop R10K STS STS-/QQp
SOR 1.00 0.98 SOR 1.00 1.00
.Jacobi 1.00 1.03 Jacobi 1.00 1.04




180 1=====;;;:::===:1l 1603 140 I
·'2O~~'e·SOR~i·{!. 100 I.Jacobi




(a) On the Ultra II (b) On the RIOK
Figure 29: Execution time reduction by STS compared with STS not considering loop overhead
loop overhead. Specifically, "STS" stands for the tiled codes with STS applied, "STS-loop" for the
tiled codes with STS not considering loop overhead (i.e., considering cache misses only). Figure 26
shows the detailed results for SOR. Figure 27 shows the results for Jacobi. Figure 28 shows the
results for LL18.
From Table 7, the impact of loop overhead seems marginal. In Figure 29, we further show the
distribution of execution time reduction by STS, compared with STS not considering loop overhead.
From Figure 29, in most cases, the performance difference remains in the range [-2%,2%]. However,
out of total 1200 test cases on both machines, by considering loop overhead, STS improves the
performance by equal to or greater than 10% in 71 cases, and hurts the performance by more than
10% in 38 cases.
Without considering loop overhead, STS tends to choose 2·D tiling targeting the Ll cache,
even with a small B 2. Althongh such a small B 2 still tends to improve Ll cache locality, the
corresponding loop overhead could offset that benefit. On the other hand, converting to I-D tiling
with the L2 cache targeted, sometimes caused by considering loop overhead, loses the ability to
exploit Ll cache locality, and therefore may actually hurt the performance. Table 7 has shown the
marginal gain (up to 4%) by considering loop overhead.
6.4 Impact of Software Pipelining and the TLB
In this subsection, we experimentally justify our decision to drop the consideration of software
pipelining and the TLB, compared with [17].
Table 8 shows the speedup of STS over the STS with consideration of software pipelining. We
can see that On the Ultra II, incorporating the software pipelining into the STS can improve the
average performance for all three programs. However, the average performance drops for all three
29
Table 8: Speedup by STS over the STS with consideration of software pipelining
Ultra II STS S1'S+swp RJOK STS STS+swp
SOR 1.00 0.94 SOR 1.00 1.03
Jacobi 1.00 0.97 Jacobi 1.00 1.03
LL18 1.00 0.98 LLIB 1.00 1.06





programs all the RlOK. The average performance gain with considering software pipelining is near
to 0%.
Even with considering the TLB, the STS will generate the same tile size on the RlOI< [17J.
However, it will generate smaller tile sizes on the Ultra II. Note that such smaller tile sizes will
cause the under-utilization of the L2 cache. Table 9 shows the speedup of STS over the STS
considering the TLB. We will further justify our decision in Section 6.G.
6.5 Loops Which Can Be Tiled at One Level Only
In this subsection, we evaluate the quality of STS for loops which can be tiled at one level only,
using two SPEC benchmarks, tomcatv and syim. These two benchmarks can be tiled at one loop
level only. Furthermore, the L2 cache should be targeted because of the large number of arrays
within the loop body and the large array column size. We also evaluate the impact of padding
through these two benchmarks.
tomcatv
We use two different reference inputs for tomcatv from SPEC92 and SPEC95 respectively. To








3 ~~---;:l:__:::::_=~:::_=__::'o 50 100 150 200 250 300 350






o 20 40 60 80 100 120 140 160
Tile Size (Ultra-(SPEC95, lomealv»
(b) SPEC95












, ':---:::----o::--:~=___:::::_c::::_~o 50 100 150 200 250 300 350






o 20 40 60 80 100 120 140 160
Tile Size (Ullra-(SPEC95, lomcatv»
(b) SPEC95
30










o so 100 150 200 250 300 350






20 40 60 80 100 120 140
Ttle Size (Rl0K-(SPEC95, lomcalV)}
(b) SPEC95
'80




















':---:::----;~~----;:::-_c:::;:-:::c-=o 50 100 150 200 250 300 350
Tile Size (Rl0K·(SPEC92.lomcalv))
(a) SPEC92
6O!---=::--::=-'-:':,----:::-c:::--=~--=~-:'o ~ 40 60 60 100 1~ 1~ 160
Tile Size (A1QK-(SPEC95, lomcalv))
(b) SPEC95
Figure 33: Performance of tomcatv with different tHe sizes and without padding on the RlOK
31
60 1
o 20 40 60 50 100 120 140 160 180 200






o 50 100 150 200 250 300 350 400 450






'" . . .'1.1, h j\ "j \..
30 • • iF""""\(" '"I\.., ..I.:L. 1























o 20 40 60 80 100 120 140 160 180 200
Tile Srze (Ullra-(SPEC95, swim»)
(b) SPEC95
Figure 35: Performance of suim with different tile sizes and without padding on the Ultra II
to three times of the size selected by STS, for each version oftomcatv. Figures 30(a) and (b) show
the results on the Ultra II, where the vertical bar indicates the tile size selected by the STS. The
original programs from SPEC92 and SPEC95 nm 5 and 174 seconds respectively on the Ultra II,
and 4.0 and 115.0 seconds respectively on the RIOK. Figures 32(a) and (b) show the results on
the RlOK. Figures 32(a) and (b) show the results on the RlOK. The results chosen by STS arc at
most 5% worse when compared with the optimal solutions for the enumerated tHe sizes for both
versions of the codes on both machines. To examine how padding affects the STS, we also run
both versions of tomcatv without padding applied. Figures 31(a) and (b) show the results on the
Ultra II, and Figures 33(a) and (b) show the results on the RIOK. Except few cases, padded version
runs significantly faster than unpadded version, which demonstrates the effectiveness of padding
for STS.
s'lim
On the RlOK, we usc three different reference inputs for swim from SPEC92, SPEC95 and SPEC2000
respectively. On the Ultra II, however, because of the large data set size and the relative small main
memory size, the SPEC2000 version of swim cannot be tiled with a positive tHe size, i.e., it cannot
be tiled profitably. Hence, on the Ultra II, we use two different reference inputs from SPEC92













I • I ,. I =
I " ~. I ,. >, I =~~i\ ,y ,,
•
.... . , • .~"
,
=
.. ·1 .. ········ .. ······
"
• =
• '" '(tI'50~~.r.o"'"<IXI"50 • • ,. ,. = = • , ..
" " "
• •
r.. ""oIR'OK-15PE=. """1) T.. SI.t<>(Fl'OI\.lsP[C'lS•....,,1l r.. s.z. tRIOK·(SPEC<'IXO•...,m)1(a) SPEC92 (b) SPEC95 (c) SPEC2000









I I ..• '"








Loo "., 21:0 = :moo ,." 0&<1> <so •
"
,.
'" = =Tn, 5',oln'OIl{Sl'ECW, _I) rolo Sao (RI""~SPEC""""",,1(a) SPEC92 (b) SPEC95
.......- .
= ----" .._.~=---c:c-...,....J0.'0,."",.,.,>5n. _ (FllOK-f.'I"'C2<UI, .....))
(0) SPEC2000
FigUl'e 37: Performance of swim with different tile sizes and without padding on the RIOK
the size selected by STS for each version of swim. On the Ultra II, the original programs from
SPEC92 and SPEC95 run 36 and 157 seconds respectively. On the RIOK, the original programs
from SPEC92, SPEC95 and SPEC2000 run 21.2, 91.9 and 1156 seconds respectively. Figures 34(a)
and (b) show the results on the Ultra II, and Figures 3G(a), (b) and (c) show the results on the
RlOK. When compared with the optimal solutions [or the enumerated tile sizes [or all versions
of the codes on both machines, the results chosen by STS are within 5% worse, except that for
SPEC92 syim on the Ultra II the performance degrades by 13%. The tiled syim requires several
scalars expanded to I-D arrays, which are ignored by our model (see Section 3). We suspect
the performance degradation is due to the cross-interference misses between 2-D arrays and these
ignored I-D arrays. Figures 35(a) and (b) show the results on the Ultra II for unpadded versions
of sYim, and Figures 37(a), (b) and (c) show the results on the RI0K. Similar to tomcatv, padded
version runs faster than unpadded version in most cases for SPEC92 and SPEC95.
TLI and TSS can also be applied to tomcatv and syim by assuming the tile height to be the
trip count of the maximum range of the inner loop bounds. They generate the same tile sizes as
STS, however, without padding applied. The vertical bars in Figures 31, 33, 35 and 37 indicate
the execution time for both TLI and TSS. From Figures 30 to 37, the versions with padding run
significantly faster than the ones without padding, with a speedup of l.08 to 2.14. Therefore, STS
is superior to TLI and TSS for tomcatv and swim.
We also applied the GroupPad algorithm [13] to the tiled tomcatv and swim. Tdble 10 compares
the results with our inter-array padding. It is clear that padding for the L2 cache (using our
interarray padding) is better than padding for the Ll cache (using GroupPad) in the case of I-D
skewed tiling with the L2 cache targeted.
33
Table 10: Execution time comparison between inter-array padding and GroupPad in seconds
tOIll~atv(U1tra II) s\lim(Ultra II) tOlllcatv(RIOK) swim(RIOK)
SPEC92 SPEe95 SPj.~C92 SPEe9S SPEC92 SPE "95 'PE '92 SPEeDS SPEC2000
Inter-array padding 4 92 26 70 3.5 72 19.8 59.6 581.6
GroupPad 5 140 28 130 3.6 117 20.0 119 685
Table 11: Summary of speedup of STS over other schemes
GIle LRW T55 TLl DAT
Ultra II 2,25 l.li3 1.95 1.38 1.37
RIOK 2.38 1.30 1.~2 1.27 1.22
Both 2.31 1.46 1.66 1.32 1.29
6.6 Discussion
In summary, Table 11 shows the normalized execution time for all GOO cases for SOR, Jacobi
and LL18, where "Both" stands for both the Ultra II and the RlOK. From Table 11, the major
competitors of STS are DAT and TLI, where DAT performs a little better than TLI. However,
because DAT ignores the L2 cache and only chooses square tile sizes, and TLI ignores the L2 cache
and docs not apply inter-array padding to minimize cross-interference misses, they do not perform
as well &"i STS.
Comparison with LRW
One interesting point is related with LRW. Considering the combination of each benchmark (SOR,
Jacobi and LL18) and each machine (Ultra II and RlOK), LRW produces smaller average Ll cache
misses in 3 out of 6 combinations compared with STS. However, this does not translate into large
performance saving. We found that in general LRW produces smaller tile sizes than STS, which
potentially introduces more loop overhead. For LL18, LRW has greater average L2 cache miss rates
than STS since STS exploits locality for L2 cache in most of cases due to large number of arrays.
TLB
Unlike [17], we drop the TLB constraint in the STS algorithm such that the array footprint size
within one tile may not fit in the TLB any more. As demom.trated in Section 6.4, such a TLB
constraint may cause the L2 cache under-utilization, thus hurting the performance in many cases.
In our execution cost model, we did not consider the TLB misses. Although the TLB misses could
be added to the execution cost, we consider the TLB misses are not important, when compared
with the Ll and L2 cache misses, because of the following reasons:
• The current modern microprocessor often has a large TLB block size. For example, the Ultra
II supports a block size of 8KB, 16KB, etc., and the RIOK supports 32KB, 64KB, etc. The
L2 cache block size , on the other hand, is much smaller.
• In general, the number of TLB entries are enough to execute one T-Ioop iteration without
TLll thrashing.
34







• For the relaxation codes we are targeting, the spatial locality is often fully exploited in the
innermost loop body. That is, the array elements arc often accessed with stride one in the
innermost loop.
• Because of the above three factors, OIle TLB miss penalty can be amortized among a large
number of TLB hits, which makes the TLB rnisse.<; less important than L2 cache misses. For
example, let us assume that each array element takes 8 bytes, the TLB block size is 32KB,
the L2 cache block size is 128B and the array is accessed with spatial locality fully utilized.- A
TLB miss will be generated every 4K data references, while a L2 cache miss will be generated
every 16 data reference.<;.
Miscellaneous
From Figures 4 to 6, the average Ll cache miss rates all the Ultra II are generally greater than
those all the RlOK. This is because the Ultra II has a smaller L1 cache size. Also from Figures 4
to 6, STS performs better on the Ultra II than on the RI0K in most cases. Such results suggest
that STS can have a bigger impact for smaller cache sizes.
Recommendation
Table 12 summarizes our general observation concerning 1-D vs. 2-D tiling. If the number of
arrays in the T-loop body is large, 1-D tiling is recommended because it is difficult to satisfy
Properties 1 and 2 (Le., eliminating all interference misses). Otherwise, if the skewing factor 8 2 is
small, 2-D tiling is preferred. If 8 2 is large, then either 1-D or 2-D may be applied, depending on
the estimated execution cost. Both 1-D tiling and 2-D tiling can target either the L1 cache or the
L2 cache, depending on the respective execution costs.
7 Conclusion
In this paper, we address the issues of when and how to exploit the L2 cache locality for skewed
tiling. We present a tile-size selection algorithm, STS, based on an execution cost model which
incorporates both the L1 and the L2 cache misses to determine when the L2 cache locality should
be exploited. We apply inter-array padding to minimize the cross-interference misses so that the
the L2 cache locality can be achieved. The experimental results show that STS achieves an average
speedup of 1.29 to 1.66 over previous algorithms for three test programs with various data sizes.
For two SPEC benchmarks with different inputs, STS achieves a speedup of 1.08 to 2.14 over t.hose
applicable previous algorithms. The results produced by STS are shown to be within 5% of the
optimal, except for one test case in which the STS underperforms by 13%. Overall, for skewed
tiling, the experimental results favor STS over previously-proposed algorithms.
In our experiments, we found that turning on the compiler switch for prefetching for the tiled
codes may degrade the performance. How to effectively combine tiling and prefetching seems an
interesting future research topic.
35
References
[1] Jacqueline Chame and Sungdo Moon. A tile selection algorithm for data locality and
cache interference. In Proceedings of the Thirteenth A eM International Conference on
S1!percomputing, pages 492-499, Rhodes, Greece, June 1999.
[2] Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and
data layout. In Proceedings of ACM SIGPLAN Conference on Programming Language Design
and Implementation, pages 279-290, La Jolla, CA, June 1995.
[3J J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In
Proceedings of the Fourth International Workshop on Languages and Compile7"s for Parallel
Computing, August 1991. Also in Lecture Notes in Computer Science, pp_ 328-341, Springer-
Verlag, August 1991.
[4J Somnath Ghosh, Margaret Martonosi, and Sharad Malik. Precise miss analysis for program
transformations with caches of arbitrary associativity. In Froceedings of the Eighth ACM
Conference on Architectural Support for Programming Languages and Operating System.'),
pages 228-239, San Jose, California, October 1998.
[5] John Hennessy and David Patterson. ComputeT Architect717'e: A Quantitative Approach.
.Morgan Kaufmann Publishers, 1996.
[6] Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. Data-centric multi-level
blocking. In Pmceedings of ACM SIGPLAN Conference on Pmgramming Language Design
and Implementation, pages 346-357, Las Vegas, NY, June 1997.
[7] Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf. The cache performance and
optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on
Architectuml Support for Pmgramming Languages and Operating Systems, pages 63-74, Santa
Clara, CA, April 1991.
[8] Naraig Manjikian and Tarck Abdelrahman. Fusion of loops for parallelism and locality. IEEE
Transactions on Parallel and Distributed System.'), 8(2):193-209, February 1997.
[9] Karhryn McKinley, Steve Carr, and Chan-Wen Tseng. Improving data locality with loop
transformations. ACM Transactions on Progmmming Languages and Systems, 18(4):424-453,
July 1996.
[10] Nicholas Mitchell, Karin Hogstedt, Larry Carter, and Jeanne Ferrante. Quantifying the multi-
level nature of tiling interactions. Inlemutional Journal of PamUet Programming, 26(6):641-
670, December 1998.
[11] Preeti Panda, Hiroshi Nakamura, Nikil Dutt, and Alexandru Nicolau. Augmenting loop tiling
with data alignment for improved cache performance. IEEE Transactions on Compute7's,
48(2),142~149, Febmary 1999.
[12J Stephen Park and Keith Miller. Random number generators: Good ones are ha.rd to find.
Communications of the ACM, 31(10):1192-1201, October 1988.
[13] Gabriel Rivera and Chau-Wen Tseng. Eliminating conflict misses for high performance
architectures. Tn Proceedings of the 1998 ACM International Confe7'Cnce on Supercomputing,
pages 353-360, Melbourne, Australia, July 1998.
36
A comparison of compiler tiling algorithms. In
Confenmce on Compile7' Construction, Amsterdam,
Gabriel Rivera and Chan-Wen Tseng.
Proceedings of the Eighth lntemational
The Netherlands, March 1999,
[15] Yongbong Song and Zhiyuan Li. A compiler framework for tiling imperfectly-nested loops. In
Froceedings of the Twelfth International Work.<;hop on Languages and Compilc7's faT Parallel
Computing, San Diego, CA, August 1999.
[14)
[16] Yonghong Song and Zhiyuan Li. New tiling techniques to improve cache temporal locality.
In Proceedings of ACM SIGPLAN Conference on Programming Language Design and
Implementation, pages 215-228, Atlanta, GA, May 1999.
[17J Yonghoug Song and Zhiyuan Li. Impact of tile-size selection for skewed tiling. Technical
Report CSD-TR-00-0018, Department of Computer Science, Pmdue University, 2000. Also
available at http://www.cs.purdue.edu/homes/songyh/academic.html.
[18] YOllghong Song and Zhiyuan Li. Effective use of the level-two cache [or skewed tiling (extended
version). Technical Report CSD-TR-?, Department of Computer Science, Purdue University,
2001. Also available at http://www.cs.purdue.edu/homes/songyh/academic.html.
[19] O. Temam, C. Fricker, and W. .lalby. Cache interference phenomena. In Proceedings of
SIGMETRICS'94, pages 261-271, Santa Clara, CA, 1994.
[20] Michael Wolf. Impmving Locality and Pamllelism in Nested Loops. PhD thesis, Department
of Computer Science, Stanford University, August 1992.
(21] Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In Proceedings of
ACM SIGPLAN Conference on Pmgmmming Languages Design and Implementation, pages
30-44, Toronto, Ontario, Canada, .June 1991.
[22) Michael E. Wolf, Dror E. Maydan, and Ding-Kai Chen. Combining loop transformations
considering caches and scheduling. In Pmceedings of the Twenty-Ninth Annual IEEE/ACM
Intcmational Symposium on Microa7"Chitcetun~, pages 274-286, Paris, France, December 1996.
[23] Michael Wolfe. High Performance Compilers fOT Pamllel Computing. Addison-Wesley
Publishing Company, 1995.
