A Compiler Algorithm for Optimizing Locality in Loop Nests by Kandemir, Mahmut et al.
Syracuse University 
SURFACE 
Electrical Engineering and Computer Science College of Engineering and Computer Science 
1997 
A Compiler Algorithm for Optimizing Locality in Loop Nests 
Mahmut Kandemir 
Syracuse University 
J. Ramanujam 
Lousiana State University, ECE Department 
Alok Choudhary 
Northwestern University, ECE Department 
Follow this and additional works at: https://surface.syr.edu/eecs 
 Part of the Computer Sciences Commons 
Recommended Citation 
Kandemir, Mahmut; Ramanujam, J.; and Choudhary, Alok, "A Compiler Algorithm for Optimizing Locality in 
Loop Nests" (1997). Electrical Engineering and Computer Science. 31. 
https://surface.syr.edu/eecs/31 
This Working Paper is brought to you for free and open access by the College of Engineering and Computer 
Science at SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science by an 
authorized administrator of SURFACE. For more information, please contact surface@syr.edu. 
A Compiler Algorithm for Optimizing Locality in Loop NestsM. Kandemir J. Ramanujam A. ChoudharyEECS Dept. ECE Dept. ECE Dept.Syracuse University Louisiana State University Northwestern UniversitySyracuse, NY, 13244 Baton Rouge, LA 70803 Evanston, IL, 60208-3118mtk@ece.nwu.edu jxr@ee.lsu.edu choudhar@ece.nwu.eduAbstractThis paper describes an algorithm to optimize cache localityin scientic codes on uniprocessor and multiprocessor ma-chines. A distinctive characteristic of our algorithm is thatit considers loop and data layout transformations in a uni-ed framework. We illustrate through examples that ourapproach is very eective at reducing cache misses and tile-size sensitivity of blocked loop nests; and can optimize nestsfor which optimization techniques based on loop transfor-mations alone are not successful. An important special caseis the one in which data layouts of some arrays are xed andcannot be changed. We show how our algorithm can handlethis case, and demonstrate how it can be used to optimizemultiple loop nests.1 IntroductionMinimizing the time spent in data accesses is an impor-tant issue in the ecient execution of nested loops on bothuniprocessors and multiprocessors. Although caches are ca-pable of reducing the average memory access time and op-timizing compilers are able to detect signicant parallelism,the performance of scientic programs on both uniprocessorsand multiprocessors can be rather poor due to not exploitingthe full potential locality in these programs [13].We present a compiler approach to enhance the cacheperformance of these programs on uniprocessors and multi-processors. In a unied framework, our approach considersmodifying array layouts in memory and transforming loopnests suitably to exploit locality. We simulate miss rates forseveral nests in order to demonstrate that our approach isvery eective at reducing number of cache misses, and reportexecution times on Sun SPARCstation 5, IBM RS/6000 andSGI Challenge. We also compare our optimization strategyto a representative method [10] from a class of approacheswhich consider only loop transformations to optimize local-ity and show that xing the memory layouts for all arrays|as in C and Fortran|limits performance that could other-wise have been obtained from the programs.A recent study shows that a group of highly parallelizedbenchmark programs spend 39% of their cycles stalled inmemory access [11]. In order to eliminate the memory bot-tleneck, spatial locality should be exploited. One way ofThis work is supported in part by NSF Young Investigator AwardCCR-9357840, NSF CCR-9509143. The work of J. Ramanujam is sup-ported in part by an NSF Young Investigator Award CCR-9457768.
achieving this is to transform the loop nest such that theinnermost loop exhibits unit-stride accesses for array ref-erences. While this approach produces satisfactory resultsfor several cases, we show in this paper that there is stillroom for signicant improvement, if the xed array layoutstrategy adopted by the conventional compilers is relaxed.In this paper we make the following contributions: We present a new algorithm to optimize the localitycharacteristics of nested loops. The algorithm applies bothdata and control transformations. We show that the known approaches considering onlycontrol transformations (e.g. loop permutations, tiling, etc.)are insucient for many cases. We demonstrate the eectiveness of our approach byboth simulation results and execution-time measurements.Since our approach increases the spatial locality and thepercentage of conict misses and reduces the percentageof capacity misses; it is generally more eective with largeblock (cache line) sizes and set-associative caches.This paper is organized as follows. Section 2 reviews ba-sic loop transformation theory. Section 3 presents relatedwork. Section 4 discusses the algorithm for optimizing lo-cality in a single loop nest. Section 5 extends this algorithmto multiple loop nests. Section 6 presents experimental re-sults which illustrate the ecacy of our approach. Section7 presents a discussion of our work on false sharing. Section8 presents summary and concludes the paper.2 PreliminariesThe memory layout for an h-dimensional array can be inone of the h! forms, each of which corresponding to layout ofdata on memory linearly by a nested traversal of the axes insome predetermined order. The innermost axis is called thefastest changing dimension. As an example for row-majormemory layout the second dimension is the fastest chang-ing dimension. We focus on loop nests where both arraysubscripts and loop bounds are ane functions of enclosingloop indices. A reference to an array X is represented byX(L~I +~b) where L is a linear transformation matrix calledarray reference matrix, ~b is oset vector and ~I is a columnvector representing the loop indices i1, i2,...,in starting fromthe outermost loop.Linear mappings between iteration spaces of loop nestscan be modeled by non-singular transformation matrices[10]. If ~I is the original iteration vector, after applying lin-ear transformation T , the new iteration vector is ~J = T ~I.Similarly if ~d is the distance/direction vector, on applyingT , T ~d is the new distance/direction vector. A transforma-tion is legal if and only if T ~d is lexicographically positive forevery ~d [15]. On the other hand, since L~I = LT 1 ~J , LT 1is the new array reference matrix after the transformation.We denote T 1 by Q. An important characteristic of our
algorithm is that using the array reference matrices, the en-tries of Q = [qij ] are derived systematically. For the rest ofthe paper, the reference matrix for array X will be denotedby LX whereas the ith row of the reference matrix for arrayX will be denoted by ~̀iX .3 Related Work3.1 Fixed Layout ApproachLoop transformations have been used for optimizing cachelocality in several papers [10, 14, 7]. Results have shownthat on several architectures the speedups achieved by looptransformations alone can be signicant.Li [10] describes a data reuse model and a compiler algo-rithm called height reduction to improve cache locality. Heintroduces the concept of a data reuse vector and denes itsheight as the number of dimensions from the rst non-zeroentry to the last entry. The non-zero entries of a reuse vectorindicate that there are reuses carried by the correspondingloops. The individual reuse vectors constitute reuse matriceswhich in turn constitute the global reuse matrix. The algo-rithm assigns priorities to reuse vectors depending on thenumber of times they occur, and tries to reduce the heightof the global reuse matrix starting from the reuse vector ofhighest priority. Apart from reducing the execution time,the height reduction algorithm serves two purposes: it reduces the sensitivity of tiling to the tile size; and it places the loops carrying reuse into innermost posi-tions; thus, when the outermost loops are parallelized, thechances of false sharing will be low.In comparison, our algorithm (Sections 4 and 5) triesto exploit the spatial locality by also considering dierentmemory layouts for dierent arrays. Since Li's approach isrepresentative of a class of algorithms that use only controltransformations to exploit locality [14, 7, 10], for the restof the paper we use Li's algorithm (denoted W-Opt) andcompare it with our algorithm.3.2 Data and Loop TransformationsFor programs that are not conducive to loop transforma-tions, data transformations should also be taken into ac-count. Only a few works have considered data and looptransformations together to optimize locality. Ju and Dietz[6] present a systematic approach that integrates data lay-out optimizations and loop transformations to reduce cachecoherence overhead. Anderson et al. [1] oer a simple algo-rithm to transform data layout to make the region accessedby each processor contiguous.Cierniak and Li [3] present a unied approach to opti-mize locality that employs both data and control transfor-mations. The notion of a stride vector is introduced and anoptimization strategy is developed for obtaining the desiredmapping vectors and transformation matrix. At the end,the following equality is obtained:T T v = ATmIn this formulation only A, data reference matrix, is known.The algorithm tries to nd T , the transformation matrix;m, a mapping vector which can assume h! dierent formsfor an h-dimensional array; and v, the desired stride vec-tor. Since this optimization problem is dicult to solve,the following heuristic is used: First it is assumed that thetransformation matrix contains only values 0 and 1. Second,
the value of the stride vector v is assumed to be known be-forehand. Then the algorithm constructs the matrix T rowby row by considering all possible legal mappings. Whencompared to our strategy (Sections 4 and 5), we argue thatour approach is more accurate, as it does not restrict thesearch space of possible loop transformations. Also our ap-proach is simpler to embed in a compilation system, since itdoes not require a priori knowledge of any vector such as v;and more importantly it does not depend on any new reuseabstraction. The approach presented by Cierniak and Li [3]is a heuristic whereas our approach for single nest is exactsolution which nds all possible optimized transformationsand memory layouts. This last point is important as will bedemonstrated in Section 6.4 Algorithm for Optimizing LocalitySince accessing data from memory is usually an order ofmagnitude slower than accessing data in cache, optimizingcompilers must reduce the number of memory accesses. Wepresent an algorithm which automatically transforms a givenloop nest to exploit spatial locality and assigns appropriatememory layouts for arrays, in a unied framework.The algorithm is shown in Figure 1. In the algorithm,C is the array reference on the LHS whereas A representsan array reference from the RHS. The symbol  denotesthe don't care condition. Let i1, i2,...,in be the loop indicesof the original nest and j1, j2,...,jn be the loop indices ofthe transformed nest, starting from outermost loop. Thefollowing is a brief explanation of our algorithm: Our transformation matrix should be such that theLHS array of the transformed loop has the innermost indexas the only element in one of the array dimensions and thatindex should not appear in any other dimension for thisarray. In other words, after the transformation, the LHSarray C should be of the form C(; ; :::; jn; :::; ; ) wherejn (the new innermost loop index) is in the rth dimensionand  indicates a term independent of jn. This means thatthe rth row of the transformed reference matrix for C is(0; 0; :::; 0; 1) and all entries of the last column, except theone in rth row, are zero. After that, the LHS array canbe stored in memory such that the rth dimension will bethe fastest changing dimension. This approach exploits thespatial locality for this reference. Notice that all possiblevalues for r should be considered. Then the algorithm works on one reference from theRHS at a time. If a row s in the data reference matrix isidentical to rth row of the original reference matrix of theLHS array, then the algorithm attempts to store this RHSarray in memory such that the sth dimension will be thefastest changing dimension. We note, however, that havingsuch a row s does not guarantee that the array will be storedon memory such that the sth dimension will be the fastestchanging dimension. If the condition above does not hold for a RHS array A,that means this array cannot be stored in memory such thatthe new innermost loop index appears only in the fastestchanging dimension. In that case the algorithm tries totransform the reference to A(; ; :::;F(jn 1); :::; ; ), whereF(jn 1) is an ane function of jn 1 and other indices ex-cept jn, and  indicates a term independent of both jn 1 andjn. This helps to exploit the spatial locality at the secondinnermost loop. If no such transformation is possible, thejn 2 is tried and so on. If all loop indices are tried unsuc-cessfully, then the remaining entries of Q are set arbitrarily,observing the data dependences and non-singularity.
Step 1 Initialize i = 1.Step 2 Set ~̀iC :Q = (0; 0; :::; 0; 1) and ~̀kC :Q = (;; :::;; 0) for each k 6= i.Step 3 Set memory layout for C such that ith index position will be the fastest changing dimension.Step 4 For each array reference A on the RHS that has ~̀lA = ~̀iC for some l, try to set memory layout for A such that the lth dimension will bethe fastest changing dimension.Step 5 Choose an array reference A for which the equality in Step 4 does not hold. Initialize j = 1.Step 6 Set ~̀jA:Q = (0; 0; :::; 1; 0) and ~̀kA:Q = (;; :::;; 0; 0) for each k 6= j. If this step is consistent with the previous steps go to Step 7,otherwise increment j and go to the beginning of this step. If there exist inconsistencies for all j values, then initialize j = 1, and set~̀jA:Q = (0; 0; :::; 1; 0; 0) and ~̀kA:Q = (;; :::;; 0; 0; 0) for each k 6= j, and repeat Step 6 and so on. If no T 1 is found then ll theremaining entries arbitrarily observing the dependences and non-singularity.Step 7 Repeat Step 6 for all reference matrices of a particular A (Of course, all reference matrices for a particular A should have the samememory layout).Step 8 Repeat Step 6 for all distinct array references.Step 9 Record the obtained transformation matrix. Also record, for each array, the loop index position which appears in the fastest changingposition for that array.Step 10 Increment i and go to Step 2 (try a dierent memory layout for the LHS array C).Step 11 Compare all the recorded transformation matrices and their associated layouts, and choose the best alternative.Figure 1: Algorithm for optimizing locality. After a transformation and corresponding memory lay-outs are found, they are recorded and the next alternativememory layout for the LHS is tried and so on. Among allfeasible solutions, the one which exploits most spatial local-ity in the innermost loop is chosen.5 Global Locality Optimization: Multiple Loop Nests5.1 General ProblemLet fN1;N2; :::;Nmg denote dierent loop nests in the pro-gram; and fA1;A2; :::;Akg denote dierent arrays. In gen-eral each nest can access a subset of the arrays. We assumethat our algorithm described before is run for each nest, anda number of possible optimized layout combinations are ob-tained for each nest. In [8], the authors show that problemof nding a global array layout combination that satises allthe nests is NP-complete even for the restricted case whereonly row-major and column-major arrays are considered.We present a heuristic for the global layout optimizationproblem.5.2 Locality Optimization under Layout ConstraintsDuring the compilation of a program it may be possible thatthe compiler, due to data dependences or some other con-straints, is not able to apply loop transformations or modifymemory layouts of some arrays. Each unmodiable informa-tion constitutes a constraint for the compiler. An importantcase is the explicitly parallel programs where loop transfor-mations are generally not possible since the programmer hasalready decided the parallelization [3].Any transformation matrix must have a full rank andshould not violate any data dependences. In the algorithm,after a candidate Q is built, it is checked against data de-pendences, and discarded if it violates any dependences orits rank is not full.We now focus on the problem of optimizing locality whensome or all array layouts are xed. We note that each xedlayout requires that the innermost loop index should be inthe appropriate array index position (dimension), depend-ing on layout of the array. For example, suppose that thememory layout for a h-dimensional array is such that the
dimension k1 is the fastest changing dimension, the dimen-sion k2 is the second fastest changing dimension, k3 is thethird etc. The algorithm should rst try to place the newinnermost loop index jn only to the k1th dimension of thisarray. If this is not possible, then it should try to placejn only to the k2th dimension and so on. If all dimensionsup to and including kh are tried unsuccessfully, then jn 1should be tried for the k1th dimension and so on. In the nextsubsection we show that this constrained layout algorithmis important for global locality optimization.5.3 Global Optimization AlgorithmIn this subsection we show how our algorithm can be ex-tended to work on multiple nests. Since a number of arrayscan be accessed by a number of nests and each of thesenests may require a dierent layout for a specic array, thealgorithm should nd a memory layout for that array thatsatises the majority of the nests.In the following we present a sketch of a simple heuris-tic. Our approach is based on the concept of most costlynest. Intuitively, this is the nest which takes the most time.Dierent methods can be adopted to choose this nest. Forexample the programmer can use compiler directives to givehints about this nest. We can also use a metric such as mul-tiplication of the number of loops and the number of arraysreferenced in the nest. The nest which has the largest result-ing value can be marked as the most costly nest. Then thealgorithm proceeds as follows: First, the most costly nestis optimized by using the algorithm presented in Figure 1.After this step, memory layouts for some of the arrays willbe determined. Then each of the remaining nests can beoptimized using the approach presented for the constrainedlayout case in the previous subsection. After each nest is op-timized, new layout constraints will be obtained, and thesewill be propagated for optimizing the remaining nests.6 Examples, Simulation Results and Experimental ResultsThis section presents several examples to illustrate the al-gorithm. Our experimental suit comprises of some kernelsand several representative nests extracted from NAS Bench-marks [2].
DO i = 1, n DO u = 1, n DO u = 1, n DO u = 1, nDO j = 1, n DO v = 1, n DO v = 1, n DO v = 1, nDO k = 1, n DO w =1 ,n DO w =1 ,n DO w =1 ,nDO l = 1, n DO y = 1, n DO y = 1, n DO y = 1, nA[i,j]+=B[k,i]+C[l,k] A[w,y]+=B[v,w]+C[u,v] A[y,u]+=B[v,y]+C[w,v] A[u,y]+=B[w,u]+C[v,w]ENDDO l ENDDO y ENDDO y ENDDO yENDDO k ENDDO w ENDDO w ENDDO wENDDO j ENDDO v ENDDO v ENDDO vENDDO i ENDDO u ENDDO u ENDDO u(A) (B) (C) (D)DO j = 1, n2 DO jj = 0, n2-1, nc DO k = f1(c), f2(c) DO k = 1, nzDO i = 1, n1 DO ii = 0, n1-1, nc DO i = f3(c), f4(c) DO j =1, nyY[j,i]=X[i,j] DO j = 1, nc DO j = f5(c), f6(c) buf[1,(k-1)*ny+j]=g[1,nx-1,j,k]ENDDO i DO i = 1, nc ..... .....ENDDO j Z[j,i]=X[i+ii,j+jj] cv[j]=vs[i,j,k,c] buf[5,(k-1)*ny+j]=g[5,nx-1,j,k](E) ENDDO i ..... buf[1,(k-1)*ny+j+ny*nz]=g[1,nx,j,k]ENDDO j ENDDO j .....DO i = 1, nc DO j = f7(c), f8(c) buf[5,(k-1)*ny+j+ny*nz]=g[5,nx,j,k]DO j = 1,nc lhs[i,j,k,1,c]=0.0 ENDDO jY[j+jj,i+ii]=Z[j,i] lhs[i,j,k,2,c]=g1(cv[j-1],rhoq[j-1]) ENDDO kENDDO j lhs[i,j,k,3,c]=g2(rhoq[j]) (H)ENDDO i lhs[i,j,k,4,c]=g1(cv[j+1],rhoq[j+1])ENDDO ii lhs[i,j,k,5,c]=0.0ENDDO jj ENDDO j(F) ENDDO iENDDO k(G)DO i3 = 2, n3-1 DO k = 1, nDO i2 = 2, n2-1 DO j = 1, n-1buff len=buff len+1 u[k,j]=0.0buff[buff len,buff id]=u[2,i2,i3] l[j,k]=0.0ENDDO i2 ENDDO jENDDO i3 l[k,k]=1.0..... DO j = k, nDO i3 = 2, n3-1 u[k,j]=a[k,j]DO i2 = 2, n2-1 DO p = 1, k-1buff len=buff len+1 u[k,j]-=l[k,p]*u[p,j]buff[buff len,buff id]=u[n1-1,i2,i3] ENDDO pENDDO i2 ENDDO jENDDO i3 IF (k  n-1) THEN..... DO i = k+1, nDO i2 = 1, n2 l[i,k]=a[i,k]DO i1 = 1, n1 DO p = 1, k-1buff len=buff len+1 l[i,k]-=l[i,p]*u[p,k]buff[buff len,buff id]=u[i1,i2,2] ENDDO pENDDO i2 l[i,k]=l[i,k]/u[k,k]ENDDO i3 ENDDO i(I) ENDIFENDDO k(J)Figure 2: (A) An example four-deep loop nest. (B) Optimized loop nest assuming xed row-major arrays. (C) Optimized loop nest. (D)Optimized loop nest. (E) An example from the FT benchmark. (F) An example from the FT benchmark. (G) An example from the SPbenchmark. (H) An example from the LU benchmark. (I) An example from the MG benchmark. (J) LU decomposition kernel.We demonstrate the simulation results obtained by usingan enhanced version of DineroIII [5], a trace-driven unipro-cessor cache simulator. We simulate the miss rates overa range of cache sizes (4K, 8K, 16K, 32K, 64K, 128K),block (cache line) sizes (8, 16, 32, 64, 128, 256) and set-associativities (direct-mapped, 2-way, 4-way, full-associative).Also presented are empirical results obtained on SPARC-station 5, RS/6000 and SGI Challenge. SPARCstation 5has a 16K direct-mapped data cache and a 32 MB memory.RS/6000 Model 590 has 256 KB data cache. SGI Challengehas a logically and physically shared memory system. It usessnoopy write-invalidate cache coherence. Each node has 1MB data cache attached to it. In SGI, during the multi-processor experiments static scheduling has been employed.Due to space concerns, we do not show the steps or parts ofsteps which lead to unsuccessful trials; and we only presenta subset of our simulation and experimental results.6.1 Example: A four-deep loop nestFigure 2:A shows a four-deep loop nest which can ben-et from the layout exibility. The array reference ma-trices for this nest are as follows. LA =   1 0 0 00 1 0 0 ,LB =   0 0 1 01 0 0 0  and LC =   0 0 0 10 0 1 0 .The algorithm works as follows:LA:Q =   0 0 0 1   0 . Therefore, q11 = q12 = q13 =q24 = 0 and q14 = 1.LB :Q =      00 0 0 1 . Therefore, q34 = 0.
LC :Q =     1 0  0 0 . Therefore, q33 = q44 = 0 andq43 = 1.At this point T 1 = Q =  0 0 0 1q21 q22 q23 0q31 q32 0 0q41 q42 1 0 . We setthe unknowns to the following values: q22 = q23 = q31 =q41 = q42 = 0 and q21 = q32 = 1 and obtain T 1 = Q = 0 0 0 11 0 0 00 1 0 00 0 1 0 . ArraysA and C are column-major whereasthe array B is row-major; and the resulting code is shownin Figure 2:C.Next the compiler tries the other alternative layout (row-major) for A.LA:Q =      00 0 0 1 . Therefore, q14 = q21 = q22 =q23 = 0 and q24 = 1.LB :Q =     1 0  0 0 . Therefore, q13 = q34 = 0 andq33 = 1.LC :Q =     0 0  1 0 . Therefore, q43 = q44 = 0.By setting q12 = q31 = q32 = q41 = 0 and q11 = q42 = 1,T 1 = Q =  1 0 0 00 0 0 10 0 1 00 1 0 0 . Arrays A and C are row-majorwhereas the array B is column-major; and the resulting codeis shown in Figure 2:D.6.2 Example: A constrained-layout nestWe revisit the example shown in Figure 2:A, this time as-suming xed row-major memory layouts for all arrays.
16 32 64128 16 32 64128 16 32 64128 16 32 64128
Block Size
0.00
0.10
0.20
0.30
0.40
M
is
s 
R
at
io
(A)
Direct Mapped, n=500
Unopt
W-Opt
Opt
4K Cache 8K Cache 16K Cache 32K Cache
16 32 64128 16 32 64128 16 32 64128 16 32 64128
Block Size
0.00
0.10
0.20
0.30
0.40
M
is
s 
R
at
io
(B)
Associativity=2, n=500
Unopt
W-Opt
Opt
4K Cache 8K Cache 16K Cache 32K Cache
Figure 3: (A) Simulation results of a four-deep nest for dierent block and cache sizes on a direct-mapped cache. (B) Simulation results of afour-deep nest for dierent block and cache sizes on a set-associative cache.
0.00
0.10
0.20
0.30
0.40
M
is
s 
R
at
io
Cache Size=8K, n=750, Block Size=32
Unopt W-Opt Opt
16
3264
128
256
16
3264128256
163264128
256
0.00
0.10
0.20
0.30
0.40
M
is
s 
R
at
io
Cache Size=8K, n=750, Block Size=64
Unopt W-Opt Opt
16
3264
128
256
1632
64128256
16 3264128
256
0.00
0.10
0.20
0.30
0.40
M
is
s 
R
at
io
Cache Size=8K, n=750, Block Size=128
Unopt W-Opt Opt
16
3264
128
256
16 3264128256
163264128256Figure 4: Tile size sensitivity for a four-deep nest.LA:Q =      00 0 0 1 . Therefore, q14 = q21 = q22 =q23 = 0 and q24 = 1.LB :Q =     0 0  1 0 . Therefore, q33 = q34 = 0 andq13 = 1.LC :Q =    0 0 0 1 0 0 . Therefore, q42 = q43 = q44 = 0and q32 = 1.We set the unknowns to the following values q11 = q12 =q31 = 0 and q41 = 1 and obtain T 1 = Q =  0 0 1 00 0 0 10 1 0 01 0 0 0 .The resulting code is shown Figure 2:B. Notice that thisis the nest that would be obtained for row-major memorylayouts, had we used the W-Opt.Figure 3:A demonstrates miss ratios for this nest with500  500 double arrays on a direct-mapped cache. We ranexperiments with three dierent versions: unoptimized ver-sion (Figure 2:A, Unopt), optimized version by xing row-major layouts for all arrays (Figure 2:B, W-Opt) and oneof the versions obtained by our approach (Figure 2:C, Opt).The results shown indicate that except for the 4K cache, ourapproach outperforms the W-Opt (Figure 2:B). In order tofurther understand the sources of the misses in the opti-mized programs we breakdown the misses into compulsory,capacity and conict misses. The results indicate that themajority of the misses in the optimized program are due toconicts which can, in principle, be eliminated by increas-ing the set-associativity. Figure 3:B shows the miss ratiosfor this example with 500500 double arrays on a 2-way setassociative cache. As expected, except for the 4K cache, ourapproach eliminates almost all misses; whereas the W-Optdoes not improve the performance at all for some cases.Tiling (also known as blocking) is a technique to improve
the locality, and is a combination of strip-mining and looppermutation [14, 15]. Due to interference misses it is di-cult to select a suitable tile size. In other words, unless thetile size is tailored according to the matrix size and cacheparameters, the performance of tiling may be rather poor[4, 9].Figure 4 illustrates the insensitivity of the optimized tiledversions to the tile size. The numbers above the bars denotethe tile sizes. Notice that while the miss ratio of the unopti-mized tiled version is very unstable, those of the optimizedversions (Figure 2:B and Figure 2:C) are stable. Notice alsothat our version outperforms the W-Opt for all tile, block(cache line) and cache sizes.Figures 6:A and B present execution times for this ex-ample with dierent input sizes on SPARCstation 5 and asingle node of SGI respectively. Opt-1 and Opt-2 denote theoptimized versions obtained by our algorithm (Figures 2:Cand D). Figures 6:C and D, on the other hand, show theexecution times on multiple nodes of SGI Challenge with150  150 and 200  200 double arrays respectively. It canbe seen that although the approach based on loop transfor-mations alone can improve the performance, our approachgives the best results on both uniprocessor and multipro-cessors. In SPARC, for example, with 250  250 doublematrices, our approach (Opt-2) runs in almost 800 secondsless than the W-Opt. On four nodes of the SGI Challenge,with 200  200 double arrays, our version (Opt-2) saves 36more seconds than W-Opt. This example clearly shows thatrelaxing the memory layouts can save substantial amountsof time for some nests.
DO i = 1, n DO j = 1, nDO j = 1, n DO i = 1, nA[i,j]=B[j,i]*C[i,j]+D[i,j]*LOG(E[j,i]) A[i,j]=B[j,i]*C[i,j]+D[i,j]*LOG(E[j,i])ENDDO j ENDDO iENDDO i ENDDO jDO i = 1, n DO i = 1, nDO j = 1, n DO j = 1, nB[i,j]=A[j,i]+E[i,j] B[i,j]=A[j,i]+E[i,j]ENDDO j ENDDO jENDDO i ENDDO i(A) (B)DO i = 1, n DO j = 1, n DO u = 1, nDO j = 1, n DO i = 1, n DO v = 1, nDO k = 1, n DO k = 1, n DO w = 1, nC[i,j]+=A[i,k]*B[k,j] C[i,j]+=A[i,k]*B[k,j] C[u,w]+=A[u,v]*B[v,w]ENDDO k ENDDO k ENDDO wENDDO j ENDDO i ENDDO vENDDO i ENDDO j ENDDO uDO i = 1, n DO j = 1, n DO u = 1, nDO j = 1, n DO i = 1, n DO v = 1, nDO k = 1, n DO k = 1, n DO w = 1, nF[i,j]+=E[i,k]*C[k,j] F[i,j]+=E[i,k]*C[k,j] F[u,w]+=E[u,v]*C[v,w]ENDDO k ENDDO k ENDDO wENDDO j ENDDO i ENDDO vENDDO i ENDDO j ENDDO u(C) (D) (E)Figure 5: (A) A simple benchmark. (B) Optimized version of (A). (C) MxMxM program. (D) A transformed version of (C). (E) Transformedversion of (C) by our approach.
50 100 150 200 250
problem size (n)
0.0
1500.0
3000.0
4500.0
6000.0
7500.0
9000.0
10500.0
12000.0
e
xe
cu
tio
n
 t
im
e
 (
se
c)
(A)
Unopt
W-Opt
Opt-1
Opt-2
50 100 150 200 250
problem size (n)
0.0
500.0
1000.0
1500.0
e
xe
cu
tio
n
 t
im
e
 (
se
c)
(B)
Unopt
W-Opt
Opt-1
Opt-2
1 2 4 8
number of processors (p)
0.0
100.0
200.0
300.0
e
xe
cu
tio
n
 t
im
e
 (
se
c)
(C)
Unopt
W-Opt
Opt-1
Opt-2
1 2 4 8
number of processors (p)
0.0
200.0
400.0
600.0
800.0
e
xe
cu
tio
n
 t
im
e
 (
se
c)
(D)
Unopt
W-Opt
Opt-1
Opt-2
Figure 6: Execution times (A) on SPARCstation 5. (B) on a single node of SGI. (C) on multiple nodes of SGI with 150  150 double arrays.(D) on multiple nodes of SGI with 200 200 double arrays.6.3 Example nests from NAS BenchmarksThe NAS Parallel Benchmarks [2] are a set of programs de-signed to help evaluate the performance of parallel super-computers. To utilize the cache eectively, the benchmarksgenerally access data with unit stride. Default layout forthe nests is column-major. It should be stressed that theexamples considered here are only representative nests. FT benchmark uses simple-transpose and complicated-transpose nests shown in Figure 2:E and F respectively. InFigure 2:E, spatial locality for the array Y is poor; our al-gorithm attaches row-major layout for Y and column-majorlayout for X, retaining the original loop order. In Figure 7:Athe leftmost group of bars show the performance improve-ment obtained by our approach for dierent block sizes ona 16K direct-mapped cache. Notice that the eectivenessof the approach increases with larger block sizes. In Fig-ure 2:F, on the other hand, the reference Z[j; i] in the rstloop has poor locality. Our locality optimization algorithmattaches row-major layouts for Z and Y , and column-majorlayout for X; and interchanges the loops in the second nestplacing the i-loop into innermost position. The middle andrightmost bar-charts in Figure 7:A show the improvementobtained by our approach for nc = 64 and nc = 150, respec-tively. Since when nc = 64, the data used by the innermostloop ts in the cache anyway, our algorithm does not addmuch. An example nest from the SP benchmark is given inFigure 2:G. In order to apply our algorithm, we rst dis-
tribute the second j-loop over the individual statements.Then our approach attaches layouts for the arrays vs andlhs such that the second dimension in both arrays will be thefastest changing dimension exploiting the spatial locality inthe innermost loops. Figure 7:B demonstrates the reductionin cache misses. Figure 2:H presents an example nest from the LUbenchmark. After distributing the j-loop, our algorithmoers two alternatives: retain the original loop order andmake the third dimension of the array g the fastest chang-ing dimension; or apply the loop interchange and make thefourth dimension of the array g the fastest changing dimen-sion. The performance improvement is similar for both thealternatives. Figure 7:C shows the performance improve-ment. With a block (cache line) size of 128, more than halfof the misses are eliminated. Three typical loop nests from the MG benchmark areshown in Figure 2:I. In the rst nest, since i2 is the in-nermost loop, our global locality algorithm makes seconddimension of u the fastest changing dimension. This choiceis appropriate for the second nest as well; and for the thirdnest our algorithm interchanges the loops i2 and i1. Theperformance improvement illustrated in Figure 7:D is sub-stantial.We ran experiments on RS/6000 and SPARCstation 5.Due to space concerns, we only present the execution timesfor simple-transpose nest, in Figures 8:A and B for RS/6000and SPARCstation 5 respectively. When n1 = n2 = n =
16 32 64 128 16 32 64 128 16 32 64 128
Block Size
0.00
0.20
0.40
0.60
0.80
M
is
s 
R
at
io
(A)
Cache Size=16K, n=500
Unopt
Opt
nc=64 nc=150
16 32 64 128
Block Size
0.00
0.10
0.20
0.30
0.40
M
is
s 
R
at
io
(B)
Cache Size=16K, n=60
Unopt
Opt
16 32 64 128
Block Size
0.00
0.20
0.40
0.60
0.80
M
is
s 
R
at
io
(C)
Cache Size=16K, n=60
Unopt
Opt
16 32 64 128
Block Size
0.00
0.20
0.40
0.60
0.80
M
is
s 
R
at
io
(D)
Cache Size=16K, n=250
Unopt
Opt
Figure 7: (A) Miss ratios for two example nests from FT. (B) Miss ratios for an example nest from the SP. (C) Miss ratios for an example nestfrom the LU. (D) Miss ratios for an example nest from the MG.
1000 2000 3000 4000
problem size (n)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
ex
ec
ut
io
n 
tim
e 
(s
ec
)
(A)
Unopt
Opt
1000 2000 3000 4000
problem size (n)
0.0
20.0
40.0
60.0
80.0
100.0
ex
ec
ut
io
n 
tim
e 
(s
ec
)
(B)
Unopt
Opt
16 32 64 128
Block Size
0.00
0.05
0.10
0.15
0.20
M
is
s 
R
at
io
(C)
Direct-mapped, Cache Size=8K, n=400
Unopt
Opt
16 32 64 128
Block Size
0.00
0.05
0.10
0.15
0.20
M
is
s 
R
at
io
(D)
Associativity=2,Cache Size=8K, n=400
Unopt
Opt
Figure 8: (A) Execution times for simple-transpose on RS/6000. (B) Execution times for for simple-transpose on SPARCstation 5. (C) Missratios for the LU decomposition. (D) Miss ratios for the LU decomposition.4000, on RS/6000, there is 45% performance improvement.6.4 LU DecompositionFigures 2:J shows an LU decomposition algorithm. Ourglobal locality algorithm identies the nests containing theinnermost p-loops as the most costly nests, and attachesrow-major layout for the array l and column-major layoutfor the array u. Figures 8:C and D show the miss rates ofthe unoptimized and optimized nests for an 8K data cachewith direct-mapping and a set-associativity of size 2 respec-tively. As can be seen, our algorithm reduces the originalmiss rates by 7% to 40%.6.5 Other examplesFigure 9:A shows the normalized miss rates for the dgemmroutine from BLAS. This routine performs the followingoperation: C = f(A)f(B) + C, where f(X) = X orXT , and  and  are scalars. Both the unoptimized andthe optimized versions have been called four times, each ofwhich with dierent operation, and the average miss rateshave been computed. Below each pair of bars is given thetriple cache size, block size, associativity. In the simulation500  500 double precision matrices are used.Figure 9:B demonstrates the performance improvementon dtrsl, a routine from LINPACK which solves the systemsof the form Tx = b or T Tx = b where T is a triangularmatrix of order n. While for optimizing the dgemm bothdata and loop transformation are used, for dtrsl only datatransformations are used.Finally, we show the impact of our algorithm on twoprograms from [3]. The program shown in Figure 5:A isa simple benchmark. Figure 9:C shows the improvementobtained by our approach. For each cache size, the threebars from left to right correspond to unoptimized versionwith column-major layouts, unoptimized version with row-major layouts and version optimized by our approach. In
the optimized version, the loops in the rst nest are inter-changed; and the following layouts are assigned: A, C, andD are column-major; B and E are row-major. With theseoptimizations, the spatial locality for every reference is ex-ploited in the innermost loop and the optimized program isgiven in Figure 5:B. 400 400 double matrices are used.Figure 5:C shows a program named MxMxM from [3].This program computes the product of three square matri-ces. The version obtained by using the method in [3] ispresented in Figure 5:D. The layouts for A and E are row-major, whereas those for the other arrays are column-major.For the four of the total eight references, the spatial localityis exploited in the innermost loop; and for the remainingfour references it is exploited in the second loop. In com-parison, our global approach transform this program to theone shown in Figure 5:E. All arrays are row-major (a ver-sion with all column-major arrays is also possible). For sixof the total eight references, the spatial locality is exploitedin the innermost loop; for the two references it is exploitedin the second loop. The normalized miss rates for an 8 KBdirect-mapped cache is are shown in Figure 9:D. These re-sults reveal that our approach is better in the sense that itnds all possible transformations and layouts; and selectsthe most optimal one.7 Impact on False SharingIn shared-memory multiprocessors when processors makereferences to dierent data items within the same cache line,false sharing occurs [12]. Since cache coherence is main-tained on a cache block (line) basis, when one processormodies a data item, it causes an invalidation in the otherprocessors' cache. It is well known that one of the maincauses of the false sharing is the parallelization of a loopthat carries spatial reuse [10, 15]. On the other hand, thelarger the granularity of parallelism the better it is; becausethe synchronization overhead will diminish with the increas-
16K,32,1 32K,64,1 64K,32,2 128K,64,4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
M
is
s 
R
at
io
(A)
dgemm from BLAS
Unopt
Opt
8K,32 8K,64 32K,32 32K,64
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
M
is
s 
R
at
io
(B)
dtrsl from LINPACK
Unopt
Opt
8K 16K 32K 64K 128K
Cache Size
0.1
0.2
0.3
0.4
0.5
0.0
0.6
0.7
0.8
0.9
1.0
M
is
s 
R
at
io
(C)
simple benchmark, Block Size=32
Unopt 
Unopt 
Opt
32 64 128 256
Block Size
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
M
is
s 
R
at
io
(D)
Cache Size = 8K
W-Opt
Opt
Figure 9: (A) Normalized miss rates for dgemm. (B) Normalized miss rates for dtrsl. (C) Normalized miss rates for a simple benchmark. (D)Normalized miss rates for the MxMxM.ing parallelism granularity. A recent study shows that apartfrom aecting the synchronization cost, the granularity ofapplication parallelism is also an important determinant ofapplications' memory behavior [13].To summarize, in order to optimize the locality for par-allel machines, the maximum granularity of parallelism isobtained and the outermost parallelized loops should notcarry any spatial reuse. Since our algorithm tries to achievethis goal by both changing the loop orders and memory lay-outs, we believe that it will be very eective at eliminatingfalse sharing on multiprocessors.Let us consider the loop shown in Figure 2:A, this timeassuming column-major layouts. Notice that applying theapproaches like in [7] and [10] for this nest result in thesame nest; that is this loop order is the most desirable onefor the xed column-major layouts. If the outermost loop(i) is parallelized then the reference a[i; j] will cause falsesharing. In comparison, for both of our optimized versions(Figure 2:C and Figure 2:D), the outermost loop (u) cannow safely parallelized, without an apparent danger of falsesharing.8 Conclusions and Future WorkWe designed a compiler algorithm which transforms the loopnests and changes the memory layouts of arrays in a uniedframework. Our algorithm can either be employed alone, orcan be combined with low-level locality optimizations suchas tiling [9].We have shown that our approach is more eective in re-ducing sensitivity of tiling to the tile size and at eliminatingfalse sharing than the approaches based on loop transfor-mations alone. In fact, when the memory layouts are xed,our approach produces the same results as other approachessuch as those of Li [10] and Wolf and Lam [14], if temporallocality is not considered. Both our simulation results andempirical results provide encouraging evidence that our ap-proach is likely to be successful on both uniprocessors andmultiprocessors. Work is in progress on evaluating the per-formance of our approach on full NAS benchmarks [2] onmulticomputers. We also intend to apply our technique tothe other levels of memory hierarchy.References[1] J. M. Anderson, S. P. Amarasinghe, and M. S. Lam. Dataand computation transformations for multiprocessors. InProc. Fifth ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, July 1995.[2] D. Bailey, T. Harris, W. Saphir, R. van der Winjngaart, A.Woo, and M. Yarrow. The NAS Parallel Benchmarks 2.0.,
Technical Report NAS-95-020, NASA Ames Research Cen-ter, CA, December 1995.[3] M. Cierniak and W. Li. Unifying Data and ControlTransformations for Distributed Shared Memory Machines.Proc. SIGPLAN '95 Conference on Programming LanguageDesign and Implementation, La Jolla, California, June 1995.[4] S. Coleman and K.S. McKinley. Tile Size Selection UsingCache Organization and Data Layout, In Proc. SIGPLAN'95 Conference on Programming Language Design and Im-plementation, La Jolla, CA, June 1995.[5] M. D. Hill and A. J. Smith. Evaluating Associativity in CPUCaches, IEEE Trans. on Computers, C-38, 12, December1989, pages 1612{1630.[6] Y.-J. Ju and H. Dietz. Reduction of Cache Coherence Over-head by Compiler Data Layout and Loop Transformations.Proc. 4th Workshop on Languages and Compilers for Par-allel Computing, Santa Clara, CA, August 1991.[7] K. McKinley, S. Carr, and C.W. Tseng. Improving DataLocality with Loop Transformations. ACM Transactions onProgramming Languages and Systems, 1996.[8] M. Kandemir, J. Ramanujam, and A. Choudhary. A Com-piler Algorithm for Optimizing Locality in Loop Nests. Tech-nical Report, Northwestern University, Evanston, IL, April1997.[9] M. S. Lam, E. Rothberg and M. E. Wolf. The CachePerformance and Optimizations of Blocked Algorithms. InProc. 4th International Conference on Architectural Supportfor Programming Languages and Operating Systems, April1991.[10] W. Li. Compiler Optimizations for Cache Locality and Co-herence. Technical Report 504, Dept. of Computer Science,University of Rochester, NY, April 1994.[11] C.-W. Tseng, J. Anderson, S. Amarasinghe, and M. Lam.Unied Compilation Techniques for Shared and DistributedAddress Space Machines. In Proc. 1995 International Con-ference on Supercomputing (ICS'95), Barcelona, Spain, July1995.[12] J. Torrellas, M. S. Lam, and J. L. Hennessy. False sharingand spatial locality in multiprocessor caches. IEEE Trans-actions on Computers, 43(6):651{663, June 1994.[13] E. Torrie, C-W. Tseng, M. Martonosi, and M. W. Hall. Eval-uating the impact of advanced memory systems on compiler-parallelized codes. In Proc. International Conference onParallel Architectures and Compilation Techniques (PACT),June 1995.[14] M. Wolf and M. Lam. A Data Locality Optimizing Algo-rithm. In Proc. ACM SIGPLAN 91 Conf. ProgrammingLanguage Design and Implementation, pages 30{44, June1991.[15] M. Wolfe. High Performance Compilers for Parallel Comput-ing. Addison-Wesley Publishing Company, CA, 1996.
