Performance Enhancement by Memory Reduction by Song, Yonghong et al.
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
2000 





Purdue University, li@cs.purdue.edu 
Report Number: 
00-016 
Song, Yonghong; Xu, Rong; Wang, Cheng; and Li, Zhiyuan, "Performance Enhancement by Memory 
Reduction" (2000). Department of Computer Science Technical Reports. Paper 1494. 
https://docs.lib.purdue.edu/cstech/1494 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 







Department or Computer Sciences
Purdue University




Performance Enhancement by Memory Reduction *t
Yonghong Song Rong Xu Cheng Wang
Department of Computer Sciences
Purdue University




In this paper, we propose a technique to reduce the virtual
memory required to store program data. Specifically, we
present an optimal algorithm to combine loop shifting, loop
fusion and array contraction to reduce the temporary array
storage required to execute a collection of loops. Memory
reduction is formulated as a net.work How problem, which is
solved by the proposed algorithm in polynomial time. When
applied to 20 benchmark programs on two platforms, our
technique reduces the memory requirement, counting both
the data and the code, by 50% on average. The transformed
programs gain a speedup of 1.57 on average, due to the
reduced working set and, consequently, the improved data
locality. In the best case, a maximum speedup of 41.3 is
achieved for one of the bencbmark programs.
1. INTRODUCTION
Compiler teclmiques, such as tiling 129, 30J, to exploit tem-
poral data locality within a single loop nest have been stud-
ied extensively. However, how to effectively exploit temporal
locality between different loop nests remains unclear. This
state of the art makes it important to seck locality enhance-
ment techniques beyond tiling.
In this paper, we approach the locality issue by reducing the
virtual memory required to store program data. In particu-
lar, we seek opportunities to contract the number of dimen-
sions of arrays. For examplcs, a two-dimensional array may
be contracted to a single dimension, or a whole array may
be contracted to a scalar. A significant potential benefit,
among others, of such a reduction in data size is the in-
creased reuse of the ached data due to the reduced working
set.
'Technical report CSD-TR-Oo-0016, Department of Com-
puter Scienccs, Purdue University, November, 2000.
'This work is sponsored in part by National Science Foun-
dation through grants CCR-9975309 and MIP-9610379, by
Indiana 21st Century Fund, by Purdue Research Founda-
tion, and by a donation from Sun Microsysterns, Inc.
This paper focuscs on reducing the temporary array storage
required to execute a collection of loops. The opportuni-
ties for such reduction exist often because the most natural
way to specify a computation task may not be the most
memory-efficient, and because the programs written in ar-
ray languages such as F!JO and HPF are often memory inef-
ficient.
Consider an extremely simple example (Example 1 in Fig~
nre l(a)), where array A is assumed dead after loop L2. Af-
ter right-shifting loop L2 by one iteration (Figure l(b)), Ll
and L2 can be fused (Figure l(c)). Array A can then be con-
tracted to two scalars, al and a2, as Figure l(d) shows. (As
a positive side-effect, temporal locality of array E is also im-
proved.) The aggressive fusion proposed here also improves
temporal data locality between different loop nests.
For a collection of loops defined later in this paper, we for-
mulate the memory reduction problem as a network flow
problem, which is optimally solvable in polynomial time.
Additional loop transformations, such as loop interchange
and circular loop skewing [30J, arc used to create opportu-
nities for aggressive fusion.
We have implemented our memory reduction technique in
our research compiler. We apply our technique to 20 bench-
mark programs on two platforms in the experiments. On
average, the memory requirement for those benchmarks is
reduced by 50%, counting both the code and the data, us-
ing the arithmetic mean. The transformed programs have
an average speedup of 1.57 (using the geometric mean). A
speedup of 41.3 is achieved for one of the benchmarks.
In the rest of this paper, we will present some preliminaries
in Section 2. We formulate the network How problem and
prove its complexity in Section 3. We present controlled
fusion and discuss enabling techniques in Section 4. Section




We consider a collection of loop nests, L1, L2, ... , Lm ,
m:;:: I, in their lexical order, as shown in Figure 2(a). The
label Li denotes a perfect nest of loops with indices £i,l,
Li,'l, ... , Li,n, n ~ 1, starting from the outmost loop. (In
Example I, i.e. Figure l(a), we have m = 2 and n = 1.)
Loop Li,j has the lower bound li,j and the upper bound
Ll:DOI=l,N
A(1) = E(l) + E(I -1)
END DO





A(I) = E(1) + E(l- 1)
END DO
DOI=2,N+l





A(I) = E(I) + E(I - 1)
ELSE IF (I.EQ.(N + 1)) THEN
E(1-I)=A(I-l)
ELSE
A(I) = E(I) + E(I - I)












Figure 1: Example 1
DO L, =1"Ui
L,: DO L", = ".I,u,.,
DO L,,~ = 11,2, UI,~
DO L"n = 'I,n, U""
L. : DO L;" = I"" u",
DO L,,~ = I.,~, lIi,2
DO L;,n = li,n.lI;,n
L~ : DO Lm,l = ' ... ", U~"
DO L ... ,~ = Im,~, u~.~
DO L ... ," =J~.n,u... ,..
L,: DO 1"" =',,1 +pl(LI).lI"l +pI(L,)
DO L,.~ = 11,2 + p2{Ld, U',2 + p2(L I )
DO LI,n = II,,, + p"(L I ), UI,n + p"{L,)
i;: DO L;" = 1•. 1+ p'(Li), U;,l + p'(L.)
DO L',2 = ,;,~ + p~(Li)' U.,~ + p2(L;)
DO L"" = I;,n + p~(Li). Ui,,, + p"(L;)
L m : DO L m " = 1m " + P'(Lm), U"'" + p'{L~)
DO Lm,~ = Im,~ + p~(Lm).U ....2 + p~{L ... )
DO Lm,n = 1m,,, +p"(Lm ), Um,,, +pn(Lm )
(b)
DO Lon = 1m, u ...
(0)
DO L = min~,';, mo:z:~,u.
(d)
DO J I = minr.:.,(I;,I +p'(L.)).
mo:z:i.:" (U'" + pI(L;))
DO J2 = min;.. ,(J',2 + p~(L,»,
mo.:z:i.:,I(Ui,2 + p~(L;))
DO I n = mini.:., (I •." +pn(L;».
m,,:z:r.:.,(Ui,n +p"(L,»
(0)
Figure 2: The original and the transformed loop nests
'U'oi respectively, where li,j and Ui,j are loop invariants. For
simplicity of presentation, all the loop nests Li, 1 :s i :s m,
are assumed to have the same nesting level n, If they do
not have the same nesting level, we can apply our technique
to different loop levels incrementally. Figure 3(a) shows ll.
simple example, where the nesting level is 2 for loops It and
12 and is 1 for loop h. We first apply our technique to fuse
loops It, 12 and 13 at the outmost level only, resulting in the
loop nest shown in Figure 3(b), We then apply the technique
to loops h and h, resulting in the loop nest in Figure 3(c).
If there exist dangling statements hetween two loop nests,
we move them before or after the loop sequence if permitted
by the dependences, Otherwise, we perform techniques such
as code sinking (30] to move such statements into one of
its adjacent loop nest, Alternatively, we can patch these
statements to the first or the last iteration of their adjacent
loop nest. In the interest of keeping our fundamental idea
clear, we do not follow the aforementioned generalizations in
this paper. We stay within the model in Figure 2(a) instead.
The array regions referenced in the given collection of loops
are divided into three classes:
• An input army region is upwardly exposed to the be-
ginning of L 1 ,
• An output army region is live after Lm.
• A local army region does not intersect with any input
or output array regions.
By utilizing the existing dependence analysis, region analy-
sis and live analysis techniques [4, 11, 12, 18], we can com-
pute input, output and local array regions efficiently. Note
that input and output regions can overlap with each other,
In Example 1 (Figure l(a)), E[O : N] is both the input (Uray
region and the output array region, and A[1 : N] is the local
array region, Figure 4(a) shows a Jnore complex example
(Example 2), which resembles one of the well-known Liver-
more loops. In Example 2, where m = 4 and n = 2, each de-
clared array is of dimension IJN +1, KN +1]. ZP, ZR, ZQ,
ZZ, ZA[l,2:KN], ZB[2:JN,KN+l] are input array regions.
ZP, ZR, ZQ, ZZ arc output array regions. ZA[2:JN,2:KN}
and ZB[2:JN,2:KN] are local array regions,
Figure 2(b) shows the code fonn after loop shifting but be-
fore loop fusion, where pi(Li ) represents the shifting factor
for loop Li,j, In the rest of this paper, we assume that
loops L; are coalesced into single level loops [30, 26] 1 af·
ter loop shifting but before loop fusion. Figure 2(c) shows
the code form after loop coalescing but before loop fusion,
and Figure 2(d) shows the code form after loop fusion, The
loops are coalesced to ell..'le code generation for general cases.
However, in most common cases, loop coalescing is unneces-
sary [26]. Figure 2(e) shows the code form after loop fusion
without loop coalescing applied. Array contraction will then
be applied to the code shown in either Figure 2(d) or in Fig-
I Unlike [30], we do not perform loop normalization after
coalescing a multi-level loop nest to a single-level one.
Figure 3: Applying to loops not with the same nest-
ing level
DO II = ...
























ZA(J,K} = ZP(J -1,K + 1) + ZR(J -1, I( -1)
END DO
END DO
L2: DO K = 2,KN
DOJ=2,JN
ZB(J,K) = ZQ(J -1,K) +ZZ(J,K)
END DO
END DO
L3: DO K = 2.KN
DOJ=2,JN
ZP(J,K) = ZP{J,K} +ZA(J,K)





ZQ(J, K) = ZQ{J, K) + ZA(J, K)
+ZA(J - 1,K} + ZB{J,K} + ZB(J.K + 1)
END DO
END DO
Figure 4: Example 2 and its original and simplified
loop dependence graphs
ure 2(e).
2.2 Loop Dependence Graph
We extend the definitions of the traditional dependence dis-
tance vector and dependence graph [14] to a collection of
loops as follows.
Definition 1. Given a collection of loop nests, Ll, ... ,
L"" as in Figure 2(20), if a data dependence exists from iter-
ation (it, i2, ... ,i,,) of loop £1 to iteration (jl, h, ,j,,) of
loop £2, we say the distance vector is (jl -il,h-i2, ,j,,-




Definition 2. Given a collection of loop nests, L 1 , L 2 , .•. I
L m , a loop dependence graph (LOG) is a directed graph
G = (V, E) such that each node in V represents a loop
nest Li, 1 SiS m. (We denote V = {Ll,L2, ... ,L",).)
Each directed edge, e =< Li,Li >, in E represents a data
dependence (flow, anti- or output dependence) from Li to
Li. The edge e is annotated by a di3tance vector 2 d;'(e).
For each dependence edge e, if its distance vector is not
constant, we replace it with a set of edges as follows. Let
8 be the set of dependence distances e represents. Let d~
be the lexicographically minimum distance in S. Let 51 =
{d;ld; t d, d; E S A (ltd E SAd i= d;)}. For any vector
d; in 81 (also in 8), there exists no other vector in 8 which
is no smaller than d;. We replace the original edge e with
(1511 + 1) edges, annotated by d~ aud di (dt E 51,1 SiS
181 1) respectively.
Figure 4(b) shows the loop dependence graph for the ex-
ample in Figure 4(20), without showing the array regions.
As an example, the flow dependence from L 1 to L 3 with
d; = (0,0) is due to array region ZA(2 : IN,2 : KN). In
Figure 4(b), where multiple dependences of the same type
(How, anti- or output) exist from one node to another, we
~Fro~ [29,30], U = (UI,112,'" ,u"),"J =_ (VI1v2, ... ,Vn),
U+V = (UI +VI, 112+V2, ... ,'Un +v,,), u-v = lUI -Vl,U2-
Tl.2,.:. ,Un - v,,), u >- V (u is lexicographically greater than
v) if30SkSn-l,(ul, ... ,Uk)={Vl, ... ,V,\:)Auk+l >
Vk+l, ii t v if ii >- v or ii = v, ii ;::: v if Uk ;::: VA: (1 S k S
n).
use one arc to represent them all in the figure. All associated
distance vectors are then marked on this single arc.
2.3 Assumptions
We make the following three assumptions in order to sim-
plify our formulation in Section 3.
Assumption 1. The loop trip counts for perfect nests Li
and L) arc equal at the same corresponding loop level h,
1 S h S n. This cau be also stated as Ui.h - li.h + 1 =
Uj,h -Ij,h +1,1:::; i,j S m,1S h S n.
To enforce Assumption 1, one could either partition the it-
eration spaces of certain loops into equal pieccs, or apply
loop peeling.
Throughout this paper, we use 13(11.) to denote the loop trip
count ,of loop L; at level h, which is constaut or symbolicly
constant w.r.t. the program segment under consideration.
Denote if = (p{l), ... ,p(n»). We let u(n) = 1 and U(h) =
cr(h+l)p(h+l),1 S h S n-l. Let u= (U{Il,U{2l, ... ,u(n»).
In this paper, we also denote Ti as the number of static
write references due to local array regions 3 in loop Li. We
arbitrarily assign each static write reference in £, a number
1 S k S Tj in order to distinguish them. Take loops in
Figure 4(b) as an example, we have if = (KN -I, IN - I),
u= (IN -1, 1), Tl = T2 = 1 and 7"3 = T4 = O.
3In the rest of this paper, the term of "a static write ref-
erence" means "a static write reference due to local array
regionsn.
We make the following assumption about the dependence
distance vectors.
AssumpHon 2. The sum of the absolute values of all de-
pendence distances at loop level h in loop dependence graph
G = (V, E) should be less than one-fourth of the trip count
of a loop at level h. This assumption can also be stated
as r:~~lldv(ek)1 < ~p for all Ck E E annotated with the
dependence distance vector dv(ek)'
Assumption 2 is reasonable because for most programs, the
constant dependence distances are generally very small. If
non-constant dependence distances exist, the techniques dis-
cussed in Section 4.2, such as loop interchange and circulax
loop skewing, may be utilized to reduce such dependence
distances.
Assump/10n 3. For each static write reference T, each in-
stance of T writes to a distinct memory location. No IF-
statement guards the statement which contains the refer-
ence T. Different static write references write to different
portions of main memory.
If a static write reference does not write to a distinct mem-
ory location in each loop iteration, we apply scalar or array
expansion to this reference [30]. Later on, our technique
should minimize the total size of the local array regions.
In case of IF statements, we assume both branches will be
taken. In [26], we discussed the case where the regions writ-
ten by two different static write references are the same or
overlap with each other,
2.4 LDG Simplification
The loop dependence graph can be simplified by keeping
only dependence edges necessary for memory reduction. The
simplification process is based on the following three claims.
Claim 1. Any dependence from Li to itself is automati-
cally preserved after loop shifting, loop coalescing and loop
fusion. This is because we arc not reordering the computa-
tion within any loop L,.
Claim 2. Among all dependence edges from L i to Lj ,
i ::f. i, suppose that the edge e has the lexicographically
minimum dependence distance vector. After loop shifting
and coalescing, if the dependence distance associated with
e is nonnegative, it is legal to fuse loops L i and L j • This is
because after loop shifting and coalescing, the dependence
distances for all other dependence edges remain equal to or
greater than that for the edge e and thus remain nonnega-
tive. In other words, no fusion-preventing dependences ex-
ist. We will prove this claim in Section 3 through Lemma 3.
Claim 3. The amount of memory needed to carry a com·
putation is determined by the lexicographically maximum
flow-dependence distance vectors which arc due to local ar·
ray regions. We will discuss this claim further in Section 2.5.
During the simplification, we classify all edges into two classes:
L-edges and M.edges. The l,.edges are used to determine
the legality of loop fusion. The M-edges will determine the
minimum memory requirement. All M-edges are flow de-
pendence edges. But an l,.edge could be a How, an anti- or
an output dependence edge. It is possible that one edge is
both an L-edge and an M-edge. The simplification process
is as follows.
• Based on the claims 1 and 3, for each combination of
the node Li and the static write reference T in Li where
Ti > 0, among all dependence edges from L; to itself
due to T, we keep only the one whose flow dependence
distance vector is lexicographically maximum. This
edge is an M-edge.
• Based on the claims 1 and 3, for each node Li such
that T; = 0, we remove all dependence edges from L;
to itseU.
• Based on the claims 2 and 3, for each node L; where
T, > 0, among all dependence edges from Li to L j (j t
i), we keep only one dependence edge for legality such
that its dependence distance vector is lexicographically
minimum. This edge is an L-edge. For any static write
reference T in L" among all dependence edges from L,
to Lj{j =I- i) due to r, we keep only one flow depen-
dence edge whose distance vector is lexicographically
maximum. This edge is an M-edge.
• Based on the claims 2 and 3, for each node L, where
Ti = 0, among all dependence edges from L; to Lj(j #:
i), we keep only the dependence edge whose depen-
dence distance vector is lexicographically minimum.
This edge is an L-edge.
The above process simplifies the program formulation and
makes graph traversal faster. Figure 4(c) shows the loop
dependence graph after simplification of Figure 4(b). In
Figure 4(c), we do not mark the classes of the dependence
edges. As an example, the dependence edge from L 1 to Lg
marked with (0,0) is an L-edge, and the one marked with
(0,1) is an M-edge. The latter edge is associated with the
static write reference ZA(J, K).
2.5 Reference Windows
Loop shifting is applied before loop fusion in order to honor
all the dependences. We associate one integer vector peL;)
with each loop nest L; in the loop dependence graph. De-
note p{Lj) = (pt (Lj), ... ,p" (Lj)) where pk(Lj) is the shift-
ing factoT of Lj at loop level k (Figure 2(b)). For each de-
pendence edge < Li, L j > with the distance vector dV, the
new distance vector is p(L.!) + d; - peL,). Our memory
minimization problem, therefore, reduces to the problem of
determining the shifting factor, pi{L;), for each Loop Li,j,
such that the total temporary array storage required is min-
imized after all loops are coalesced and legally fused.
In [9], Gannon et al. use a reference window to quantify the
minimum cache footprint required by a dependence with a
loop-invariant distance. We shall use the same concept to
quantify the minimum temporary storage to satisfy a flow
dependence.
Definition 3. (from [9]) The reference window, W(1rx)t
for a dependence 1fx : 81 -+ 82 on a variable X at time t,
is defined as the set of all clements of X that are referenced
by 81 at or before t and will also be referenced (according
to the dependence) by 52 after t.
In Figure l(a), the reference window due to the flow depen-
dence (from £1 to L2 due to array A) at the beginning of
each loop L2 iteratioll is { A(I), A(I + 1), ... ,A(N) }. Its
reference window size ranges from 1 to N. In Figure l(c),
the reference window due to the flow dependence (caused by
array A) at the beginning of each loop iteration is { A(I-1)
}. Its reference window size is 1.
Next, we extend Definition 3 for a set of flow dependences
as follows.
Definition 4. Given flow dependence edges eI, e2, ... ,
e~, suppose their reference windows at time t are WI, W2,
... , W. respectively. We define the reference window of {
el,e2, ... ,e. } at time t as Uj=1 Wj.
PROOF. This is because the predicted reference window
size for any flow dependence should be no smaller than the
minimum required memory size to carry the computation for
that dependence. The predicted refenmce window size for
the kth static write reference r in L i should be nQ smaller
than the memory size to carry the computation for all flow
dependences due to r. 0
THEOREM 1. Minimizing memory requirement is equiva-
lent to minimizing the predicted reference window size for
all flow dependences due to local army regions.
PROOF. By Definition 5 and Lemma 1. 0
In this paper, iiv denotes the inner product of ii and v.
Given a dependence 1f with the distance vector d;' = (d l , d2 •
... , d") after loop shifting, ad;' is the dependence distance
for 1f after loop coalescing but before loop fusion, which we
also call the coalesced dependence distance. Due to Assump.
tion 3, ad;' also represents the predicted reference window
size of 1f both in the coalesced iteration space and in the
original iteration space.
LEMMA 2. Loop fusion is legal if and only if all coalesced
dependence distances are nonnegative.
Since the reference window characterizes the minimum mem-
ory required to carry a computation, the problem of mini-
mizing the memory required for the given collection of loop
nests is equivalent to the problem of choosing loop shifting
factors such that the loops can be legally coalesced and fused
and that, after fusion, the reference window size of all How
dependences due to local array regions is minimized. Given
a collection of loop nests which can be legally fused, we
need to predict the reference window after loop coalescing
and fusion.
Definition 5. For any loop node £; (in an LDG) which
writes to local array regions R, suppose iteration (it,··· ,in)
becomes iteration i after loop coalescing and fusion. We de-
fine the predicted reference window of L; in iteration (iI, ... ,in)
as the reference window of all flow dependences due to R
in the beginning of iteration i of the coalesced and fused
loop. Suppose the predicted reference window with itera-
tion (il' ... ,i-") has tbe largest size of those due to R. We
define it as the predicted reference window size of the entire
loop £i due to R. We define the predicted reference window
due to a static write reference r in Li as the predicted refer-
ence window of L; due to be the array regions written by r.
(For convenience, if L; writes to nonlocal regions only, we
define its predicted reference window to be empty.)
Based on Definition 5, we have the following lemma:
LEMMA 1. The predicted reference window size for the
kth static mte reference r in L; must be no smaller than
the predicted reference window size for any flow dependence
due to r.
PROOF. This is to preseroe aU the originol dependences.
We now use loop node £2 in Figure 4(c) to illustrate how
to compute the size of the predicted reference window for
one particular static write refcrence. In this example, thc
predicted reference window size of L 2 due to the static write
reference ZB(J, K) is the same as the predicted reference
window size of L2. There exist two dependence edges from
£2 to L3, one L-edge and one M-edge, with distance vectors
(-1,0) and (0,0). There also exist two dependence edges
from L2 to L4, one L-edgc and one M-edge. Let
Note that loop shifting and coalescing is always legal. To
make loop fusion legal, the following constraint is enforccd:
(5)
The constraint (5) guarantees that the coalesced dependence
distance is nonnegative for all dependences after loop shift-
ing and coalescing but before loop fusion.
iTdv3 represents the predicted reference window size for the
flow dependence from L2 to La, and iTdV4 for the predicted
o
reference window size for the flow dependence from L2 to
L 4 · The size of the predicted reference window of L2 can
be computed by taking the greater one of the above two
reference window sizes, i.e., max(udv3,udv4), according to
Lemma 1.
Next, we formulate the objective function for memory re-
duction to minimize the size of local array regions.
3. OBJECTIVE FUNCTION
In this section, we first formulate a graph-based system to
minimize the predicted reference window size, thus mini-
mizing the total memory requirement. We then transform
our problem to a network flow problem, which is solvable in
polynomial time.
Given a loop dependence graph G, the objective function to
minimize the size of the predicted reference windows for all
loop nests can be formulated as follows. (e =< Li, Lj > is
an edge in G.)
(8)
subject to
u(p(L,) + d;"(e) - p(L;)) ~ 0, V L-edge e (7)
O'M~,k 2: 5(p(Lj) + d;"(e) - p(Li)), V M-edge e, 1:5 k:5 7";
(8)
We call the above defined system as Problem 1. In (6),
O'M~,k represents the predicted reference window size for
the local array regions due to the kth static write reference
in Lj.
Constraint (7) says that the coalesced dependence distance
must be nonnegative for all L-edges after loop coalescing but
before loop fusion. Constraint (8) says that the predicted
reference window size, O'M~,k, must be no smaller than the
predicted reference window size for every M-edge originated
from L; and due to the kth static write reference in Li.
Combining the constraint (7) and Assumption 2, the follow-
ing lemma says that the coalesced dependence distance is
also nonnegative for all M-edges.
LEMMA 3. If the constraint (7) holda, 5(p(Li) + d;'(e)-
p(Li» 2:: 0 holds for all M.edges e =< Li,Li > in G.
PROOF. Ifi = j, we have O'(p(L,) + d;'(e) - p(L,)) =
5d;'(e). If d""v(e) = 0, then O'dv(e) = 0 holds. Otherwise,
assume that the first non-zero component of d;'(e) is the
hth. component. Based on Assumption 2, we have O'd;'(e) 2:
5(0, ... ,0,1, -t.a(h+l) + 1, ... I -t.a(lI) + I) > O.
For an M.edge C2 =< Li,Li >,i =I- j, there mwt exist an
L-edge el =< Li,Li >. The constraint (7) guarantees that
5(p(Lj) + d""v(ed - p(Li)) 2:: °holds. We have O'(p(Li) +
d;'(e2) - p(L;)) = O'(p(Lj ) +d;'(el) - p(Li)) +0'(dv(e2) -
d;'(el) ~ 0'(d;'(e2) - d;'(el).
By the definition of L-edges and M-edges, we have d;'(e2)-
d;'(ed 1::: 0. Similar to the proof for the case of i = j in the
above, we can prove that u(d;'(e2) - d;'(el)) 2:: 0 holds. 0
From the proof of Lemma 3, we can also see that for any
dependence 1f which is eliminated during our simplification
process in Section 2.4, its coalesced dependence distance is
also nonnegative, given that the constraint (7) holds. Hence,
the coalesced dependence distances for all the original de-
pendences (before simplification in Section 2.4) are nonnega-
tive, after loop shifting and coalescing but before loop fusion.
Loop fusion is legal according to Lemma 2.
In Section 2.4, we know that for any flow dependence edge
e3 from Li to Lj due to the static write reference r which
is eliminated during the simplification process, there mu.st
exist an M-edge e4 from L i to Li due to r. From the proof of
Lemma 3, u(p(Lj) + d;'(e4) -p(L;)) 2:: u(p(Lj ) + d;'(e3)-
p(L;)) holds. Hence, the cons~raint (8) computes the pre-
dicted reference window size, O'M~,k, over all How depen-
dences originated from L; due to the kth static wri~e ref-
erence in the unsimplified loop dependence graph (see Sec-
tion 2.2). According to Lemma 1, the constraint (8) cor-
rectly computes the predicted reference window size, UlvI~,k'
3.1 Transforming the Original Problem
We define a new problem, Problem 2, by adding the fol-
lowing two constraints to Problem 1. (e =< L;, Li > is an
edge in G.)
p(Li) + d;'(e) - p(Li) 1::: 0, V L-edge e (9)
M~,k t: peL,) + d;"(e) - p(L;), V M-edge e,l :5 k:5 7";
(10)
In the following, we show that given an optimal solution
for Problem 1, we can construct an optimal solution for
Problem 2 with the same value for the objective function
(6), and vice versa.
LEMMA 4. Given any optimal solution for Problem 1,
we an construct an optimal solution for Problem 2, wilh
the same value for the objeclive ftmction (6).
PROOF. The ~earch space of Problem 2 is a subset of
that of Problem 1. Given an LDG G, the optimal objective
junclion value (6) for Problem 2 mu~t be equal to or gren.ter
than thaI for Problem 1. Given any optimal solution for
Problem 1, we find the shifting factor (p) and M~,k values
for Problem 2 as follows.
1. Initially let p and M~,k values from Problem 1 be the
solution for Problem 2. In the following sleps, we
will adjust these values so that all the constraints for
Problem 2 are satisfied and the value for the objedive
junction (6) is not changed.
2. If all p values satisfy the constraint (9), go to step .,f.
Otherwise, go to step 3.
3. This step finds ji values which satisfy the constraint
(9).
Following the topological order of nodes in G, find the
first node L i such that there e.rists an L-edge e =<
Li , L j > where. the constraint (g) is not satisfied. (Here
we ignore self cycles since they must represent M-edges-, -
in G.) Suppose dv = p(Lj) + dv(e) - p(L i ) = (0,
... , 0, CI, ... ) where CI < 0 is the 8th and the first-, -nonzero component of dv. Let 0 = (0, ... , 0, -CI,
CIP,·+I), 0, ... ) where the only two nonzero compo-
nents are the 8th and the (3 + l)th. Change p(Lj) by
p(Lj) = p(Lj ) +6. Because of ii6= 0, the new p val-
ues, including p(Lj ), satisfy the constraints (7) and
(8). The value for the objective funelion (6) is also
not changed.
Ifp(Lj ) + d;'(e) - peL;) is stilllcxicographicaUy neg·
ative, we can repeat the above prDass. Such a process
will tenninate within at most n times since otherwise
the constraint (7) would not hold for the optimal solu-
tion of Problem 1.
Note that the node L. is selected based on the tapa·
logical order and the shifting factor p(Lj) is increased
compared to its original value. For any L-edge with the
destination node Lj , if the constraint (9) holds before
updating p(Lj), it still holds after the update. Such a
propeTly will guarantee our process to tenninate.
Go to step 2.
i. This step finds M~.k tralues which satisfy the constraint
(10).
Given 1 :5 i 5 m and 1 :5 k :5 Ti, find the M~,k
value which satisfies the constraint (10) such that the
constraint (10) becomes equal/or at least one edge.
If Ule M~,k achieved above satisfies the constra1nt (8),
we are done. Otherwise, we increase the nth compo-
nent of the M~.k value unHl the constraint (O) holds
and becomes equal for at least one edge.
Find all M~.k values. The value for the objective func-
tion (6) is not changed.
With such p and M~.k values, the value for the objective
function (6) for Problem 2 is the same as tho! for Prob-
lem 1. Hence, we get an optimal solution for Problem 2
with the same value for the object1ve function (6). 0
THEOREM 2. Any optimal solution for Problem 2 is also
an optimal solution for Problem 1.
PROOF. Given any optimal solution of Problem 2, we
take its :P Ilnd M~,k values (IS the solution for Problem 1.
Such p and M~,k values satisfy the constra1nts (7)-(8), and
the value for the objective function (6) for Problem 1 is
the same as that for Problem 2. Such a solution must be
optimal for Problem 1. Otherwise, we can construct from
Problem 1 another solution ofProblem 2 which has lower
value for the objective fundion (6), according to Lemma 4.






[fl) d l )
Figure 5: The transformed graph (Gl) for Fig-
ure 4(C:)
Based on Theorem 2, given an optimal solution for Prob-
lem 2, we immediately have an optimal solution for Prob·
lem 1. In the rest of this section, we try to solve Problem
2 instead.
By expanding the vectors in Problem 2, an integer pro-
gramming (IP) prohlem results. General solutions for IP
problems, however, do not take the LDG graphical charac-
teristics into account. Instead of solving the IP problem,
we transform it into a network How problem, as discussed in
the Test of this section.
3.2 Transforming Problem 2
Given a loop dependence graph G, we generate another
graph GI = (VJ,E1) as follows.
• For any node L; E 0, create a co'ITCSponding node ii
in G1 .
• For any node Li § 0, if L1 has an outgoing M-edge,
let the weight of Li be weLl) "" -TjU. For each static
write reference rio (1 ::; k 5 Ti) in L i , create another
node ii (10) in G I , which is called the sink of ii due to
rio. Let the weight of iy,j be w(i;'k») = ii.
• For any node Lj E G which does not have an outgoing
M-edge, let the weight of ii be ii
• For any M-edge < Li, Lj > in G due to the static
write reference rio, suppose its distance vector dv. Add
- - (10)
an edge < Lj,L; > to G l with the distance vector
-dv.
• For any L-edge < Li,Lj > in G, suppose its distance
vector dv. Add an edge < ii, Lj > to G 1 with the
distance vector dv.
For the original graph in Figure 4(c), Figure 5 shows the
transformed graph.
We associate a vector q to each node in GI as follows.
• For each node Li in GI, cr. = p(Li).
v _C d L- (k) ~ ~ -()• ~'oreaU1no e i ,qi=Mi,k+pL,.
The new system, which we call Problem 3, is defined as
follows. (e =< Vi, Vj > is an edge in GI annotated by d-;'.)
subject to





We develop optimality conditions to solve Problem 3. We
utilize the network flow property. A network How consists
of a set of vectors such that each vector I(e;} corresponds
to each edge ei E E I and for each node Vi E VI, the sum
of flow values from all the in-edges should be equal to w(Vi)
plus the sum of flow values from all the out-edges. That is,
THEOREM 3. Problem 3 i3 equivalent to Problem 2.
PROOF. We have
,<,IV. I ( )-
""i=1 W Vi ql
Hence the objective function (6) is equivalent /0 (11).
For each edge e =< Li,Lj > in GI, the inequality (12) is
equivalent to
(14)
where Cl is an L-edgc in G from Li to L j • Inequality (14) is
et]tlivalent /0 (9), hence inequality (12) 13 eq1livalent to (9).
where ek =< ., Vi > represents an in-edgc of Vi and ek =<
Vi,. > represents an out-edge of Vi.
LEMMA 5. Given G 1 = (VI,EI ), there exists at least one
legal network flow.
PROOF. Find a .spanning tree T olG I • Assign the flow
value to be 6 for all the edges no! in T. Hence, il we can
find a legal network flow for T, the same flow assignment is
also legal for G 1 .
We as.sign flow value to the edges in T in reverse topological
order. Since the total weight of the node.:; in T is equal to 6,
a legal network flow exists for T. 0
Based on equation (17), given a legal network flow, we have
Suppose f(ek) C 6 for the edge ek EEl, which is equivalent
to Ck 2:: O. With the constraint (13), we have
For any node V E VI, wc have w(v} = ca, wherc c =
-7i,0 or 1. For our network flow algorithm, we abstract
out the factor (j from w(v) such that w(v) is represented
by c only. Such an abstraction will give each How value the
form I(ck) = CkU, where Ck is an integer constant.
For each edge e =< Lj, iYl > in G I , the inequality (12) is
eqUivalent to
(15)
where el is an M.edge in G from Li 10 Lj due to the kth
static write reference in L;. Inequality (15) is equivalent to
(10), hence inequality (12) is equivalent to (10).
Similarly, it is casy to show that the constrnints (7) and (8)
are equivalent to constraint (13). 0
..... !VII ( ) ~ IE11!()( - -).... i=l W Vi ~ = k=1 ek Clf - ql
where ek =< Vi,Vj >E E l •
(18)
Note that one edge in G could be both an L-edge and an M-
edge, which corresponds to two edges in Gl. Assumption 2





where ek E EI is annotated with the dependence distance
vector dv(e.l;).
If we consider the vector as the basic computation unit,
Problem 3 is a nonlinear system, due to the constraint
(13). The same as Problem 2, such a nonlinear system can
bc solved by linearizing the vector representation so that
the original problem becomes an integer programming prob-
lem, which in its gencral form, is NP-complete. In the next,
however, we show that we can achieve an optimal solution
in polynomial time for Problem 3 by utilizing the network
flow property.
Therefore, with thc equation (18), if f(ek) 1::: 6, we have
El~\lw(vi)cii ~ -El~1f(e.l;)d~. (21)
Collectively, we have the optimality conditions stated as
the following theorem such that if they hold, the inequal-
ity (21) becomes the equality and the optimality is achieved
for Problem 3.
THEOREM 4. If the following three condition.:; hold,
1. Constraints (12) and (13) are satisfied, and
2. A legal network flow f(e",) = c",rJ exists such that Ck 2::
o for 1:5 k :5 lEI I, and
3. E\~llw(Vi)qi = -El~llf(e",)d~ holds, i.e., 'inequality
(21) becomes an equality.
Problem 3 achieves an optimal solution -EL~lf(ek)d-;'.
PROOF. ObTJiou..s from the above discu..ss'ion. 0
3.4 Solving Problem 3
Here, let us consider each vector W(Vi), qi and d~ as a sin-
gle computation unit. Based on the duality theory [24, 2],
Problem 3, excluding the constraint (13), is equivalent to
several graph-based polynomial-time algorithms, for exam-
ple, sucCC3sive shorlest path algorilhm with the complex-
ity O(1V113 ), double sctlling algorithm with the complexity
O(IVII[Elllog!Vll), and so on. From [1], the current fastest
polynomial-time algorithm for solving network flow problem
is enhanced capacity sctlling algorithm with the complexity
O((lEI]logIVII)(IE11 + loglViI). For these algorithms, we
have the following lemma.
LEMMA 6. For any optimal solution of qi in Problem
4, there exists a spanning tree T in G1 such that each Mge
e =< Vi, 'Vj > in T saiisfie$ qj -lji + d-;' = o.
PROOF. This is true due to the foundation of the simplex
melhod [2)' 0
(22)
Let T be the spanning tree in Lemma 6. If we fix any q to
be 0, all cii, 1,5; i :5 lVII, can be determined uniquely. With
such uniquely-determined qi, we have
subject to
E'I."'<.,V;>EEl f(ei:l = W(Vi) + E'J,=<vi •.>EElf(ek), 1 :5 i ,5; IVII·
(23)
(25)
For any e =< 'Vi,Vj >E E 1 with annotation d~, with the
inequality (25), we have
f(ei) t: 0,1 :5 i :5 lEd· (24) (26)
The constraint (13) is mandatory for the equivalence be-
tween Problem 3 and its dual problem, following the de-
velopment of optimality conditions in Section 3.3 [1]. The
constraint (23) in the dual system precisely defines a flow
property, where each edge ei is associated with a flow vector
f(e;). We define Problem 4 as the system by (11)-(12) and
(22)-(24). Similar to W(Vi), tbe vector f(Ck) is represented
by Ck where f(ek) = CkU. Although apparently the search
space of Problem 4 encloses that of Problem 3, Problem
4 has correct solutions only within the search space defined
by Problem 3.
Based on the property of duality, Problem 4 achieves an
optimal solution if and only if
• The constraints (12), (23) and (24) holds, and
• The objective function values for (11) and (22) arc
equal, i.e., E\~\lw(v;)ql = -EL~Ilf(ek)d-;' holds.
Ifwe can prove that the constraint (13) holds for the optimal
solution of Problem 4, such a solution must also be optimal
for Problem 3, according to Theorem 4.
There exist plenty of algorithms to solve Problem 4 [1,
2]. Although those algorithms are targeted to the scalar
system (the vector length equals to I), some of them can
be directly adapted to our system by vector summation,
subtraction and comparison operations. In (2], the authors
present a network simplex algorithm, which can be directly
utilized to solve our system. The algorithmic complexity,
however, is exponential in the worst case in terms of the
number of nodes and edges in 0 1 . In [1], the authors present
For the inequality (26), based on the inequality (16), we
have
Iqj -ql + d~1 <p,e=< V;,'Vj >E E 1 is annotated with d~.
(27)
LEMMA 7. rJ(qj-qi+d-k) 2::0, wheree=<vi,vj >eEI
is annotated with d~, subject to lhe constraints (12) and
(27).
PROOF. If cjj - qi + d'i.. = 0, then u(qj - ql + d-k) ? 0
holds.
Otherwise, assume the first non·zero component is the hth
for qj -lji + d-;". Then, qJ-l - qf-l +dk-) = 0, 1 ,5; 8 :5 h -1,
and qJhl - qfhl + dkhl > O.
With the constraint (27), we have
rJ(cjj - ql + d~)
2:: rJ(O, ... ,0, qy') _qfh)+d~hl,_(3(h+ll+ I , ... ,_/3(nl +l)
= q(hl(qJhl _ q~hl + d~hl) _ q(h+I)/3(h+ll + q{h+il
_ ... _ q(nl/3(n) + q{nl
= q(hl(qJhl _ q~hl + dkhl _ 1) + q{nl
>0 0
Hence, Inequality (16) guarantees that the constraint (13)
always holds when the optimality of Problem 4 is achieved.
The optimal solution for Problem 4 is also an optimal so-
lution for Problem 3.
Input: G 1 = (VI,EI)
Output: qj,1 S i:S IVII
Procedure,
!,(~.. )=Ofor 1.:0::; k.:o::; lEd, <II =0 ror 1 S i:::; IVd.
~(tI;) = W'{tli) for 1 :<:; i.:o::; IVd.
Initialize the sels E = {tI;l~(u;) < o} and D = {tlll~(tli) > OJ.
while (E '* ¢) do
Select Il. node v.. E E and '" ED.
Determine shortest path distnnces "'1 rrom node 0", to all
ot.her nodes in G 1 With respect t.o the residue eo.1.8
eij = dU - <II + cjj, where the edge < 0;, Vj >
is annotated with dU in Gl.
Let. P denote a .honest path from v" to v,.
Update <iJ= qi -"i,I:<:;'.:o::; IVII.
(j = mint -~(v,,), e(vI), min{r;j I < o;,v, >E P}), where r;j is
the flow value in the residue network flow graph.
Auglllentli Unil" of flow a10nglhe path P.
Update f'{e.. ), E, D, e'i nnd the re.idue graph.
end while
Figure 6: The successive shortest path algorithm
3.5 Successive Shortest Path Algorithm
We now brielly present one network flow algorithm, succes-
sive shortest path algorithm [1), which can he used to solve
Problem 4.
The algori~hm. is depicted in Figure 6. We let f(ek) =
f'(ek)iJ and W(Vi) = w'(v;)iJ, where f'(ek) and W'(Vi) are
scalars. After the first while iteration, the algorithm always
maintains feasible shifting factors and nonnegativity of llow
values by satisfying the constraints (12) and (24). It adjusts
the flow values such that the constraint (23) holds for all
edges in GI when the algorithm ends. For the complete de-
scription of the algorithm., including ~he concept of reduced
cost and residue network flow graph, the semantics of sets E
and D, e~c., please refer to [IJ for details.
We have developed a code generation scheme as well as three
linear-time heurisHcs for Cast compilation. Figure 7 show:s
the transformed code for Example 2 after memory reduction.
See [26J for details.
4. REFINEMENTS
4.1 Controlled Fusion
Al~hough array contraction after loop fusion will decrease
the overall memory requirement, fusing too many loops can
potentially increase the working set size of ~he loop body,
hence it can potentially increase register spilling and cache
misses. This is padicularly true if a large number of loops
are under consideration. To control ~he number of fused
loops, after computing the shifting factors to minimize the
memory requirement, we use a simple greedy heuristic,
Pick..and.Reject (see Figure 8), to incrementally select loop
nests to be actually fused. If a new addiHon will cause
the es~imated cache misses and register spills to be worse
than before fusion, then the loop nest under consideration
will not be fused. The heuristic then continues to selec~
fusion candidates from the remaining loop nests. The loop
nests are examined in an order such that the loops whose
fusion saves memory most arc considered first. We estimate
register spilling by using the approach in (22J and estimate
cache misses by using the approach in [7J.
It may also be important to avoid fusing at ~oo many loop
REAL-S ZA(2: KN},,,,aO,,,,al, ZBO{2 : IN), ZBl{2, IN),,,b
b6 J = 2,JN




-,,,t = ZP(J - I,K) + ZR(J -I,K - 2)
:b = ZQ(J - 1, K) + ZZ(J, K)
IF (J.EQ.2) THEN
ZP(J,K -I} = ZP(J,K -I} +"aL - ZA(K -1)
-ZB1(J) +.::b .
ZQ(J, K - I) = ZQ(J, K -1) + :al + ZA(K _ I)
+ZBl{J) +:b
ELSE
ZP(J, K - 1) = ZP(J, K _ 1) + :,,1 - :00
-ZB1(J) +",b
ZQ(J, K - 1) = ZQ(J, K _ 1) + ",al + ",,,0
+ZBl(J) +:6
END IF





",,,1 = (ZP(J -t,KN+ 1) +ZR(J _ 1,KN _I)
IF (J.EQ.2) THEN
ZP(J,KN) = ZP(J, KN) + ",,,1 - ZA(KN)
-ZBI(J) + ZBO(J)
ZQ(J,KN) = ZQ(J,KN) + ",,,1 + ZA(KN)
ZBI(J) + ZBOP)
ELSE
ZP(J,KN} = ZP(J, KN)+",al- ",,,0
-ZDl(J) + ZBO(J)





Figure 7: The transformed code for Figure 4(a) after
memory reduction
leuels if loops are shifted. This is because, after loop shift-
ing, fusing too many loop levels can potentially increase the
number of operations due to the IF-statements added in the
loop body or due to the effect of loop peeling. Coalescing,
if applied, may also introduce more subscript computation
overhead. Although all such costs tend to be less significant
than the costs of cache misses and register spills, we carefully
control the fusion of innermost loops. H the rate of increased
operations after fusion exceeds a certain threshold, we only
fuse the outer loops.
4.2 Enabling Loop Transformations
We usc several well-known loop transformations to enable
effective fusion. Long backward data-dependence distances
make loop fusion ineffective for memory reduction. Such
long distances arc sometimes due to incompatible loops [271
which can be corrected by loop interchange. Long back-
ward distances may also be due to circular data dependences
which can be corrected by circular loop skewing [27]. Fur-
thermore, our technique applies loop distribution to a node,
Li] if the dependence distance vectors originated from Li
are different from each other. In this case, distributing the
loop may allow different shifting factors for the distributed
loops, potentially yielding a more favorable result.
4.3 Tiling vs. Reduction
Suppose the collection ofloops in Figure 2(a) are embedded




Input: (I) II colleelion of m loop nesUi, (2) aM~.k
(1 ~ i ~ m, 1 ~ k ~ "I), (3) the estimllted number of regis-
ter spills np and the estimated number of cache miMes nm, bOlh in
the original loop nesl.8.
Output: A set of loop nes~s to be fused, FS.
Procedur..:
l. Initialize FS to be emply. Le~ as initially contain Illl the m
loop neslB.
2. H as is empty, return FS. Otherwise, selecl a loop neill Ll
from OS such lhllt the local arrllY regions R written in Li CILll
be reduced most, i.e., the differenee between the she of Rand
the size of the predicted reference window for Ll is nO smaller
than that for any other loop neilt in as. Let TR be the set of
loop nesls in OS which con~ain references lo R. E.limllte 0,
the number of register spills, lind b, the number of Cliche misses,
af~er fusing the loops in both 1"8 and TR and after performing
arrny contrnction for the fused loop. If (a ~ npA6 ~ nm), then
FS of- FSuTR, as of- OS-TR. Otherwise, as of- OS-{Li}
ILlId go to step 2.
Figure 8: Procedure Pick...and-Reject
print .is the same in every T iteration. It is possible then
to perform tiling on the whole T loop nest SO as to exploit
temporal locality across different T iterations [27]. On the
other hand, after loop shifting plus fusion, the T loop and
the fused loops form a (n + l).!evel perfectly-nested loop
nest. This resulting loop nest would appear to be a perfect
candidate for tiling, since many tiling algorithms apply to
perfectly-nested loops only. However, the kind of shifting re-
quired for memory reduction often introduces very long back-
ward dependences, which actually prevents profitable tiling.
(On ther other hand, partial memory reduction, which may
not minimize the memory requirement, may allow profitable
tiling. The interaction between partial memory reduction
and tiling seems an interesting topic for our future research.)
Where tiling and memory reduction can be performed sep-
arately, but not simultaneously, we need to make a choice,
and we do so based on simple estimations of the cache miss
penalty.
Let ObI be the L1-cache line size and Ob2 be the L2-cache
line size, both measured in the number of data elements.
Let PI be the L1-cache miss pllnalty and P2 be the L2-cache
miss penalty. Further, let W represent the footprint of the
original loop body (in the number of data clements). We
estimate the average cache miss penalty for each T-iteration
in the original code by
pIW p2W
Obi + Ob2 . (28)
Likewise, let WI represent the footprint of the fused loop
body after memory reduction. We estimate the average
cache miss penalty for each T-iteratlon in the fused code
by
PIWI p2WI--+--CbI Cb2 '
Obviously, WI is expected to be smaller than W. Under
extreme circumstances, WI may completely fit in a certain
cache, say the L2 cache, then the estimation is revised to
remove the miss penalty on that cache.
To tile the T loop nest, certain arrays may be duplicated
[27]. Let W2 represent the array footprint size of the loop
body after the array duplication phase (in the number of
data elements) but before tiling. (W2 may then be greater
than W.) The average cache miss penalty for each T-iteration
after tiling will depend on the number of inner.loop levels
which are tiled. Based on a detailed calculation [26J, we de-
rive the average miss penalty per T -iteration under two-level
tiling as
p",,:cS'-"iiW.:-2 P2B2 W2
-- + .OblBI Cb2B 2
where 8 1 and 8 2 represent the skew factors of tiling and
(BI,B2) represent the tile size.
The average miss penalty per T-iteration under one-level
tiling is estimated as
PI W2 + P2 8 1W2 (31)
Cbl Cb2BI'
Which transformation to choose is then determined by a
comparison of the estimated cache miss penalties. Our ex-
perimental results will show that these simple cost models
work quite well.
5. EXPERIMENTAL RESULTS
We bave implemented our memory redudion technique in a
research compiler, Panorama [12]. We implemented a net·
work How algorithm, successive shortest path algorithm [I].
The loop dependence graphs in our experiments are rela-
tively simple. ,The successive shortest path algorithm takes
less than 0.06 seconds for each of all the benchmarks. To
measure its effectiveness, we tested our memory reduction
technique on 20 benchmarks on a SUN Ultra II uniprocessor
workstation and on a MIPS RlOl( processor within an SGI
Origin 2000 multiprocessor. The Ultra II processor has a
16KB directly-mapped L1 data cache with a 16.byte cache
line, and it has a 2MB directly-mapped unified L2 cache with
a 64-byte cache line. The cache miss penalty is 6 machine
cycles for the Ll data cache and 45 machine cycles for the L2
cache. The MIPS RlOK has a 32KB 2-way set-associative
Ll data cache with a 32-byte cache line, and it has a 4MB
2-way set-associative unified L2 cache with a 128-byte cache
line. The cache miss penalty .is 9 machine cycles for the Ll
data cache and 68 machine cycles for the L2 cache.
5.1 Benchmarks and Memory Reduction
Table 1 lists the benchmarks used in our experiments, their
descriptions and their input parameters. In the table, "min"
represents the number of loops in the loop sequence (m) and
the maximum loop nesting level (n). Note that the array size
and the iteration counts arc chosen arbitrarily for LL14, LL18
and Jacobi. To differentiate two versions of sllim in SPEC95
and SPEC2000, we call the SPEC95 version as Sli'i.m.1 and
the SPEC2000 version as sllim2. SIlim2 is almost identical
to sIlim1 except for its larger data size. For combustion, we
change the array size (N1 and N2) from 1 to 10, so the exe-
cution time will last for several seconds. Programs climate,
laplace-jb, laplace-gB and all the Purdue set problems






















Table 1: Test programs
C8cription
Livermore Loop No. 14
Livermore Loop No. 18
acobl ernel w 0 COnver CnCe test
A mesh generalion program from SPEC95fp
weather prediction program from PECll5fp
weat er pre ,,:t,on program rom P 2000fp
n astrophysical pro ram rom 96 p
promalit)' test from 2000 p
multigrid solver from NPB2.3-aerilll enC mllr
thermochemical program rOm haos group
ur ue set problem02
Purdue set problemOS
Purdue set problem04
urdue set pro em07
urdue set problemOS
Purdue set problem 12
Purdue set problem13
two- ayer s a OW water climate model from ice
Jacobi method of llplnee rOm ice
Gau!I5- eidel method of Laplace from Rice
Input Parameters
N 1001, ITM X



























Figure 9: Memory sizes before and after transfor-
mation on the Ultra II
For each of the benchmarks, we examine three versions of
the code, i.e. the original one, the one after loop fusion but
before array contraction, and the one after array contrac-
tion. For all versions of the benchmarks, we use the native
Fortran compilers to produce the machine codes. On the Ul-
tra II, we follow the recommendations from SUN's optimiz.-
ing compiler group and use the following optimization flags.
For the original tomcatv code. we use "_fast -xchip=ultra2
-xarch=v8plusa -xpad=local:23". For all versions of 5viml
and 5Vim2, we use "-fast -xchip=ultra2 _xarch=v8plusa -
xpad=common:15". For all versions of combustion, we sim-
ply use "-fast" because it produces better_performing codes







21J. Except for lucas, all the other benchmarks are written
in F77. We manually apply our technique to lucas, which is
written in F90. Among 20 benchmark programs, our algo-
rithm finds that all purdue-set programs, lucas, LL14 and
combustion do not need to perform loop shifting. For each
of the benchmarks in Table 1, all m loops are fused together.
For 5viml, svim2 and hydro2d, where n = 2, only the outer
loops are fused. For all other benchmarks, all n loop levels
are fused.
Figure 10: Memory sizes before and after transfor-









(0 l S' II O' iii P 't KB»)•• .., " " rlg,n rograma om ,LL14 LL18 Jacobi tomeatv swiml
'" 11520 193110 14750 1~794swim2 h)'dro2d lucBS 00 combustion
191000 1\405 142000 8300 "purdue-02 pur ue-03 purdue-04 pur ue-07 pur ue-08
4198 4198 4194 '" 4720purdue-12 purdue-13 climate laplace·jb laplace-g.








Figure 11: Performance before and after transfor-
mation on the Ultra II
"-fast -xchip=ultra2 -xarch=v8plusa -fsimplc=2". When we
compare the best results of different versions, we switch on
and off prcfetching (i.e. the "-xprefetchh flag) and pick the
better result for each version. On the RlOK, we simply use
the optimization flag "-03" except with the following adjust-
mcnt:.s. We switch off prefetching for laplace-jb. software
pipelining for laplace-go and loop unrolling for purdue-03.
For svirnl and B.-1m2, the native compiler fails to insert
prefetch instructions in the innermost loop body after mem-
ory reduction. We manually insert prefetch instructions into
the three key innermost loop bodies, following exactly the
same prefetching patterns used by the native compiler for
the original codes.
Figure 9 compares the code sizes and the data sizes of the
original and the transformed codes on the Ultra II. The data
size shown for each original program is normalized to 100.
The actual data size varies greatly for different benchmarks,
which are listed in the table associated with the figure. Sim-
ilarly, Figure 10 compares the data sizes and the code sizes
on the RlOI{. For mg and climate, the memory requirement
differs little before and after the program transformation.
This is due to the small size of the contractable local array.
For all other benchmarks, our technique reduces the memory
requirement noticeably on both machines. The arithmetic
mean of the reduction rate, counting both the data and the
code, is 50% for all benchmarks on both machines. Specifi-
cally, the arithmetic mean is 49% on the Ultra II alone, and
51% on the R10K. For several small purdue benchmarks,
the reduction rate is almost 100%.
5.2 Performance
Figure 11 compares the normalized execution time on the
Ultra II, where "Mid" represents the execution time of the
codes after loop fusion but before array contraction, and
"Final" represents the execution time of the codes after ar-
ray contraction. Similarly, Figure 12 compares the normal-
ized execution time on the R10K. The geometric mean of
speedup after memory reduction is 1.57 for all benchmarks
running on both machines. The geometric mean is 1.73 on
the Ultra II alone, and it is lAO on the R10K alone.
The best speedup is achieved for program purdue-03, which
is 5.67 on the R10K and is 41.3 on the Ultra II. This program
Figure 12: Performance before and after transfor-
mation on the RIOK
contains two local arrays, A(1024, 1024) and P(1024), which
carry values between three adjacent loop nests. Our tech-
nique is able to reduce both arrays into scalars and to fuse
three loops into one. After comparing the assembly codes
on both machines, we found the reason for the less dramatic
speedup on the RlOK. Prefetching instructions inserted by
the native compiler hide memory latency quite well, better
than those inserted by the Ultra II's compiler in this case.
Excluding program purdue-03 on the Ultra II, the geomet-
ric mean of speedup after memory reduction is 1.41 for all
other combinations of benchmarks and machines.
We see three programs actually get slowed down slightly af-
ter memory reduction. The execution time of both purdue-13
and laplace-gs on the Ultra II is increased by 2%. The ex-
ecution time of purdue-OB on the RIQK is increased by 1%.
Both purdue-OB and purdue-i3 make several math library
function calls which have dominated the execution time. For
laplace-ge, loop peeling is applied which may reduce the
effectiveness of scalar replacement, and increase the number
of total memory references.
Gao d at. proposes to perform array contraction enabled
by loop fusion only [10]. With their technique, the geomet-
ric mean of speedup after array contraction is 1.30 for all
benchmarks on both machines.
5.3 Memory Reference Statistics
To further understand the effect of memory reduction on
the performance, we examined the cache behavior of differ-
ent versions of the tested benchmarks. We measured the
reference count (dynamic load/store instructions), the miss
count of the Ll data cache, and the miss count of the L2
unified cache on both machines. We usc the perfex pack-
age on the MIPS RlOK and the perfmon package on the
Ultra II to get the cache statistic>. Figures 13 and 14 com-
pare such statistics on the Ultra II, where the total reference
counts in the original codes are normalized to 100. Similarly,
Figures 15 and 16 compare the statistics on the RlOK.
When arrays are contracted to scalars, register reuse is of-
ten increased. Figures 13 to 16 show that the number of
total references get decreased in most of the cases. The to-









;::',.-:: : ---: ..-.,'...._.._----,----.,---~.. ,""..
EJ L':":-"7':~c~:;-~-~~~:;~:~~~
IIIiI ~illllliiilllilwilllillllll~1il
(Original, Mid and Final arc from left to right for each
benchmark)
(Original, Mid and Final are from left to right for each
benchmark)
Figure 13: Cache statistics before and after trans-
formation on the Ultra II
Figure 16: Cache statistics before and after trans-
formation on the RIOK (cont.)
(Original, Mid and Final arc from left to right for each
benchmark)
both machines, is reduced by 21.1% after memory reduc-
tion. Specifically, the reduction rate is 20.0% on the Ultra
II alone, and it is 22.3% on the RI0K alone. However, in a
few cases, the total reference counts get increased instead.
We examined the assembly codes and found a number of
reasons:
1. The fused loop body contains more scalar references in
a single iteration than before fusion. Tbis increases the
register pressure and sometimes causes more register
spilling.
2. The native compilers can perform scalar replacement [3J
for references to noncontracted arrays. The fused loop









Figure 14: Cache statistics hefore and after trans_
formation on the Ultra II (cont.)
(Original, Mid and Final are from left to right for each
benchmark)
Figure 15: Cache statistics before and after trans-
formation on the RI0K
• If register pressure is high in a certain loop, the
native compiler may choose not to perform scalar
replacement.
• After loop fusion, the array dataftow may become
more complex, which then may defeat the native
compiler in its attempt to perform scalar replace-
ment.
3. Loop peeling may decease the effectiveness of scalar
replacement since fewer loop iterations benefit from
it.
Despite the possibility of increased memory reference counts
in a few cases due to the above reasons, Figures 13 to 16 show
that cache misses are generally reduced by memory reduc-
tion. The total number of cache misses, counting all bench-
marks on both machines, is reduced by 58.0% after memory
reduction. Specifically, the reduction rate is 28.6% on the
Ultra II alone, and it is 63.8% on the RI0K alone. Tbe to-
tal number of Ll data cache misses, counting all benchmarks
on both machines, is reduced by 57.3% after memory reduc-
tion. Specifically, the reduction rate is 27.5% on the Ultra
II alonc, and it is 63.0% on thc RI0K alone. The improved
Figure 17: Performance of the original programs wi
and w/o prefetching on the Ultra II
Figure 18: Performance of the transformed pro-
grams wi and w/o prefetching on the Ultra II
cache performance seems to often have a bigger impact on
execution time than the total reference counts.
5.4 Interaction with Other Compiler Optimiza-
tions
In this subsection, we examine how our memory reduction
technique affects prefetching, software pipelining, register al-
location and unroll-and-jam. The issue of concern is whether
the memory reduction makes other compiler optimizations
suffer. A performance comparison with loop tiling is also
presented.
Pre/etching and Software Pipelining
On the RlOK, we compared the perCormance impact ofprefetch-
ing and software pipelining on both the original codes and
the transformed codes. On the Ultra II, we compared the
performance impact of preCetching only, since we cannot
specifically switch off software pipelining alone for the native
compiler.
Figures 17 and 18 show the normalized execution time with
and without prefetching, on the Ultra II, Cor the original pro-
grams and the transformed programs respectively. PreCetch-
ing affects the performance little for the transformed codes






Figure 19: Performance of the original programs wi











Figure 20: Performance of the original programs wi







Figure 21: Performance of the transformed pro-
grams wi and w/o prefetching and software pipelin-





". ~: "!! ~ i ~r<";;: H: ~! ~l]j: ::' ~,
!I
Iii II"'
., ,,: ~: " ~, ~;l' ~! ·'i., lJ
~i
"; ,i: iii :: ;i'!I ~i ~i ;j' ;i .' :;1,. "' i,I ;1 ,!
"





I • I,· ....-, , . ",• • I, • ._-! " "• •
• •
/ / / / ///"..-/,/ 0<"//// "///
Figure 22: Perforznllnce of the transformed pro-
grBnlS wI and w/o prefetching and software pipelin-
ing on the RIOK (cont.)
Figure 24: PerforIDance of the code with (unroll-
and-jam, scalar replacement) on the RIOK
,oo,,.
"i " ...• .-",-
I " IlTtan>-UIJDI• •
Figure 23: Performance of the code with (unroll-
and-jam, scalar replcement) on the Ultra II
execution time for the original programs, with and with-
out preCetching and software pipelining, on the RIOK. Fig-
ures 21 and 22 show the normalized execution time for the
transformed programs. Software pipelioing and prcfctch-
iog improves the performance for the transformed codes in
most cases. One exception is that laplaca-jb with prefetch-
iog, where prefetching makes performance worse by 49.4%.
A close look at cache statistics with perfa:z: shows that
prefetching increases the Ll cache miss count by 50% com-
pared with the code without prefetching. Another exception
is that laplace-ga with software pipelining, where software
pipelining makes performance worse by 20.7%. Based on the
results from perf8%. software pipelining generates 59% more
floating point instructions than without software pipelining
(536M vs. 337M).
Register Allocation
As stated earlier in this section, loop fusion may potentially
increase register pressure and thus may potentially reduce
register reuse. Figures 13 and 14 show that, in 5 of the
20 codes transformed for the Ultra II, slightly more mem-
ory references are is<3ued than the original codes. Figures 15
and 16 show that, in just two of the 20 codes transformed for
the RlOK, slightly more memory references are issued than
the original codes. Loop fusion seems to have degraded reg-
ister reuse somewhat in those codes. However, we should
point out that, except in three cases, (laplace-gs on the
Ultra II and sviml and S"Il'im2 on the RI0K), all those trans-
formed codes in question actually run faster than their orig-
inal codes. Nonetheless, it is useful to examine the register
matter further .
One interesting question is whether the seemingly degraded
register utilization is truely due to the increased register
pressure. Alternatively, it might be due to the native com-
piler's inability to properly perform scalar replacement and
unroll-and-jam on the fused loop body. (These "wo tech-
niques are important to good register allocation.) To find
the answer, we manually applied unroll-and-jam and scalar
replacement to the codes of concern. We experimented with
unroll factors from 1 to 4 (a factor of 1 meaning no un-
rolling), and we applied scalar replacement where possible.
We then picked the best results. Figures 23 and 24 show the
results on the Ultra II and on the RlOK respectively, where
"Drg" stands for the original code, "Org-Unroll" for the orig-
inal code with unroll-and-jam plus scalar replacement man-
ually applied. "Trans" stands for the transformed code and
''Trans-Unroll'' for the transformed code with unroll-and-
jam plus scalar replacement applied. From these figures, we
conclude that loop fusion indeed increases register pressure
somewhat, as unroll-and-jam and scalar replacement applied
manually do not seem to make much difference, before or af-
ter memory reduction.
Compare with Tiling
As stated in Section 4.3, for certain loop sequences, both
tiling and memory reduction may be applied profitably. In
our benchmarks, we have LL18, Jacobi, tomcatv, aviml and
avim2 which can be tiled profitably. Table 2 compares the
performance between memory reduction and tiling. In this
table, Jacobi is tiled at two loop levels. All other four pro-
grams are tiled at one loop level only. Even though LL1B
can be legally tiled at 2-levels, its performance is poorer
than l-level tiling.
Using the cost estimation in Section 4.3, our research com-
piler chooses 2-level tiling for Jacobi on both machines. It
chooses I-level tiling for LL1B, s"Il'iml and svim2 and chooses
Table 2: Performance of memory reduction vs. tiling (in seconds)
Benchmarks Ultra II RI0K
Mem-Rcd Tiling - Mem_Red Tiling • -,," ""LL18 '.J ,., 1.11 4,79 5.117 0.64
Jacob! 88.2 32.9 2.20 68.58 39.64 1.73
tomcmt .. 80.1 96.0 0.83 70.26 72.73 0.117
a,,1m1 118.4 74.5 L50 " 58.50 1.47a,,1m2 ,OJ '" 1.09 '" '" 1.27
memory reduction for tomcatv, also on both machines. This
turns out to be correct in 9 out of the 10 cases. The ex-
ception is LL18 on the RI0K. The tiled assembly code of
LL18 on the RIOK shows that the loop index variables of
the tile-controlling loop and the time-step loop (i.e. the T
loop) are spilled heavily, thus introducing significantly more
load/store instructions than the code with memory reduc-
tion.
6. RELATED WORK
The work by Fraboulet et al. is the closest to our mem-
ory reduction technique [8]. Given a perfectly-nested loop,
they use loop alignment to adjust the iteration space for in-
dividual statements such that the total buffer size can be
minimized. Unlike ours, they only formulate the optimiza-
tion problem for the I-D case as a network flow problem,
in a form different from OllIS. For multi-dimensional case,
they apply I-D fonnulation loop level by loop level. They
do not present any experimental results, and they do not
consider the effect of memory reduction on cache behavior
and execution speed.
Callahan et al. present unroll-and-jam and scalar replace-
ment techniques to replace array references with scalar vari-
ables to improve register allocation [3]. However, they only
consider the innermost loop in a perfect loop nest. They do
not consider loop fusion, neither do they consider array par-
tial contraction. Gao and Sarkar present the collective loop
fu:;ion {1O]. They perform loop fusion to replace arrays with
scalars, but they do not consider partial array contraction.
They do not perform loop shifting, therefore they cannot
fuse loops with fusion-preventing dependences. Sarkar and
Gao perform loop permutation and loop reversal to enable
collective loop fusion [23]. These enabling techniques can
also be used in our framework.
Lam et 0/. reduce memory usage for highly-specialized multi-
dimensional integral problems where array subscripts are
loop index variables [15]. Their program model does not
allow fusion-preventing dependences. Lewis et 0/. proposes
to apply loop fusion and array contraction directly in array
statement level for those array languages such as F90 [16).
The same result can be achieved if the array statements
arc transformed into various loops and loop fusion and ar-
ray contraction arc then applied in scalar level. They do
not consider loop shifting in their formulation. Strout et al.
consider the minimum working set wbich permits tiling for
loops with regular stencil of dependences 128]. Their method
applies to perfectly-nested loops only. In [6], Ding indicates
the potential of combining loop fusion and array contraction
through an example. However, he does not apply loop shift-
ing and does not provide formal algorithms and evaluations.
Gannon et ol. introduce the concept of reference window,
using it to estimate the cache hit rate and to guide program
optimization for a software-controlled cache [9). They do
not address the memory reduction problem.
There exist a lot of work related with loop fusion. To name
a few, Kennedy and McKinley prove maximizing data lo-
cality by loop fusion is NP-hard [13]. They provide two
polynomial-time heuristics. Singhai and McKinley present
parameterized loop fusion to improve parallelism and cache
locality [25]. They do not perform memory reduction or
loop shifting. Megiddo and Sarkar use mized integer pro-
gramming to optimize weighted loop fusion for parallel pro-
grams [19]. Recently, Darte analyzes the complexity of loop
fusions [5] and claims that the problem of maximum fusion
of parallel loops with constant dependence distances is NP-
complete when combined with loop shifting. None of these
works address the issue of minimizing memory requircment
for a collection of loops and their techniques are very differ-
ent from OllIS. Manjikian and Abdelrahman present shift-
and-peel [17J. They shift the loops in order to cnable fusion.
However, they do not consider array contraction.
7. CONCLUSION
In this paper, we present a locality enhancement technique,
memory reduction, which is a combination of loop sbifting,
loop fusion and array contraction. Wc reduce the mem-
ory reduction problem to a network flow problem, which
is solved optimally. (The current fastest algorithm has the
complexity O((IElloglVIHIEI + loglVl» where G = (V, E)
is the loop dependence graph.) We propose controlled fu-
sion to prevent excessive register spilling and cache misses
which may be caused by excessive loop fusion. We develop
a simple memory cost model for memory reduction. For
a loop nest where both tiling and memory reduction can
apply, the scheme having the smaller cost is chosen. Exper-
imental results so far show that our tcchnique can reduce
the memory requircment significantly. At the same time,
it speeds up program execution by a factor of 1.57 on av-
erage. Furthermore, the memory reduction does not seem
to create difficulties for a number of other back-end com-
piler optimizations. We also believe that memory reduc-
tion by itself is vitally important to computers which are
severely memory-coll5trained and to applications which arc
extremely memory-demanding.
8. REFERENCES
11] R. Ahuja, T. Magnanti, and J. Orlin. Network Flows:
TheoJ1j, Algorithms, and Applications. Prentice-Hall
Inc., Englewood Cliffs, New Jersey, 1993.
[2] M. S. Bazaraa, J. J. Jarvis, and H. D. Sherali. Linear
Programming and Network Flows. Wiley, New York,
1990.
[3] D. Callahan, S. Carr, and K. Kennedy. Improving
register allocation for subscripted variables. In
Proceedings of ACM $IGPLAN 1990 Conference on
Programming Language Design and Implementation,
pages 53--65, White Plains, New York, June 1990.
[4] B. Creusillet and F. Irigoin. Interprocedural array
region analyses. International Journal of Parolle!
Programming, 24(6):513-546, December 1996.
[5] A. Darte. On the complixity of loop fusion. In
Proceedings of International Conference on Parallel
Architecture and Compilation Techniques, pages
149--157, Newport Beach, California, October 1999.
[6] C. Ding. Improving EfJee/iuB Bandwidth Through
Compiler Enhancement of Global and Dynamic Cache
Rewe. PhD thesis, Department of Computer Science,
Rice University, January 2000.
(7] J. Ferrante, V. Sarkar, and W. Thrash. On estimating
and enhancing cache effectiveness. In Proceedings of
4th Intematitmal Workshop on Languages and
Compilers for Parallel Computing, August 1991. Also
in Lecture Notes in Computer Science, U. Banerjee,
D. Gelernter, A. Nicolau, and D. Padua, eels., pp.
328-341, Springer-Verlag, Aug. 1991.
[8] A. Fraboulet, G. Hurard, and A. Mignotte. Loop
alignment for memory accesses optimization. In
Proceedings of the 12th International Symposium on
Sys/em SyntJlesis, Boca Raton, Florida, November
1999.
[9] D. Gannon, W. Jalby, and K. Gallivan. Strategies for
cache and local memory management by global
program transformation. Journal of Parollel and
Distributed Computing, 5(5):587--616, October 1988.
[10] G. R. Gao, R. Olsen, V. Sarkar, and R. Thekkath.
Collective loop fusion for array contraction. In the fifth
Workshop on Languages and Compilers for Parallel
Computing. Also in U. Banerjee, D. Gelernter,
A. NicoJau, and D. Padua, editors, No. 757 in Lecture
Notes in Computer Science, pages 281-295.
Springer-Verlag, 1992.
(11] T. Gross and P. Steenkiste. Structured dataflow
analysis for arrays and its use in an optimizing
compiler. Software-Practice and Experience, 20(2),
February 1990.
(12] J. Gu, Z. Li, and G. Lee. Experience with efficient
array data flow analysis for array privatization. In
Proceedings of the 6th ACM SIGPLAN Symposium on
Principles and Practice of Parnllel Progrnmming,
pages 157-167, Las Vegas, NV, June 1997.
[13] K. Kennedy and K. S. McKinley. Maximizing loop
parallelism and improving data locality via loop fusion
and distribution. In Springer- Verlag Lecture Noles in
Computer Science, 768. Proceedings of the sixth
Workhsop on Languages and Compilers for Parallel
Computing, Portland, Oregon, August 1993.
[14] D. J. Kud:.. The $tT'IJcture of Computers and
Computations, volume 1. John Wiley & Sons, 1978.
115] C.-C. Lam, D. Cociorva, G. Baumgartner, and
P. Sadayappan. Optimization of memory usage and
communication requirements for a class of loops
implementing multi-dimensional integrals. In the
twelfth International Workshop on Languages and
Compilers for Parnllel Computing, San Diego, CA,
August 1999.
[16] E. C. Lewis, C. Lin, and L. Snyder. The
implementation and evaluation of fusion and
contraction in array languages. In Proceedings of the
1998 ACM SIGPLAN Conference on Programming
Language Design and Implementation, pages 50-59,
Montreal, Canada, June 1998.
[17) N. Manjikian and T. Abdelrahman. Fusion of loops
for parallelism and locality. IEEE funsaclions on
Parnllel and Distributed Systems, 8(2):193-209,
February 1997.
[18] D. Maydan, S. Amarasinghe, and M. Lam. Array
data·B.ow analysis and its use in array privatization. In
Proceedings of ACM SIGPLAN-$lGACT Symposium
on Principles of Programming Languages, pages 2-15,
Charleston, SC, January 1993.
[19] N. Megiddo and V. Sarltar. Optimal weighted loop
fusion for parallel programs. In Proceedings of the
ninth Annual ACM Symposium on Parollel AlgoriUims
and Architecture, pages 282-291, Newport, Rhode
Island, USA, June 1997.
[20] A. G. Mohamed, G. C. Fox, G. von Laszewski,
M. Parashar, T. Haupt, K. Mills, Y.-H. Lu, N.-T. Lin,
and N.-K. Yeh. Applications benchmark set for
fortran-d and high performance fortran. Technical
Report CRPS-TR92260, Center for Research on
Parallel Computation, Rice University, June 1992.
[21] J. Rice and J. Jing. Problems to test parallel and
vector languages. Technical Report CSD-TR-1016,
Department of Computer Science, Purdue University,
1990.
(22] V. Sarkar. Optimized unrolling of nested loops. In
Proceedings of the ACM International Conference on
Supercomputing, pages 153-166, Santa FE, NM, May
2000.
[23] V. Sarkar and G. R. Gao. Optimization of array
accesses by collective loop transformations. In
Proeee.dings of 1991 A CM International Conference on
SupereomplJting, pages 194-205, Cologne, Germany,
June 1991.
[24] A. Schrijver. Theory of Linear and Integer
Programming. John Wiley & Sons, 1986.
[25] S. K. Singhai and K. S. McKinley. A parameterized
loop fusion algorithm for improving parallelism and
cache locality. The Computer Journal, 40(6), 1997.
[26] Y. Song. Compiler Algorithms for Efficient Use of
Memory Systems. PhD thesis, Department of
Computer Sciences, Purdue University, November
2000.
127] Y. Song and Z. Li. New tiling techniques to improve
cache temporal locality. In Proceedings of ACM
SIGPLAN Conference on Programming Language
Design and Implementation, pages 215-228, Atlanta,
GA, May 1999.
[281 M. Strout, L. Carter, J. Ferrante, and B. Simon.
Schedule-independent storage mapping for loops. In
Proceedings of the 8th Intem.ational Conference on
Architedural Support for Programming Languag~ and
Operoting Systems, pages 24-33, San Jose, CA,
October 1998.
[291 M. Wolf. Improving Locality and Parol/elism in Nested
Loops. PhD thesis, Department of Computer Science,
Stanford University, August 1992.
(30) M. Wolfe. High Performance Compilers for Parallel
Computing. Addison-Wesley Publishing Company,
1995.

