Syracuse University

SURFACE
College of Engineering and Computer Science Former Departments, Centers, Institutes and
Projects

College of Engineering and Computer Science

1996

A Unified Tiling Approach for Out-Of-Core Computations
Rajesh Bordawekar
California Institute of Technology and CACR

Alok Choudhary
Northwestern University

J. Ramanujam
Louisiana State University

Mahmut Kandemir
Syracuse University

Follow this and additional works at: https://surface.syr.edu/lcsmith_other
Part of the Computer Sciences Commons

Recommended Citation
Bordawekar, Rajesh; Choudhary, Alok; Ramanujam, J.; and Kandemir, Mahmut, "A Unified Tiling Approach
for Out-Of-Core Computations" (1996). College of Engineering and Computer Science - Former
Departments, Centers, Institutes and Projects. 31.
https://surface.syr.edu/lcsmith_other/31

This Article is brought to you for free and open access by the College of Engineering and Computer Science at
SURFACE. It has been accepted for inclusion in College of Engineering and Computer Science - Former
Departments, Centers, Institutes and Projects by an authorized administrator of SURFACE. For more information,
please contact surface@syr.edu.

A Uni ed Tiling Approach for Out-Of-Core Computations
M. Kandemir

CIS Dept., Syracuse University, Syracuse, NY 13244
mtk@top.cis.syr.edu

R. Bordawekar

CACR, Caltech, Pasadena, CA 91125
rajesh@cacr.caltech.edu

A. Choudharyy

ECE Dept., Northwestern University, Evanston, IL 60208-3118
choudhar@ece.nwu.edu

J. Ramanujam

ECE Dept., Louisiana State University, Baton Rouge, LA 70803
jxr@ee.lsu.edu

Abstract
This paper describes a framework by which an out-of-core stencil program written in a data-parallel
language can be translated into node programs in a distributed-memory message-passing machine with
explicit I/O and communication. We focus on a technique called Data Space Tiling to group data
elements into slabs that can t into memories of processors. Methods to choose legal tile shapes under
several constraints and deadlock-free scheduling of tiles are investigated. Our approach is uni ed in the
sense that it can be applied to both FORALL loops and the loops that involve ow-dependences.

1 Introduction and Related Work
Since, today, almost every processor has some kind of memory hiearachy organized into layers with di erent
costs, compiler optimizations to reduce memory access costs are important. Tiling, one such optimization,
was rst used by Abu-Sufah et al.[Abu81] in order to optimize loop nests in a paging-memory system. The
later applications were generally on cache memories and registers [Wol89, WL91]. In [RS92] a number of loop
iterations were aggregated into tiles that execute atomically without any synchronization in a distributedmemory message-passing machine. Irigoin and Triolet [IT88] introduce tiles which are atomic, identical
and bounded. Within this context di erent (and sometimes contradictory) optimization criteria have been
o ered to choose the best tile shape and size [RS92, SD90, BDRR93].
The orientation of this paper is di erent from those of the previous works in one important aspect: We tile
the data that reside on disks; that is, we address so called out-of-core problem. The primary data structures
for the programs reside on disks and the programs explicitly read from and write into disks. We call the
unit of transfer between disk and memory a Data Tile and the technique to schedule read and writes Data
 This

work was supported in part by NSF Young Investigator Award CCR-9357840, NSF CCR-9509143 and in part by
the Scalable I/O Initiative, contract number DABT63-94-C-0049 from Defense Advanced Research Projects Agency(DARPA)
administered by US Army at Fort Huachuca. The work of J. Ramanujam was supported in part by an NSF Young Investigator
Award CCR-9457768 and by the Louisiana Board of Regents through contract LEQSF(1991-94)-RD-A-09.
y A longer version of this paper may be obtained from http://web.ece.nwu.edu/choudhar.

Space Tiling. We demonstrate the tradeo s in choosing good tile shapes for FORALL loops[KLS94] and the
loops that contain ow-dependences. Extra File I/O is introduced and scheduling techniques to eliminate it
are presented.
The rest of the paper is organized as follows: Section 2 describes the problem and the underlying model. In
Section 3, reuse vectors and chain vectors are discussed. How tiling parameters are determined is discussed
in Section 4. Section 5 studies scheduling of data tiles and we conclude in Section 6.

2 Problem Description and Our Model
The problem addressed in this paper is to compile applications that use very large amount of data on
distributed memory message-passing architectures. A computation is called Out-Of-Core (OCC) if the data
used by it cannot t in the memory; that is, parts of data reside in les.
In the rest of the paper, a message passing distributed memory machine is assumed. Out-of-core arrays
are divided by the programmer into local out-of-core arrays each of which is stored on a logical disk attached
to processor. We call this model Local Placement Model (LPM). Each processor has its local out-of-core les
stored on the logical disk attached to it, and data sharing is performed by explicit message passing[Bor96].
During the course of program, parts of the local out-of-core le, called tile s or slab s, are fetched into memory,
the new values are computed and the tile is stored back (if necessary) into appropriate locations in the local
le. Computation Volume of a tile is the number of data points it contains.
Our programming model is based on the data parallel programming paradigm. In this model, parallelism
is achieved by decomposing data among processors. We assume a xed set of distribution patterns e.g.,
row-block, column-block, block-block. Array partitioning results in each processor storing in memory a local
array associated with each array. In an out-of-core program local arrays are also out-of-core.

3 Preliminaries
3.1 Reuse Vectors and Reuse Matrices
We assume that loop bounds and array subscripts are ane functions of the enclosing loop indices, and all
the statements are inside the deepest loop. Data reuse in a program can be expressed by an integer vector
called Data Reuse Vector [WL91, Li94, Wol96]. If ~i1 and ~i2 are two iterations that access the same data
element, the reuse vector for this access can be de ned as ~r = ~i2 , ~i1 . Reuse Matrix [Li94] is a matrix every
column of which is a reuse vector.
Suppose that X (f~(~i)) is a reference for an array X on the LHS of an assignment and X (~g(~i)) is a reference
for the same array on the RHS of assignment where f~(~i) = A~i + ~c1 and ~g(~i) = A~i + ~c2 . Here A is the
access (reference) matrix and ~c1 and ~c2 are constant vectors. Data reuse betweeen these two references
can be found by solving A~i1 + ~c1 = A~i2 + ~c2 . Then, the temporal reuse vector, ~r = ~i2 , ~i1 can be found
from A~r = ~c1 , ~c2 . Note that the reuse matrix captures both ow-dependences and anti-dependences.
Since in a FORALL statement all dependences are resolved as anti-dependences, reuse matrix contains only
anti-dependences. It is convenient to represent reuse matrix R as R = [D; S ] where D and S denote the
matrices that contain ow-dependence and anti-dependence vectors respectively. D is frequently called Data
Dependence Matrix [ZC90, Wol96].

3.2 Chain Vectors and Chain Matrices
An r-dimensional array de nes an r-dimensional polyhedron. The vectors that de ne the relation between
data points (array elements) are called Chain Vectors [RS89]. For example, the relation between data points
for X (i) = X (i , 1) + X (i + 1) can be represented by two chain vectors for each value of i. One of them is in
direction 1 whereas the other one is in direction -1. This corresponds informally to the statement in order
to compute the new value of X (i) both X (i , 1) and X (i + 1) are needed. It should be emphasized that the
chain vectors, in general, can span di erent arrays1. In this paper we only consider stencil applications, and
therefore we assume that there is only one array referenced in the nest and all references to it have the same
access matrix.2 In that case chain vectors can be represented in a graph called Data Space Graph (DSG).
Suppose that X (f~(~i)) and X (~g(~i)) are two references for the same array X and the latter occurs on RHS
whereas the former occurs on LHS. Let f~(~i) = A~i + ~a1 and ~g(~i) = A~i + ~a2 , where A is the access matrix and
~a1 and ~a2 are constant vectors.
A data reuse exists between two iterations ~i1 and ~i2 i A~i1 + ~a1 = A~i2 + ~a2 ) A(~i2 ,~i1 ) = ~a1 , ~a2 ) A~r =
~a1 , ~a2 where ~r is the reuse vector (it may be dependence or anti-dependence vector). On the other hand,
we can de ne a chain vector ~t for this reference pair as follows: ~t = (A~i + ~a1 ) , (A~i + ~a2 ) = ~a1 , ~a2 . From
these two equations we obtain an important relation between ~t and ~r:

~t = A~r
where A is the access matrix, ~r is the reuse vector and ~t is the chain vector. If ~r 2 D we call ~t an E ective

Chain Vector. In other words, an e ective chain vector is a chain vector implied by a ow-dependence. A
Chain Matrix T is a matrix every column of which is a chain vector. On the other hand an E ective Chain
Matrix U is a chain matrix every column of which is an e ective chain vector. We have now the following
relation between U and D:
U = AD
(1)
And for every ~u 2 U and d~ 2 D we have
~u = Ad~
We assume that access matrix A is square and invertible. That is, the dimensionality of data space is
equal to that of iteration space. As
that a 2-deep nesthas the statement
X(i,i+j)=X(i an example
 suppose


1
0
1
1
1
1
1,i+j)+X(i-1,i+j+2). Then A = 1 1 , D = ,1 ,3 , and U = 0 ,2 .

4 Constraints on the Tile Shape and Size
Let n be the dimension of the iteration (and data) space and m be the number of columns of the e ective
chain matrix U . Every tile shape can be succintly described by one of two ways. Let Pd be an nxn matrix
every column of which corresponds to tile boundary in that dimension. It is known that the columns of Pd
constitute an extreme vector set for the e ective chain matrix U [RS92]. The second way to de ne a tile is a
matrix Hd every column of which is a vector perpendicular to the tile boundary along that dimension. The
relation between Pd and Hd is Pd = Hd,1 [IT88, BDRR93]. In the rest of the paper Pd and Hd are called
tiling matrices.
In general, there are four factors that determine shape and size of a tile on data space:

 Computation Constraint. Tile shape must be legal in the sense that all e ective chain vector trac
between two tiles must be in one direction (i.e. from one tile to the other.)

1 The chain vectors that span di erent arrays are called Cross Chain Vector s whereas the chain vectors de ned entirely on
a single array are called Regular Chain Vector s. In this paper we concentrate only on regular chain vectors.
2 The references de ned that way form a Uniformly Generated Reference Set [GJK88, WL91].

 I/O Constraint. Tile shape must be compatible with the le layout.
 Communication Constraint. The number of chain vectors (e ective or not) going from one tile to the
others and the number of tiles that a tile communicates with should be minimized.
 Memory Constraint. Tile size cannot be larger than the size of the memory of a node.

Of course, for a loop that contains only anti-dependences, computation constraint does not exist.

4.1 Computation Constraint (Legality)
Arbitrary clustering of data space points into tiles might result in e ective chain vector cycles between tiles.
Tile shapes that produce e ective chain vector cycles are called illegal tiles. The reason for illegality is that
processing of tiles must be atomic in the sense that a tile must take all the data it requires from outside
before computation begins, and all the data required by other tiles should be available after computation
terminates. Allowing e ective chain vector cycles among tiles violates this requirement. As an example, for
the data space graph shown in Figure 1:(A), tiles (1) and (2) are legal, while tile (3) is illegal. The arrows
on the gure represent e ective chain vectors.
Let h1 ,h2 ,....,hn be the rows of the Hd matrix and u1 ,u2 ,....,um be the columns of the e ective chain
matrix U . The legality condition can now be stated as follows [IT88, RS92, BDRR93]:
hi :uj  0
(2)
for i = 1; 2; :::; n and j = 1; 2; :::; m. We note that Hd is not necessarily unique.
(1)
(2)

(3)

(A)

(B)

Figure 1: Di erent Tile Shapes

4.2 I/O Constraint
I/O cost of a tile is determined by the number of le accesses (I/O calls) required to read it from disk. Under
column-major layout assumption, in order to minimize the number of le accesses, number of (sub)-column
reads should be minimized. This may not always possible in practice as minimization of I/O calls can lead
to illegal tiles or some communication requirements may impose lower bounds on some edges of tile. In
Figure 1:(A), tile (1) and tile (2) have the same computation volume (8 data points); however I/O cost
(number of sub-columns) of tile (1) is 4 while that of tile (2) is 2. Everything else being equal, tile (2) is a
better choice than tile (1). Note that this constraint is unique to data space tiling.

4.3 Communication Constraint
Communication Volume [BDRR93] of a tile is the number of chain vectors (e ective or not) going from one
tile to the others. Communication volume may be reduced if the set of extreme vectors for chain vectors

(i.e. columns of Pd ) is a subset of the chain vectors. An important observation is that minimizing the
communication volume (cost), in some cases, leads to an increase in the I/O cost. This situation is shown in
Figure 1:(B) (taken from [BDRR93]). The tile on the left leads to 5 communications whereas the right one
leads to 4 communications. On the other hand, the left one needs 2 I/O calls while the one on the right needs
3 I/O calls. This example clearly demonstrates the tradeo between I/O and communication constraints.
There is another aspect of the communication as well. For any tile, the tile size along each dimension must
be larger than the magnitude of the maximum of the corresponding components of chain vectors. This will
ensure that all the communicating tiles will be neighbors.

4.4 Memory Constraint
Tile size cannot be larger than the size of node memory. We can state this constraint as M  Vcomp (T )
where M is the size of the memory of a single processor and Vcomp (T ) is the computation volume of data
tile T .
Let Hi and Pi denote iteration space (IS) tiling matrices while Hd and Pd denote data space (DS) tiling
matrices. The proofs of the following theorems can be found elsewhere[KBCR96].
Theorem: It is always possible to nd an Hd and a Pd in order to tile the data space in a deadlock-free
manner provided that A invertible.
Theorem: For any legal Hd (or Pd ) on DSG, there is a corresponding legal Hi (or Pi ) on the iteration space
provided that A is invertible.
Theorem: If access matrix A is unimodular, then the number of integer points in the data tile and that of
the corresponding iteration tile are equal; overmore there is one-to-one correspondence between the points
of these two tiles.

4.5 Choosing Tile Shape and Size
Overall cost of a data tile (Tcost ) has two components: I/O cost (TI=O ) and communication cost (Tcomm ).
Consider now the tile and the e ective chain vector shown in Figure 2:A. The computation volume of this
tile Vcomp = jp~1 p~2 j and the communication volume can be approximated Vcomm  jp~1~uj + j~up~2 j [BDRR93].
First we need a few de nitions.

CI=O = startup cost for an I/O read (write), tI=O = cost of reading (writing) an element from (into) a le,
Ccomm = startup cost for communication, tcomm = cost of communicating an element, and K =maximum
message length of the machine.

The overall cost ( le read and send communication) is

Tcost = p11 CI=O + Vcomp tI=O + d Vcomm
K eCcomm + Vcomm tcomm

then the optimization problem is to minimize Tcost under the constraints Vcomp  M and Pd ,1 U  0.
Moreover, all entries of Pd are restricted to be integers. This problem in general is dicult to solve. In the
following we present a heuristic that works for restricted cases of ~u (or U in general).
Since CI=O is the most costly term in overall cost expression, we believe that the communication cost of a
tile is of secondary importance as compared with its I/O cost.
Recall that the condition for legality of a data tile represented by tiling matrix Hd is

Hd:U  0

where Hd is the tiling matrix for the data space and U is the e ective chain matrix. We now consider a
restricted version of U : We assume that 8~u 2 U is lexicographically positive.
Theorem: In two-dimensional case if columns of U are lexicographically positive, it is always possible to
choose a tiling matrix Pd of the form


Pd = ,1e 01
(3)
where e > 0 so that Hd U  0 (see [RS92] for the proof).
u
u
p
2
u

e

p
1
p p
P= p p
1 2

=

11 12

(C)

(B)

p p

21 22

u = u1
u
2

(A)

8
4
8

8
8

4
4

4
(E)
4

4
(D)

Figure 2: (A) A Data Tile de ned by ~p1 and p~2 and an E ective Chain Vector ~u. (B) Legal Tile Shapes.
(C) Entries of Pd on the Data Tile. (D) Legal Tile for e = 1 on the Data Space. (E) Legal Tile Shapes for
e = 1 and e = 2.
A few legal tile shapes that are speci ed by such a Pd matrix are shown in Figure 2:B. Re-scaling ~p1 by
and ~p2 by gives the following re-scaled tiling matrix:

Pd =



, e

0



We now impose our constraints on this speci c type of Pd (see Figure 2:C).



Legality Constraint. e  max(max( ,uu12i i ); 1), (u1i 6= 0) where ~ui =

vector.
 I/O Constraint.



u1i
u2i



should be minimized.

 Communication Constraint.

vector.
 Memory Constraint.

 max(jt1j j),

 max(jt2j j) where ~tj =

is the i th e ective chain



t1j
t2j



is the j th chain

 M where M is the memory size of a node.

Considering I/O and communication constraints together it is clear that = max(jt1j j). Then considering
communication and memory constraints together we have
max(jt2j j)   M

From this last expression and from the legality constraint e can be set to appropriate values.





As an example suppose that U = T = 40 ,44 and M = 32. Using the constraints given above
= 4, e  1 and 4   8. In order to utilize available memory as much as possible, we should set = 8.
Now di erent valuesfor e givedi erent solutions all of which have the minimium
 I/O cost. For example if
4
0
e = 1 we have Pd = ,4 8 , if e = 2 on the other hand we obtain Pd = ,48 08 . Figure 2:E shows
I/O optimized legal tiles for e = 1 and e = 2; and Figure 2:D shows the former one on the data space graph
of the example.

5 Optimizing I/O in Stencil Codes
5.1 Translated Code
Translated node program of processor k for a 2-deep nest is as follows3:

LBIT ,UBIT ,SIT
LBJT ,UBJT ,SJT
read data tile DT defined

DO IT =
DO JT =

f

by k (IT,JT) from local file
compute the corresponding iteration tile
for data tile
handle communication and storage of boundary data
DO IE =
IE , IE , IE
DO JE =
JE , JE , JE
perform
on
to compute the new data values
ENDDO JE
ENDDO IE
handle communication and storage of boundary data
data tile
defined by k (IT,JT) into local file
ENDDO JT
ENDDO IT

LB

write

LB

UB

S
UB S

computation

DT

CT

DT

DT

f

In the translated loop nest, loops IT and JT are called tiling loops and loops IE and JE are called element
loops. Thoroughout this section we try to nd a suitable scheduling function fk for a given computation
such that overall I/O cost is minimized and potential parallelism is exploited. Note that LBIE , UBIE , LBJE ,
and UBJE depend on the iteration tile CT . In other words, out-of-core compiler rst fetches the data tile,
then computes the corresponding iteration tile, and after that executes the iterations in the iteration tile
using the elements of the data tile. Note also that scheduling function fk may be di erent for each processor,
and this fact is used to nd schedules that eliminate all unnecessary I/O and maximize parallelism.

5.2 Extra File I/O Problem
In this subsection we concentrate on how an out-of-core compiler should automatically schedule tile reads/writes
to access data in an ecient manner in a distributed-memory environment. First we consider the case with
cross-processor anti-dependences only.
De nition: Any le I/O performed for any purpose except local computation is termed as Extra File I/O
[Bor96].
3

Translated code for the FORALL statement is very similar, so it is omitted (see [Bor96]).

Since extra le I/O is pure overhead, it should be eliminated whenever possible. In out-of-core computations, communication cost can, in genaral, be negligible for most of the cases when compared with the I/O
cost. But inappropriate communication methods can cause extra le I/O; in other words communication
may necessiate extra I/O. The proposed technique tries to eliminate the types of communication that cause
extra le I/O. Within this context, the ecacy of a scheduling method will be tested by 1) the amount of
overhead it incurs, and 2) the amount of extra memory needed to hold the data to be transferred between
tiles. The overhead involved can be divided into two groups: extra le I/O cost and communication cost.
Let us now elaborate on extra le I/O cost. During computation when a non-local data is requested by a
processor, there are two possibilities: 1) the data has been brought into memory by the owner processor prior
to the request, or 2) the data has not been brought into memory yet. In the rst case if a processor nds out
that a data will later be requested by another processor,it can keep the data in memory till it is requested.
This approach does not cause extra le I/O (if there is enough memory), but it requires extra memory
allocation for the data to be transferred. In the second case however, the owner processor should read the
data requested by issuing I/O call(s) and send it to the requesting processor. Our scheduling strategy must
prevent the occurence of this second case whenever possible.
(1)
0

1

(2)
2

3

0

0

1

2

(4)

3

1

(3)
2

3

0

0

1

2

1

2

3

3

(5)

Figure 3: Di erent Scheduling Strategies
Consider now Figure 3 to see the overheads of di erent scheduling strategies for a square tile of size S xS
(dashed arrows represent the direction of anti-dependence whereas solid arrows indicate execution order).
Schedule (1) and (4) do not cause extra le I/O, but require extra memories. Schedule (2) leads to extra
le I/O. This is because when, for example, processor 1 needs boundary data from processor 0, processor
0 should issue le read requests in order to read the data. The schedules (3) and (5) do not involve extra
le I/O and do not require any extra storage for the data to be transferred. Note also that if there were
two anti-dependences in the opposite directions instead of one, then the schedules (1) and (4) would involve
extra le I/O as schedule (2). The next section discusses methods to eliminate extra le I/O.

5.3 Scheduling Tiles
We assume two-dimensional data sets and two-dimensional processor grid for the illustrative purposes. In
such a grid, every processor has a direction vector with respect to each of its neighbors.
 Asan example,
consider Figure 5:A. The direction vector for processor 2 with respect to processor 1 is ,11 . In general,
direction vector for processor i with respect to processor j is denoted by vij .
Each tile can be represented by a size matrix Ts . It is a square matrix diagonal elements of which are the
sizes of the tile in the corresponding dimensions and all other elements are zero. Let O be a similar diagonal
matrix representing the local out-of-core array in terms of its size. These two matrices can be used to compute
an important parameter f called Degree of Freedom and a matrix F called Freedom Matrix. Given a Ts and
an O, the freedom matrix can be computed as F = dTs,1Oe. The number of the non-unit diagonal elements
of F gives the degree of freedom f . Scheduling matrix Gi , for processor i, determines the order in which

the data tiles are brought into memory for processing. For a nested loop that contains only anti-dependences
(e.g. a FORALL statement4 ), tile access patterns and their corresponding scheduling matrices for f =1 and
f =2 are given in Figure 4:A and Figure 4:B respectively. At the heart of the scheduling algorithm is the
(A)

()
1
0

()
-1

0

()
0

-1

1 0

()
0
1

()

()

()

( )

0 1

-1 0
0 -1

-1 0
0 1

1 0
0 -1

(B)

Figure 4: Tile Access Patterns and Scheduling Matrices for a loop that contains only anti-dependences.
following theorem the proof which as well as a generalization to arbitrary grid sizes may be found in [BCR95].
Theorem: In a two by two processor grid, schedules Gi and Gj eliminate extra le I/O between processor i and j i they satisfy the following equality :

GTi vij = GTj vji
Given the theorem above, the scheduling algorithm for a nested loop that contains only anti-dependences
consists of three steps :

 assign a symbolic schedule matrix to each processor. The dimension of the schedule matrix will be

equal to the degree of freedom of the tiles.
 compute the schedule equations for every processor pair using Gi ,Gj ,vij and vji .
 initialize the schedule matrix of a processor with an arbitrary schedule from Figure 4 and compute the
corresponding schedules of the remaining processors by solving the schedule equations.
Consider the following elliptic solver example that uses ve-point relaxation.
FORALL(i=2:n-1,j=2:n-1)
X(i,j)=(X(i,j)+X(i+1,j)+X(i-1,j)+X(i,j+1)+X(i,j-1))/5.0
ENDFORALL

Suppose f =2 and F =
Let



ai bi
ci di





2 0
0 2



as shown in Figure 5:B.

denote a symbolic schedule matrix for processor i. After obtaining the necessary equa-

tions using the preceding theorem, if we let the scheduling matrix for processor 0 to be



a0 b 0
c0 d0



=

4 It should be noted that if a FORALL loop contains multiple statements, there may be ow-dependences among di erent
statements, and they should be counted for.



,1















0
1 0
,1 0 , and G3 = 1 0 .
0 ,1 , then the remaining schedules are G1 = 0 ,1 , G2 =
0 1
0 1
As shown in Figure 5:C, these values de ne a scheduling which does not involve any extra le I/O.
Theorem If a loop contains only anti-dependencies, then using LPM, it is always possible to schedule
tiles such that all extra le I/O is eliminated [BCR95, Bor96].
Let us now consider the computations that involve ow-dependences, considering only rectangular tiles.
A Tiling Space Graph (TSG) is de ned to indicate the chain vectors among the tiles. We are dealing with
the two-dimensional case where each component of the e ective chain vectors between two tiles is 0 or 1. As
a rst step, the tile access patterns that are associated with scheduling matrices are re-de ned as shown in
Figure 6. The numbers indicate the order of execution. We discuss a technique called
 minimal
 perturbation.
1
Suppose that an e ective chain vector ui in the TSG is a member of set spanf 0 g. The practical


,
1
signi cance of this is that any scheduling in the opposite direction
0 is not acceptable. In order to
avoid that, after all
 schedule matrices Gi areobtained (as in the previous case), if the rst column of any
,
1
1
of them is
0 , it is converted to 0 and the other column is left as it is. So, in order to reach
legitimate schedules we apply minimal changes to schedule matrices, hence the name minimal perturbation.
The scheduling algorithm for a nested loop that may contain ow-dependences is the same as that of
a loop that contains only anti-dependences, except that the initial schedule should be legal (observe the
e ective chain vectors) and maximize (pipeline-) parallelism. As a last step of algorithm, we apply minimal
perturbation. Consider the following example that illustrates the technique (see Figure 5:D)
2

3

Data Tile

0

1

Flow-Dependence
Execution Order

(A)

(E)

(B)

(F)

(C)

(D)

(G)

(H)

(I)

Figure 5: (A) A 2x2 Processor Grid. (B) Data Tiles with f =2. (C) A scheduling that eliminates extra
le I/O. (D) Data Tiles with ow-dependence. (E) A scheduling that eliminates extra le I/O for owdependence case. (F) A scheduling that leads to extra le I/O. (G) A scheduling that requires extra storage.
(H-I) Schedulings that sequentialize computation.
DO i = 2,n-1
DO j = 1,n-1
X(i,j)=(X(i+1,j)+X(i,j+1)+X(i-1,j))/3.0
ENDDO j
ENDDO i





This nest contains an e ective chain vector in 10 direction. If we write down the schedule equations


as in the previous example and initialize G0 = 10 01 (to maximize pipeline parallelism), after the fourth






step of the algorithm we obtain G1 = ,10 01 , G2 = 10 ,01 , and G3 = ,10 ,01 .

The schedules for processor 1 and processor 3 are violating the e ective chain vector and are not acceptable. So, we apply
 minimal
 perturbation

to correct them at thelast step of the algorithm and we obtain
1
0
1
0
G0 = 0 1 , G1 = 0 1 , G2 = 10 ,01 , and G3 = 10 ,01 . These schedules are shown in
Figure 5:E. It is easy to see that there is no extra le I/O involved. Also notice that the locality between
processors 0 and 2 (similarly between 1 and 3) is exploited. The reason is that the minimal perturbation
preserves the localities originating from anti-dependence relations as much as possible. Scheduling in Figure 5:F, on the other hand, leads to extra le I/O (because of the reference X (i; j + 1)) and scheduling in
Figure 5:G requires extra storage of size n=2 per processor. From the discussion above we can conclude the
following, proof of which is omitted due to lack of space, but follows closely the preceding discussion.
Theorem: By scheduling data tiles appropriatelly, it is always possible using LPM to eliminate extra le
I/O.
(A)

()

()

1

()

-1

0

()

0

0

0

-1

3

1

3
1 0

()

2
1

1
1

3

2

-1 0

2

1

3
1

1

()
0 -1

3

0 1

3
1 2

2 1
3

( )
1 0

-1 0

2

()

2

0 1

2

0 -1

3

(B)

Figure 6: Tile Access Patterns and Scheduling Matrices for loop nests containing ow-dependences.
In a nested loop that contains only anti-dependences, it is not important what the initial schedule for a
processor is. But in a nest that contains ow-dependences, an initial schedule must be legal (not violate any
e ective chain vector). In addition to that, among the candidate schedules some of them might be preferable
over the others, especially when parallelism is considered. To see this, consider Figure 5:H and Figure 5:I.
Although both schedules are perfectly legal, neither of them exploits the parallelism available.

6 Conclusion and Future Work
In this paper, we have presented a technique to decompose an out-of-core array stored on disk(s) into
partitions called data tiles. We have discussed the extra le I/O problem and proven that it is always
possible to nd tile schedules so that extra le I/O is eliminated completely in stencil computations.
We are currently working on the feasibility of data space tiling approach for loop nests that contain arbitrary
computations on out-of-core arrays in multicomputers.

References
[Abu81] W.Abu-Sufah. On the Performance Enhancement of Paging Systems Through Program Analysis
and Transformations, IEEE Transactions on Computers, C-30(5), pages 341-355, May 1981.

[Bar94] R.K.Barua, Global Partitioning of Parallel Loops and Data Arrays for Caches and Distributed
Memory in Multiprocessors, Masters Thesis, Dept. of Electrical Engineering and Computer Science,
MIT, May 1994.
[BC95] R.Bordawekar and A.Choudhary. Communication Strategies for Out-Of-Core Programs on Distributed Memory Machines. In Proc.International Conference on Supercomputing, pages 395403,Barcelona, July 1995.
[BCR95] R.Bordawekar, A.Choudhary, and J.Ramanujam. Automatic Optimization of Communication in
Out-Of-Core Stencil Codes. Technical Report, Scalable I/O Initiative, Center of Advanced Computing Research, CALTECH, November 1995.
[BDRR93] P.Boulet, A.Darte, T.Risset, and Y. Robert, (Pen)-ultimate tiling? Technical Report 93-36,Ecola
Normale Superieure de Lyon, 46, Alle'e d'Italie, 69364 Lyon Cedex 07, France, November 1993.
[Bor96] R.Bordawekar. Techniques for Compiling I/O Intensive Parallel Programs, Ph.D. Thesis, Dept. of
Electrical and Computer Eng., Syracuse University, April 1996.
[GJK88] D.Gannon, W. Jalby, and K.Gallivan. Strategies for Cache and Local Memory Management by
Global Program Transformations, Journal of Parallel and Distributed Computing,5:587-616, 1988.
[IT88] Francois Irigoin and Remi Triolet. Supernode Partitioning. Proc. 15th Annual ACM Symp. Principles of Programming Languages, pages 319-329, San Diego, CA, January 1988.
[KBCR96] M.Kandemir, R.Bordawekar, A.Choudhary, and J.Ramanujam. Data Space Tiling for Out-ofCore Computations. Technical Report, ECE Dept., Northwestern University, Evanston, IL, September 1996.
[KLS94] C.Koebel, D.Lovemen, R.Schreiber, G.Steele, and M.Zosel. High Performance Fortran Handbook.
MIT Press, 1994.
[Li94] W.Li. Compiler Optimizations for Cache Locality and Coherence, Technical Report 504, Dept. of
Computer Science, University of Rochester, April 1994.
[RS89] J. Ramanujam and P. Sadayappan. A Methodology for Parallelizing Programs for Multicomputers
and Complex Memory Multiprocessors, In Proc. Supercomputing'89, Reno, NV, pages 637-646,
November 1989.
[RS92] J.Ramanujam and P.Sadayappan. Tiling Multidimensional Iteration Spaces for Multicomputers.
Journal of Parallel and Distributed Computing, 16(2):108-120, October 1992.
[SD90] R.Schriber and J.Dongarra. Automatic Blocking of Nested Loops. Technical Report, RIACS, May
1990.
[WL91] M.Wolf and M.Lam. A data Locality Optimizing Algorithm. in Proc. ACM SIGPLAN 91
Conf.Programming Language Design and Implementation, pages 30-44, June 1991.
[Wol89] M.Wolfe, More Iteration Space Tiling. in Proc. Supercomputing' 89, pages 655-664, Reno NV,
November 1989.
[Wol96] M.Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, CA, 1996.
[ZC90] H.Zima and B.Chapman. Supercompilers for Parallel and Vector Supercomputers, ACM Press Frontier Series, 1990.

