Generating efficient tiled code for distributed memory machines by Peiyi Tang
Generating Efﬁcient Tiled Code for Distributed Memory Machines
Peiyi Tang
￿
and Jingling Xue
￿
Department of Mathematics and Computing
University of Southern Queensland, Toowoomba, QLD 4350, Australia
￿
School of Computer Science and Engineering
The University of New South Wales, Sydney, NSW 2351, Australia
￿
Abstract — Tiling can improve the performance of nested loops on distributed memory machines by ex-
ploiting coarse-grain parallelism and reducing communication overhead and frequency. Tiling calls for a
compilation approach that performs ﬁrst computation distribution and then data distribution, both possibly
on a skewed iteration space. This paper presents a suite of compiler techniques for generating efﬁcient SP-
MD programs to execute rectangularly tiled iteration spaces on distributed memory machines. The following
issues are addressed: computation and data distribution, message-passing code generation, memory man-
agement and optimisations, and global to local address translation. Methods are developed for partitioning
arbitrary iteration spaces and skewed data spaces. Techniques for generating efﬁcient message-passing code
for both arbitrary and rectangular iteration spaces are presented. A storage scheme for managing both local
and nonlocal references is developed, which leads to the SPMD code with high locality of references. T-
wo memory optimisations are given to reduce the amount of memory usage for skewed iteration spaces and
expanded arrays, respectively. The proposed compiler techniques are illustrated using a simple running exam-
ple and ﬁnally analysed and evaluated based on experimental results on a Fujitsu AP1000 consisting of 128
processors.
Index Terms. Nested Loops, Tiling, Distributed Memory Machines, SPMD, Memory Optimisa-
tions.
1 Introduction
Distributed memory machines such as IBM SP-2, TMC CM-5 and Fujitsu AP1000 rely on mes-
sage passing for interprocessor communication and synchronisation. To obtain good performance
on these machines, it is important to manage communication between processors efﬁciently. For
many applications, exploiting coarse-grain parallelism can amortise the impact of communication
overhead by reducing communication overhead and frequency.
Iteration space tiling, also known as blocking, is an important loop optimisation technique for
regularly structured computations where the most of execution time is spent in nested loops. Ex-
ample applications include some dense matrix algebra and regular grid-based ﬁnite-difference and
ﬁnite-element algorithms. The data dependences of these algorithmsusually span the entire iteration
space, making it impossible to create coarse-grain DOALL parallelism. The loop nests of this kind
are known as DOACROSS loop nests [37].
Tiling aggregates small loop iterations into tiles. By executing tiles as atomic units of compu-
tation, communication takes place per tile instead of per iteration. By adjusting the size of tiles,
tiling can achieve coarse-grain DOACROSS parallelism while reducing communication overhead
and frequency on distributed memory machines [13, 26, 29, 30, 33, 38]. While a lot of work has
been done on tiling or other related optimisations [10, 32], little research efforts in the literature are
1A Rectangularly Tiled Iteration Space
Generate Sequential Tiled Code
Generate SPMD Code
￿ Computation Distribution
￿ Data Distribution
￿ Message-Passing Code Generation
￿ Memory Management and Optimisations
Message-Passing SPMD Code
Figure 1: Generating SPMD code for tiled iteration spaces
found on automatic generation of efﬁcient blocked code for distributed memory machines, which is
the subject of this paper.
Several prototype compilers, such as Fortran-D [19] and Vienna Fortran [2], have been devel-
oped for compiling HPF-like data-parallel languages. These systems ﬁrst ﬁnd the data distribution
(usually speciﬁed by the programmer) and then infer the computation distribution (using the owner-
computes rule [19] or others [3]). When compiling a tiled iteration space, however, the proposed
approach will carry out these two phases in the reverse order. This is because once the iteration
space is tiled, the computation distribution is partially determined. According to the so-called atom-
ic tiles constraint [20, 39], each tile is an atomic unit of work to be scheduled on a processor, and
once scheduled, a tile runs to completion without preemption. Therefore, when generating code for
tiled iteration spaces, it is natural to determine ﬁrst the computation distribution and then the data
distribution.
This paper is concerned with compiling tiled iteration spaces for efﬁcient execution on distribut-
ed memory machines. Given a rectangularly tiled iteration space, this paper provides a suite of
compiler techniques for generating a message-passing SPMD program to be run on each processor.
Our compilationapproach is illustrated in Figure 1. In our SPMD executionmode, all processors ex-
ecute the same program but on different data sets, and in addition, each processor computes only the
values of data it owns. This means that the owner-computes rule [19] is still enforced in our gener-
ated SPMD programs, although it is not used in deriving the computation distribution. We introduce
compiler techniques to solve the following problems: computation distribution, data distribution,
message-passing code generation, memory management, memory optimisations, and global to local
address translation. By exploiting the regularity of data dependences, both the proposed compiler
techniques and the generated SPMD code are simple and efﬁcient.
This paper contains the following main results. First, we propose methods for deriving the com-
putation and data distribution for a tiled iteration space. The iteration space is assumed to be convex
and of any arbitrary shape. The tiles are distributed cyclically in some dimensions and collapsed
2in the remaining dimensions. This tends to achieve good load balance for non-rectangular iteration
spaces and overlap computation and communication. By adjusting the tile size, the general block-
cyclic distribution can be implemented. Given the computation distribution, the data distribution is
found using the “computer-owns” rule, by which a processor owns the data it computes. For skewed
iteration spaces, an array is partitioned in a skewed block-cyclic fashion. Such data distribution
cannot be speciﬁed by HPF data mapping directives [22].
Second, we present techniques for generating efﬁcient message-passing code for both arbitrary
and rectangular iteration spaces. The problem is nontrivial when iteration spaces have arbitrary
shapes and when empty tiles can be enumerated. In our communication scheme, small messages
are aggregated into large messages, amortising the large startup overhead on distributed memory
machines. In addition, each message comprises the elements from a regular array section of an
array, yielding efﬁcient code for packing and unpacking messages.
Third, we describe a memory compression technique for allocating local memory for distributed
arrays. Both the local and nonlocal data of a distributed array are stored in the same local array.
The nonlocal data are stored using an extended concept of overlap areas [15]: they are organised
according to the loop indices at which they are computed rather than their array indices. This
storage scheme leads to high locality of references for the computations within tiles regardless of
any sparse array references that may appear in the original program. We provide various formulas
for translating between global and local addresses. We generate efﬁcient SPMD code by avoiding
unnecessary division and modulo operations.
Fourth, we have developed two optimisation techniques to reduce the amount of memory usage.
In particular, address folding is used to deal with skewed iteration spaces and memory recycling with
expanded arrays. If the iteration space is rectangular and then skewed, the address folding ensures
that no extra memory will be allocated. In the case of an expanded array, the extra memory needed
in an expanded dimension is increased only by the size of the tile (not the size of the array) along
that dimension.
Finally, we have validated and evaluated all proposed compiler techniques. We present some
experimental results based on the SOR example and identify some performance-critical factors that
impact the performance of tiled code on distributed memory machines.
The rest of this paper is organised as follows. Section 2 provides the background information
for the paper. In particular, we describe our approach to generating sequential tiled code given a
tiled iteration space. We present our compiler techniques in Sections 3 – 8, with an emphasis on
communication code generation and memory management. In Section 9, we evaluate our compiler
techniques based on our experiments on an Fujitsu AP1000 and identify future research areas for
improvingthe performance of blocked code. Section 10 discussesrelated work, and ﬁnally the paper
is concluded in Section 11.
2 Background
2.1 Notation and Terminology
A single up-case letter in bold font is used to represent a set. The notation
￿
￿ stands for a vector,
and
￿
￿
￿
￿
￿
￿
￿
denotes the subvector of
￿
￿ containing the entries
￿
￿
￿
￿
￿
￿
￿
￿
￿ in that order.
￿
￿
and
￿
￿
3denote all-zero and all-one vectors, respectively. If
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿, we deﬁne
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. The integer modulo operator
￿
￿
￿ is deﬁned such that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
Thus,
￿
￿
￿
￿
￿
￿
￿
whenever
￿
￿
￿
. The operators such as
￿,
￿
￿
￿,
￿
￿ and
￿
￿ on two vectors
are component-wise. As is customary,
￿ and
￿ represent the lexicographic order “less than” and
“less than or equal to”, respectively. Let
￿
￿
￿
￿
￿
. The notation
￿
￿
￿
￿
￿ denotes the lexicographic
minimum of all integer points in the set
￿:
￿
￿
￿
 
￿
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
￿.
2.2 Program Model
Tiling focuses on very structured computations expressed in perfectly nested loops with constant
data dependences. Without complicating the presentation unduly, we consider a simple program
model deﬁned as follows:
#
$
%
&
’
￿
(
)
￿
*
’
￿
+
,
￿
*
’
￿++
-
.
.
.
#
$
%
&
’
￿
(
)
￿
*
’
￿
+
,
￿
*
’
￿++
-
/
&
0
&
1
’
-
-
(
2
&
/
&
0
&
1
’
3
1
4
￿
-
-
5
.
.
.
5
/
&
0
&
1
’
3
1
4
6
-
-
-
*
However,withsomenotationalextensions(by addingmoresubscripts,forexample),all thecompiler
techniques proposed in the paper can be used in the case of multiple array variables appearing
in multiple statements, each of which may possibly be guarded by conditionals. The only real
restriction is that the program must have constant dependence vectors.
It is assumed that all dependences are ﬂow (i.e., true) dependences, implying that the program
has single assignment semantics. Methods for removing anti and output dependences include array
expansion [14], node splitting [27], array privatisation [17], and others [4]. The absence of output
dependences makes it easy to enforce the owner-computes rule when the computation distribution
is performed before the data distribution. For reasons of symmetry, the compiler techniques we pro-
pose for dealing with ﬂow dependences can be used to deal with anti dependences as well. But anti
dependences can besatisﬁed bycollectivecommunicationfasterthan point-to-pointcommunication.
Many stencil-based programs like SOR consist of an outermost time step that performs iterative-
ly the computations described by the inner loops. The output data dependences in these programs
can be removed simply by array expansion [14]. We use a memory recycling technique to minimise
the amount of memory usage for expanded arrays (Section 9).
In our program model, all loops have stride 1 and the loop bounds
7
8 and
9
8 are afﬁne expres-
sions of outer loop variables
￿
￿
￿
￿
￿
￿
￿
￿
8
:
￿. Thus, the iteration space is deﬁned by:
;
￿
<
￿
￿
=
￿
>
￿
7
8
?
￿
8
?
9
8
@
where
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ denotes an iteration vector. Each read reference has the form
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿,
where
￿
C
 
￿
￿
￿
is a constant dependence vector. The subscript function
B is an afﬁne expression of
loop variables; it is injective over the iteration space
;
because the program has single assignment
semantics.
D denotes the set of all dependence vectors in the program:
D
￿
<
￿
C
￿
￿
￿
￿
￿
￿
￿
C
6
@. Because
wedealwithDOACROSS loopnests,thedependencevectorsareassumedtospantheentireiteration
space, i.e.,
E
F
G
￿
￿
D
￿
￿
￿
￿
￿
. (This implies that
H
￿
I
.)
The following program is used as a running example to illustrate various compiler techniques
used in the entire compilation process.
4EXAMPLE 1 Consider the following double loop:
#
$
%
&
’
(
￿
*
’
+
￿
*
’
++
-
#
$
%
&
￿
(
￿
*
￿
+
￿
*
￿
++
-
/
&
’
5
￿
￿
-
(
/
&
’
5
￿
￿
3
￿
-
￿
/
&
’
3
￿
5
￿
￿
3
￿
-
By putting the statement into the following form:
A
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
A
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
A
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
we can see easily that all data dependences in the program are constant:
D
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@,
and the subscript function
B, deﬁned as
B
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
, is injective over the iteration space
;
￿
<
￿
￿
￿
￿
￿
=
￿
?
￿
?
￿
￿
￿
?
￿
?
￿
@ depicted in Figure 2(a).
2.3 Iteration Space Tilings
Section 2.3.1 deﬁnes a rectangular tiling as a loop transformation from
￿
￿
￿
to
￿
￿
￿
￿
. Section 2.3.2
calculates the dependences between tiles useful for generating message-passing code. Section 2.3.3
focuses on generating the sequential tiled code, which will be reﬁned gradually into an SPMD
program.
2.3.1 Rectangular Tilings
Tiling divides the iteration space into uniform tiles and traverses the tiles to cover the entire iteration
space. For example, as shown in Figure 2(b), the iteration space in Figure 2(a) has been divided
into 10 tiles, each of which is a
￿
￿
￿
rectangle. In general, the size of a tile can be characterised
by an
I
-dimensional vector, called the tile size vector,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿, where
￿
8 is the number
of iterations in dimension
>. The number of iterations in a full tile is, thus,
￿
￿
￿
￿
￿
￿
￿
￿
￿, but the
number of iterations can be less for tiles at the boundaries of the iteration space. In Figure 2(b), a
full tile contains
￿
￿
￿
￿
￿ iterations, but the two boundary tiles each contain two iterations only.
To ﬁnd the mapping from iterations to nonnegative tile indices, we need the largest integer vec-
tor
￿
￿
￿
￿
￿
such that
￿
￿
￿
￿
￿
?
￿
￿
for all iterations
￿
￿
in the iteration space. That is,
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
<
￿
8
=
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
;
@. This vector can be found using Fourier-Motzkin elimination [31, p. 49] or the PIP
system [12]. For the iteration space in Example 1, we have
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
A rectangular tiling on an
I
-dimensional iteration space can be modeled as a mapping to a
￿
I
-
dimensional tiled iteration space deﬁned as follows:
￿
￿
;
￿
￿
;
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
!
￿
:
!
￿
"
#
$
!
%
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
& (1)
where the tiled iteration space is the range of the mapping:
;
￿
￿
<
￿
￿
￿
￿
￿
=
￿
￿
 
;
@. The ﬁrst component
￿
￿
￿
￿
!
￿
:
!
￿
"
#
$
!
%
￿ identiﬁes the index of the tile that contains the iteration
￿
￿
and the second component
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
gives the position of
￿
￿
relative to the tile origin. The origin of tile
￿
￿
is the
5￿
￿
1 2 3 4 5 6 7 8 9
1
2
3
4
￿
￿
￿
￿
￿
￿
1 2 3 4 5 6 7 8 9
1
2
3
4
0 1 2 3 4
0
1
(a) Iteration space (b)
￿
￿
￿
Tiling ( depicting tile origins)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
@
￿
￿
￿
￿
PE0 PE1 PE0 PE1 PE0
(c) Tile dependences (d) Computation distribution
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
(e) Data distribution (f) Read-only data
￿
￿
￿
￿
￿
￿
￿
￿
￿ at PE0
￿
￿
￿
￿
￿
￿
PE0 PE1 PE0 PE1 PE0
(g) Read-only data
￿
￿
￿
￿
￿
￿
￿
￿
￿ at PE1 (h) Communication scheme
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
1 2 3 4 5 6 7 8 9
1
2
3
4
0 1 2 3 4
0
1
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
1 2 3 4 5 6 7 8 9
1
2
3
4
0 1 2 3 4
0
1
(i) Communication sets (one per tile) (j) Nonlocal access sets
Figure 2: A running example.
6unique iteration
￿
￿
such that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. As an example, the origins of all tiles in Figure 2(b) are
depicted in solid circles. Our deﬁnition of
￿
￿
￿ ensures that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
.
The tile space, denoted
￿, is the set of all tile index vectors, which is the projection of the tiled
iteration space over its ﬁrst
I
coordinates:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
=
￿
￿
 
;
￿
The tile space for the example in Figure 2 is
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
=
￿
?
￿
￿
?
￿
￿
￿
?
￿
￿
?
￿
@.
A rectangular tiling is an iteration-reordering transformation. To be legal, all dependence vectors
must be nonnegative [20, 39]. If a loop nest has dependence vectors with negative entries, the loop
nest can be skewed to make all dependence vectors nonnegative [37]. The skewed loop nest can
then be tiled by rectangles in the normal manner. However, some parallelepiped tilings cannot be
implemented this way. A generalisation of our compile techniques in this case is beyond the scope
of this paper.
In the following discussions, all dependence vectors are assumed to be nonnegative.
2.3.2 Dependences between Tiles
To generate message-passing code, it is necessary to ﬁnd out the data dependences between tiles
called the tile dependences. A tile depends on another tile if there exists a path of data dependences
from the latter to the former. In general, the tile dependences can be calculated from the data
dependences in the program and the tile size used [39]. Unlike a unimodular transformation, a
tiling transformation can transform a single dependence vector between iterations into several tile
dependence vectors. Precisely, the dependence vector
￿
C
￿
￿
C
￿
￿
￿
￿
￿
￿
C
￿
￿
 
D is mapped to the
following set of tile dependence vectors:
￿
￿
D
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
. . .
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
. . .
C
￿
￿
￿
￿
￿
!
￿
￿
￿
=
C
￿
8
 
<
￿
C
8
￿
8
￿
￿
￿
C
8
￿
8
￿
@
￿
￿
￿
￿
￿
In some trivial cases such as when the iteration space has only two iterations, some tile dependence
vectors in
￿
￿
￿
C
￿ may not exist. But
￿
￿
￿
C
￿ contains all tile dependence vectors possible.
Let
￿ be the set of all tile dependence vectors. We have:
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
C
￿ (2)
As illustrated in Figure 2(c),
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
@ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@. So
the dependence vector
￿
￿
￿
￿
￿ alone induces the dependences with all its neighbouring tiles. Thus,
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@.
For practical nested loops with constant dependences, the tile size is sufﬁciently larger than the
magnitudes of distance vectors [16]. We assume
￿
>
￿
C
8
?
￿
8 holds for every dependence vector
￿
C
in the program. This leads to the following theorem.
THEOREM 1 If
￿
￿
￿
C
 
D
￿
￿
C
?
￿
￿
￿, then
￿
￿
￿
<
￿
￿
￿
@
￿
￿
￿
￿
￿.
72.3.3 Sequential Tiled Code
Once a tiling transformation is found, we need to generate a sequential tiled program to execute the
points in the tiled iteration space lexicographically. This program serves as the basis from which
the ﬁnal SPMD program is gradually derived. Thus, we want the sequential code to be simple and
efﬁcient so that the SPMD code can also be simple and efﬁcient.
The sequential tiled code consists of
￿
I
loops that are perfectly nested, where the outer
I
tile
loops step between tiles and the inner
I
element loops enumerate the iterations within a tile. The tile
loops are constructed using Fourier-Motzkin elimination [31] or its extensions [12, 28].
According to the tiling transformation
￿ given in (1), all iterations that satisfy
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
are mapped to the same tile
￿
￿
. Thus, the set of iterations in the tile can be deﬁned by:
;
!
￿
￿
<
￿
￿
 
;
=
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@ (3)
Then, the tile space can be expressed as:
￿
￿
<
￿
￿
=
￿
￿
￿
 
;
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@ (4)
The above inequalities deﬁne a
￿
I
-dimensional convex polyhedron in terms of tile indices
￿
￿
￿
￿
￿
￿
￿
￿
￿
and loop indices
￿
￿
￿
￿
￿
￿
￿
￿
￿. By eliminating successively the loop indices
￿
￿
￿
￿
￿
￿
￿
￿
￿ from these in-
equalities using Fourier-Motzkin elimination [31, p. 49] or other extensions [12, 28], we obtain a
convex polyhedron deﬁned as follows:
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
=
￿
>
￿
7
￿
8
?
￿
8
?
9
￿
8
@
where
7
￿
8 and
9
￿
8 are afﬁne expressions of
￿
￿
￿
￿
￿
￿
￿
￿
8
:
￿ and problem size parameters and can be
simpliﬁed as follows when the iteration space is rectangular:
7
￿
8
￿
￿
9
￿
8
￿
￿
￿
￿
:
￿
￿
%
￿
￿ (5)
These inequalities give rise to the tile loops as shown in Figure 3. Thus
￿
￿
is the iteration space of
the tile loops and will be called the effective tile space. By construction,
￿
￿
￿
￿ but
￿
￿
￿
￿ may
not be true. As a result, some empty tiles (containing no iterations) may be enumerated by the tile
loops.
The tile loops are said to be exact if
￿
￿
￿
￿
, i.e., if no empty tiles are enumerated.
The element loops shown in Figure 3 are obtained straightforwardly. Note that the original
loop bounds
7
8 and
9
8 are used in the loop bounds of these element loops to ensure that only the
iterations in the original iteration space are executed; they are redundant when all tiles are full tiles.
In general, the tile space
￿ is not a convex polyhedron. Thus, the tile loops may be inexact.
While it is possible to generate multiple loop nests to enumerate the tile space exactly [28], one
single perfect loop nest is preferred for several reasons. First, our tile loops are expected to be exact
for real programs. Second, the message passing code and its generation are simple. Third, the ﬁnal
8#
$
%
&
￿
￿
(
)
￿
￿
*
￿
￿
+
,
￿
￿
*
￿
￿++
-
.
.
.
#
$
%
&
￿
￿
(
)
￿
￿
*
￿
￿
+
,
￿
￿
*
￿
￿++
-
#
$
%
&
’
￿
(
￿
￿
￿
&
)
￿
5
￿
￿
￿
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿
+
￿
￿
￿
&
,
￿
5
￿
￿
&
￿
￿
￿
￿
-
3
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿++
-
.
.
.
#
$
%
&
’
￿
(
￿
￿
￿
&
)
￿
5
￿
￿
￿
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿
+
￿
￿
￿
&
,
￿
5
￿
￿
&
￿
￿
￿
￿
-
3
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿++
-
/
&
0
&
1
’
-
-
(
2
&
/
&
0
&
1
’
3
1
4
￿
-
-
5
.
.
.
5
/
&
0
&
1
’
3
1
4
6
-
-
-;
Figure 3: Sequential tiled code.
SPMD code derived from the tiled code is simple. We must emphasise that although
￿
￿
￿
￿
cannot
be guaranteed and the tile loops in Figure 3 may traverse empty tiles, the iterations of the original
program traversed by the element loops are exact.
The tiled code for the rectangular tiling in Figure 2(b) can be obtained as follows:
#
$
%
&
￿
￿
(
￿
*
￿
￿
+
￿
*
￿
￿++
-
#
$
%
&
￿
￿
(
￿
*
￿
￿
+
￿
*
￿
￿++
-
#
$
%
&
’
(
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
’
+
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
’
++
-
#
$
%
&
￿
(
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
￿
+
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
￿
++
-
/
&
’
5
￿
￿
-
(
/
&
’
5
￿
￿
3
￿
-
￿
/
&
’
3
￿
5
￿
￿
3
￿
-
(6)
where the lower bounds of
￿
and
￿
can be simpliﬁed to
￿
￿
￿
￿
￿
and
￿
￿
￿
￿
￿
, respectively, and the
upper bound of
￿
to
￿
￿
￿
￿
￿
.
2.4 Machine Model
Parallel processors of a distributed memory machine are arranged as an
￿-dimensional mesh:
￿
￿
<
￿
￿
?
￿
￿
￿
￿
￿
@
where
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is a positive integer vector such that
￿
8
￿
￿
represents the number of
processors in the
>-th dimension. Each processor is uniquely identiﬁed by a processor index vector
￿
￿
 
￿
. The total number of processors in the processor mesh is
￿
￿
￿
￿
￿
￿
￿
￿
￿
. Each processor may
directly communicate with any processor in the target machine via message passing.
3 Computation Distribution
The computationdistribution is to determine which tiles are computed at which processors. Because
a tile is an atomic unit of computation, all iterations in the same tile are computed at the same
processor.
The tiles are allocated to processors cyclically according to the tile allocation function:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
(7)
9This can be speciﬁed in HPF as
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ [22]. That is, the ﬁrst
￿
dimensions of the tile space are distributed cyclically to processors and the last
I
￿
￿ are collapsed.
Theiterationsinthe
>-thdistributeddimensionoftheiterationspacearedistributedusingcyclic(
￿
8),
where
￿
8 is the tile size in the
>-th dimension.
Since the loops are fully permutable when all dependence vectors are nonnegative [36]. Any
combination of
￿ dimensions can be distributed once the corresponding loops are permuted to
become the outer
￿ loops.
Let
￿
!
￿ be the set of tiles allocated to processor
￿
￿. We have:
￿
!
￿
￿
<
￿
￿
=
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
Based on (4),
￿
!
￿ can be deﬁned by the following set of linear inequalities:
￿
!
￿
￿
<
￿
￿
=
￿
￿
￿
 
;
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
where the modulo constraint can be made linear by introducing extra variables, a standard technique
used in HPF compilers [11].
The iterations allocated to processor
￿
￿ are the iterations contained in all tiles that are allocated
to that processor. The portion of the iteration space allocated to the processor
￿
￿, called the local
iteration space and denoted
;
!
￿, is deﬁned by the same inequalities as
￿
!
￿:
;
!
￿
￿
<
￿
￿
 
;
=
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
Suppose the tiles in Figure 2(c) are distributed over a linear arrangement of two processors. That
is,
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
@ and
￿
￿
￿
￿
￿
￿. Then the ﬁrst dimension of the tile space is cyclically distributed
to the two processors as shown in Figure 2(d) in that dimension, and the tiles in the other dimension
are all allocated to the same processor. The tiles allocated to processor
￿
￿
￿
￿
￿
￿
￿, where
￿
?
￿
￿
?
￿
,
can be calculated as:
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
=
￿
￿
￿
￿
￿
￿
G
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
G
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
The local iteration spaces for PE0 and PE1 are speciﬁed by the same inequalities as above:
;
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
=
￿
￿
￿
￿
￿
￿
￿
￿
G
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
G
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@ (8)
In both cases,
￿
￿
￿
￿
￿
￿
￿
￿
￿ can be rewritten as
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
, where
￿
is a new variable.
By distributing tiles to processors cyclically in
￿-dimensions, we tend to minimise load im-
balance and maximise parallelism. This is especially signiﬁcant for skewed iteration spaces as
conﬁrmed by our experimental results. By collapsing tiles in the remaining dimensions, the compu-
tation in some processors can be overlapped with the communication that takes place in some other
processors. Furthermore, since the entries of tile dependence vectors are
￿
or
￿
(Theorem 1), our
computation distribution requires only nearest-neighbour communication between processors.
10Receive and Unpack the Read-Only Data From the Host
*
#
$
%
&
￿
￿
(
)
￿
￿
￿
&
￿
￿
3
)
￿
￿
-
￿
$
￿
￿
￿
*
￿
￿
+
,
￿
￿
*
￿
￿+=
￿
￿)
.
.
.
#
$
%
&
￿
￿
(
)
￿
￿
￿
&
￿
￿
3
)
￿
￿
-
￿
$
￿
￿
￿
*
￿
￿
+
,
￿
￿
*
￿
￿
+=
￿
￿
)
#
$
%
&
￿
￿
￿
￿
(
)
￿
￿
￿
￿
*
￿
￿
￿
￿
+
,
￿
￿
￿
￿
*
￿
￿
￿
￿++
-
.
.
.
#
$
%
&
￿
￿
(
)
￿
￿
*
￿
￿
+
,
￿
￿
*
￿
￿++
-
Receive and Unpack Messages for Tiles from Neighbouring PEs
*
#
$
%
&
’
￿
(
￿
￿
￿
&
)
￿
5
￿
￿
￿
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿
+
￿
￿
￿
&
,
￿
5
￿
￿
&
￿
￿
￿
￿
-
3
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿++
-
.
.
.
#
$
%
&
’
￿
(
￿
￿
￿
&
)
￿
5
￿
￿
￿
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿
+
￿
￿
￿
&
,
￿
5
￿
￿
&
￿
￿
￿
￿
-
3
￿
￿
’
￿
￿
￿
￿
￿
-
*
’
￿++
-
/
&
0
&
1
’
-
-
(
2
&
/
&
0
&
1
’
3
1
4
￿
-
-
5
.
.
.
5
/
&
0
&
1
’
3
1
4
6
-
-
-
*
Pack and Send Messages for Tiles to Neighbouring PEs
*
Pack and Send the Result Back to the Host
*
Figure 4: SPMD code for processor
￿
￿ (after computation distribution).
Once tiles are partitioned across the processors, the ﬁrst
￿ tile loops in Figure 3 are modiﬁed,
as shown in Figure 4, so that each processor computes its own share of tiles. For loop nests with
rectangular iteration spaces, all the modulo operations are redundant because
7
￿
8
￿
￿
. The message-
passing code generation as indicated in the four boxes is discussed in Sections 4 and 5.
In the case of our running example, if the tiles are distributed to PE0 and PE1 as shown in
Figure 2(d), the outer tile loop given in (6) can be modiﬁed as indicated below:
#
$
%
&
￿
￿
(
￿
￿
*
￿
￿
+
￿
*
￿
￿+=
￿
-
#
$
%
&
￿
￿
(
￿
*
￿
￿
+
￿
*
￿
￿++
-
#
$
%
&
’
(
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
’
+
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
’
++
-
#
$
%
&
￿
(
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
￿
+
￿
￿
￿
&
￿
5
￿
￿
￿
￿
￿
-
*
￿
++
-
(9)
so that each processor computes only the tiles it is allocated to.
4 Computer-Owns Rule, Data Distribution and I/O Code Gen-
eration
The data distribution is to determine the home processors where the data elements of an array are
permanently stored. Given a computation distribution, the data distribution is found using the so-
called computer-owns rule, by which a processor owns the data it computes. This rule has two
desirable implications. First, the owner-computes rule is still enforced in our SPMD programs
because the program has single assignment semantics. Second, the read-only data of an array are
not owned by any slaveprocessors but owned by the host (i.e., master) processor. We describe below
the inequalities that specify the read-only data owned by the host and the computed data owned by
11each processor. These inequalities are used to generate the message-passing code to perform the
I/O as indicated in the top and bottom boxes in Figure 4. For single assignment programs, only the
read-only data are required to be sent initially from the host to the slave processors.
The data allocated to a particular processor
￿
￿ are the data computed by that processor. The
portion of the data space of the array
A allocated to processor
￿
￿, called the local data space and
denoted
￿
!
￿, can be calculated as follows:
￿
!
￿
￿
B
￿
;
!
￿
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
 
;
!
￿
@ (10)
This rule of data distribution is called the computer-owns rule, because an element is owned by the
processor that computes its value. Due to the single assignment assumption, we have
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
for any two processors
￿
￿
￿ and
￿
￿
￿ since
;
!
￿
￿
￿
;
!
￿
￿
￿
￿. This implies that the owner-computes rule is
implicitly enforced.
Let
￿
￿
￿
￿
￿
be the set of the read-only data accessed by all processors and
￿
￿
￿
￿
￿
￿
!
￿ be the set of the
read-only data accessed at the processor
￿
￿, respectively:
￿
￿
￿
￿
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
￿
 
;
!
￿
￿
￿
￿
￿
C
 
D
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
!
 
;
￿
@
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
<
B
￿
￿
￿
￿
￿
C
￿
=
￿
￿
 
;
!
￿
@
￿
￿
￿
￿
￿
￿
￿ (11)
Figures 2(e) – (g) depict the data distribution and the read-only data accessed at two processors
for our running example. The data owned by processor
￿
￿
￿
￿ are in the set
￿
￿
￿
￿
￿
￿
B
￿
;
￿
￿
￿
￿
￿, where
;
￿
￿
￿
￿ is given in (8). The inequalities for specifying
￿
￿
￿
￿
￿
,
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿ are omitted.
Given
￿
￿
￿
￿
￿
￿
!
￿, we use the Omega Calculator [28] to generate the top code section in Figure 4.
Thehostusesthesameloopstructuretosendthecorrespondingread-only datatoprocessors. Forour
running example, the code in the global address space for receiving the read-only data at processor
￿
￿
￿
￿, where
￿
?
￿
￿
?
￿
, is:
if (
￿
￿ == 0)
for(j = 0; j <= 3; j++)
/* receive A(0,2j) */
if (
￿
￿ >= 0 &&
￿
￿ <= 1)
for(i = 0; i <= 9; i++)
￿
if (-2*
￿
￿+i-2 <= 4*
￿
:
￿
￿
￿
￿
￿
￿
:
￿
￿
￿))
/* receive A(i,0) */
if ((-2*
￿
￿+i)
￿
$
￿
￿
- == 0)
/* receive A(i,0) */
￿
The bottom code section in Figure 4 is derived similarly based on
￿
!
￿.
If the iteration space is skewed, the data space for an array is also skewed. In this case, the data
distribution cannot be speciﬁed by HPF’s block or cyclic data mapping directives [22]. But the data
distribution can be understood as being block-cyclically distributed on a skewed data space
5 Interprocessor Communication Code Generation
This section focuses on implementing the middle two code sections in Figure 4. That is, we discuss
how to generate message-passing code to communicate the required data between processors in
12order to satisfy all cross-processor dependences between tiles. Given a processor, all read-only data
accessed by the processor are received initially from the host, and therefore, only the data computed
nonlocally and accessed at the processor need to be communicated.
Let
￿ be the projection of
￿ over the
￿ distributed dimensions:
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
=
￿
￿
 
￿
@
where every
￿
￿
 
￿, called a data link, indicates that two processors
￿
￿
￿ and
￿
￿
￿ such that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
will communicate to each other. In general, a processor
￿
￿ sends messages to the processors in
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
=
￿
￿
 
￿
@ and receives messages from the processors in
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
=
￿
￿
 
￿
@.
Many communication schemes are possible depending on how messages are merged. Their
development and understanding can be facilitated by considering a generic tile
￿
￿
and the two sets:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
=
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
=
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
Recall that the tile
￿
￿
is allocated to the processor
￿
￿
￿
￿
￿. The tiles, called the successor tiles, in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ are allocated to the processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
and consume the data produced in
￿
￿
.
Reciprocally, thetiles, called the predecessor tiles, in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ are allocated to theprocessor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
and produce the data consumed by
￿
￿
. The two sets for the running example are:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
Note that for the running example
￿
￿
<
￿
￿
￿
@ because
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@.
The following two communication schemes have been evaluated and compared:
1. The data produced in the predecessor tiles
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and consumed in tile
￿
￿
are combined
and sent in one message. This implies that the data produced in tile
￿
￿
and consumed in the
successor tiles in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ will be sent in separate messages. As a result, the send code is
more complex than the receive code. This scheme is feasible because
￿
￿
cannot be computed
unless all the predecessor tiles in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ have been computed.
2. The data produced in tile
￿
￿
and consumed in the successor tiles in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ are combined and
sent in one message. This implies that the data produced in the predecessor tiles in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
and consumed in tile
￿
￿
will be received in separate messages. As a result, the receive code is
more complex than the send code.
Both schemes naturally encompasses the three classic optimisations: message vectorisation, coa-
lescing and aggregation [19]. Both schemes are the mirror image of each other. However, we have
adopted the second scheme because the property of sending the data produced in a tile in one single
message can be exploited to reduce memory cost for expanded arrays (Section 8.2).
Figure 2(h) illustratesthesecond scheme, i.e., the communicationscheme presented in thepaper.
Take tile
￿
￿
￿
￿
￿ allocated to PE0 for example. The two data elements depicted in solid dots are
accessed in the two tiles
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿ allocated to PE1. PE0 will send to PE1 the two elements
13in one message at the completion of tile
￿
￿
￿
￿
￿. PE1 will not start executing tile
￿
￿
￿
￿
￿ until it has
received this message from PE0.
In Section 5.1, we derive the inequalities that specify the data to be communicated between pro-
cessors. In Section 5.2, we discuss the generation of the required message-passing code. Section 5.3
contains simpliﬁed message-passing code for rectangular iteration spaces.
5.1 Communication Sets
In our communication scheme, the communication set represents the set of data produced in one tile
at oneprocessor and consumed by all the tilesin anotherprocessor. Precisely, thecommunicationset
￿
!
￿
￿
!
￿ is the set of the data produced in tile
￿
￿
at processor
￿
￿
￿
￿
￿ and consumed by all tiles in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
at the processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. Both the sending and receiving processor identiﬁers are implicit
in the notation
￿
!
￿
￿
!
￿.
Each tile
￿
￿
generates exactly
=
￿
=
communicate sets, where
=
￿
=
?
￿
￿
￿
￿
.
If we use
￿
!
￿
￿
!
￿ to represent the set, called the nonlocal access set, of the data produced in tile
￿
￿
and consumed in tile
￿
￿
￿
￿
￿
, then
￿
!
￿
￿
!
￿ is simply the union of the nonlocal access sets corresponding
to all tiles in the successor set
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ deﬁned below:
￿
!
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
!
￿
￿
!
￿ (12)
Consider the running example illustrated in Figure 2. PE0 and PE1 communicate only through
the data link
￿
￿
￿
￿
￿
￿. Because
=
￿
=
￿
￿
, each tile generates one communication set
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ as
depicted in Figure 2(i).
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ contains one element if tile
￿
￿
￿
￿
￿
￿
￿ is located in the top row and
two elements otherwise. In the proposed approach, the communication set will be approximated
as containing two elements. Each tile
￿
￿
￿
￿
￿
￿
￿ generates two nonlocal access sets,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿, one element per set, as depicted in Figure 2(j). These two nonlocal access sets are
accessed in tiles
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿, respectively. Of course, if
￿
￿
￿
￿
￿
￿
￿ is a tile in the top
row,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ will not be accessed because tile
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ does not exist.
Recall that
￿
￿
￿
C
￿ contains all tile dependence vectors induced by the dependence vector
￿
C
. Let
D
!
￿
￿
<
￿
C
=
￿
C
 
D
￿
￿
￿
 
￿
￿
￿
C
￿
@
Thus
D
!
￿ contains all dependence vectors in the program that induce the tile dependence vector
￿
￿
.
The nonlocal access set
￿
!
￿
￿
!
￿ contains the data produced in tile
￿
￿
and read by all possible refer-
ences
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ in tile
￿
￿
￿
￿
￿
, where
￿
C
 
D
!
￿. So it can be expressed as follows:
￿
!
￿
￿
!
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
 
;
!
￿
￿
￿
￿
￿
C
 
D
!
￿
￿
￿
￿
￿
￿
￿
C
￿
 
;
!
￿
￿
!
￿
￿
@
which is equivalent to:
￿
!
￿
￿
!
￿
￿
￿
!
￿
￿
￿
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
 
;
!
￿
￿
￿
￿
￿
￿
￿
C
￿
 
;
!
￿
￿
!
￿
@
By using (12),
￿
!
￿
￿
!
￿ can be expressed as:
￿
!
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
!
￿
￿
￿
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
 
;
!
￿
￿
￿
￿
￿
￿
￿
C
￿
 
;
!
￿
￿
!
￿
@
￿
14Based on (3), the inequalities
￿
￿
 
;
!
￿
￿
￿
￿
￿
￿
￿
C
￿
 
;
!
￿
￿
!
￿ can be made more explicit:
￿
!
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
!
￿
￿
￿
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
 
;
￿
￿
￿
￿
￿
￿
C
￿
 
;
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
C
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
C
￿
￿
￿
@
￿
If
￿
￿
 
;
￿
￿
￿
￿
￿
￿
C
￿
 
;
is ignored, this set is over-approximated only for boundary tiles. We obtain:
￿
!
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
!
￿
￿
￿
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
C
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
C
￿
￿
￿
@
￿
(13)
which can be simpliﬁed to:
￿
!
￿
￿
!
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
<
B
￿
￿
￿
￿
=
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
@
(14)
where
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
Let
￿
C
!
￿
￿
￿
￿
￿ and
￿
C
!
￿
￿
￿
￿
￿ be deﬁned such that their
>-th entries are:
C
!
￿
￿
￿
￿
￿
￿
8
￿
￿
G
￿
<
C
8
=
￿
C
 
D
!
￿
￿
@
C
!
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
<
C
8
=
￿
C
 
D
!
￿
￿
@
(15)
The communication set
￿
!
￿
￿
!
￿ is further over-approximated as the smallest enclosing rectangle:
￿
!
￿
￿
!
￿
(
￿
0
&
1
’
-
￿
1
￿
+
&
1
’
3
1
’
￿
￿
￿
3
1
￿
￿
1
￿
-
+
1
￿
3
1
￿
￿
3
1
4
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+
&
1
’
3
1
’
￿
￿
￿
-
￿
￿
￿
￿
￿
3
1
￿
￿
￿
￿
￿
￿
￿
&
1
￿
￿
￿
￿
￿
￿
￿
1
￿
-
+
&
1
￿
3
1
4
!
￿
￿
￿
￿
￿
-
￿
￿
￿
￿
￿
3
1
￿
￿ (16)
Here,
￿
￿
,
￿
C
!
￿
￿
￿
￿
￿ and
￿
C
!
￿
￿
￿
￿
￿ are all compile-time constants,
￿
￿
￿
￿
￿
is a symbolic constant as a function
of problem size parameters and
￿
￿
consists of the
I
tile indices. Thus, the communication set is a
convex polyhedron in terms of the
I
loop indices
￿
￿
￿
￿
￿
￿
￿
￿
￿. Therefore, the compiler can use these
inequalities to generate a set of perfectly nested loops to pack and unpack messages.
To ﬁnd outtheinequalitiesthat specify thecommunicationset
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ for therunningexample
depicted in Figure 2(i), we simply substitute all known parameters into (16). By using
￿
￿
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿,
￿
C
!
￿
￿
￿
￿
￿
￿
￿
C
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿, we obtain (with simpliﬁcations):
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
=
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
?
￿
￿
￿
￿
￿
@
This communication set is exact for all tiles except those on the top of the iteration space (Fig-
ure 2(i)). For a tile on the top, one element is transferred redundantly.
151
#
$
%
1
￿
￿
￿
2
￿
#
&
￿
1
￿
￿
￿
￿
￿
￿
&
1
￿
5
1
￿
-
￿
￿
￿
￿
￿
￿
&
1
￿
-
(
#
￿
￿
￿
￿
-
￿
$
￿
￿
￿
￿
￿
￿
*
3
￿
￿
￿
￿
(
￿;
4
#
$
%
1
’
￿
￿
!
￿
￿
!
￿ in lexicographic order
￿
5
￿
￿
#
&
￿
￿
￿
++
-
￿
(
/
&
0
&
1
’
-
-;
6
￿
￿
￿
￿
&
&
1
￿
￿
1
￿
-
￿
$
￿
1
￿
5
￿
￿
#
5
￿
￿
￿
-;
(a) Send code for “Pack and Send Messages between Tiles”
1
#
$
%
1
￿
￿
￿
2
#
$
%
1
￿
￿
￿
%
￿
￿
&
1
￿
5
1
￿
- in lexicographic order
￿
3
￿
#
￿
￿
￿
￿
￿
&
1
￿
-
(
#
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
*
4
￿
#
1
￿
(
￿
￿
￿
￿
￿
￿
￿
&
1
￿
5
1
￿
-
￿
￿
￿
￿
5
%
￿
￿
￿
&
&
1
￿
3
1
￿
-
￿
$
￿
1
￿
5
￿
￿
#
5
￿
￿
￿
￿
$
#
&
￿
￿
#
-
-;
6
￿
￿
￿
￿
(
￿
*
7
#
$
%
1
’
￿
￿
!
￿
￿
!
￿ in lexicographic order
￿
8
/
&
0
&
1
’
-
-
￿
(
￿
￿
#
￿
￿
￿
￿
++
￿;
(b) Receive code for “Receive and Unpack Messages between Tiles”
Figure 5: Two message-passing code sections for tiles.
5.2 Message-Passing Code
A tile
￿
￿
is said to be valid if it is enumerated by the tile loops, i.e., if it is in the effective tile space
￿
￿
.
So an empty tile is also regarded as being valid if it is enumerated by the tile loops. It is possible to
solve an integer programming problem to check at run-time if every tile being enumerated is empty
or not [28]. But this will be too expensive and unnecessary because the tile loops are expected to
be exact for real programs, and if not, the number of empty tiles should be very small. On the other
hand, it is efﬁcient to check at run-time whether a tile
￿
￿
is valid or not by simply evaluating the
conditional expression
￿
￿
 
￿
￿
.
Let
￿
￿
￿
￿
￿
￿
￿
￿
￿ be a boolean function that returns true if
￿
￿
is a valid tile and false otherwise.
During the execution of our SPMD code, a processor does not distinguish the empty tiles from
those that are not. However, due to the cyclic computation and data distribution, a processor has to
verify the validity of some tiles at run-time in order to send and receive messages correctly.
Let
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ be the lexicographically minimum valid tile in the successor set
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
=
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@ (17)
Figure 5 gives both the send and receive code for implementing the middle two message-passing
code sections from Figure 4. In the send code, each tile
￿
￿
executed at a processor
￿
￿ generates exactly
=
￿
=
communication sets (i.e., messages)
￿
!
￿
￿
!
￿, one for each neighbouring processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
such that
￿
￿
 
￿. This message is delivered as long as there is one valid successor tile in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
In the receive code, a processor
￿
￿ must ensure that all messages generated from the predecessor tiles
16in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ for every
￿
￿
 
￿ have been received before it can execute tile
￿
￿
. Let
￿
￿
￿
￿
￿
be a tile that is
also allocated to the processor
￿
￿ and that depends also on some of the predecessor tiles in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
Thus, the messages generated in the predecessor tiles of
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ that
￿
￿ depends must be received
even before
￿
￿ is executed. So the key in the construction of the receive code is to decide when a
message should be received. The answer is that a message should be received just before the ﬁrst
tile that depends on it is executed. This ﬁrst tile is identiﬁed by
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ in line 4 of the receive
code shown in Figure 5(b). In line 2 of the receive code, the
￿
￿
￿ loop processes all the predecessor
tiles in lexicographic order and receives the communication sets generated in these tiles if tile
￿
￿
is
the ﬁrst consumer of these messages.
The receive code is constructed to work for any arbitrary iteration space and can be simpliﬁed for
commonly occurring iteration spaces such as rectangular and triangular iteration spaces. Section 5.3
demonstrates how this can be achieved when the iteration space is rectangular.
THEOREM 2 The communicationcode in Figure 5 ensures that all cross-processortiledependences
are satisﬁed before a tile is executed.
Proof. Suppose processor
￿
￿ is ready to execute tile
￿
￿
. The tile
￿
￿
consumes the data produced at the
neighbouring processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
from all valid tiles
￿
￿ in the predecessor set
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. Let
the tiles
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
6
in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ be ordered such that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
6
. These tiles are executed in
lexicographic order. Therefore, the communication sets
￿
!
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
!
￿ generated in these tiles
are sent to the processor
￿
￿ in that order; so they should be received in the same order (line 2). Of
course, if a tile in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is invalid – which can happen when
￿
￿
is a close to the boundaries of the
effective tile space, then there is nothing to receive (line 3). Consider the
>-th such communication
set
￿
!
￿
￿
￿
!
￿. By thedeﬁnitionof(16), itcarries thedatatobeconsumedbyallvalidtilesinthesuccessor
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿ that are executed at the processor
￿
￿, and tile
￿
￿
is among these consumer tiles. Clearly,
the processor must receive this
>-th communication set when
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿ (line 4), i.e., when
￿
￿
is the ﬁrst tile that accesses the data in the message
￿
!
￿
￿
￿
!
￿. If
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
instead, then the
message
￿
!
￿
￿
￿
!
￿ was received before tile
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿ is executed. Therefore, all cross-processor
dependences will be ﬁrst satisﬁed before a tile can be executed.
In both the send and receive code, the test
￿
￿
￿
￿
￿
￿
￿
￿
￿, can be replaced with
￿
￿
 
￿
￿
"
￿
￿
 
￿
￿
and
simpliﬁed symbolically using the techniques promoted in the Omega Calculator [28]. In practice,
the predicate we use in line 2 of the send code is
￿
￿
 
￿
￿
"
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
!
 
￿
￿
￿ after being
symbolically simpliﬁed. In line 4 of the receive code,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ for all predecessor tiles
￿
￿ of tile
￿
￿
are found efﬁciently in one-pass as follows. Let
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. We have
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
=
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@ for every
￿
￿ in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. The successor set
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ contains at most
￿
￿
:
￿
elements. For instance,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ contains at most two elements when a 3D iteration space
is distributed to a 2D processor space, where
I
￿
￿ and
￿
￿
￿
. The compiler can generate code to
check if each tile in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is valid or not, and ﬁnd
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ for all
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ in one
pass. Our experimental results indicate that the time taken by this part of the code is negligible.
171
#
$
%
1
￿
￿
￿
2
￿
#
&
￿
￿
￿
￿
￿
￿
,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
,
￿
￿
- continue;
3
￿
￿
￿
￿
(
￿;
4
#
$
%
1
’
￿
￿
!
￿
￿
!
￿ in lexicographic order
￿
5
￿
￿
#
&
￿
￿
￿
++
-
￿
(
/
&
0
&
1
’
-
-;
6
￿
￿
￿
￿
&
&
1
￿
￿
1
￿
-
￿
$
￿
1
￿
5
￿
￿
#
5
￿
￿
￿
-;
(a) Simpliﬁed send code
1
#
$
%
1
￿
￿
￿
2
￿
#
&
)
￿
￿
￿
￿
￿
3
￿
￿
￿
￿
￿
￿
￿
)
￿
￿
￿
￿
￿
3
￿
￿
-
￿
$
￿
￿
￿
￿
￿
￿
*
3
%
￿
￿
￿
&
&
1
￿
3
1
￿
-
￿
$
￿
1
￿
5
￿
￿
#
5
￿
￿
￿
￿
$
#
&
￿
￿
#
-
-;
4
￿
￿
￿
￿
(
￿
*
5
#
$
%
1
’
￿
￿
!
￿
:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
!
￿ in lexicographic order
￿
6
/
&
0
&
1
’
-
-
￿
(
￿
￿
#
￿
￿
￿
￿
++
￿;
(b) Simpliﬁed receive code
Figure 6: Simpliﬁed message-passing code for rectangular iteration spaces.
5.3 Simpliﬁed Message-Passing Code for Rectangular Iteration Spaces
If the iteration space
;
is rectangular, the effective tile space
￿
￿
will be rectangular as well. We
describe how to simplify the message-passing code when the effective tile space is rectangular. The
techniques we use generalise well to other regular iteration spaces.
Being rectangular, the effective tile space
￿
￿
has the form:
￿
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
=
￿
￿
?
>
?
I
￿
￿
?
￿
8
?
9
￿
8
@
The send code can be simpliﬁed immediately to the one given in Figure 6(a).
To simplify lines 2 – 4 of the receive code, we rely on the following lemma.
LEMMA 1 Suppose
￿
￿
is rectangular. Let
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿, where
￿
￿
 
￿
￿
. Then
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
where the sign of the equality holds if and only if
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
Proof. The proof constructs explicitly
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. By the deﬁnition of the predecessor set, we
have
￿
￿
￿
￿
￿
￿
￿
￿
for some
￿
￿
 
￿ such that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. By construction,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
I
￿
￿. Since
￿
￿
is valid, we have
￿
￿
?
>
?
￿
￿
￿
?
￿
8
?
9
￿
8.
Since
￿
￿ is valid, we must also have:
￿
￿
￿
￿
?
>
?
I
￿
￿
?
￿
8
￿
￿
8
?
9
￿
8. This implies that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
. By deﬁnition,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
I
￿
￿
￿
￿
￿
,
where the sign of the equality holds if and only if
￿
￿
￿
￿
￿
￿
￿
I
￿
￿
￿
￿
, i.e.,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
Based on this lemma, the receive code of Figure 5 is ﬁrst simpliﬁed to:
18￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+=
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
++
￿
/
￿ Receive message
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿/
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
;
￿
￿
￿
 
￿
￿
￿
￿
!
￿
￿
"
￿
#
￿
$
%
￿
￿
$
&
￿
’
￿
￿
￿
￿
%
￿
￿
￿
￿;
(
￿
￿
)
￿
￿
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
*
++
￿
￿
￿
￿
￿
+
￿
￿
￿
￿
,
￿
￿
+
￿
￿
￿
￿
,
￿
￿
+
++
￿
-
￿
*
$
￿
+
￿
￿
%
￿
￿
.
(
￿
￿
++
/
￿
￿
￿
￿
￿
*
￿
0
1
2
￿
￿
$
￿
￿
￿
,
￿
￿
￿
*
￿
0
3
4
￿
5
$
￿
￿
￿
,
￿
￿
￿
*
++
￿
￿
￿
￿
￿
+
￿
0
1
2
￿
￿
$
￿
￿
￿
,
￿
￿
￿
+
￿
0
3
4
￿
￿
$
￿
￿
￿
,
￿
￿
￿
+
++
￿
-
￿
*
$
￿
+
￿
￿
-
￿
*
$
￿
+
!
￿
￿
,
-
￿
*
!
￿
$
￿
+
!
￿
￿
/
￿ Send message
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿/
￿
￿
￿
￿
6
7
￿
￿
￿
￿
￿
￿
￿
￿
￿
(
￿
￿
)
￿
￿
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
,
￿
￿
*
￿
￿
￿
￿
,
￿
￿
*
++
￿
￿
￿
￿
￿
+
￿
￿
￿
￿
,
￿
￿
+
￿
￿
￿
￿
,
￿
￿
+
++
￿
%
￿
￿
.
(
￿
￿
++
/
￿
-
￿
*
$
￿
+
￿
￿
&
￿
￿
#
￿
￿
￿
￿
,
￿
￿
"
￿
#
￿
$
%
￿
￿
$
(
￿
￿
￿;
Figure 7: SPMD program with communication code for Example 1.
1
￿
￿
￿
￿
￿
 
￿
3
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
5
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
9
￿
￿
￿
￿
￿
:
￿
￿
￿
￿
9
￿
￿
￿
￿;
6
￿
￿
￿
￿
￿
￿
8
7
￿
￿
￿
￿
￿
 
￿
!
￿
￿
!
￿ in lexicographic order
￿
8
A
￿
B
￿
￿
￿
￿
￿
￿
￿
9
￿
￿
￿
￿
￿
￿++
￿
;
and then to the one given in Figure 6(b) with the lines re-numbered.
By inserting the the communication code into the program given in (9), we obtain the SPMD
program for our running program given in Figure 7, where the I/O code is not shown.
6 Memory Management
Because we create SPMD programs, all processors must possess the same array declarations. There-
fore, each processor will execute in its own local address space. For regularly structured computa-
tions, we use a storage scheme to store both the local and nonlocal data of a distributed array in
the same local array. To hold the nonlocal data, we use an extension of overlap areas [15]: the
nonlocal data are stored according to the loop indices at which they are computed rather than their
array indices. This storage scheme leads to high locality of references for the computations within
tiles regardless of any sparse array references that may appear in the original program.
Let us use the program given in Figure 7 to illustrate our storage scheme. Figure 8(a) depicts
the portion of the data space of
A, where the elements accessed at PE1 are indicated with their
19￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
￿
5
￿
1,1
1,1
1,1
1,1
1,0
1,0
1,0
1,0
3,1
3,1
3,1
3,1
3,0
3,0
3,0
3,0
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
/
&
￿
5
￿
-
(a) The elements of
-
accessed at PE1 (b) The elements of
-
stored in the local array
￿
Figure 8: Local memory allocation for PE1 for the program in Figure 7.
array indices. The superscripts denote the tile indices at which the local elements are computed
(see Figure 2(b)). The shaded entries are nonlocal elements: those in light gray are the read-only
data received from the host and the others are the nonlocally computed data received from PE0. In
Figure 8(b), the elements distributed to PE1 are compressed to yield a compact local array
￿. The
nonlocal elements are stored close to where they are used. The arrows indicate the two dependence
vectors
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿ between the elements in the local array.
In general, the local memory space in
￿ is organised as follows. The data computed within a
tile are stored in an array section of size
￿
￿
￿
￿
￿
￿
￿
￿
￿. The nonlocal data (including the read-
only data) accessed in a distributed dimension are stored in the overlap area of the array section on
the side towards the negative direction of the axis; the other side is not expanded because only the
ﬂow dependences are considered. To store the read-only data initially received from the host in a
non-distributed dimension, each local array is also expanded at the beginning in that dimension.
The local data space
￿
!
￿ allocated to a processor is non-contiguous because the data distribution
is cyclic and can be sparse because array references can be sparse. For instance, the array reference
function in Example 1, as illustrated in Figure 8, is sparse. In our storage scheme, both kinds of
sparsity are removed.
Our storage scheme can be summarised in one sentence: Any two dependent elements
A
￿
B
￿
￿
￿
￿
￿
and
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ are mapped to the local array elements
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
C
￿, respectively, where
￿
￿
is an array index vector for the local array
￿. Regrettably, the various address translation functions
required for achieving this objective can be quite involved both technically and notationally.
In Section 6.1, we present a memory compression technique to determine the minimal amount of
local memory allocated for a distributed array. Section 6.2 discusses the conversion between global
and local loop indices. In Section 6.3, we describe the required translations between global and
local array indices to ﬁnd out the addresses to store (a) locally computed data, (b) the read-only data
received from the host, and (c) the nonlocal data received from neighbouring processors.
Let
￿
C
￿
￿
￿ be deﬁned such that its
>-th entry
C
￿
￿
￿
￿
8 is the maximum of the
>-th entries of all
dependence vectors in
D:
C
￿
￿
￿
￿
8
￿
￿
G
￿
<
C
8
=
￿
C
 
D
@.
206.1 Local Array Allocation
For each array
A in the program, all processors use the same local array
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ to
store both the local and nonlocal elements.
Let
￿
￿
￿
￿
￿ be the smallest integer vector such that
￿
￿
￿
￿
￿
￿
￿
￿
for all iterations
￿
￿
in the iteration space.
That is,
￿
￿
￿
￿
￿
8
￿
￿
G
￿
￿
￿
8
=
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
;
￿. Just like its counterpart
￿
￿
￿
￿
￿
,
￿
￿
￿
￿
￿ can be found using
Fourier-Motzkin elimination [31, p. 49] or the PIP system [12].
In the
>-th dimension,themaximal numberoftiles is
￿
￿
"
￿
￿
￿
￿
:
￿
"
#
$
￿
￿
￿
￿
%
￿
￿. Two cases are considered.
If
> is a distributed dimension, where
￿
?
>
?
￿, then the maximal number of these tiles allocated
to any processor is at most
￿
￿
"
￿
￿
￿
￿
:
￿
"
#
$
￿
￿
￿
￿
%
￿
￿
￿
￿. To accommodate both the local data and the nonlocal
data (including the read-only data), the size of the
>-th distributed dimension required is at most
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
"
￿
￿
￿
￿
:
￿
"
#
$
￿
￿
￿
￿
%
￿
￿
￿
￿.
If the
>-th dimension is not distributed, where
￿
￿
>
?
I
, the size of
￿ along that dimension is
￿
8
￿
￿
"
￿
￿
￿
￿
:
￿
"
#
$
￿
￿
￿
￿
%
￿
￿
￿
C
￿
￿
￿
￿
8, where
C
￿
￿
￿
￿
8 is needed to store the read-only data from the host.
Let
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. The upper bounds in
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ are deﬁned as follows:
￿
8
￿
￿
￿
￿
￿
￿
￿
?
>
?
￿
￿
￿
￿
8
￿
￿
"
￿
￿
￿
￿
:
￿
"
#
$
￿
￿
￿
￿
%
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
"
￿
￿
￿
￿
:
￿
"
#
$
￿
￿
￿
￿
%
￿
￿
￿
C
￿
￿
￿
￿
8
￿
￿
￿
(18)
In the case of the SPMD program given in Figure 7, we have
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. So the local array declarations used is
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
6.2 Global v.s. Local Loop indices
A function that translates between global loop indices to local loop indices is given. This function
will be used to deﬁne various address translation functions to map both the local and nonlocal
elements of an array to the local memory of a processor.
For notational convenience, let
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
.
The global loop indices are mapped to the local loop indices as follows:
￿
￿
;
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿ (19)
The conversion from global to local loop indices is independent of the processor identiﬁer. Let
;
￿
￿
<
￿
￿
￿
￿
￿
=
￿
￿
 
;
@. The inverse of this function is found to be:
￿
:
￿
!
￿
￿
;
￿
￿
￿
;
￿
￿
:
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
(20)
The inverse is dependent on the processor identiﬁer
￿
￿.
The division and modulo operations are expensive, especially if they are performed frequently
inside the innermost loop of the SPMD code. When generating SPMD code, the following fact will
21be exploited. Let
￿
￿
￿
￿
￿
￿
￿
￿
￿. By noting that
￿
￿
￿
￿
!
￿
:
!
￿
"
#
$
!
%
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
, we
can simplify the function
￿
by avoiding the modulo operation:
￿
￿
;
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿ (21)
The division and modulo operations are unnecessary for a non-distributed dimension
>, where
￿
￿
>
?
I
. In this case,
￿
￿
8
￿
￿
8 and
￿
￿
8
￿
￿
. Thus, the
>-th component of the function
￿
can be
simpliﬁed to:
￿
￿
￿
￿
￿
8
￿
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8 (22)
and similarly, the
>-th component of the function
￿
:
￿
!
￿ can be simpliﬁed to:
￿
:
￿
!
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
8
6.3 Global v.s. Local Array indices
Given an array
A, this section provides the formulas that a processor
￿
￿ can use to ﬁnd out the local
addresses to store the three types of data elements for
A: (a) the local data
￿
!
￿ deﬁned in (10); (b) the
read-only data
￿
￿
￿
￿
￿
￿
!
￿ deﬁned in (11), which are received initiallyfrom the host; and (c) the nonlocal
data
￿
!
￿
￿
!
￿ deﬁned in (16), which are received from the neighbouring processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
.
Asmentionedearlier, alladdresstranslationfunctionsaredeﬁned sothattwodependentelements
A
￿
B
￿
￿
￿
￿
￿ and
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ are mapped to the local array elements
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
C
￿, respectively.
6.3.1 Mapping Locally Computed Data
A locally computed element
A
￿
￿
￿
￿ in
￿
!
￿ is mapped to
￿
￿
￿
￿
￿
￿
￿
￿ as follows:
￿
￿
￿
￿
￿
￿
￿
￿
B
:
￿
￿
￿
￿
￿
￿ (23)
The inverse of this function given below depends on the processor identiﬁer
￿
￿:
￿
:
￿
!
￿
￿
￿
￿
￿
￿
￿
B
￿
￿
:
￿
!
￿
￿
￿
￿
￿
￿
￿
The effect of
B on the sparsity of array accesses is eliminated since for every write reference
A
￿
￿
￿
￿, there must exist an iteration
￿
￿
such that
￿
￿
￿
B
￿
￿
￿
￿. This means that
￿
￿
￿
￿
￿
￿
￿
￿
B
:
￿
￿
￿
￿
￿
￿
￿
￿
￿
B
:
￿
￿
B
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. So itistheiterationvector
￿
￿
ratherthan thearray indexvector
￿
￿ that determines
the actual mapping of
A
￿
￿
￿
￿ to the local processor memory. Once the effect of
B on array references
is eliminated, the function
￿
has been designed to compress the loop indices resulting from a cyclical
computation distribution into contiguous local array indices.
This function tells where the results of the computation are. Thus, the compiler can use this
function to generate the bottom code section in Figure 4 to send the results to the host.
226.3.2 Mapping the Read-Only Data
Each read-only element
A
￿
￿
￿
￿ in
￿
￿
￿
￿
￿
￿
!
￿ is mapped to
￿
￿
￿
￿
￿
￿
￿
￿ according to the function:
￿
￿
￿
￿
￿
￿
￿
￿
B
:
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
C
￿where
￿
C
 
D such that
B
:
￿
￿
￿
￿
￿
￿
￿
C
 
;
!
￿ (24)
Here,
B
:
￿
￿
￿
￿
￿
￿
￿
C
gives the iteration at which
A
￿
￿
￿
￿ is accessed.
This function is used in the top code section in Figure 4 to store the read-only data appropriately.
6.3.3 Mapping Nonlocally Computed Data
Supposethat processor
￿
￿ has received the nonlocaldata elements in the communicationset
￿
!
￿
￿
!
￿ from
the processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. Each nonlocal element
A
￿
￿
￿
￿ in
￿
!
￿
￿
!
￿ can be mapped to
￿
￿
￿
￿
￿
￿
￿
￿ using
exactly the same function
￿ for mapping the read-only data:
￿
￿
￿
￿
￿
￿
￿
￿
B
:
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
C
￿where
￿
C
 
D such that
B
:
￿
￿
￿
￿
￿
￿
￿
C
 
;
!
￿ (25)
In practice, however, the nonlocally computed data are not mapped this way to the local memory
of the receiving processor
￿
￿. Because each processor executes in its own local address space, the
nonlocal data contained in a messageis expressed in thelocal address space of the sending processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. These nonlocal data must be mapped to the local address space of the receiving
processor
￿
￿.
By applying the function
￿ to the communication set
￿
!
￿
￿
!
￿ in the global address space, we obtain
the following communication set in the local address space of the sending processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
:
￿
!
￿
￿
!
￿ =
￿
1
’
￿
￿
1
￿
+
1
’
￿
3
1
￿
￿
￿
!
￿
!
￿
￿
￿
3
1
4
￿
￿
￿
+
1
￿
3
1
￿
3
1
4
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+
&
1
’
￿
3
1
￿
￿
￿
!
￿
!
￿
￿
￿
3
1
4
￿
￿
￿
-
￿
￿
￿
￿
￿
3
1
￿
￿
￿
￿
￿
￿
￿
1
￿
+
1
￿
￿
￿
￿
￿
￿
3
1
4
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
3
1
￿
￿ (26)
Each communication set now represents a regular array section of the local array
￿, yielding more
efﬁcient packing and unpacking code. When some boundary tiles are executed, some extra elements
of
￿ transferred may not be associated with any elements in the distributed array
A. But these
elements will never be accessed.
After having received the communication set
￿
!
￿
￿
!
￿ from
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
, the processor
￿
￿ uses the
following function to map a data element
￿
￿
￿
￿
￿
￿ in
￿
!
￿
￿
!
￿ to
￿
￿
￿
￿
￿
￿
￿
￿
￿ in its own local memory:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
where the ﬁrst
￿ entries
￿
￿
￿
8 are deﬁned as follows:
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
8
￿
￿
(27)
23LEMMA 2 Both
￿ and
￿ map a nonlocal element received to the same local memory location of a
receiving processor.
Proof. Assumethat
A
￿
B
￿
￿
￿
￿
￿ computed at the processor
￿
￿ accesses the non-local element
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿
computed at the processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. According to the function
￿ in (25),
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is
mapped to
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿ in the local address space of the receiving processor
￿
￿. We show that the
function
￿ in (27) will also map
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ to
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿.
Because
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is a locally computed element at the sending processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
,
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is allocated to
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿ in the local address space of
￿
￿ according to the
function
￿ in (23). The element
￿
￿
￿
￿
￿
￿ is contained in the message
￿
!
￿
￿
!
￿ sent from
￿
￿ to
￿
￿. On arriving
at the receiving processor
￿
￿,
￿
￿
￿
￿
￿
￿ is mapped to
￿
￿
￿
￿
￿
￿
￿ in the local address space of
￿
￿ according to
the function
￿ in (27).
The rest of the proof is to establish that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
, i.e.,
￿
>
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
8. Note that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿. Consider the
>-th component of
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿. If
> is a non-distributed
dimension, where
￿
￿
>
?
I
, then
￿
￿
￿
8
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
C
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
8 holds due to (22). Let us
assume that
> is a distributed dimension, i.e.,
￿
?
>
?
￿. Since
￿
￿
￿
￿
C
 
;
!
￿, the iteration
￿
￿
￿
￿
C
is
contained in tile
￿
￿
. From (21), we obtain:
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
C
￿
8
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
8
￿
8
￿
￿
￿
8
￿
C
8
￿
￿
8
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
Since
￿
￿
￿
￿
C
 
;
!
￿, there must exist
￿
￿
 
￿, where
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
, such that
￿
￿
 
;
!
￿
￿
!
￿. From (21), we
obtain:
￿
￿
￿
￿
￿
8
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
8
￿
￿
8
￿
8
￿
￿
￿
8
￿
￿
8
￿
￿
8
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
If
￿
8
￿
￿
,
￿
￿
￿
8
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
8 holdstrivially. If
￿
8
￿
￿
, thentwocases areconsidered. If
￿
8
￿
￿
, we
have
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. Clearly,
￿
￿
￿
8
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
8 holds.
If
￿
8
!
￿
￿
, we have
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿. Again
￿
￿
￿
8
￿
￿
￿
8
￿
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
8
holds.
Based onthislemma, weshowthattheaddresstranslationfunctionsintroducedaboveimplement
the scheme as illustrated in Figure 8.
THEOREM 3 In local address space, the loop body
A
￿
B
￿
￿
￿
￿
￿
￿
￿
￿
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
A
￿
B
￿
￿
￿
￿
￿
C
6
￿
￿
￿
given in Figure 4 becomes
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
6
￿
￿.
Proof. The write reference
A
￿
B
￿
￿
￿
￿
￿ is mapped to
￿
￿
￿
￿
￿
￿
￿
￿ according to (23). It sufﬁces to show that
each read reference
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is mapped to
A
￿
￿
￿
￿
￿
￿
￿
￿
C
￿. Three cases are distinguished depending
on whether
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is (a) a local element, (b) a read-only element received from the host, or (c) a
nonlocal element received from another processor. In case (a),
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is mapped to
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
according to (23). By using the assumption that
￿
>
￿
C
8
?
￿
8, we can show by an algebraic
manipulation that
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
. In case (b),
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is read-only, it is mapped
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
directly according to (24). In case (c), due to Lemma 2, the same function
￿ used for mapping the
read-only data is used for mapping the nonlocal data. If
A
￿
B
￿
￿
￿
￿
￿
C
￿
￿ is a non-local element, it is
mapped
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿ directly according to (25).
247 SPMD Code in Local Address Space
Figure 9 gives the SPMD code that each processor executes in its own local address space.
Receive and Unpack the Read-Only Data From the Host
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
,
￿
￿
￿
!
￿
￿
￿
￿
"
￿
#
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+=
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
,
￿
￿
￿
!
￿
￿
￿
￿
"
￿
#
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+=
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
++
￿
/
￿ Receive and Unpack Messages for Tiles from Neighbouring PEs
￿/
1
￿
￿
￿
￿
￿
￿
￿
2
￿
￿
￿
￿
￿
￿
￿
￿
￿
#
￿
￿
￿
$
￿
￿
￿ in lexicographic order
￿
3
￿
￿
 
￿
(
￿
#
￿
￿
￿
￿
￿
￿
￿
(
&
￿
￿
￿
￿
￿
￿
￿
￿
￿
;
4
￿
￿
￿
￿
￿
"
￿
￿
&
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
5
￿
￿
￿
 
￿
￿
￿
￿
!
￿
￿
￿
"
￿
#
￿
￿
$
%
￿
￿
$
&
￿
’
￿
￿
￿
￿
%
￿
￿
￿
￿;
6
(
￿
￿
)
￿
￿
￿
7
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
￿
￿ in lexicographic order
￿
8
￿
￿
￿
￿
￿
*
￿
￿
)
￿
%
￿
￿
.
(
￿
￿
++
/;
*
￿
￿
0
1
2
￿
￿
￿
$
￿
￿
￿
￿
,
*
￿
￿
￿
￿
￿
￿;
￿
￿
￿
0
3
4
￿
￿
￿
$
￿
￿
￿
￿
￿
,
￿
￿
!
￿
,
*
￿
￿
￿
￿
￿
￿;
*
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
,
*
￿
!
￿
￿
￿
￿
!
*
￿
￿
￿
￿
￿
,
 
￿
!
"
￿
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
*
￿
++
$
*
￿
￿++
￿
￿
￿
￿
*
￿
￿
0
1
2
￿
￿
￿
$
￿
￿
￿
￿
,
*
￿
￿
￿
￿
￿
￿;
￿
￿
￿
0
3
4
￿
￿
￿
$
￿
￿
￿
￿
￿
,
￿
￿
!
￿
,
*
￿
￿
￿
￿
￿
￿;
*
￿
￿
￿
￿
￿
￿
￿
￿
#
￿
￿
#
￿
,
*
￿
!
￿
￿
￿
￿
!
*
￿
￿
￿
￿
￿
,
 
￿
!
"
￿
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
*
￿
++
$
*
￿
￿++
￿
￿
￿
￿
*
￿
￿
￿
$
￿
￿
￿
￿
*
￿
!
￿
 
￿
￿
$
￿
￿
￿
$
￿
￿
￿
*
￿
!
￿
 
%
￿
￿
￿
/
￿ Pack and Send Messages for Tiles to Neighbouring PEs
￿/
1
￿
￿
￿
￿
￿
￿
￿
2
￿
￿
￿
&
￿
￿
￿
&
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
)
 
￿
(
￿
#
￿
￿
￿
￿
￿
￿
￿
(
&
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
;
3
(
￿
￿
)
￿
￿
;
4
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
￿
￿ in lexicographic order
￿
5
%
￿
￿
￿
(
￿
￿
++
￿
)
￿
￿
￿
￿
*
￿;
6
&
￿
￿
#
￿
￿
￿
￿
,
￿
￿
￿
"
￿
#
￿
￿
$
%
￿
￿
$
(
￿
￿
￿;
Pack and Send the Result to the Host
￿
Figure 9: SPMD code in local address space.
In the send code, the communication set
￿
!
￿
￿
!
￿ consists of the data elements in the local address
space of each sending processor. On arriving at a receiving processor, these data elements are
mapped to the local memory of the receiving processor using the function
￿ given in (27).
The rewriting of the loop body is straightforward due to Theorem 3.
The tile loops are the same as those in Figure 4.
25The element loops are obtained as follows. To avoid expensive conversions from global to local
loop indices, an efﬁcient solution is developed to maintain both the global and local loop indices.
Note that all
I
equations in (21) are independent of each other. Therefore, we maintain the original
>-th element loop to enumerate the iterations in global indices along that dimension and perform the
translation from
￿
8 to
￿
￿
8 inside the
>-th loop. As a ﬁrst approximation, we get:
￿
￿
￿
￿
￿
8
￿
￿
G
￿
￿
7
8
￿
￿
8
￿
8
￿
￿
￿
￿
￿
￿
8
￿
8
￿
8
?
￿
￿
￿
￿
9
8
￿
￿
8
￿
￿
8
￿
>
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
8
￿
8++
￿
￿
￿
8
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
8
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
8
where
￿
￿
8 is a linear function of
￿
8 with coefﬁcient 1. We only need to compute
￿
￿
8 once at the
beginning of the
￿
8 loop and increment
￿
￿
8 at the end of each iteration. This gives rise to the element
loops shown in Figure 9. Note that for the
>-th non-distributed dimension, where
￿
￿
8
￿
￿
8 and
￿
￿
8
￿
￿
, we always have
￿
￿
￿
￿
>
?
I
￿
￿
￿
8
￿
￿
8
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿.
The SPMD code can be optimised further by using standard techniques such as constant propa-
gation, common subexpression elimination and loop invariant motion. These optimisations can be
performed by a sequential compiler.
By maintaining both global and local loop variables, we have avoided expensive modulo op-
erations that would otherwise occur due to the translation between global and local indices. The
modulo operations in the ﬁrst
￿ tile loops are unavoidable for arbitrary iteration spaces. When the
iteration space is rectangular, the modulo operations can be eliminated from the loop bounds and
and the ﬂoor operations
￿
￿
￿
￿
￿
￿ can be eliminated by normalising the
￿ outer tile loops to have stride
1 and then performing appropriate variable substitutions.
The program from Figure 7 in local address space is given Figure ??. Note that the tile loops
have been normalised. The new tile loop variables are
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿.
8 Memory Optimisations
If a program has dependence vectors with negative entries, the iteration space must be skewed to
make all dependence vectors nonnegative so that a rectangular tiling can be legally applied. In
addition, all anti and output dependences must be removed, which may be done by array expansion
[14]. Two memory optimisations are developed to reduce the amount of memory usage: address
folding deals with skewed iteration spaces and memory recycling with expanded arrays.
8.1 Address Folding
The basic idea is introduced using an example shown in Figure 20. Suppose
A is an array deﬁned
in the entire iteration space, where
;
￿
<
￿
￿
￿
￿
￿
=
￿
?
￿
?
￿
￿
￿
￿
G
￿
￿
￿
￿
￿
￿
￿
￿
?
￿
?
￿
￿
￿
@. This
means that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. Assume that
￿
￿
￿
￿
￿
￿
￿
￿,
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿.
According to (18), the local array declared will be
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. Since both the iteration and
data spaces are skewed, many elements of
￿ are not used. The size of the second dimension of
this array can be roughly halved if we exploit the fact that any chain of tiles allocated to the same
processor consists of at most ﬁve tiles. In other words, the number of elements of
￿ accessed for a
ﬁxed value in the ﬁrst dimension is at most
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. Thus, a smaller local
26/
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ declared here
￿/
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
by (27)
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
?
￿
￿
:
￿
￿
￿
￿
8
￿
￿
￿++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
?
￿
8
￿
￿
￿++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿;
/
￿ Receive message
￿
￿
￿
￿
￿
￿
￿
￿
￿
:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿/
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
9
￿
￿
￿
￿
￿
:
￿
￿
￿
￿
9
￿
￿
￿
￿;
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
:
￿
￿
￿
￿
￿
￿
8
￿
￿
?
￿
￿
￿
￿
￿
￿
￿
￿
￿
:
￿
￿
￿
￿
￿
￿
8
￿
￿
++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
?
￿
￿
￿
￿
￿
￿
8
￿
￿
++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
9
￿
￿
￿
￿
￿
￿++
￿
8
￿
￿
￿
G
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
8
￿
?
￿
￿
8
￿
++
￿
￿
￿
++
￿
￿
￿
￿
G
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
8
￿
￿
￿
￿
8
￿
?
￿
￿
8
￿
++
￿
￿
￿
++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿;
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿;
/
￿ Send message
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿/
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
?
￿
￿
￿
￿
￿
￿
8
￿
￿
++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
?
￿
￿
￿
￿
￿
￿
8
￿
￿
++
￿
9
￿
￿
￿
￿
￿
￿++
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
9
￿
￿
￿
￿
￿
￿
￿;
Figure 19: SPMD program for Example 1 in local address space.
array
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
can be declared instead. In this two-dimensional case, the address folding can
be deﬁned as a mapping from the local indices to folded local indices:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
The SPMD code can be modiﬁed appropriately to reﬂect this mapping.
In general, let
￿
8
￿
￿
G
￿
<
=
￿
8
￿
￿
8
=
￿
￿
=
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
￿
￿
 
;
@ (28)
where
￿
8 represents the maximal number of iterations (or range) on any line that is within the
iteration space and parallel to the
>-th elementary vector. Then, the local array declaration
￿
￿
￿
￿
￿
27￿
￿
￿
￿
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
8
9
PE0
PE1
Figure 20: Address folding for a skewed iteration space.
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ will be declared, where the upper bounds are given by:
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
?
>
?
￿
￿
￿
￿
8
￿
￿
￿
￿
%
￿
%
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
where
￿
￿
￿
￿
%
￿
%
￿
￿ gives the maximal number of tiles along that dimension in the worst case.
To calculate
￿
8, it is infeasible if all the points in the iteration space have to be checked. We use
the properties of convex polyhedra to reduce the number of points to be checked. Only the vertices
of the iteration space
;
need to be checked, as the maximum ranges occur where either or both end
points are the vertices, which can be found efﬁciently for regularly structured computations.
The function that folds the local indices of
￿ to those of
￿
￿ can be deﬁned as follows:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
This mapping is injective by construction.
The modulo operations on
￿
￿
8 can be coded by exploiting the fact that
￿
￿
8 increments by 1 at each
iteration of the
>-th element loop in Figure 9. So
￿
￿
8
￿
￿
￿
￿
￿
￿
8
￿
￿
￿ is performed only once at the
beginning of the loop and set to 0 once it reaches
￿
￿
8
￿
￿
.
8.2 Memory Recycling
Many applications such as stencil-based computations will require array expansion to be performed
on the original programs. Frequently, the outermost dimension of the loop nest would cause a new
array dimension, in the case of SOR this would represent time steps. Each hyperplane (constructed
from all but the time dimension) would then contain the results at a particular time step. Only the
last hyperplane is required for the results. All others are only used to store intermediate results.
Introducing dimensions into the data space increases the amount of memory required. Memory
recycling is a technique of allocating less memory than required for the complete problem and
reusing that memory as the computations proceed. The array needs only to hold enough information
to perform its current computations. All previous information is old and can be overwritten.
An important question is: How many hyperplanes do we need to perform the computations if
the hyperplanes that were unused are reused for other computations?
28Memory recycling consists of deﬁning a recycling function that maps the elements in a local
array into an array of a reduced size. It must be designed so that all three types of data elements of
an array are handled correctly: (a) the locally computed data, (b) the nonlocally computed data, and
(c) the read-only data. The two key observations are as follows:
￿ All useful data produced in a tile accessed by other processors are sent immediately after the
tile is executed (see the send code of Figure 5).
￿ A message (i.e., a communication set) arriving at a processor will be received and unpacked
into the local memory of the processor only just before the ﬁrst tile that depends on the mes-
sage is executed (see line 4 of the receive code of Figure 5).
Therefore, if
> is an expanded dimension and distributed, we wish to reduce the size of the local
array to
￿
8
￿
C
￿
￿
￿
￿
8 in that dimension. The local array declared will be
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
where
￿
￿
8
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
and the other upper bounds
￿
￿
￿, where
￿
!
￿
>, are declared as in (18).
This is because the information that is needed for a tile to perform its computations only extends
back
C
￿
￿
￿
￿
8 iterations in an expanded dimension. Then, the local indices along those dimensions are
simply folded appropriately using the modulo operations:
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
:
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (29)
The locations where the read-only data are stored can also be recycled. Since a processor
￿
￿
knows the read-only data it accesses, which are in
￿
￿
￿
￿
￿
￿
!
￿ given in (11). Therefore, the SPMD
program is modiﬁed so that the read-only data are saved in a separate data structure and then used
to perform the required initialisations appropriately. This will be explained using the SOR example
in Section 9.
The key for memory recycling to work is to ensure that the nonlocal data are accessed correctly.
LEMMA 3 Let the
>-th distributed dimension be recycled. Then
￿
!
￿
￿
!
￿ and
￿
!
￿
￿
￿
!
￿
￿ are mapped to the
same local memory region if and only if
￿
￿
￿
￿
￿
￿
and
￿
￿
!
￿
>
￿
￿
￿
￿
￿
￿
￿.
Proof. Follows from (26) and (29).
THEOREM 4 Let the distributed and expanded dimension
> be recycled. Assume that all read-only
data are read correctly. The results computed are the same as in the non-recycled case.
Proof. It sufﬁces to show that the space for the nonlocal data is not recycled when the nonlocal
data will still be accessed. Let tile
￿
￿
be the next one to be executed by processor
￿
￿. Let
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ be a valid tile, where
￿
￿
￿
￿
￿
￿
￿
￿
such that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. Note that
￿
￿ is executed in the
processor
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. We show that the space for the communication set
￿
!
￿
￿
!
￿ will not be recycled
before tile
￿
￿
is executed. Let
￿
￿
￿
be a tile such that (a)
￿
￿
￿
￿
￿
￿
, (b)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
and (c)
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. By Lemma 3, it sufﬁces to show that the communication set
￿
!
￿
￿
￿
!
￿ will not be
stored where the communication set
￿
!
￿
￿
!
￿ is. When (a) – (c) are true, the ﬁrst
￿ entries of
￿
￿
￿
must be
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. Since the
>-th dimension is distributed, where
￿
?
>
?
￿,
the spaces for storing both
￿
!
￿
￿
￿
!
￿ and
￿
!
￿
￿
!
￿ cannot overlap according to Lemma 3.
29A non-distributed expanded dimension can also be recycled if it is the only non-distributed di-
mension or if the data dependences in the program exhibit some desired patterns. In this case, the
local indices along the recycled non-distributed dimension
> can be folded as follows
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
8
￿
￿
￿
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
￿
￿
￿
￿
8
￿
C
￿
￿
￿
￿
8
￿
In addition, the data produced in a tile along the
>-th dimension must be copied to where the read-
only data are stored in the
>-th dimension.
Memory recycling is signiﬁcant because often the expanded dimension represents time steps.
By applying memory recycling, the size of the local array along the time dimension depends on the
size of tiles in that dimension not the number of time steps the program is executed.
The performance results of the two memory optimisations are presented in Figure 22(b).
9 Experimental Results
The objectives of our experiments are twofold. One is to evaluate all proposed compiler techniques
and the other is to identify some performance-critical factors for running tiled code on distributed
memory machines. Therefore, SOR serves as a good example to achieve the two objectives. All
experiments were carried out on a Fujitsu AP1000 consisting of 128 SPARC cells each consisting
of a SPARC processor running at 25MHz with 16MB RAM. All processors are connected by a
2D torus network called the T-net. The T-net provides the interprocessor communication using the
wormhole routing with a 25MB/sec of bandwidth. All I/O is performed on the host.
We start with the following ﬁve-point SOR program:
#
$
%
&
￿
(
￿
*
￿
+
￿
3
￿
*
￿
++
-
#
$
%
&
’
(
￿
*
’
+
￿
*
’
++
-
#
$
%
&
￿
(
￿
*
￿
+
￿
*
￿
++
-
￿
&
’
5
￿
-
(
￿
￿
&
￿
&
’
3
￿
5
￿
-
￿
￿
&
’
5
￿
3
￿
-
￿
￿
&
’
￿
￿
5
￿
-
￿
￿
&
’
5
￿
￿
￿
-
-
￿
&
￿
3
￿
-
￿
&
’
5
￿
-
(30)
where
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ is a 2D array. The dependences for the program are:
D
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@, where
￿
￿
￿
￿
￿
￿
￿ is caused due to a self output de-
pendence on
￿
￿
￿
￿
￿
￿. Next, we eliminate this output dependence by expanding
￿ into a 3D array
A
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ as follows:
#
$
%
&
￿
(
￿
*
￿
+
￿
*
￿
++
-
#
$
%
&
’
(
￿
*
’
+
￿
*
’
++
-
#
$
%
&
￿
(
￿
*
￿
+
￿
*
￿
++
-
/
&
￿
5
’
5
￿
-
(
￿
￿
&
/
&
￿
5
’
3
￿
5
￿
-
￿
/
&
￿
5
’
5
￿
3
￿
-
￿
/
&
￿
3
￿
5
’
￿
￿
5
￿
-
￿
/
&
￿
3
￿
5
’
5
￿
￿
￿
-
-
￿
&
￿
3
￿
-
/
&
￿
3
￿
5
’
5
￿
-
The read-only data of
A are initialised as follows:
A
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
(31)
30RECYCLING NO RECYCLING
dim 1 (
￿
￿)
￿
￿
￿
￿
&
￿
￿
￿
￿
-
￿
￿
￿
￿
%
￿
￿
￿
￿
3
￿
FOLDING NO FOLDING
dim 2 (
￿
￿)
&
￿
￿
￿
￿
-
￿
￿
￿
%
￿
%
￿
￿
￿
￿
3
￿
&
￿
￿
￿
￿
-
￿
￿
￿
￿
:
￿
%
￿
￿
￿
￿
3
￿
dim 3 (
￿
￿
)
￿
￿
￿
￿
￿
%
￿
%
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
:
￿
%
￿
￿
￿
￿
Figure 21: Declaration of local array
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
The results of the computations can be found in
A
￿
￿
￿
￿
￿
￿
￿.
We apply
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ to skew the expanded program to make all dependences nonnegative:
￿
$
￿
￿
(
￿
5
￿
￿
$
’
￿
(
￿
￿
￿
￿
5
￿
￿
￿
￿
￿
$
￿
￿
(
￿
￿
￿
￿
￿
5
￿
￿
￿
￿
￿
￿
(
￿
￿
5
’
(
3
￿
￿
￿
’
￿
5
￿
(
3
￿
￿
￿
￿
￿
￿
*
/
&
￿
5
’
5
￿
-
(
￿
￿
&
/
&
￿
5
’
3
￿
5
￿
-
￿
/
&
￿
5
’
5
￿
3
￿
-
￿
/
&
￿
3
￿
5
’
￿
￿
5
￿
-
￿
/
&
￿
3
￿
5
’
5
￿
￿
￿
-
-
￿
&
￿
3
￿
-
/
&
￿
3
￿
5
’
5
￿
-
The dependence set becomes:
D
￿
￿
<
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
@. To make all
dependences nonnegative, it is sufﬁcient to replace the 2 in the skewing matrix with 1. However,
skewing the inner two loops with different skewing factors creates more irregular iteration spaces so
that the performance of our compiler techniques can be tested more thoroughly.
At this point, we calculate that
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿,
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. The skewed iteration space is tiled with rectangles of size
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ and the tiled
code is obtained per Section 2.3. The ﬁrst two dimensions of the three tile loops are distributed
cyclically to a 2D processor mesh. The SPMD code is then generated following Figure 9. Since the
skewed iteration space is no longer rectangular — one of the main reasons why SOR is used in our
experiments, the general message-passing code for arbitrary iteration spaces in Figure 5 is adopted.
The message-passing code for I/O is generated as discussed at the beginning of Section 5.
The speedup of an SPMD program is measured as:
￿
￿
￿
￿
￿
￿
￿
where
￿
￿ is the execution time of the (untiled) sequential program given in (30) on a single AP1000
processor and
￿
￿ is the execution time of the SPMD program on
￿ AP1000 processors.
Our experimental results are given in Figure 22. We describe the rationale behind each experi-
ment and draw conclusions where appropriate.
Tiling Cost If a loop nest is skewed before being tiled, more complex loop bounds in the skewed
code introduce into the tiled code even more complex loop bounds involving
￿
￿
￿ and
￿
G
￿.
31To measure its effect on the execution time, we have designed a test case, presented in Fig-
ure 22(a), that achieves the perfect balance such that each processor is allocated roughly the
same number of tiles containing exactly
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
iterations. To reduce the impact
of boundary tiles and due to the memory limit, we have used only
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
processors. The two memory optimisations are not used so that their interference is avoided.
In this test case, the sequential execution time is
￿
￿
￿
￿
￿
￿
￿
￿ secs. To measure the time spent
only on the computations, we commented out the communication code corresponding to the
two middle message-passing code sections in Figure 4, assuming ideally 0 as the communi-
cation cost. The parallel execution time on 25 processors is measured to be
￿
￿
￿
￿
￿
￿
￿
￿ secs.
This imposes an upper limit
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
on the best speedup possible.
While the tiled code can be optimised by such optimisations as common subexpression e-
limination and loop invariant motion, it may be worthwhile considering special-purpose tech-
niques for optimising the tiled code.
Communication Cost Our communication scheme is reasonably efﬁcient since it combines small
messages into large messages, which encompasses the three classic optimisations: message
vectorisation, coalescing and aggregation [19]. Consider the test case in Figure 22(a) we
analysedabove,whereallprocessorsaregivenexactly
￿
￿
￿
￿
￿
￿
￿
iterationsand
￿
￿
￿
￿
￿
￿
￿
￿ secs.
The parallel execution time is
￿
￿
￿
￿
￿
￿
￿
￿ secs. This translates to a speedup of
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
on 25 processors.
Validity Test Due to the cyclic computation and data distribution, a processor is required to test
at run time whether some tiles are valid or not (line 2 of the send code and lines 2 – 4 of
the receive code in Figure 5). By applying the optimisations discussed in Section 5.2, the
time elapsed on this part of the program is within 3% of the total execution time among all
experiments we have conducted and is thus negligible.
Memory Optimisations The time dimension is distributed and can be recycled according to The-
orem 4. To fold the other two dimensions, note that the number of iterations along each of
the two dimensions is
￿
￿
￿
￿
￿
￿
￿
according to (28). This leads to Figure 21. Address
folding is very useful for skewed iteration space. In the case when
￿
￿
￿
, for example,
the local memory allocated is reduced to one sixth. Memory recycling is crucial for expand-
ed arrays. By recycling the time dimension, the size of the local array in that dimension is
￿
￿
￿
C
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
, which is independent of
￿
, the number of time steps used. However,
future research is still needed to reduce the memory cost even further.
In our SPMD mode of execution, all read-only data are initially received before the tiles are
executed, When a dimension is recycled, the locations where the read-only data are supposed
to be stored permanently can be recycled, too. Therefore, these read-only data are saved in a
separate array and then used to provide the required initialisation. In the SOR example, the
read-only data are speciﬁed in (31).
As shown in Figure 22(b), the memory space saved by the two optimisations is achieved at the
expense of CPU cycles. While further optimisations may be sought for division and modulo
operations, the overhead is expected to be small for large computation tasks.
32￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
T
i
m
e
(
s
e
c
s
)
Processor (IDs)
comp
recv
send
M = N = 200
P1 = P2 = 5
B1 = 10, B2 = 20
B3 = 20
Neither
R
F
R + F
M = 100, N = 600
P1 = P2 = 10
B1 = B2 = 10
S
p
e
e
d
u
p
Tile Size B3
5 10 30 60
|
|
|
|
|
|
|
|
|
|
0
5
10
15
20
25
30
35
40
45
50
(a) Impact of tiling and comm cost (b) Impact of memory optimisations
M = N = 600
P1 = P2 = 10
B1 x B2 x B3 = 2560
R + F
S
p
e
e
d
u
p
Tile shape
8x8x40 40x8x8 8x40x8 4x8x80 16x16x10 16x32x5
|
|
|
|
|
|
|
|
0
5
10
15
20
25
30
35
40
M = N = 600
P1 x P2 = 128
B1 = B2 = 8, B3 = 40
R + F
S
p
e
e
d
u
p
Processor layout
2x64 64x2 4x32 32x4 8x16 16x8
|
|
|
|
|
|
|
|
0
5
10
15
20
25
30
35
40
(c) Impact of tile shape on performance (d) Impact of processor layout on performance
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ 0
5
10
15
20
25
30
35
40
45
T
i
m
e
(
s
e
c
s
)
Processor (IDs)
comp
recv
send
￿
￿
￿
0 32 64 96
0
5
10
15
20
25
30
35
40
45
T
i
m
e
(
s
e
c
s
)
Processor (IDs)
comp
recv
send
(e) Case
￿
￿
￿
￿
￿
￿ in (c) (f) Case
￿
￿
￿
￿ in (d)
0 20 40 60 80 100
0
5
10
15
20
25
30
35
40
45
T
i
m
e
(
s
e
c
s
)
Processor (IDs)
comp
recv
send
￿
￿
￿
0 32 64 96
0
5
10
15
20
25
30
35
40
45
T
i
m
e
(
s
e
c
s
)
Processor (IDs)
comp
recv
send
(g) Case
￿
￿
￿
￿
￿
￿ in (c) (h) Case
￿
￿
￿
￿ in (d)
Figure 22: Experimental results of SOR on AP1000.
33Granularity (i.e., Computation and Communication Overlap) Figure 22(b) also serves to illus-
trate the importance of reducing communication overhead by overlapping computation and
communication. In all four test cases, the tile sizes
￿
￿ and
￿
￿ in the two distributed dimen-
sions are ﬁxed, implying that each processor has the same workload. By adjusting the tile
size
￿
￿
in the non-distributed dimension, we change the granularity of tiles. The performance
varies mainly due to the degree at which computation and communication is overlapped. The
smaller the
￿
￿
is, the larger the parallelism is but the larger the communication cost will be.
The converse is true as
￿
￿
is gradually increased. Previous work on determining tile sizes
[26, 29, 30, 38] usually assumes rectangular iteration spaces with nonnegative dependence
vectors and does not consider a concrete computation distribution on a speciﬁc machine.
Load Balance Poor performance usually results if the processors do not have balanced workload.
We ran the program by ﬁxing the granularity of a tile at
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
and the
processor layout at
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
. We change the workload of a processor by adjusting
the aspect ratio of a tile, which affects the performance as shown in Figure 22(c). The best
speedup among the six test cases is obtained when all processors compute roughly the same
number of iterations as shown in Figure 22(e). In the worst case, as displayed in Figure 22(g),
the ﬁrst 50 processors are assigned about twice as many iterations as the other 50 processors
due to the tile size
￿
￿
￿
￿
￿
used.
Processor Assignment and Parallelism On the other hand, we are not guaranteed to achieve good
performance even if all processors are assigned the same workload. Figure 22(d) indicates the
sensitivity of the performance to the processor layout given a ﬁxed number of processors. In
both the best and worst of the six test cases displayed in Figures 22(f) and (h), respectively,
all processors are assigned roughly the same number of iterations. However, in the worst case
shown in Figure 22(h), all processors spend about 80% of the total run time waiting at the
receiving statement. But the speedup more than doubles when the two processor dimensions
are swapped (see the second bar in Figure 22(d)). This is because the second distributed di-
mension is swept in the second tile loop in each processor. When this dimension is distributed
to
￿
￿ processors, the ﬁrst few processors are given
￿
￿
￿
%
￿
%
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
tiles along that di-
mension. These few processors cannot execute any tiles along the ﬁrst dimension unless they
have executed these two tiles. This has almost serialised the execution of the entire program.
10 Related Work
Given a tiling transformation, this paper presents compiler techniques for generating a SPMD pro-
gram to execute a rectangularly tiled iteration space. Some closely related work is reviewed below.
Previous work on tiling includes: improving the performance of a memory hierarchy [6, 7,
23, 25, 35, 40], determining the sizes and shapes of tile to minimise communication overhead on
distributedmemorymachines [10, 29, 30, 33, 38]and determiningthetilesizeto minimiseexecution
time on distributed memory machines [1, 5, 26]. To integrate our techniques with a data-parallel
compiler, compiler techniques for selecting a tiling transformation for commonly occurring iteration
space shapes and for selecting a processor layout remain to be developed. We are not aware of
any work studying the impact of processor layout on the performance of tiled code. Almost all
34previous work on time-minimal tilings focuses on ﬁnding good rectangular tiles for rectangular
iteration spaces without taking cache optimisation on a single processor into account [1, 5, 26]. The
problem of selecting a tiling that minimises the total execution time by considering many factors
simultaneously such as communication overhead and cache performance is difﬁcult. Some initial
attempt can be found in [25].
Methods for removing anti and output dependences and for transforming programs into single
assignment form are many: array expansion [14], node splitting [27], array privatisation [17] and
others[4]. Recently, apartialarray expansiontechniqueisproposedtoreducetheamountofmemory
usage [24]. Removing anti and output dependences for SOR-like programs with an outermost time
loop is simple.
General techniques for compiling data-parallel languages can be found in [2, 19] and the survey
paper [11]. In compiling HPF-like data parallel languages, the data distribution is usually speciﬁed
by the programmer. Essentially, the data space of an array is ﬁrst tiled by rectangular tiles and
the tiles are then distributed cyclically to the processors for some dimensions of the data space and
collapsed to the same processor in the other dimensions. After the data distribution is performed,
the computation distribution is inferred by the compiler. In compiling tiled code, these two phases
are reversed in this paper. When a tile is treated as an atomic unit of computation, the computation
distributionis partially speciﬁed. Therefore, the approach discussed in this paper ﬁrst determines the
computation distributionand then infers the data distribution. When a rectangular tiling is applied to
a skewed iteration space, the data space of an array can be viewed as being tiled by parallelepipeds
which are not necessarily rectangles. Since the program has single-assignment semantics after array
expansion, the owner-computes rule is still enforced.
After both data distribution and computation distribution are performed, message-passing code
must be generated. The problem of deriving communication sets is addressed in [8, 18, 21, 34]. In
[11], three approaches for enumerating regular communication sets are identiﬁed: array sections,
ﬁnite state machines, and polyhedra and lattices. This paper manipulates polyhedra represented by
afﬁne constraints and generates a set of loops to enumerate a message approximated by an array
section. Our communication scheme naturally encompasses such optimisation techniques as mes-
sage vectorisation, coalescing and aggregation for optimising communication as discussed in [19].
Several storage schemes for managing nonlocal data, such as buffers, hash tables, software paging
and overlap areas are discussed in [11, 15]. This paper uses an extended concept of overlap areas to
store the nonlocal data.
Our earlier paper [33] addresses the communication code generation for tiled code in the spe-
cial case when the tiled iteration spaces are rectangular. Our technical report [41] contains some
preliminary ideas of address folding and memory recycling. Our previous results on measuring the
performance of AP1000 can be found in [9, 33].
The Omega Calculator [28] has been particularly useful for generating packing and unpacking
code given a set of inequalities described using Presburger formulas.
11 Conclusions
In this paper, we have presented compiler techniques for generating an SPMD program for efﬁcient
execution on a distributed memory machine given a tiled iteration space. We use a cyclic compu-
35tation distribution to allocate tiles to processors to reduce load imbalance and overlap computation
and communication. We use the computer-owns rule to derive the data distribution from the com-
putation distribution. We provide techniques for generating all required communication code. In
particular, we have presented efﬁcient message-passing code for both arbitrary and rectangular it-
eration spaces. We introduce a storage scheme for managing both the local and nonlocal data. By
extending the concept of overlap areas slightly, our SPMD code exhibits high locality of references
despite of any sparse array references used in the program. We have developed two memory optimi-
sations to reduce the amount of memory usage for skewed iteration spaces and expanded arrays. We
provideall the require translation functions between global and local indices. We present a complete
SPMD code that a processor executes in its own local address space. By maintaining both global
and local loop variables, we have avoided expensive runtime translation between and global local
indices.
We have evaluated and validated all compiler techniques in a Fujitsu AP1000. Our performance
results indicate that tiling is a useful performance-improving optimisation for distributed memory
machines and our compiler techniques can generate efﬁcient SPMD programs for tiled iteration s-
paces. We have identiﬁed several research problems that need to be further investigated. These
include the problem of generating tiled code with efﬁcient loop structure and loop bounds and the
problem of selecting tile sizes and shapes to overlap computation and communication and to min-
imise load imbalance.
While presented in the context of compiling for one single loop nest, our compiler techniques
are general enough to be extended for multiple loop nests. To integrate these techniques with a data-
parallel compiler, it is desirable to choose a tiling transformation in such a way that the implied data
distributioncorresponds to theuser-speciﬁed data distribution. Thismay beachieved in combination
withan appropriatechoiceofthetileallocationfunctiondiscussedin Section3, as oneofthereferees
suggests. This topic will be investigated in future work.
12 Acknowledgements
WearegratefultoAustralianNationalUniversityforprovidingustheaccesstotheirFujitsuAP1000.
We also wish to thanks the referees for their comments and suggestions. The ﬁrst author acknowl-
edges gracefully the support of an Australian Research Council Grant A49232251. The second
author would like to thank Professor David Padua of University of Illinois at Urbana-Champaign
for discussions during his sabbatical at Urbana-Champaign. The second author also gracefully ac-
knowledges the support of an Australian Research Council Grant A49600987.
References
[1] R. Andonov and S. Rajopadhye. Optimal orthogonal tiling of 2-D iterations. Journal of Parallel and
Distributed Computing, 45(2):159–165, Sept. 1997.
[2] S. Benkner, B. M. Chapman, and H. P. Zima. Vienna Fortran 90. In Scalable High Performance
Computing Conference, pages 51–59, Apr. 1992.
36[3] V. Bouchitt´ e, P. Boulet, A. Darte, and Y. Robert. Evaluating array expressions on massively parallel ma-
chines with communication/computation overlap. International Journal of Supercomputer Applications
and High Performance Computing, 9(3):205–219, 1995.
[4] P. Y. Calland, A. Darte, Y. Robert, and F. Vivien. On the removal of anti and output dependences.
Technical Report 96–04, Ecole Normale Sup´ erieure de Lyon, Feb. 1996.
[5] P.Y. Calland, J. Dongarra, and Y.Robert. Tiling with limited resources. In L.Thiele, J. Fortes, K.Vissers,
V.Taylor, T.Noll, and J. Teich, editors, 1997 Application Speciﬁc Systems, Architectures and Processors,
pages 229–238. IEEE Computer Society Press, 1997.
[6] S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Supercomputing ’92, pages
114–124, Minneapolis, Minn., Nov. 1992.
[7] L. Carter, J. Ferrante, and S. Flynn Hummel. Efﬁcient parallelism via hierarchical tiling. In SIAM Conf.
on Parallel Processing for Scientiﬁc Programming, New York, 1995.
[8] Siddhartha Chatterjee, John R. Gilbert, Fred J. E. Long, Robert Schreiber, and Shung-Hua Teng. Gen-
erating local addresses and communication sets for data-parallel programs. In 4th ACM SIGPLAN Sym-
posium on Principles & Practice of Parallel Programming, pages 149–158, San Diego, Calif., May
1993.
[9] S. Chen and J. Xue. Issues of tiling double loops on distributed memory machines. In 5th Australian
Parallel and Real-Time Systems, pages 377–388, Aukland, 1998.
[10] Y.-S. Chen, S.-D. Wang, and C.-M. Wang. Tiling nested loops into maximal rectangular blocks. J. of
Parallel and Distributed Computing, 35(2):108–120, 1996.
[11] F. Coelho, C. Germain, and J. L. Pazat. State of the art in compiling HPF. In G. R. Perrin and A. Darte,
editors, Data Parallel Programming Model: Foundations, HPF Realization and Scientiﬁc Applications,
Lecture Notes in Computer Science 1132, pages 659–664. Elsevier (North-Holland), 1996.
[12] J.-F. Collard, T. Risset, and P. Feautrier. Construction of DO loops from systems of afﬁne constraints.
Technical Report 93–15, Ecole Normale Sup´ erieure de Lyon, May. 1993.
[13] A. Darte and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop
tiling. Technical Report 96–34, Ecole Normale Sup´ erieure de Lyon, November 1996.
[14] P. Feautrier. Dataﬂow analysis for array and scalar references. International Journal of Parallel Pro-
gramming, 20(1):23–53, Feb. 1991.
[15] M. Gerndt. Updating distributed variables in local computations. Concurrency: Practice and Experi-
ence, 2(3):171–193, 1990.
[16] Gina Goff, Ken Kennedy, and Chau-Wen Tseng. Practical dependence testing. In ACM SIGPLAN’91
Conf. on Programming Language Design and Implementation, pages 15–29, Toronto,, Jun. 1991.
[17] J. Gu, Z. Li, and G. Lee. Symbolic array dataﬂow analysis for array privatization and program paral-
lelization. In Supercomputing ’95. ACM Press, 1995.
[18] S. K. S. Gupta, S. D. Kaushik, S. Mufti, S. Sharma, Chua-Huang Huang, and P. Sadayappan. On com-
piling array expressions for efﬁcient execution on distributed-memory machines. In 1993 International
Conference on Parallel Processing, volume II, pages 301–305, St. Charles, Ill., August 1993.
37[19] S. Hiranandani, K. Kennedy, and C. W.-Tseng. Compiling Fortran D for MIMD distributed memory
machines. Communications of the ACM, 35(8):66–80, Aug. 1992.
[20] F. Irigoin and R. Triolet. Supernode partitioning. In 15th Annual ACM Symposium on Principles of
Programming Languages, pages 319–329, San Diego, California., Jan. 1988.
[21] C. Koelbel. Compile-time generation of regular communication patterns. In Supercomputing ’91, pages
101–110. ACM Press, 1991.
[22] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. Steele Jr., and M. E. Zosel. The High Performance
Fortran Handbook. The MIT Press, Cambridge, 1994.
[23] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked
algorithms. In 2nd International Conference on Architectural Support for Programming Languages and
Operating Systems, pages 63–74, Santa Clara, California, Apr. 1991.
[24] V. Lefebvre and P. Feautrier. Optimizing storage size for static control programs in automatic paralleliz-
ers. In 1997 European Parallel Processing Conference, 1997.
[25] N. Mitchell, L. Carter, J. Ferrante, and K. H¨ ogstedt. Quantifying the multi-level nature of tiling interac-
tions. International Journal of Parallel Programming, 26(26):641–670, 1998.
[26] H. Ohta, Y. Saito, M. Kainaga, and H. Ono. Optimal tile size adjustment in compiling for general
DOACROSS loop nests. In 1995 ACM International Conference on Supercomputing, pages 270–279.
ACM Press, 1995.
[27] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communication
of the ACM, 29(12):1184–1201, Dec. 1986.
[28] W. Pugh. The Omega test: A fast and practical integer programming algorithm for dependence analysis.
Communications of the ACM, 35(8):102–114, Aug. 1992.
[29] J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for multicomputers. Journal
of Parallel and Distributed Computing, 16(2):108–230, Oct. 1992.
[30] R. Schreiber and J. J. Dongarra. Automatic blocking of nested loops. Technical Report 90.38, RIACS,
May 1990.
[31] A. Schrijver. Theory of Linear and Integer Programming. Series in Discrete Mathematics. John Wiley
& Sons, 1986.
[32] W. Shang and Jose A. B. Fortes. Independent partitioning of algorithms with uniform dependencies.
IEEE Transaction on Computers, 41(2):190–206, Feb. 1992.
[33] P. Tang and J. N. Zigman. Reducing data communication overhead for DOACROSS loop nests. In 1994
ACM International Conference on Supercomputing, pages 44–53. ACM Press, 1994.
[34] A. Thirumalai and J. Ramanujam. Fast address sequence generation for data-parallel programs using
integer lattices. In 9th Workshop on Languages and Compilers for Parallel Computing, pages 291–301,
Aug 1996.
[35] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In ACM SIGPLAN’91 Conference on
Programming Language Design and Implementation, pages 30–44, Toronto, Ont., Jun. 1991.
38[36] M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism.
IEEE Transactions on Parallel and Distributed Systems, 2(4):452–471, Oct. 1991.
[37] M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[38] J. Xue. Communication-minimal tiling of uniform dependence loops. Journal of Parallel and Distribut-
ed Computing, 42(1):42–59, 1997.
[39] J. Xue. On tiling as a loop transformation. Parallel Processing Letters, 7(4):409–424, 1997.
[40] J. Xue and C.-H. Huang. Reuse-driven tiling for data locality. International Journal of Parallel Pro-
gramming, 26(6):671–696, 1998.
[41] J. N. Zigman and P. Tang. Implementing global address space in distributed memory machines. Tech-
nical Report TR-CS-94-10, Department of Computer Science, The Australian National University, Oct.
1994.
39