More flexible, less coherent: NATO after Lisbon by Noetzel, T & Schreer, Benjamin
1
Reducing Data Communication
Overhead for Doacross Loop
Nests
Peiyi Tang and John N. Zigman
Technical Report
ANU-TR-CS-93-16
December 23, 1993
Reducing Data Communication Overhead
for Doacross Loop Nests
Peiyi Tang and John N. Zigman
Department of Computer Science
The Australian National University
Canberra ACT 2601 Australia
December 23, 1993
Abstract
If the loop iterations of a loop nest cannot be partitioned into independent
sets, the data communication for data dependences are inevitable in order to
execute them on parallel machines. This kind of loop nests are referred to as
Doacross loop nests.
This paper is concerned with compiler algorithms for parallelizing
Doacross loop nests for distributed-memory multicomputers. We present a
method that combines loop tiling, chain-based scheduling and indirect mes-
sage passing to generate ecient message-passing parallel codes. We present
our experiment results on Fujitsu AP1000 which show that low communica-
tion overhead and high speedup for Doacross loop nests on multicomputers
can be achieved by tuning these techniques.
Keywords: Doacross loop nests, chain-based scheduling, loop tiling, in-
direct message passing, distributed-memory multicomputers
This work was supported in part by the Australian Research Council under Grant No.
A49232251 and Fujitsu Laboratories, Ltd., Japan
ii
1 Introduction
If the loop iterations of a loop nest cannot be partitioned into independent sets, the data
communication for data dependences are inevitable in order to execute the loop nest on
parallel processors. This kind of loop nests are known as Doacross loop nests.
An example of Doacross loop nest is shown in Figure 1. Although the four loops
in the loop nest can be interchanged, it can never be transformed to a loop nest with
outermost Doall loops.1 Many loop nests for solving dierential equations using the
nite dierence method are Doacross loop nests.
This paper is concerned with compiler algorithms to generate ecient parallel codes
for Doacross loop nests on multicomputers. Distributed-memory multicomputer sys-
tems such as Fujitsu AP1000 and TMC CM-5 rely on message passing (point-to-point and
aggregate) for interprocessor data communication and synchronization. The overhead of
message passing in multicomputers is large.
To generate ecient parallel codes for Doacross loop nests on multicomputers, the
time delay caused by the large overhead of message passing must be reduced. This paper
presents a method that combines chain-based scheduling [1], loop tiling [2,3] and indirect
message passing to reduce the data communication overhead.
for i1 = 0 to 127
for i2 = 0 to 127
for i3 = 0 to 127
for i4 = 0 to 127
a(i1 + 1,i2 + 1,i3 + 1,i4 + 1) =
0.25*(a(i1 + 1,i2 + 1,i3,i4 + 1) + a(i1 + 1,i2,i3 + 1,i4 + 1) +
a(i1 + 1,i2 + 1,i3,i4) + a(i1 + 1,i2,i3,i4) +
a(i1,i2 + 1,i3,i4) + a(i1 + 1,i2,i3,i4) +
a(i1,i2 + 1,i3 + 1,i4 + 1) + a(i1 + 1,i2 + 1,i3 + 1,i4 + 1) );
Figure 1. An Example of Doacross Loop Nest
The wave-front method was proposed to parallelize Doacross loop nests for vector
machines [4,5]. More recently, Wolf and Lam [3] showed that any Doacross loop nest
can be transformed to a loop nest with Doall loops enclosed in the outermost serial
Do loop. The code of the transformed loop nest generated for multicomputers is of the
following form:
1A Do loop is a Doall loop if there are no data dependences between its iterations and, therefore,
they can be executed independently in parallel.
1
do loop
doall loops
...
loop body computation;
...
enddoall;
barrier;
aggregate data communication;
enddo
Obviously, this wave-front method on multicomputers does not provide the overlap be-
tween computation and data communication. It also requires barrier synchronization to
enforce Doall loops enclosed in the outermost serial loop. The barrier synchronization
could be very expensive in large systems. The chain-based scheduling of Doacross loop
nests [1] allows the data communication to be overlapped with the computation and
eliminates the barrier synchronization.
To amortize the large startup overhead of message passing, we also incorporated loop
tiling [2,3] into the chain-based scheduling. By tuning the size of the tiles, we can control
the granularity of the parallel tasks to reduce the impact of message-passing overhead.
To further reduce the data communication overhead, we invent a novel message pack-
ing scheme called indirect message passing. The indirect message passing reduces the
number of messages without loss of parallelism of the chain-based parallel codes.
To show the eectiveness of these techniques we parallelized and coded theDoacross
loop nest in Figure 1 and run it on a Fujitsu AP1000 with dierent congurations of
processor array and loop tiling. On the 128-processor AP1000, we can get the speedup
as high as 121.9 by tuning the tile size and the shape of processor array.
In Section 2, we describe the loop transformation for loop tiling and chain-based
scheduling. In Section 3, we describe the algorithms to generate messages for data
communication. In Section 4, we present the method of indirect message passing. The
experiment results on AP1000 are presented in Section 5. Section 6 completes the paper
with conclusions.
2 Loop Tiling and Chain-Based Scheduling
In this section, we rst give the denition of Doacross loop nests. Then we present the
algorithms for loop tiling and chain-based scheduling.
2.1 Tiling Iteration Space of Doacross Loop Nests
The index set of an n-dimensional loop nest with loop steps 1 can be modeled by integer
grids within a convex polyhedron determined by the loop bounds. For example, the index
2
set of the loop nest in Figure 1 can be represented by
I4 = f~i 2 Z4jC~i  ~cg
where
C =
26666666666664
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
 1 0 0 0
0  1 0 0
0 0  1 0
0 0 0  1
37777777777775
and ~c =
26666666666664
127
127
127
127
0
0
0
0
37777777777775
In general, the loop bounds can be ane functions of index variables of the enclosing
loops and each loop except the outermost one can have more than one upper or lower
bounds. The matrix C has the row-echelon shape if the rows for the same loop are
grouped together.
There is a data dependence from index ~i1 2 I
n to ~i2 2 I
n if (1) ~i1 is lexicographically
less than ~i2, (2) a data element is accessed by both iterations, ~i1 and ~i2 with at least one
access is write operation and (3) there is no index ~i3 between ~i1 and ~i2 that writes to the
same data element. The vector ~d = ~i2   ~i1 is called distance vector . The dependence
distance vector ~d is constant if the dependence exits for every pair of ~i1; ~i2 2 I
n such that
~d = ~i2   ~i1. In this paper, we only consider the loop nests with constant dependence
distance vectors. The distance vectors form a distance matrix:
D =
h
~d1;    ; ~dm
i
where each ~dj is a distance vector. The loop nest in Figure 1 has 7 distance vectors and
it's distance matrix is
D =
26664
0 0 0 0 1 0 1
0 1 0 1 0 1 0
1 0 1 0 1 1 0
0 0 1 1 1 1 0
37775
We use the rank of distance matrix D to distinguish Doall and Doacross loop
nests.
Denition 1 (Doacross Loop Nest) A loop nest with n loops is a Doacross loop
nest if the rank of its dependence distance matrix is n; otherwise, the loop nest is a
Doall loop nest.
The rationale behind Denition 1 is that if the rank of D is less than n, the loop nest can
be transformed to have at least one outermost Doall loops. If the rank of the matrix
of distance vectors is n, it is still possible to convert it to a loop nest with outermost
Doall loops. But, the loop range of these Doall loops in this case are limited by the
distance vectors which are very small in real programs. This limited Doall parallelism
3
may not be sucient for large parallel systems. In this paper, we simply put it in the
category of Doacross loop nests.
The number of data elements generated in one iteration to be used in another iteration
of a Doacross loop nest is usually very small. Therefore, the rst step to reduce the
run-time data communication overhead is to group a number of iterations to form a larger
task and generate larger messages at the end of each task. The technique to do that is
called loop tiling , an extension of strip-mining transformation in vectorizing compilers.
The general theory for loop tiling can be found in [2]. For the purpose of this paper, the
simple rectangular tiling is sucient.
Let us use
DO(i1;    ; in)((L1; U1);    ; (Ln; Un))fBODY g
to denote a n-dimensional loop nest with Lj and Uj being the lower and upper bounds
of the j-th loop of index ij , 1  j  n. The index set of the loop nest is
In = f~i 2 ZnjC~i  ~cg
where ~i = (i1;    ; in)
2 and matrix C and vector ~c which dene the convex polyhedron
are derived from the loop bounds ((L1; U1);    ; (Ln; Un)). A tile with origin ~i0 is the
subset of the index set
T (~i0) = f~i 2 I
nj~0 ~i ~i0  ~k   ~1g
where ~k = (k1;    ; kn) is the size vector of the tiling and each kj (1  j  n) is a positive
integer. The origins ~i0 form a lattice in the index set and can be generated by an integer
matrix L:
~i0 = L~t
In the simple rectangular tiling, it is
L =
26664
k1 0    0
0 k2    0
           
0 0    kn
37775
and ~t = (t1;    ; tn) is called tile index of the corresponding tile.
Since L~t = (k1t1;    ; kntn), another form for the origin ~i0 is ~i0 = ~k  ~t. Therefore the
tile with tile index ~t, denoted by T~t, is
T~t = T (
~k  ~t) = f~i 2 Inj~k  ~t ~i  ~k  ~t + ~k   ~1g
Tile index set , denoted by T n, is the set of all tile indices whose tiles are non-empty
2All vectors in this paper are column vectors, although we use tuples to denote them in the text.
4
dened as follows:
T n = f~t 2 Znj9~i 2 In s:t: ~k  ~t ~i  ~k  ~t+ ~k   ~1g
or
T n = f~t 2 Znj9~i 2 Zn s:t: ~k  ~t ~i  ~k  ~t+ ~k   ~1 ^ C~i  ~cg
Notice that the above inequalities dene a convex polyhedron in the 2n-dimensional space
spanned by ~t and ~i. The tile index set T n is an integer programming projection onto
the n-dimensional subspace spanned by ~t. Using the row-echelon algorithm described in
[2], compilers can generate the loop bounds to scan T n accurately. Let the loop bounds
obtained to scan the tile index sets be ((TL1; TU1);    ; (TLn; TUn)).
For example, the following Doacross loop nest
DO (i1; i2) ((0, 7),(0,7))
f
a(i1; i2) = 0.25 * ( a(i1; i2) + a(i1   1; i2) + a(i1; i2   1) + a(i1   1; i2   1) )
g
Suppose we use the size vector ~k = (2; 2) = ~2 to tile the index set of this loop nest. The
tile with index ~t is
T~t = f~i 2 Z
2j~0 ~i  ~7 ^ ~2  ~t ~i  ~2  ~t + ~1g
The tile index set is
T 2 = f~t 2 Z2j~0  ~t  ~3g
Loop tiling changes the execution order of the iterations to nish all iterations in a
tile before executing the next one. The result of loop tile is the loop nest with 2n loops
as follows:
DO(t1;    ; tn)((TL1; TU1);    ; (TLn; TUn))
f
DO(i1;    ; in) ((max(L1; t1k1);min(U1; (t1 + 1)k1   1));    ;
(max(Ln; tnkn);min(Un; (tn + 1)kn   1)))
f
BODY
g
g
The results of loop tiling the example in this section is
DO(t1; t2)((0; 0); (3; 3))
f
DO(i1; i2) ((max(0; 2t1);min(7; (2t1 + 1)); (max(0; 2t2);min(7; 2t2 + 1)))
5
f
a(i1; i2) = 0.25 * ( a(i1; i2) + a(i1   1; i2) + a(i1; i2   1) + a(i1   1; i2   1) )
g
g
Figure 2 illustrates the loop tiling of the example. The small circles represent the it-
erations of the loop nest. Each shaded square in the gure represents a tile. Arrows
represents the data dependences between the loop iterations.
t2=0
t2=1
t2=2
0
4
6
2t1=0 t1=1 t1=2 t1=364
2
i2
i1
t2=3
Figure 2. Tiled Loop Nest
Since the loop tiling changes the order of execution of the iterations, it is not always
legal. However, if all the dependence distance vectors are non-negative, the rectangular
tiling does not change the original execution order between any pair of iterations related
by the dependences and, thus, is legal. If the distance vectors for the dependencies have
negative elements, we can always skew the loop index set using unimodular transforma-
tion. An algorithm to calculate the unimodular matrix to skew the iteration space to
makes the distances vectors of all dependencies non-negative can be found in [1]. In this
paper, we assume that the distance vectors of all data dependencies are non-negative.
That is, We have
8~d 2 D : ~d  ~0
Each iteration ~i in the loop index set In belongs to only one tile in the tile index set
T n. For the rectangular tiling, the total function from In onto T n can be expressed as
follows:
t : In ! T n; t(~i) =~i div ~k
where opertator div applies to the corresponding elements of ~i and ~k.
6
The purpose of loop tiling is to control the granularity of the parallel execution of
the loop nest. The computation of the loop iterations of a tile is meant to be executed
sequentially by a single processor. A processor issues send and receive commands for
data communication only between the computations. The message size is determined by
the tile size.
To execute the nested loop in parallel, we need to allocate and schedule the tiles to
parallel processors. The basic ideas in the so-called chain-based scheduling are as follows:
1. To overlap the computation and data communication, each processor needs to be
allocated sequences of tiles, called chains.
2. The chains of the tiles need to be allocated cyclically to parallel processors so that
the full bandwidth of the multicomputer network can be utilized.
In the example of this section, we can allocate tiles with the same rst tile index,
i.e.,the tiles of tile indices ~t = (t1; ), to the same processor. The chains are, then,
allocated to processors cyclically as shown in the Figure 3. The thick arrows represent
t2=0
t2=1
t2=2
0
4
6
2t1=0 t1=1 t1=2 t1=364
2
i1
t2=3
i2
PE0 PE1 PE0 PE1
Figure 3. Chain-Based Scheduling of Nested Loop
the interprocessor communication due to the data dependences.
In general, parallel processors are assumed to be organized as anm-dimensional mesh
(1  m < n). Formally, the processor space is
Pm = f~p 2 Zmj~0  ~p  ~P   ~1g
where ~P = (P1;    ; Pm) and Pj > 1 (1  j  m) is the size of the j-dimension. The
total number of processors is P1   Pm.
7
The tile allocation is a total function from tile index set T n to processor space Pm
dened as follows:
a : T n ! Pm; a(~t) = ~tj1;m mod ~P
where ~tj1;m is the subvector ~t composed of the rst m elements of ~t and operator mod
applies to the corresponding elements of ~tj1;m and ~P . In this paper, we use ~zjp;q (1  p <
q  n) to denote the subvector (zp;    ; zq) of vector ~z = (z1;    ; zn).
The tiles allocated to the same processor are executed in the lexicographic order of
the tile index ~t. The parallel code for each processor ~p 2 Pm is shown in Figure 4. For
the example of this section, we have one-dimensional processor array P 1 and the size of
the processor array is P1 = 2. The parallel code for processor p 2 f0; 1g (i.e.,PE0 or
PE1) is as follows:
DO t1 = p; 3; 2
DO t2 = 0; 3
Receive and Unpack Messages;
DO i1 = 2t1; 2t1 + 1
DO i2 = 2t2; 2t2 + 1
a(i1; i2) = 0.25 * ( a(i1; i2) + a(i1   1; i2) + a(i1; i2   1) + a(i1   1; i2   1) )
ENDDO
ENDDO
Pack and Send Messages;
ENDDO
ENDDO
The parallel code in Figure 4 also shows that whenever a processor nishes the com-
putation of a tile, it sends the data generated by it to other processors. Similarly, each
processor has to receive the data it needs before it starts the computation of a new tile.
In the next two sections, we discuss how to generate and consume these messages for
data communication.
3 Data Communication
There are two code sections left undened in Figure 4: Pack and Send Messages, and
Receive and Unpack Messages. Given a loop nest parallelized as shown in Figure 4, we
want to increase the message size and reduce the number of messages as much as possible
to amortize the large startup overhead of message passing. In this section, we answer the
following questions:
 How many messages does a processor needs to send (receive) after (before) the
computation of each tile?
 How to generate code for packing and unpacking each message?
8
DO t1 = TL1 + (p1   TL1 mod P1) mod P1, TU 1, P1
...
DO tm = TLm + (pm   TLm mod Pm) mod Pm, TUm, Pm
DO tm+1 = TLm+1, TUm+1
...
DO tn = TLn, TUn
Receive and Unpack Messages;
DO i1 = max(L1; t1k1), min(U1; t1k1 + k1   1)
...
DO in = max(Ln; tnkn), min(Un; tnkn + kn   1)
...
...
ENDDO
...
ENDDO
Pack and Send Messages;
ENDDO
...
ENDDO
ENDDO
...
ENDDO
Figure 4. Chain-Based Code for Processor ~p
9
It is well known that the output and anti data dependencies are caused by memory
reuse of the data elements in programs. In distributed-memory multicomputers, the
global arrays have to be implemented in local memories of parallel processors. If a data
element is used in two dierent processors, it must have two copies in the two processors.
In multicomputers, it makes no sense to enforce output and anti data dependences while
physical memory locations of data elements on dierent processors have to be duplicated.
In this paper, we assume that all the arrays and scalars in the loop nest are expanded
[6] and every dependence in D is a ow dependences.
To simplify the presentation, we concentrate on the ow dependences raised from a
single left-hand reference of an array, say array A and used DA  D to denote the set
of dependences concerned. We use f to denote the subscript function of the left-hand
side reference of DA. For example, the elements of array A written by this left-hand side
reference in tile ~t is ff(~i)j~i 2 T~tg. Since all arrays are expanded, array A must have the
same dimensionality as the index set and subscript function f is a one-to-one function.
In real programs, distance vectors are small non-negative integer vectors. When tiling
a loop nest, the size vector, ~k, is usually far larger than distance vectors. We assume
that
8~d 2 DA : ~k  ~d
Subsequently, we can be assured that the data produced in a tile is only used by its
neighboring tiles. Figure 5 shows the data communication for data dependence ~d = (2; 1)
between the tiles in a two-dimensional index set with tile size ~k = (4; 4). Arrows in the
gure show the ow data dependencies across tiles. The loop iterations in tile ~t = (0; 0)
producing the data to be used by its neighboring tiles (1; 0), (0; 1) and (1; 1), are marked
with W1 = W(0;0);(1;0);~d, W2 = W(0;0);(0;1);~d, and W3 = W(0;0);(1;1);~d, respectively. In
general, a tile ~t 2 T n may produce data used by its 2n  1 neighboring tiles ~t+~b , where
~b is a vector in Bn = B     B with B = f0; 1g.
In the following discussion, we rst concentrate on the data communication between
tiles for a single dependence ~d 2 DA. Then we formulate the data sets for messages
between processors. Finally, we show how to merge data sets of messages for multiple
dependences in DA.
We will focus on message packing and sending rst. The data sets for message un-
packing and receiving can be derived from those for message packing and sending.
Given a tile ~t in T n and a distance vector ~d in DA, we dene the remote write set to
the neighboring tile ~t+~b as follows:
Denition 2 (Remote Write Set) The remote write set of tile ~t to the neighboring
tile ~t + ~b with respect to distance vector ~d, denoted by W~t;~b;~d, is the set of elements of
array A written by tile ~t and read by tile ~t+~b, i.e.
W~t;~b;~d = ff(
~i) j ~i 2 T~t ^~i+
~d 2 T~t+~bg
To enable compilers to generate code to pack and send messages for data commu-
nication between tiles, we need the inequality constraints for W~t;~b;~d. From
~i 2 T~t and
10
t1
t2
0
0
1
W1
W3
1
W2
PE0 PE1
Figure 5. Remote Write Set for a dependence
~i+ ~d 2 T~t+~b, we have
~0  ~i  ~t  ~k  ~k   ~1
~0  (~i + ~d)  (~t +~b)  ~k  ~k   ~1
This leads to
max(~0;~b  ~k   ~d) ~i  ~t  ~k  ~k   ~1 +min(~0;~b  ~k   ~d)
Recall that ~b is a vector in Bn with B = f0; 1g and ~d  ~k. Based on these facts, it is not
dicult to show that the following is true:
max(~0;~b  ~k   ~d) = ~b  (~k   ~d)
min(~0;~b  ~k   ~d) = (~b  ~1)  ~d
The inequality constraints for W~t;~b;~d are summarized in the following lemma:
Lemma 1 The inequality constraints for W~t;~b;~d are as follows:
W~t;~b;~d = ff(
~i) j C~i  ~c ^ C(~i+ ~d)  ~c ^
~b  (~k   ~d) ~i  ~t  ~k  ~k   ~1  (~1 ~b)  ~dg
11
In the example of Figure 5, we have ~k = (4; 4) and ~d = (2; 1). From Lemma 1,M(0;0);(1;0);~d
is the set of data elements accessed by the iterations ~i 2 In (i.e.,ff(~i)g) such that 
2
0
!
~i 
 
3
2
!
which is exactly the same as indicated in Figure 5.
The remote write sets described above provide a basic logical structure of potential
data communication between tiles. Whether the interprocessor data communication for
them is needed depends on whether the two neighboring tiles are allocated to dierent
processors. If the tiles are allocated to the same processors, the data value can be passed
by accessing the local memory of the processor. For example, if tiles are allocated to
processors by columns as shown in Figure 5, the data communication represented by
W(0;0);(0;1);~d does not need message passing. Moreover, the messages to the tiles that are
allocated to the same processor can be merged. In the same example in Figure 5, the
remote write sets W1 =W(0;0);(1;0);~d and W3 =W(0;0);(1;1);~d can be merged.
According to the tile allocation function a in the chain-based scheduling described in
Section 2, tiles ~t and ~t +~b are allocated to the same processor if and only if ~bj1;m = ~0
3.
That is,
a(~t) = a(~t+~b); i ~bj1;m = ~0 (1)
Consider the remote write sets W~t;~b1;~d and W~t;~b2;~d of tile
~t to its two neighboring tiles
~t +~b1 and ~t +~b2. If ~b1j1;m = ~b2j1;m, the tiles ~t +~b1 and ~t +~b2 are allocated to the same
processor:
a(~t +~b1) = a(~t +~b2)
The messages W~t;~b1;~d and W~t;~b2;~d, therefore, can be merged.
Now we can dene the messages for data communication between processors based
on the above observation.
Denition 3 (Send Set) Given a distance vector ~d and a vector ~v 2 Bm, the send set
of tile ~t to its neighboring processor a(~t+~v), denoted by M s~t;~v;~d, is the union of the remote
write sets to the neighboring tiles allocated to the same processor:
M s~t;~v;~d =
[
~bj1;m=~v
W~t;~b;~d
Note that vector ~v in the above denition is anm-vector from Bm wherem is the number
of dimensions of the processor array. When ~v is added to the n-vector ~t, it is promoted
to an n-vector by appending extra n m 0s to itself.
There are 2m   1 send sets to the 2m   1 neighboring processors for each tile. To
obtain the inequality constraints for M s~t;~v;~d, consider the constraints in Lemma 1 on the
3In general, a(~t) = a(~t + ~y) if and only if ~yj1;m mod ~P = ~0. Since we assumed ~P > ~1, we have
~bj1;m mod ~P = ~bj1;m
12
j-th dimension with m+ 1  j  n. If bj = 1, we have
kj   dj  ij   tjkj  kj   1
If bj = 0, we have
0  ij   tjkj  kj   dj   1
Merging these two cases gives the following inequality:
0  ij   tjkj  kj   1
The inequality constraints for M s
~t;~v;~d
, thus, can be summarized in the following theo-
rem:
Theorem 1 The inequality constraints for send set M s
~t;~v;~d
is as follows:
M s~t;~v;~d = ff(
~i) j C~i  ~c ^ C(~i+ ~d)  ~c ^
~v  (~kj1;m   ~dj1;m) ~ij1;m   ~tj1;m  ~kj1;m  ~kj1;m   ~1  (~1  ~v)  ~dj1;m ^
~0 ~ijm+1;n   ~tjm+1;n  ~kjm+1;n  ~kjm+1;n   ~1g
We have described the data sets of communication for a single dependence in DA.
The data sets of multiple dependences in DA should be further merged because all the
data related to dependences in DA are from the same left-hand reference. The merged
message of tile ~t for all dependences in DA to its neighboring processors a(~t+~v), denoted
by M s~t;~v, can be dened as follows:
M s~t;~v =
[
~d2DA
M s~t;~v;~d
Let ~dmax and ~dmin be the maximum and minimum distance vectors dened as follows:
~dmax = max
~d2DA
(~d)
~dmin = min
~d2DA
(~d)
We have
~dmin  ~d  ~dmax
for all ~d in DA.
The inequality constraints for exact M s~t;~v can be complicated. Instead, we dene a
superset of it, denoted as cM s~t;~v, by taking the minimum of the lower bounds and the
maximum of the upper bounds in each dimension as follows:
cM s~t;~v = ff(~i) j C~i  ~c ^ C(~i+ ~d)  ~c ^
13
~v  (~kj1;m   ~d
maxj1;m) ~ij1;m   ~tj1;m  ~kj1;m  ~kj1;m   ~1  (~1  ~v)  ~d
minj1;m ^
~0 ~ijm+1;n   ~tjm+1;n  ~kjm+1;n  ~kjm+1;n   ~1g
It is not dicult to prove the following theorem:
Theorem 2 M s~t;~v is a subset of
cM s~t;~v, i.e.,M s~t;~v  cM s~t;~v.
Compilers use cM s~t;~v to generate code to pack and send messages to neighboring 2m 1
processors.
The code generated for message packing and sending for all dependences in DA are
as follows:
begin
for ~v in Bm do
pack a message according to the inequalities for cM s~t;~v;
send the message to processor a(~t+ ~v);
endfor;
end
This code is to be put in the code section \Pack and Sending Messages" in Figure 4.
Since the processor executing ~t ~v sends message cM s~t ~v;~v to its neighboring processor
a(~t), the later should receive the same data set and unpack it before the computation of
~t. Given a tile ~t, let the message to be received from the neighboring processor a(~t   ~v)
(~v 2 Bm) be denoted by cM r~t;~v. Obviously, we have
cM r~t;~v = cM s~t ~v;~v
By substituting ~t  ~v for ~t in the inequality constraints for cM s~t;~v, we can have
cM r~t;~v = cM s~t ~v;~v = ff(~i) j C~i  ~c ^ C(~i+ ~d)  ~c ^
 ~v  ~dmaxj1;m) ~ij1;m   ~tj1;m  ~kj1;m  (~1  ~v)  (~kj1;m   ~d
minj1;m)  ~1 ^
~0 ~ijm+1;n   ~tjm+1;n  ~kjm+1;n  ~kjm+1;n   ~1g
The code to be put in the code section \Receive and Unpack Messages" in Figure 4
should be as follows:
14
begin
for ~v in Bm do
receive a message from processor a(~t   ~v);
unpack the message to the memory according to cM r~t;~v;
endfor;
end
The messages passing described in this section is called direct message passing .
4 Indirect Message Passing
We have shown that for each dependence, a tile may have 2m  1 messages to send to its
neighboring processors, where m is the number of dimension of the processor array. We
have also provided the inequality constraints for compiler to generate codes to pack and
send these messages.
In this section, we present a scheme to further reduce the number of messages between
processors. The scheme is called indirect message passing which reduces the number of
messages from 2m   1 to m.
t1
t2
10
0
1
Figure 6. Indirect Message Passing
We use a simple example to illustrate the idea rst. Suppose that we have a 3-
dimensional loop index set (n = 3) and a 2-dimensional processor array (m = 2). Assume
that the size of the loop index set is 8x8x8. The tile size vector is ~k = (4; 4; 4). Therefore,
the size of the tile index set is 2x2x2. Assume that the size vector of the processor array
15
is ~P = (2; 2). According to the allocation function, each processor is allocated one
chain. For instance, processor (0,0) is allocated chain (0,0) and chain (0,0) consists
of tile (0,0,0) and tile (0,0,1). Figure 6 shows the top view of the 3-dimensional loop
index set after tiling. The square boxes show the rst tiles of the four chains. The
circles in the square boxes represent the loop iterations. Assume that the dependence
vector is ~d = (2; 1; 0). Therefore, tile (0,0,0) has three messages to send: M s
(0;0;0);(1;0);~d
to
processor (1,0), M s
(0;0;0);(0;1);~d
to processor (0,1) and M s
(0;0;0);(1;1);~d
to processor (1,1). All
these messages are illustrated by dashed rectangular boxes and arrows in Figure 6. The
idea is to merge M s
(0;0;0);(1;1);~d
with M s
(0;0;0);(1;0);~d
to be sent to processor (1,0) rst. Then,
M s
(0;0;0);(1;1);~d
is merged with M s
(1;0;0);(0;1);~d
and sent to processor (1,1) by processor (1,0).
The direct passing of M s
(0;0;0);(1;1);~d
from processor (0,0) to processor (1,1) is eliminated.
Note that this does not cause any loss of parallelism because tile (1,1,0) will not start
until it receives M s
(1;0;0);(1;0);~d
from processor (1,0). The merged messages are illustrated
by the solid rectangular boxes and arrows in Figure 6.
To simplify the presentation of indirect message passing in this section, we use the
data sets to be received for each tile in direct message passing in the previous section.
The receive set for a tile ~t from its neighboring processors a(~t   ~v) is dened as follows:
Denition 4 (Receive Set) Given a distance vector ~d 2 DA and a vector ~v 2 B
m, the
receive set of tile ~t from its neighboring processor a(~t  ~v), denoted by M r~t;~v;~d, is the send
set by tile ~t  ~v from processor a(~t   ~v) to processor a(~t). That is,
M r~t;~v;~d =M
s
~t ~v;~v;~d
Note that the m-vector ~v is promoted to an n-vector with n m 0s appended when
it is subtracted from the n-vector ~t in the above notations.
By substituting ~t   ~v for ~t in the inequality constraints in Theorem 1, we can have
the inequality constraints for M r~t;~b;~d summarized in the following theorem:
Theorem 3 The inequality constraints for receive set M r~t;~v;~d are as follows:
M r~t;~v;~d = ff(
~i) j C~i  ~c ^ C(~i  ~d)  ~c ^
 ~v  ~dj1;m ~ij1;m   ~tj1;m  ~kj1;m  (~1  ~v)  (~kj1;m   ~dj1;m)  ~1 ^
 ~djm+1;n ~ijm+1;n   ~tjm+1;n  ~kjm+1;n  ~kjm+1;n   ~djm+1;n   ~1g
4.1 Message Routing and Merging
Given a tile ~t and a dependence ~d 2 DA, using the direct message passing described in
Section 3 the processor executing ~t is supposed to receive 2m   1 messages M r~t;~v;~d from
its neighboring processors a(~t   ~v). We distinguish those messages whose ~v 2 Bm are
elementary vectors and call them elementary messages:
16
Denition 5 (Elementary Message) A message represented by a receive set, M r
~t;~v;~d
,
is elementary if ~v = ~ej (1  j  m), where ~ej is an elementary vector in B
m, i.e.,~e1 =
(1; 0;    ; 0), ~e2 = (0; 1;    ; 0) and ~em = (0; 0;    ; 1).
The messages represented by M r~t;~v;~d that are not elementary are called forward mes-
sages.
In the indirect message passing scheme, only elementary messages are directly passed
between processors. The forward messages are merged with elementary messages and
routed to the destination processor indirectly.
Consider a forward message M r
~t;~v;~d
from tile ~t   ~v at processor a(~t   ~v) to tile ~t at
processor a(~t). Assume that ~v = ~ej1 +   +~ejr where 2  r  m and 1  j1 <    < jr 
m. The message is to be forwarded to processor a(~t   ~v + ~ej1) rst, then to processor
a(~t ~v+~ej1+~ej2) and so on. After r steps, the message reaches the destination processor
a(~t   ~v + ~ej1 + ~ej2 +    + ~ejr).
In each of these steps, the forward messageM r
~t;~v;~d
is merged with one of the elementary
messages along the route. At step 1, it is merged with M r~t ~v+~ej1 ;~ej1 ;~d
. At step 2, it is
merged with M r
~t ~v+~ej1+~ej2 ;~ej2 ;
~d
and so on.
From the inequality constraints from Theorem 3, it is not dicult to see that for the
forward message M r
~t;~v;~d
with ~v = ~ej1 +    + ~ejr to be non-empty, elements dj1;    ; djr
of ~d must be all greater than 0, i.e.,djk > 0 (k = 1;    ; r). For the same reason, all
the elementary messages in the above algorithms, namely M r~t ~v+~ej1 ;~ej1 ;~d
, M r~t ~v+~ej1+~ej2 ;~ej2 ;~d
and so on, are all non-empty. This means that forward messages are always merged with
existing elementary messages and no extra messages are introduced.
4.2 Merged Receive and Send Sets
For a given tile ~t and a dependence vector ~d, the elementary message along the q-th
dimension merged with forward messages consists of three parts:
1. The q-th elementary message itselfM r~t;~eq ;~d
from processor a(~t ~eq) to processor a(~t).
2. The forward messages that reach the processor a(~t) as their nal destination. Let
the forward message be M r~t;~v;~d. From the routing mechanism described above, we
must have:
 vq = 1;
 ~vj1;q 1 6= ~0;
 ~vjq+1;m = ~0
3. The forward messages that pass processor a(~t) as one of the intermediate processors.
Let the forward message be from tile ~t1 at processor a(~t1) to tile ~t2 = ~t1 + ~v at
processor a(~t2). According to the routing mechanism, we have ~t1 + ~vj1;q = ~t and
~t+ ~vjq+1;m = ~t2. Since a(~t) 6= a(~t + ~vjq+1;m) and hence, ~t 6= ~t2, we have:
17
 vq = 1;
 ~vjq+1;m 6= ~0
The union of all the messages in the rst and second parts can be expressed as[
vq=1^~vjq+1;m=~0
M r~t;~v;~d (2)
The union of the messages of the third part can be expressed as[
vq=1^~vjq+1;m 6=~0
M r~t+~vjq+1;m;~v;~d (3)
Because the quantifying condition ~vjq+1;m = ~0 in (2), the union of all the messages in
parts 1 and 2 can be rewritten in a form similar to that of (3) as follows:[
vq=1^~vjq+1;m=~0
M r~t+~vjq+1;m;~v;~d (4)
Unifying expressions 3 and 4 gives rise to a single expression for the entire data set of
the merged elementary message along the q-th dimension:
Theorem 4 (Merged Receive Message of q   th Dimension) Given a dependence
~d, the merged receive set for tile ~t from the neighboring processor a(~t   ~eq)(1  q  m),
denoted as IM r~t;q;~d, can be expressed as follows:
IM r~t;q;~d =
[
~eq~v~1
M r~t+~vjq+1;m;~v;~d
To obtain the inequality constraints for IM r~t;q;~d, we use the inequality constraints for
M r
~t;~v;~d
in Theorem 3. We only need to consider the inequalities for the 1-st up to m-th
dimensions. Consider the j-th dimension where 1  j < q rst. If vj = 1, the constraint
is
 dj  ij   tjkj   1
If vj = 0, we have
0  ij   tjkj  kj   dj   1
Merging them gives rise to the following constraint:
 dj  ij   tjkj  kj   dj   1
Now consider j-th dimension for q  j  m. If vj = 1, we have
 dj  ij   (tj + 1)kj   1
18
which can be transformed to
kj   dj  ij   tjkj  kj   1
If vj = 0, we have
0  ij   (tj + 0)kj  kj   dj   1
Merging them gives rise to the following constraint:
0  ij   tjkj  kj   1
For the q-th dimension, we have
 dq  iq   tqkq   1
Putting it all together, we have the following inequality constraints for the merged
receive set:
Theorem 5 The inequality constraints for the merged receive set IM r~t;q;~d is as follows:
IM r~t;q;~d = ff(
~i) j C~i  ~c ^ C(~i+ ~d)  ~c ^
 ~dj1;q 1 ~ij1;q 1   ~tj1;q 1  ~kj1;q 1  ~kj1;q 1   ~dj1;q 1   ~1 ^
 dq  iq   tqkq   1 ^
~0 ~ijq+1;n   ~tjq+1;n  ~kjq+1;n  ~kjq+1;n   ~1 ^
 ~djm+1;n ~ijm+1;n   ~tjm+1;n  ~kjm+1;n  ~kjm+1;n   ~djm+1;n   ~1g
Using the same technique in Section 3 for merging IM r~t;q;~d for all
~d in DA, we can
have the merged receive set for DA as follows:
dIM r~t;q = ff(~i) j C~i  ~c ^ C(~i+ ~d)  ~c ^
 ~dmaxj1;q 1 ~ij1;q 1   ~tj1;q 1  ~kj1;q 1  ~kj1;q 1   ~d
minj1;q 1   ~1 ^
 dmaxq  iq   tqkq   1 ^
~0 ~ijq+1;n   ~tjq+1;n  ~kjq+1;n  ~kjq+1;n   ~1 ^
 ~djm+1;n ~ijm+1;n   ~tjm+1;n  ~kjm+1;n  ~kjm+1;n   ~djm+1;n   ~1g
We can also easily prove that [
~d2DA
IM r~t;q;~d 
dIM r~t;q
The inequalities for IM r~t;q;~d are used to receive and unpack the merged message for
DA from the neighboring processors a(~t   ~eq). The code to be generated for the section
\Receive and Unpack Messages" in Figure 4 is
19
begin
for q = 1 to m do
receive the merged message from processor a(~t   ~eq);
unpack the message to the memory according to dIM r~t;q;
endfor;
end
Similarly, we use dIM s~t;q to denote the merged message for DA to send to the neigh-
boring processor a(~t + ~eq). Obviously, we have
dIM s~t;q = dIM r~t+~eq;q
The inequality constraints for dIM s~t;q are
dIM s~t;q = ff(~i) j C~i  ~c ^ C(~i+ ~d)  ~c ^
 ~dmaxj1;q 1 ~ij1;q 1   ~tj1;q 1  ~kj1;q 1  ~kj1;q 1   ~d
minj1;q 1   ~1 ^
kq   d
max
q  iq   tqkq  kq   1 ^
~0 ~ijq+1;n   ~tjq+1;n  ~kjq+1;n  ~kjq+1;n   ~1 ^
 ~djm+1;n ~ijm+1;n   ~tjm+1;n  ~kjm+1;n  ~kjm+1;n   ~djm+1;n   ~1g
The compiler-generated code to be put in the code section \Pack and Send Messages"
in Figure 4 becomes
begin
for q = 1 to m do
pack the merged message according to dIM s~t;q;
send the message to processor a(~t + ~eq);
endfor;
end
5 Experiment Results on AP1000
We have presented the mechanism to form large messages to amortize the startup over-
head of message passing for nested Doacross loops executed on multicomputers. The
basic techniques are loop tiling, chain-based scheduling (described in Sections 2 and 3)
and indirect message passing (described in Section 4).
20
To investigate the eectiveness of these techniques, we rst study the parallelism of
chain-based Doacross loop nests on an ideal machine called Parallel Random Access
Machine (PRAM). A PRAM is a parallel machine with an innite shared random ac-
cess memory. The data communication between processor is done through the shared
memory. The time of memory access is assumed to be zero. In other words, the data
communication in PRAM takes zero time and the speedup is restricted only by data
dependences between tiles and the shape and size of the processor array. By comparing
the speedup of Doacross on PRAM and real distributed-memory multicomputers, we
can nd the impact of data communication overhead and the eectiveness of loop tiling
and indirect message passing.
Given a Doacross loop nest, a tile size vector ~k and a processor array Pm of PRAM
as described in Section 2, we want to calculate the speedup of the parallel code in Figure 4
on the PRAM. To simplify the model, we assume that lower bounds of the n-dimensional
Doacross loop nest are all zeros. The upper bound of the j-th loop is Nj 1. According
to the rectangular loop tiling (see Section 2), there are S1   Sn tiles where Sj = Nj=kj
is the number of tiles along the j-th dimension of the tile index set T n. It is assumed
that Nj is an integer multiple of kj for 1  j  n. We also assume that Sj is an integer
multiple of Pj, the number processors along the j-th (1  j  m) dimension of the
processor array Pm.
It is obvious that there are S1    Sm chains. Each chain contains Sm+1    Sn tiles.
We use g = k1    kn to model the execution time of a tile. The execution time of each
chain is
(Sm+1    Sn)g
We dene
Lm+1 = Sm+1    Sn
to be the length of the chain. The chains are allocated to parallel processors cyclically.
Suppose we have a 3-dimensional tile index set (n = 3) with (S1 = 4; S2 = 9 and
S3 = 2 to be allocated on a 2-dimensional (m = 2) processor array with P1 = 2 and
P2 = 3. Figure 7, which is the top view of the 3-dimensional tile set, illustrates the
processor allocation of the chains to the processor array. Each square in the gure
represents the rst tile of the chain and arrows represents the data dependences among
them. The blank squares are the rst tiles of the chains allocated to processor (0,0) and
the darkest squares are the rst tiles of the chains allocated to processor (2,3). The six
processors will execute the chains in the box at the lower-left corner in Figure 7 rst.
Then the chains in the box on the right are executed and so on. In particular, processor
(0,0) executes chain (0,0) rst, and then chain (0, 3) and so on. After nishing chain
(0,6), it comes back to execute chain (2,0) as illustrated in Figure 7. The starting time
of the rst tile of chain (0,0) is obviously 0. The start time of chain (0,3) by the same
processor (0,0) is constrained by to the following to conditions:
1. Due to the data dependences along the second dimension of the chains, the starting
time should be at least as large as P2g = 3g.
21
t2
t1
1 3 7
0
1
2
3
2 4 5 6 80
Figure 7. Chain-based Processor Allocation of the Example
2. Since processor (0,0) cannot start the rst tile of chain (0,3) until it completes the
entire chain (0,0), the start time should be at least as large as L3g = S3g = 2g
That is, the start time of chain (0,3) should be
max(P2; L3)g
If P2 was less than or equal to L3, processor (0,0) can start chain (0,3) right after it
nishes chain (0,0) and there is no time of busy-waiting between them. If P2 > L3, as
the case in our example of Figure 7, processor (0,0) has to busy-wait for (P2   L3)g = g
after each chain along the second dimension except the last one. Therefore, the total
time for processor (0,0) to complete all the chains along the second dimension is
L2g = (
S2
P2
L3 + (
S2
P2
  1)(P2   L3)
+)g = 8g
where (P2 L3)
+ is the notation for the positive part of the integer P2 L3. The positive
part of an integer x is dened by
x+ =
(
x if x  0
0 otherwise
Similarly, the start time of the rst tile of chain (2,0) by processor (0,0) is constrained
by: (1) the data dependences along the rst dimension of the chains, and (2) the total
time for processor (0,0) to complete the chains along the second dimension, L2. It is
max(P1; L2)g
For the same reason, the total time for processor (0,0) to complete its all chains along
22
the rst and second dimensions is
L1g = (
S1
P1
L2 + (
S1
P1
  1)(P1   L2)
+)g = 16g
Consider the start time of the chains allocated to processor (2,3). Due to the data
dependences of among the rst tiles of the chains the start time of the rst tile of a chain
by processor (2,3) is always
(P1 + P2   2)g = 3g
later than that of the chain started by processor (0,0) in the same box in Figure 7. It is
clear now that the total time to complete all chains of the loop nest by the six processors
is
TP = [L1 + (P1 + P2   2)]g = 19g
The sequential execution time of the loop nest by one processor is
TS = S1S2S3g = 72g
The speedup of the parallel execution is
SP = TS=TP = 3:80
In general, the speedup of parallel execution over sequential execution can be calcu-
lated as follows:
1. Calculate
Lm+1 = Sm+1    Sn;
Lm =
Sm
Pm
Lm+1 + (
Sm
Pm
  1)(Pm   Lm+1)
+;
...
L1 =
S1
P1
L2 + (
S1
P1
  1)(P1   L2)
+
2. Compute speedup
SP =
S1    Sn
L1 + (P1 +   Pm  m)
If all the (Pj   Lj+1)
+ = 0, 1  j  m, we can have the following closed formula for
the speedup:
SP =
P1   Pm
1 + (P1 +    + Pm  m)(P1    Pm)=(S1    Sn)
We parallelized the Doacross loop nest in Figure 1 by hand and run it on a Fujitsu
AP1000 with dierent tile sizes and processor arrays. The AP1000 we use has 128 Sparc
processors called cells connected by a two-dimensional (2D) mesh/torus network called
23
Fujitsu AP1000 PRAM
direct indirect
tile size T apP (sec) H(%) S
ap
P T
ap
P (sec) H(%) S
ap
P S
pram
P
4x8x2x2 67.424 27.85 92.2 61.312 20.66 101.4 127.8
4x8x4x4 54.362 10.16 114.4 54.036 9.62 115.1 127.3
4x8x8x8 52.910 6.22 117.5 52.853 6.12 117.6 125.3
4x8x16x16 55.122 4.34 112.8 55.197 4.81 112.6 117.9
4x8x32x32 67.599 3.50 92.0 67.676 3.60 91.9 95.3
8x16x2x2 54.435 10.28 114.2 53.650 8.97 115.9 127.3
8x16x4x4 51.329 3.34 121.1 51.002 2.71 121.9 125.3
8x16x8x8 53.477 1.40 116.3 53.309 1.08 116.6 117.9
8x16x16x16 65.440 0.31 95.0 65.030 -0.32 95.6 95.3
8x16x32x32 114.262 -0.95 54.4 113.873 -1.29 54.6 53.9
Table 1. Results on 2D (16x8) Processor Array
T net . The message passing primitives use T net for interprocessor data communication.
The AP1000 Cell Operating Systems allows users to dene a logical processor array of
any number of dimensions, although the physical topology of T net is the 2D mesh/torus.
To get the sequential execution time of the program in Figure 1 on a single processor,
we have to reduce the problem size from 128x128x128x128 to 32x32x32x32, because each
sparc processor in the AP1000 has only 16Mbyte memory. The sequential execution
time of this small size problem on a single processor is 24.285 seconds. We approximate
the sequential execution time of the problem size 128x128x128x128 by multiplying this
number with 1284=324, giving the approximate sequential execution time TS = 6216:96
seconds.
We use two processor array congurations in our experiments: 16x8 and 8x4x4. Both
contain 128 processors. For the 2D processor array (16x8), we x the rst and second
components of the tile size vector ~k and change the third and fourth components from
2x2, 4x4,   , up to 32x32. The xed part of the size vector has two congurations: 4x8
and 8x16. We coded and run the parallel programs using both direct and indirect message
passing. The results are shown in Table 1. T apP in the table is the parallel execution time
in seconds measured from the AP1000. SapP is the speedup calculated by
SapP =
TS
T apP
The speedup on the PRAM, SpramP , calculated by the algorithm above is also listed in
the table.
24
The communication overhead H listed in the table is calculated by
H =
T apP   T
pram
P
T apP
where T pramP is the parallel execution time of the PRAM. Because the PRAM is an ideal
parallel machine without communication overhead, the fraction dened above gives the
percentage of the parallel execution time on the AP1000 caused by the data communica-
tion. The PRAM actually does not have parallel execution time, because it is not a real
machine. For the purpose of comparison, it is calculated by
T pramP =
TS
SpramP
The speedup of the AP1000 with direct and indirect message passing as well as the
speedup of the PRAM for 2D processor array are plotted in Figure 8. The communication
overheads H are plotted in Figure 9.
For the 3D processor array (8x4x4), we x the rst three components of the tile size
vector and change the fourth component from 2, 4,   , up to 64. Table 2 lists the results
for the two congurations of the rst three components of the tile size vector: 4x8x8 and
8x16x16. The speedup and the communication overhead for the 3D processor array are
Fujitsu AP1000 PRAM
direct indirect
tile size T apP (sec) H(%) S
ap
P T
ap
P (sec) H(%) S
ap
P S
pram
P
4x8x8x2 66.235 26.44 93.9 62.818 22.44 99.0 127.6
4x8x8x4 60.738 19.53 102.4 59.093 17.29 105.2 127.2
4x8x8x8 58.637 16.12 106.0 57.605 14.62 107.9 126.4
4x8x8x16 57.543 13.43 108.0 56.992 12.59 109.1 124.8
4x8x8x32 58.236 12.35 106.8 57.898 11.84 107.4 121.8
4x8x8x64 98.924 9.05 62.8 99.030 9.15 62.8 69.1
8x16x16x2 56.477 11.80 110.1 55.871 10.84 111.3 124.8
8x16x16x4 55.688 8.34 111.6 55.314 7.72 112.4 121.8
8x16x16x8 57.572 7.07 108.0 57.312 6.65 108.5 116.2
8x16x16x16 62.298 6.21 99.8 62.074 5.87 100.2 106.4
8x16x16x32 72.305 5.51 86.0 76.542 10.74 81.2 91.0
8x16x16x64 117.604 4.58 52.9 117.396 4.41 53.0 55.4
Table 2. Results on 3D (8x4x4) Processor Array
plotted in Figure 10 and Figure 11, respectively.
From the performance data presented, we can have the following observations:
25
Speedup: From Figures 8 and 10, we rst notice that the speedup on the ideal PRAM
is aected by the tile size and shape. While SpramP approaches to 128 with small
tile sizes, it can drop to as low as around 55 for the large tile sizes. In general, as
the tile size increases, the dierence between the speedups of the AP1000 and the
PRAM decrease and the speedup of the AP1000 is restricted mainly by the limited
parallelism available. As the tile size decreases, the data communication overhead
increases and it pulls the speedup of the AP1000 down.
In each conguration, there is a tile size that gives the best speedup for the AP1000.
The tile sizes that give the best speedup are marked by arrows in the plots. In-
terestingly, the both tile sizes for the best speedup on the 2D processor array are
equal to 211. For the 3D processor array, the tile sizes for the best speedup are
around 212 to 213.
The highest speedup on the AP1000 observed is 121.9 with tile size 8x16x4x4 and
indirect message passing on the 2D (16x8) processor array.
Communication Overhead: From Figures 9 and 11, we see that communication over-
head decreases as the tile size increases. The communication overhead on 3D pro-
cessor array is larger than that on 2D processor array. This is probably due to the
two facts: (1) there are more messages in the 3D processor array than the 2D pro-
cessor array and (2) the 3D processor array is actually simulated by the AP1000
Operating system on top of the 2D physical mesh/torus network. After all, the
communication overhead is below 10% for the reasonably large tile sizes in most
congurations.
Indirect Message Passing: When the tile size is small, the communication overhead
of direct message passing is larger than that of indirect message passing. The gap
between them increases as the tile size decreases. For some large tile sizes, the
communication overhead of indirect message passing is larger than that of direct
message passing (see tile size 4x8x16x16 on the 2D processor array and tile size
8x16x16x32 for the 3D processor array). This is because the large tile size causes
more data to be forwarded in indirect message passing. After all, indirect message
passing does help reduce the communication overhead in most congurations, but
the reduction is limited.
There are a couple of anomalies in the speedups. We notice that, for tile sizes
8x16x32x32 and 8x16x16x16 (indirect message passing), the speedups of the AP1000
on the 2D processor array are larger than those of the PRAM by the small margin. That
is why we observe the negative (small though) communication overhead in these con-
gurations. This is probably due to the fact that we do not have the exact sequential
execution of the program for the size 128x128x128x128.
We conclude this section with the following remarks:
 By combining and tuning loop tiling, chain-base scheduling and indirect message
passing, compilers can generate ecient parallel codes on multicomputers with high
speedup and low data communication overhead for Doacross loop nests.
26
 The parallelism of the chain-based parallel programs with tiling depends on the
sizes and the shapes of the tile index set and the processor array. The algorithm to
calculate the speedup of the PRAM can be used to search for the best conguration.
 The right tile size for the low communication overhead and high speedup depends
on the machine used. On the AP1000, the right tile size for the Doacross loop
nests like the one in Figure 1 is about 211 to 212.
6 Conclusion
We have presented a compiler method that combines chain-based scheduling, loop tiling
and indirect message passing to parallelize Doacross loop nests for distributed-memory
multicomputers. The eectiveness of the method has been demonstrated by the experi-
ment results on a 128-processor Fujitsu AP1000.
The experiment results show that the high speedup (about 120) can be obtained by
tuning the tile size and processor array properly. By comparing the performances on the
AP1000 and an ideal parallel machine without communication cost, we show that the
data communication overhead in our method are very low (below 10%) with moderately
large grain sizes.
We contribute the high speedup and the low communication overhead to the combined
eect of chain-based scheduling, loop tiling and indirect message passing: chain-based
scheduling makes it possible to overlap the data communication with the computation;
loop tiling controls the grain size of the computation to amortize the message passing
overhead and, nally, the indirect message passing further reduces the number of mes-
sages.
References
[1] P. Tang, \Chain-Based Scheduling: Part I { Loop Transformations and Code Genera-
tion," Department of Computer Science, The Australian National University, Technical
Report TR-CS-92-09, June 1992.
[2] C. Ancourt and F. Irigoin, \Scanning Polyhedra with DO Loops," in Proceedings of the
Third ACM SIGPLAN Symposium on Principle and Practice of Parallel Programming ,
Wiliamsburg, Virginia, April 21-24, 1991, pp. 39{50.
[3] M. E. Wolf and M. S. Lam, \A Loop Transformation Theory and an Algorithm to
Maximize Parallelism," IEEE Transactions on Parallel and Distributed Systems, vol.
2, no. 4, pp. 452{471, October 1991.
[4] L. Lamport, \The Parallel Execution of DO Loops," Communications of the ACM ,
vol. 17, no. 2, pp. 83{93, February 1974.
[5] D. A. Padua and M. J. Wolfe, \Advanced Compiler Optimization for Supercomputers,"
Communications of the ACM , vol. 29, no. 12, pp. 1184{1201, December 1986.
27
[6] P. Feautrier, \Array Expansion," in Proceedings of the 1988 ACM International Con-
ference on Supercomputing , Malo France, July 1988, pp. 429{441.
28
50
60
70
80
90
100
110
120
130
4x8x2x2 4x8x4x4 4x8x8x8 4x8x16x16 4x8x32x32
S
pe
ed
up
Tile Size 
Speedup For Tile Size (4,8,x,x)
PRAM
AP (Direct)
AP (Indirect)
50
60
70
80
90
100
110
120
130
8x16x2x2 8x16x4x4 8x16x8x8 8x16x16x16 8x16x32x32
S
pe
ed
up
Tile Size 
Speedup For Tile Size (8,16,x,x)
PRAM
AP (Direct)
AP (Indirect)
Figure 8. Speedup on 2D (16x8) Processor Array
29
0
5
10
15
20
25
30
4x8x2x2 4x8x4x4 4x8x8x8 4x8x16x16 4x8x32x32
O
ve
rh
ea
d 
(%
)
Tile Size 
Communication Overhead For Tile Size (4,8,x,x)
AP (Direct)
AP (Indirect)
0
5
10
15
20
25
30
8x16x2x2 8x16x4x4 8x16x8x8 8x16x16x16 8x16x32x32
O
ve
rh
ea
d 
(%
)
Tile Size 
Communication Overhead For Tile Size (8,16,x,x)
AP (Direct)
AP (Indirect)
Figure 9. Communication Overhead on 2D (16x8) Processor Array
30
50
60
70
80
90
100
110
120
130
4x8x8x2 4x8x8x4 4x8x8x8 4x8x8x16 4x8x8x32 4x8x8x64
S
pe
ed
up
Tile Size 
Speedup For Tile Size (4,8,8,x)
PRAM
AP (Direct)
AP (Indirect)
50
60
70
80
90
100
110
120
130
8x16x16x2 8x16x16x4 8x16x16x8 8x16x16x16 8x16x16x32 8x16x16x64
S
pe
ed
up
Tile Size 
Speedup For Tile Size (8,16,16,x)
PRAM
AP (Direct)
AP (Indirect)
Figure 10. Speedup on 3D (8x4x4) Processor Array
31
0
5
10
15
20
25
30
4x8x8x2 4x8x8x4 4x8x8x8 4x8x8x16 4x8x8x32 4x8x8x64
O
ve
rh
ea
d 
(%
)
Tile Size 
Communication Overhead For Tile Size (4,8,8,x)
AP (Direct)
AP (Indirect)
0
5
10
15
20
25
30
8x16x16x2 8x16x16x4 8x16x16x8 8x16x16x16 8x16x16x32 8x16x16x64
O
ve
rh
ea
d 
(%
)
Tile Size 
Communication Overhead For Tile Size (8,16,16,x)
AP (Direct)
AP (Indirect)
Figure 11. Communication Overhead on 3D (8x4x4) Processor Array
32
