Chain-based scheduling: Part I - loop transformations and code generation by Tang, Peiyi
Chain-Based Scheduling: Part I {
Loop Transformations
and Code Generation
Peiyi Tang
July 21, 1992
Chain-Based Scheduling: Part I { Loop
Transformations
and Code Generation

Peiyi Tang
Department of Computer Science
The Australian National University
Canberra ACT 2601 Australia
July 21, 1992
Abstract
Chain-based scheduling [1] is an ecient partitioning and scheduling
scheme for nested loops on distributed-memory multicomputers. The idea
is to take advantage of the regular data dependence structure of a nested
loop to overlap and pipeline the communication and computation.
Most partitioning and scheduling algorithms proposed for nested loops
on multicomputers [1,2,3] are graph algorithms on the iteration space of the
nested loop. The graph algorithms for partitioning and scheduling are too
expensive (at least O(N), where N is the total number of iterations) to be
implemented in parallelizing compilers. Graph algorithms also need large
data structures to store the result of the partitioning and scheduling.
In this paper, we propose compiler loop transformations and the code
generation to generate chain-based parallel codes for nested loops on multi-
computers. The cost of the loop transformations is O(nd), where n is the
number of nesting loops and d is the number of data dependences. Both
n and d are very small in real programs. The loop transformations and
code generation for chain-based partitioning and scheduling enable paralleliz-
ing compilers to generate parallel codes which contain all partitioning and
scheduling information that the parallel processors need at run time.

This work is supported in part by the Australian Research Council under Grant No. S6600132
and by the donation of Fujitsu Laboratories LTD.
ii
Keywords: nested loops, partitioning, scheduling, multicomputers, loop
transformations, chain-based scheduling.
iii
1 Introduction
The chain-based partitioning and scheduling [1] is an ecient scheme to execute nested
loops with constant data dependences on distributed address space multicomputers. The
idea is to take advantage of the regular data dependence structure to partition and
schedule loop iterations in such a way that communication and computation can be
overlapped and pipelined.
Due to the large context switch overhead of operating systems, partitioning and
scheduling at the loop iteration level of a program need to be managed by compilers.
Like most other partitioning and scheduling algorithms proposed for nested loops on
multicomputers[2,3], the chain-based scheduling algorithm [1] is originally a graph algo-
rithm on the iteration space of the nested loop. The cost of the graph-based algorithms
is at least O(N), where N is the total number of loop iterations. Note that the cost of
the nested loop itself is also O(N). In other words, the time for the compiler to run a
graph-based partitioning and scheduling algorithm is at least as much as the sequential
execution time of the nested loop itself. More importantly, graph-based partitioning and
scheduling algorithms need a large data structure to store the result of the scheduling for
each processor. The time cost of retrieving these large data structures from the disk at
run time will be prohibitively high.
The solution to the problems of graph-based partitioning and scheduling lies in com-
piler program transformations. In order to be practically useful in massively parallel
systems, any partitioning and scheduling scheme has to be incorporated in program
transformations. For nested loops, the most time-consuming parts of numerical scien-
tic programs, the partitioning and scheduling as well as the communication primitives
generation should be done through loop transformations.
In this report, we present a series of loop transformations and code generation for
chain-based partitioning and scheduling. A parallelizing compiler can use these trans-
formations to generate SPMD (Single Program Multiple Data) style parallel programs,
in which all the partitioning and scheduling information are incorporated. When the
generated programs are executed by the parallel processors, the computation and the
communication are overlapped and pipelined.
In section 2, the program model of nested loop with constant data dependences is
introduced. Section 3 presents the program transformations for chain-based partition-
ing and scheduling. They include loop skewing, loop tiling, loop normalization and
chain-based code generation. Section 4 concludes the paper with a brief discussion of
performance of the generated chain-based parallel codes.
2 Nested Loops and Constant data dependences
Nested loops consume most of the CPU time in scientic and engineering supercomputing.
To speed up the execution of nested loops on massively parallel computers like Intel
Touchstone Delta System and FujitsuAP1000, the computation needs to be parallelized,
partitioned and scheduled to parallel processors.
1
DO i
1
= L
1
, U
1
.
.
.
DO i
n
= L
n
, U
n
B(i
1
;    ; i
n
)
ENDDO
.
.
.
ENDDO
Figure 1. Program model of uniform recurrence
In this paper, we concentrate on a class of nested loops known as uniform recurrences.
It covers many important classes of scientic computations including iterative methods
for linear systems to solve partial dierential equations.
A uniform recurrence is a perfectly-nested loop with constant data dependences be-
tween the loop iterations. Figure 1 shows the program model of a uniform recurrence.
Each loop bound, L
k
or U
k
, can be a linear function of the indices of the surrounding
loops, i.e.
L
k
= a
k;0
+ a
k;1
i
1
+    + a
k;k 1
i
k 1
U
k
= b
k;0
+ b
k;1
i
1
+    + b
k;k 1
i
k 1
The loop body of the nested loop, denoted as B(i
1
;    ; i
n
) in Figure 1, is a sequence
of assignments without if or exit statements. The nested loop in Figure 1 denes an
iteration space I  Z
n
:
I = f(i
1
;    ; i
n
) 2 Z
n
: L
k
 i
k
 U
k
; k = 1;    ; ng
A data dependence exists from iteration
~
i = (i
1
;    ; i
n
) and to iteration
~
i0 =
(i0
1
;    ; i0
n
) if
1.
~
i 6=
~
i0 and
~
i executes before
~
i0 in the temporal order of the nested loop, and
2. both iterations
~
i and
~
i0 access the same scalar or array element and at least one of
the accesses is write, and
3. there is no other iteration between
~
i and
~
i0 that writes to the same data element.
If
~
i writes and
~
i0 reads the data element, the data dependence is true dependence, because
it reects a data ow between the two iterations. If
~
i reads and
~
i0 writes the data element,
the data dependence is called anti dependence. If both
~
i and
~
i0 writes the data element, it
is called output dependence. Anti and output dependences are caused by reuse of memory
2
do i
1
= 0; 4
do i
2
= 0; 4
a(i
1
; i
2
) = f
1
(c(i
1
; i
2
  1); b(i
1
; i
2
));
b(i
1
; i
2
) = f
2
(a(i
1
  1; i
2
+ 1); c(i
1
; i
2
));
c(i
1
; i
2
) = f
3
(b(i
1
  1; i
2
); a(i
1
; i
2
));
enddo;
enddo;
(a) Program
i1
i2
0 2 4
2
4
(b) Iteration space
Figure 2. An Example of Nested Loop
storage and, therefore, are not essential. They can be eliminated by scalar renaming and
scalar or array expansion. For this reason, we consider only ow data dependences in
this paper, assuming that all the anti and output data dependences in original programs,
if any, have been eliminated by renaming and expansion.
If there is a data dependence from
~
i to
~
i0, vector
~
d =
~
i0  
~
i is called distance vector of
the dependence. A distance vector is always lexicographically positive, i.e., the rst non-
zero element is positive and not all elements are zeros due to the fact that
~
i0 is later than
~
i. The dependence is constant if the distance of the dependence is independent of the
source, i.e., it exists between every pair of
~
i;
~
i0 2 I such that
~
i+
~
d =
~
i0. A uniform recur-
rence is a perfectly nested loop whose data dependences are all constant. Figure 2 shows
an example of a uniform recurrence and its iteration space with data dependences.
There are three constant data dependences with distance vectors
~
d
1
= (1; 1),
~
d
2
= (1; 0)
and
~
d
3
= (0; 1) associated with arrays a, b and c, respectively, all of which are ow de-
pendences. The data dependences of a uniform recurrence can be represented by the set
of distance vectors, D = f
~
d
1
;    ;
~
d
m
g. If the rank of D, r, is less than n, the number of
loops in the nested loop, we can transform the nested loop to a new nested loop of n loops
in which the n r outermost loops are DOALL loops [4]. A DOALL loop is a parallel loop
without data dependences across its iterations. Outer DOALL loops can be eciently
executed on multicomputers without interprocessor communication [5]. If the rank of D
is equal to n, the outmost loop cannot be transformed to a DOALL loop. In this paper,
we concentrate on this kind of nested loops. Using the hyperplan method [4,6,7], such a
nested loop can be transformed to a nested loop with n loops in which the outmost loop
is a sequential loop and all the inner n   1 loops are all DOALL loops. However, the
overhead of the fork-join synchronization for the outmost sequential loop could be very
large. In this paper, a dierent parallelization approach is used. We use interprocessor
communication to enforce the data dependences if the related iterations are allocated to
dierence processors. A parallel loop with dependences between iterations enforced by
synchronization (as in shared address space multiprocessors) or communication (as in
3
Algorithm 1 (Skewing Matrix)
inputs:
D: set of dependence vectors;
outputs:
E: set of dependence vectors;
T : unimodular matrix;
T = identity matrix;
for i = 2 to n do
A = identity matrix;
for each
~
d 2 D do
if
P
i 1
j=1
a
ij
d
j
+ d
i
< 0 then
let k be the smallest in f1;    ; i  1g
such that d
k
> 0;
a
ik
= max(a
ik
; d( d
i
 
P
i 1
j=k+1
a
ij
d
j
)=d
k
e);
endif;
endfor;
for each
~
d 2 D do
d
i
=
P
i
j=1
a
ij
d
j
;
endfor;
T = AT;
endfor;
return(D, T);
distributed address space multicomputers) is called DOACROSS loop. The chain-based
partitioning and scheduling spread iterations of DOACROSS loops across parallel pro-
cessors and use communication to enforce data dependences between the loop iterations.
3 Program Transformations for Chain-Based
Scheduling
In this section, we present program transformations to generate chain-based parallel
programs for nested loops on multicomputers. The transformations are: (1) loop skewing,
(2) loop tiling, (3) loop normalization and (4) chain-based partitioning and scheduling.
4
3.1 Loop Skewing
The purpose of the loop skewing, as the rst step of the series of transformations, is
twofold:
1. to facilitate the transformation for loop tiling;
2. to guarantee deadlock-free execution of tiles.
The time to pass a message of n data (integer or oating-point number) from one
processor to another in a multicomputer can be expressed as
T
n
=W + n (1)
where W is the startup delay of the message and  is the data transfer rate. In current
multicomputers, W is usually two orders of magnitude longer than  . It would be ex-
tremely inecient to send messages with a single datum. Therefore, the computation of
iterations of the nested loop need to be grouped so that the data to be passed between
processors can be aggregated to form larger messages. Each group is called a tile and it
usually is a set of neighbouring iterations in the iteration space. The computation of a
tile is atomic with respect to message passing. In other words, the processor receives all
the data it needs before it starts the computation of the tile. It does not sends data to
other processors until the computation of the tile is nished.
Irigoin and Triolet suggested to use hyperplanes to tile the iteration space [7]. Given
an iteration space I  Z
n
, a set of hyperplane families
H = f(
~
h
1
; s
1
);    ; (
~
h
n
; s
n
)g
can be used to partition the iteration space into tiles. In particular, each (
~
h
k
; s
k
) 2 H is
a familay of parallel hyperplanes with the norm
~
h
k
and s
k
is the distance between them.
A tile is the set of iterations between the parallel hyperplanes dened by H, i.e. a tile
with index (q
1
;    ; q
n
) is
f
~
i 2 I : b
~
h
k

~
i
s
k
c = q
k
; k = 1;    ; ng
Since tiles are atomic with respect to message passings, deadlocks are possible [7]. To
prevent deadlocks, the following condition is sucient:
8i 2 f1;    ; ng;8j 2 f1;    ;mg :
~
h
i

~
d
j
 0
Tiling with hyperplanes with arbitrary norms is too general to be used by a par-
allelizing compiler to generate simple and ecient parallel codes. After loop skewing
transformation, we can use orthogonal hyperplane families to tile the transformed itera-
tion space to rectangular tiles. Compilers can generate simple and deadlock-fee parallel
codes for the rectangular tiles as will be seen shortly
5
DO j
1
= j
min
1
, j
max
1
.
.
.
DO j
n
= G
n
, H
n
B0(j
1
;    ; j
n
)
ENDDO
.
.
.
ENDDO
Figure 3. Nested loop after skewing
Loop skewing is a loop transformation originally used to improve vectorization [8].
Loop skewing and other two loop transformation, loop interchange and loop reversing,
have been recently unied into a single loop transformation called unimodular trans-
formation [4,9]. An n  n integer matrix T is unimodular if its determinant is 1,
i.e. j det(T )j = 1. A unimodular matrix T transforms each index vector
~
i 2 I to an
index vector in a new iteration space J = f
~
j 2 Z
n
: 9
~
i 2 I such that T
~
i =
~
jg
1
. Un-
der a unimodular transformation, the new iteration space J is the set of index vectors
~
j = (j
1
;    ; j
n
) conned in a convex hull determined by the new loop bounds. The
methods of calculating these new loop bounds can be found in [4,10]. Since j det(T )j = 1
and the inverse of T is still an integer matrix. Therefore, for each
~
j within the loop
bounds, there is an
~
i 2 I such that
~
i = T
 1
~
j. In other words, there is a one-to-one map-
ping between the two iteration spaces and there is no \holes" in iteration space J . The
unimodular matrix T also transforms each dependence distance vector
~
d 2 D to a new
distance vector T
~
d. A unimodular transformation T is legal if and only if for all
~
d 2 D,
T
~
d is lexicographically positive. Under a legal unimodular unimodular transformation,
all of the data dependences will be satised when the iterations in iteration space J are
traversed in the lexicographic order.
In this paper, loop skewing is used to transform a nested loop to be fully-permutable.
A nested loop is fully-permutable if all elements of every dependence vector are non-
negative. Given a nested loop with the arbitrary data dependences represented by D, we
want to nd a unimodular matrix T such that each element of (e
1
;    ; e
n
) = ~e = T
~
d for
all
~
d 2 D is non-negative. Algorithm 1 shows the algorithm to construct such unimodular
matrix T . In Algorithm 1, T is constructed as a product of sequence of n 1 unimodular
matrices A
j
(i = n;    ; 2), T = A
n
A
n 1
  A
2
. Dene T
i
(2  j  n) as A
i
T
i 1
and T
1
is the identity matrix. It can be seen from Algorithm 1 that, for all
~
d 2 D, the rst
1
Here, both
~
i and
~
j are column vectors. In this paper, we use comma-separated tuples to denote
both column and row vectors and their usage can be gured out from the context of the formulae.
6
do j
1
= 0; 4
do j
2
= j
1
; j
1
+ 4
a(j
1
; j
2
  j
1
) =
f
1
(c(j
1
; j
2
  j
1
  1); b(j
1
; j
2
  j
1
));
b(j
1
; j
2
  j
1
) =
f
2
(a(j
1
  1; j
2
  j
1
+ 1); c(j
1
; j
2
  j
1
));
c(j
1
; j
2
  j
1
) =
f
3
(b(j
1
  1; j
2
  j
1
); a(j
1
; j
2
  j
1
));
enddo;
enddo;
(a) Program
0
2
6
8
4
4
j2
2
j1
(b) Skewed iteration space
Figure 4. The example after loop skewing
i elements of T
i
~
d are non-negative. T
n
gives the matrix T required. In Algorithm 1,
the elements on the diagonal of the matrix A
i
are all 1. All the other elements are 0,
except the rst i   1 elements of the i-th row, a
ij
; 1  j  i   1. Assume that the
rst i   1 elements of each
~
d
i 1
= T
i 1
~
d :
~
d 2 D are non-negative, after the (i   1)-th
iteration. In iteration i, the elements, a
ij
; 1  j  i   1, of A
i
are determined to make
P
i 1
j=1
a
ij
d
i 1
j
+d
i 1
i
 0 for every
~
d
i 1
= T
i 1
~
d :
~
d 2 D. Note that a
ik
; 1  k  i  1 never
decrease and all d
i 1
1
;    ; d
i 1
i 1
are non-negative. It is clear that when
~
d
i 1
= T
i 1
~
d are
processed one at a time in the inner loop of the algorithm, the work done for previous
vectors can never be undone. Since The rst element of every
~
d
1
= T
1
~
d =
~
d :
~
d 2 D
is always non-negative, the above inductive reasoning leads to the correctness of the
algorithm.
Once the skewing matrix T is obtained, the nested loop in Figure 1 can be transformed
to the nested loop shown in Figure 3. The transformation consists of two parts:
1. transformation of the loop body;
2. transformation of the loop bounds;
The transformation of the loop body is simple. Since
~
i = T
 1
~
j, replacing each i
k
,
k = 1;    ; n with the corresponding expression of j
1
;    ; j
n
in the loop body produces
the now loop body B0(j
1
;    ; j
n
). The new loop bounds can be obtained by the method
described in [4]. The low and upper bounds of the rst loop is the minimum and maximum
possible values of j
1
, j
min
1
and j
max
1
, respectively. The loop bounds of the other loops are
7
DO k
1
= j
min
1
; j
max
1
; s
1
.
.
.
DO k
n
= j
min
n
; j
max
n
; s
n
DO j
1
= k
1
;min(k
1
+ s
1
  1; j
max
1
)
.
.
.
DO j
n
= max(k
n
; G
1
n
; G
2
n
;   ),
min(k
n
+ s
n
  1;H
1
n
;H
2
n
;   )
B0(j
1
;    ; j
n
)
ENDDO
.
.
.
ENDDO
ENDDO
.
.
.
ENDDO
Figure 5. Nested loop after tiling
DO k
1
= 0; d(j
max
1
  j
min
1
+ 1)=s
1
e   1
.
.
.
DO k
n
= 0; d(j
max
n
  j
min
n
+ 1)=s
n
e   1
DO j
1
= j
min
1
+ k
1
s
1
,
min(j
min
1
+ k
1
s
1
+ s
1
  1; j
max
1
)
.
.
.
DO j
n
= max(j
min
n
+ k
n
s
n
; G
1
n
; G
2
n
;   ),
min(j
min
n
+ k
n
s
n
+ s
n
  1;H
1
n
;H
2
n
;   )
B0(j
1
;    ; j
n
)
ENDDO
.
.
.
ENDDO
ENDDO
.
.
.
ENDDO
Figure 6. Nested loop after normalization
of the form (k = 2;    ; n):
G
k
= max(G
1
k
; G
2
k
;   )
H
k
= max(H
1
k
; H
2
k
;   )
where each G
p
k
or H
p
k
is a linear functions of the outer loop index variables, j
1
;    ; j
k 1
as follows:
G
p
k
= d(g
p
k;0
+
P
k 1
i=1
g
p
k;i
j
i
)=g
p
k;k
e
H
p
k
= b(h
p
k;0
+
P
k 1
i=1
h
p
k;i
j
i
)=h
p
k;k
c
The unimodular matrix for skewing the nested loop in Figure 2 is
T =
"
1 0
1 1
#
The dependence vectors become ~e
1
= (1; 0), ~e
2
= (1; 1) and ~e
1
= (0; 1). The skewed
nested loop and its iteration space with data dependences are shown in Figure 4.
3.2 Loop Tiling
Let E the set of transformed dependence vectors, E = fT
~
d :
~
d 2 Dg and we have ~e  0
2
for all ~e 2 E. Now we can tile iteration space J with the set of orthogonal hyperplane
families:
H = f(
~
h
1
; s
1
);    ; (
~
h
n
; s
n
)g
2
~e  0 if each element of ~e is greater than or equal to 0.
8
do k
1
= 0; 2
do k
2
= 0; 4
do j
1
= 2k
1
;min(2k
1
+ 1; 4)
do j
2
= max(2k
2
; j
1
);min(2k
2
+ 1; j
1
+ 4)
a(j
1
; j
2
  j
1
) =
f
1
(c(j
1
; j
2
  j
1
  1); b(j
1
; j
2
  j
1
));
b(j
1
; j
2
  j
1
) =
f
2
(a(j
1
  1; j
2
  j
1
+ 1); c(j
1
; j
2
  j
1
));
c(j
1
; j
2
  j
1
) =
f
3
(b(j
1
  1; j
2
  j
1
); a(j
1
; j
2
  j
1
));
enddo;
enddo;
enddo;
enddo;
(a) Program
j1
j2
0
2
6
8
4
k1=1
k2=2
k1=0 k1=2
k2=0
k2=1
k2=3
k2=4
2 4
(b) Tiled iteration space
Figure 7. The example after tiling and normalization
where
~
h
1
= (1; 0;    ; 0)
~
h
2
= (0; 1;    ; 0)
.
.
.
~
h
n
= (0; 0;    ; 1)
and ~s = (s
1
;    ; s
n
) is called size vector of tiling. For every ~e = (e
1
;    ; e
n
) 2 E,
~
h
k
 ~e = e
k
 0, k = 1;    ; n. Therefore, the tiling is legal and deadlock-free.
The loop transformation of the orthogonal tiling above is straightforward. We only
need to introduce n additional loops to control the execution of the tiles as shown in
Figure 5. These new k-loops for tiles are called controlling loops. The execution of
iterations within a tile is still controlled by the j-loops. In Figure 5, the lower and upper
bounds of look k
l
, (l = 1;    ; n) are j
min
l
and j
max
l
, the minimum and the maximum
possible values of j
l
in iteration space J , respectively. The method to determine the
lower and upper bounds of controlling k-loops as well as inner j-loops can be found in
[4].
To facilitate the chain-based partitioning and scheduling, the controlling k-loops need
to be normalized [11]. The normalized controlling loops are shown in Figure 6. Notice
that normalizing the controlling loops does not aect the loop body B0(j
1
;    ; j
n
), be-
cause the index variables of the controlling loops are only used in the loop bounds of the
inner j-loops.
After tiling and normalization, the nested loop of the example and its tile space are
9
do k
1
= pid; 2; 2
do k
2
= 0; 4
receive message from PE
(pid 1mod2)
;
do j
1
= 2k
1
;min(2k
1
+ 1; 4)
do j
2
= max(2k
2
; j
1
);min(2k
2
+ 1; j
1
+ 4)
a(j
1
; j
2
  j
1
) =
f
1
(c(j
1
; j
2
  j
1
  1); b(j
1
; j
2
  j
1
));
b(j
1
; j
2
  j
1
) =
f
2
(a(j
1
  1; j
2
  j
1
+ 1); c(j
1
; j
2
  j
1
));
c(j
1
; j
2
  j
1
) =
f
3
(b(j
1
  1; j
2
  j
1
); a(j
1
; j
2
  j
1
));
enddo;
enddo;
send message to PE
(pid+1mod2)
;
enddo;
enddo;
(a) Chain-based parallel code
j1
j2
0
2
6
8
4
k1=1
k2=2
k1=0
k2=0
k2=1
k2=3
k2=4
2 4
PE0 PE1
k1=2
PE0
(b) Communication between chains
Figure 8. Chain-based Scheduling of the Example
shown in Figure 7. Note that some tiles are incomplete or empty. The loop bounds of
the inner j-loops will guarantee that only the relevant iterations will be executed [4].
3.3 Chain-Based Partitioning and Scheduling
After loop skewing and loop tiling and normalization of the controlling loops, we can form
chains of tiles and schedule them to the parallel processors. Let us use the transformed
nested loop in Figure 7 to illustrate the idea of chain-based partitioning and scheduling
rst.
Suppose that there are p = 2 processors, PE
0
and PE
1
, in the multicomputer system.
To execute the tiles shown in Figure 7(b), we can allocate all the tiles with k
1
= 0 to
PE
0
and all the tiles with k
1
= 1 to PE
1
. After PE
0
nishes the tiles with k
1
= 0, it
can proceed to execute all the tiles with k
1
= 2. The tiles with the same k
1
index form
a chain. A chain is a sequence of tiles connected by data dependences.
The partitioning and scheduling of the example are shown in Figure 8. Figure 8(a)
shows the code for PE
pid
. The processor identity is denoted by pid. In this example,
pid 2 f0; 1g. After a processor nishes the computation of a tile, it packs a message
and sends it to the next processor. It receives a message from the previous processor
before the computation of the tile starts. The thick arrows on Figure 8(b) represent the
messages passed between processors.
The rationale of chain-based partitioning and scheduling is as follows:
 Allocating all the tiles of a chain to the same processor can save the message-passing
10
for data dependences between the tiles. For example, the data dependences from
the tile (k
1
; k
2
) = (0; 0) to the tile (k
1
; k
2
) = (0; 1) can be satised by executing the
former rst and using the local memory for data passing.
 Due to the regular dependence structure of the computation, the interprocessor
communication between chains can be aligned with each other and overlapped
with the computation of chains in a pipeline fashion.
 The code generation for such partitioning and scheduling is simple as shown in
Figure 8(a).
In this example, we use the inner controlling loop to form chains and spread the
iterations of the outer controlling loop across the parallel processors. That is, the outer
controlling loop is executed as a DOACROSS loop.
In general, partitioning and scheduling tiles is a mapping M from tile space K to
processor space P , i.e., M : K ! P . A multicomputer has a network with xed topology
such as hypercube or mesh to connect processors. However, we can use an appropriate
logical structure for processor space P for partitioning and scheduling and then embed
the logical processor space into the physical parallel processors of the system. For a
tile space with n controlling loops, we need at least one innermost loop to form chains.
Assume that we are going to use q  1 inner controlling loops to form chains. This
means that we are going to allocate all iterations of the q inner controlling loops to a
single processor. The n q outer controlling loops are DOACROSS loops and represent a
(n  q)-dimensional chain space, C, which contains all the chains. The data dependences
between chains are regular and can be satised by communication between processors.
The best way to pipeline the communication and computation among the processors is
to fully interleave chains across the processors. Therefore, the logical processor space we
need is a (n  q)-dimensional mesh dened as follows:
P = f(p
1
;    ; p
n q
) : 0  q
k
 P
k
  1; k = 1;    ; n  qg
where P
k
is the size of the processor space in the k  th dimension and P
1
   P
n q
is the
total number processors used for the computation. Here we assume that the (n   q)-
dimensional mesh can be embedded into the physical topology of the network of the
system.
The chain-based partitioning now can be represented by the following function M :
K ! P :
M(k
1
;    ; k
n q
;    ; k
n
) = (k
1
mod P
1
;    ; k
n q
mod P
n q
)
According to the above partitioning function, the chain with k
1
= w
1
;    ; k
n q
= w
n q
will be executed by the processor (w
1
mod P
1
;    ; w
n q
mod P
n q
). In other words, the
set of chains allocated to processor (p
1
;    ; p
n q
) is
f(k
1
;    ; k
n q
) 2 C : k
l
 p
l
(mod P
l
); l = 1;    ; n  qg
The processor will execute these chains in the increasing lexicographic order of
11
DO k
1
= p
1
; d(j
max
1
  j
min
1
+ 1)=s
1
e   1; P
1
.
.
.
DO k
n q
= p
n q
; d(j
max
n q
  j
min
n q
+ 1)=s
n q
e   1; P
n q
DO k
n q+1
= 0; d(j
max
n q+1
  j
min
n q+1
+ 1)=s
n q+1
e   1
.
.
.
DO k
n
= 0; d(j
max
n
  j
min
n
+ 1)=s
n
e   1
Receive and Unpack messages;
DO j
1
= j
min
1
+ k
1
s
1
,
min(j
min
1
+ k
1
s
1
+ s
1
  1; j
max
1
)
.
.
.
DO j
n
= max(j
min
n
+ k
n
s
n
; G
1
n
; G
2
n
;   ),
min(j
min
n
+ k
n
s
n
+ s
n
  1;H
1
n
;H
2
n
;   )
B0(j
1
;    ; j
n
)
ENDDO
.
.
.
ENDDO
Pack and Send messages;
ENDDO
.
.
.
ENDDO
ENDDO
.
.
.
ENDDO
Figure 9. SPMD chain-based parallel code
12
(k
1
;    ; k
n q
). Therefore, the SPMD code for processor (p
1
;    ; p
n q
) is a program shown
in Figure 9. It is obvious that the program contains all the partitioning and scheduling
information that each processor needs at run time.
4 Concluding Remarks
We have presented a series of loops transformations to generate chain-based parallel pro-
grams for nested loops on multicomputers. They are loop skewing, loop tiling and nor-
malization, and chain-based partitioning and scheduling. We are currently implementing
these transformations in our prototype parallelizing compiler for Fujitsu AP1000.
We have been concentrating on the loop transformations for chain-based partitioning
and scheduling. The code generation for message packing and unpacking in the chain-
based codes will be discussed elsewhere.
There are a number of factors that will aect the performance of the generated chain-
based parallel codes:
Size of tiles. The size of tiles determines the granularity of the parallel computation.
The tile size should be such that the execution time of a tile should be roughly
equal to the communication time of a message passing between processors. The
balance between the computation and communication will hide the communication
latency and oer good pipelining for the whole parallel execution.
Regularity of chain length. Since the tile space is trapezoid instead of rectangular
due to the possible loop skewing, chosing dierent controlling loops to form chains
will aect the pipelining eect of interprocessor communication. In fact, all the n
controlling loops are interchangeable and we can move any of them inwards to form
chains. The question is how many and what loops should be moved inwards and
used to form chains to achieve the best performance.
Processor space. We need to decide the number of processors needed and the shape of
the logical processor space to give the best performance. This problem is directly
related to the size of tiles and the shape of the chain space.
The thorough discussions of these issues go beyond the scope of this paper and we will
address them in the next report: \Chain-Based Scheduling: Part II { Granularity and
Performance".
References
[1] P. Tang and G. Michael, \Chain-Based Partitioning and Scheduling of Nested Loops
for Multicomputers," Proceedings of the 1991 International Conference on Parallel
Processing , vol. II, pp. 243{246, August 1991.
13
[2] C. -T. King, W. -H. Chou and L. M. Ni, \Pipelined Data-Parallel Algorithms: Part
II { Design," IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 4,
pp. 486{499, October 1990.
[3] J-P. Sheu and T-H. Tai, \Partitioning and Mapping Nested Loops on Multiproces-
sor Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 2, no. 4,
pp. 430{439, October 1991.
[4] M. E. Wolf and M. S. Lam, \A Loop Transformation Theory and an Algorithm to
Maximize Parallelism," IEEE Transactions on Parallel and Distributed Systems, vol.
2, no. 4, pp. 452{471, October 1991.
[5] G. Michael and P. Tang, \Parallel Loop Code Generation for AP1000," in Proceed-
ings of the Second Fujitsu-ANU CAP Workshop, Canberra Australia, November 1991,
pp. D.1{D.11.
[6] L. Lamport, \The Parallel Execution of DO Loops," Communications of the ACM ,
vol. 17, no. 2, pp. 83{93, February 1974.
[7] F. Irigoin and R. Triolet, \Supernode Partitioning," in Conference Record of the Fif-
teenth Annual ACM Symposium on Principles of Programming Languages, San Diego,
CA, January 1988, pp. 319{329.
[8] M. Wolfe, \Loop Skewing: the Wavefront Method Revisited," Kuck and Associate,
Inc., , 1987.
[9] U. Banerjee, \Unimodular Transformations of Double Loops," in Advances in Lan-
guages and Compilers for Parallel Processing: Proceedings of the Third (1990) Work-
shop on Languages and Compilers for Parallel Computing, A. Nicolau, D. Gelernter,
T. Gross and D. Padua, Eds. London, UK: Pitman, pp. 192{219, 1991.
[10] K. G. Kumar, D. Kulkarni and A. Basu, \Generalized Unimodular Loop Transforma-
tions for Distributed Memory Multiprocessors," Center for Development of Advanced
Computing, FG-TR-014, January 1991.
[11] H. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers.
Addison-Wesley Publishing Company, 1991.
14
