Parallelizing non-vectorizable loops for MIMD machines by Kim, Ki-chang & Nicolau, Alexandru
UC Irvine
ICS Technical Reports
Title
Parallelizing non-vectorizable loops for MIMD machines
Permalink
https://escholarship.org/uc/item/4xw088kv
Authors
Kim, Ki-chang
Nicolau, Alexandru
Publication Date
1990
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
Parallelizing :-.Ion-Vectorizable Loops for :VIL\ID machine~ 
Ki-chang ~Kim=-and Alexandru ~icolau 
Department of Information and Computer Science 
l:niversity of California, Irvine 
Irvine, CA. 92717 
Technical Report #90-01 
January 1990 
z 
017 
C.3 
!JD , Y0-0/ 
I, i J' 1 1• , · · · 1 rt 1 . ! I , •I I 
. I 
I I . ' f • j - ; i I 
Parallelizing Non-Vectorizable Loops for ivII~ID machines 
Ki-chang Kim and Alexandru Nicolau 
Department of Information and Computer Science 
University of California, Irvine 
Irvine, CA. 92717 
Abstract 
Parallelizing a loop for MIMD machines can be described as a process of partitioning it 
into a number of relatively independent subloops. Previous approaches to partitioning non-
vectorizable loops were mainly based on iteration pipelining which partitioned a loop based 
on iteration number and exploited parallelism by overlapping the execution of iterations . 
However, the amount of parallelism exploited this way is limited because the parallelism inside 
iterations has been ignored. In this paper, we present a new loop partitioning technique which 
can exploit both forms of parallelism - inside and across iterations. While inspired by the 
VLIW approach, our method is designed for more general, asynchronous, MIMD machines. 
In particular, our schedule takes the cost of communication into account, and attempts to 
balance it with respect to parallelism. We show our method is correct, efficient, and produces 
better schedules than previous iteration level approaches. 
' 
1 Introduction 
To utilize the power of multiple processors in asynchronous MIMD machines, we need to 
decompose a task into parallel subtasks. Parallelization of a task could be done by the human 
programmer, or by a parallelizing compiler. Our interest is in the latter. The major concern of 
this paper is loop parallelization - partitioning a loop into a number of relatively independent 
subtasks. Loop partitioning is different from graph partitioning, in that the former deals with 
a potentially infinite graph due to the number of iterations which in general is not known at 
compile time. Since we assume non-vectorizable loops (implying the presence of loop-carried 
dependences), the problem is how to partition, efficiently, a graph which contains a number 
of arb~trarily long paths that are entangled together. 
A dominant technique for loop parallelization (for non-vectorizable loops) is iteration 
pipelining, e.g., Dopipe [Padua79] or DOACROSS [Cytron86]. It partitions a loop based on 
indices; the destination processor of any operation is determined solely by its iteration num-
ber. A typical way of partitioning is interleaving: the indices are partitioned into p groups, 
where the ith partition contains those iterations whose indices satisfy (x mod p) = i, where 
x is the iteration number. The subloops formed in this way are distributed to processors 
and executed concurrently. Of course, since dependences may exist between iterations, all 
such potential dependences need to be identified at compile time, and skewing between the 
parallel iterations needs to be introduced. This skewing can be obtained on asynchronous 
multiprocessors by inserting synchronization code at appropriate points. This synchroniza-
tion will have some cost, and for the iteration pipelining technique to work well this cost 
should be relatively small. Such synchronization can be effectively achieved in a variety of 
machines, hence the popularity of this technique. However, it does not, in itself, attempt 
to optimize execution by taking into account communication cost. Furthermore, since the 
unit of scheduling is an iteration, all parallelism that might have existed inside iterations is 
ignored, and only the parallelism across iterations is being exploited. 
Another technique for dealing with non-vectorizable loops is Perfect Pipelining [AiNi88a] 
[AiNi88b]. This technique was targeted for statically scheduled, synchronous architectures 
(e.g. VLIW's )[FiDo84], and, thus, its purpose is to find as many parallel operations as pos-
sible, regardless of iteration boundaries, to fill the long instruction word. When we assume 
zero communication/synchronization delay, the loop parallelization problems in MIMD's and 
VLIW's, become similar. Given enough processors for MIMD machines, and sufficient func-
tional units in VLIW machines, the optimal schedule based on (compile time) data depen-
dences for both architectures can be obtained by scheduling each operation at the earliest 
1 
time it can be executed . 
Since the number of iterations in the loop is, in general, not known at compile time, 
scheduling every operation at the earliest time it can be executed se€ms impossible. However 
[AiNi88a] has found that when every operation is scheduled as early as possible , the resulting 
schedule shows a repeating pattern. 1 An example of a pattern is found in Figure 3(b ). It is 
obtained by sorting the graph in Figure 3( a) topologically subject to data dependences , which 
corresponds to scheduling the operations in the figure as early as possible. We underlined 
a set of repeating operations (with a finite difference in index value, 1 in this case), which 
we call a pattern. The importance of this pattern is that we can reproduce the optimal 
schedule of the loop merely by repeating its pattern. Perfect Pipelining is based on this 
concept of pattern. It identifies the pattern and replaces the loop body with it, yielding 
an optimal schedule (given compile time data dependences) for a multiprocessor with zero 
communication time and enough processors. 
In this paper we extend the concept of pattern to the case of non-zero communication 
time. We prove that a pattern emerges in the resulting schedule even when each operation in 
the loop is assigned to the first available processor, that is, the first processor that can execute 
the operation at the earliest time, considering the cost of communication. This assignment 
destroys the ideal pattern of Perfect Pipelining due to the introduction of communication 
delays but we show that the resulting schedule produces a new pattern of its own. Thus our 
scheduling algorithm trades off parallelism and inter-processor communication in an effort to 
optimize overall performance on MIMD machines with non-zero communication time. 
To conform with the inability of general purpose MIMD machines to execute multi-way 
jumps of the kind supported by VLIW's, we will assume the input loop is either without 
conditional statements or is if-converted (A1Ke83]. This will also make comparison with 
conventional iteration based methods for MIMD machines meaningful, as this technique does 
not deal with in-loop conditional jumps. 
The rest of the paper is organized as follows: Section 2 explains the scheduling technique 
we have developed and proves its correctness; Section 3 gives several examples to highlight 
various points in the scheduling process; Section 4 reports the results of experimentations 
we have performed to test the performance and robustness of our algorithm; and Section 5 
summarizes our conclusions. 
1 More a.ccurately each operation shows a repeating pattern (i.e., repeats with a fixed frequency) . The 
details of how an overall pattern can be detected and the proof of its existence are in [Ai~i88a]. 
2 
j 
'f ... ' 
' • 
Figure 1: A classification example. 
2 Scheduling in the presence of communication constraints. 
2.1 Modeling the structure of a loop 
Before we present our algorithm, we need to introduce our model of a. loop. This model is 
useful in simplifying the following discussion. 
We assume two things: the da.ta. dependence gra.ph of the loop is a. connected one, and all 
dependence distances a.re one or zero.l If the gra.ph is not connected, we ca.n simply sepa.ra.te 
the graph into several connected ones a.nd a.pply our scheduling algorithm to ea.ch of them 
independently. Also if the dependence distances a.re greater tha.n one, we can reduce them 
down to one or zero by unwinding the loop properly, as explained in [MuSi87]. 
A loop is viewed as a. five-tuple, < V, E, Flow-in, Cyclic, Flow-out >. Vis a. set of 
nodes, where a node represents a. unit of computation - it could be a. single operation or a. 
whole procedure.3 Eis a. set of tw~tuples, < Vt, vl >, where each tw~tuple represents a. data. 
dependence link from node Vt to node vl. Together, V a.nd E define the data. dependence 
graph for this loop. Flow-in, Cyclic, a.nd Flow-out are disjoint subsets of V satisfying the 
following conditions: a. node is in Flow-in if it has no predecessors or all of its predecessors 
are in Flow-in; a. node is in Flow-out if it is not in Flow-in, a.nd ha.s no successors or all of 
its successors a.re in Flow-out; a. node is in Cyclic if it is neither in Flow-in nor in Flow-out.• 
lThe precise definition• of dat& dependence graph and dependence diatance used here conform to the 
standard ones u described in (Padua79]. 
3 Gra.nularity ~hould be ch011en depending on machines, to make the execution time of a node within the 
sa.me order of magnitude aa communication COit. 
4So, to identify these subset•, the Flo111·in subset should be identified fi.nt, then Flo111-out sublet, a.nd then 
3 
In Figure 1, for example, nodes (A.B,C,D,F) a.re in Flow-in, nodes (G,H,J) a.re in Flow-out, 
a.nd nodes ( E,I,K ,L) a.re in Cyclic. 
The reason for this classification is based on the observation tha.t the Cyclic nodes, nodes 
belonging to the Cyclic subset, are the ones which really determine the execution time of the 
given loop (assuming enough resources are provided). Flow-in a.nd Flow-out nodes ha.ve little 
impact on the total execution time . The scheduling of Flow-in nodes is limited only by the 
la.test time they ca.n be scheduled, a.nd the scheduling of Flow-out nodes is limited only by 
the earliest time they can be scheduled. Note that if there a.re no Cyclic nodes, the loop is a 
DOALL loop . 
Below we present tJ'o lemmas related to the Cyclic subset which will be used later in 
Section 2.3. 
Lemma 1. There is a.t least one strongly connected subgra.ph5 in a. Cyclic subset. (Examples 
of strongly connected subgraphs are (E,I) and (L) in Figure 1.) 
Proof: If there is no strongly connected subgraph, there is no cycle in the Cyclic 
subset. This means all nodes in the Cyclic subset are Flow-in nodes by definition, 
because starting from the roots of the graph we can classify all nodes as Flow-in nodes. 
This is a contradiction since a. Cyclic subset can not contain Flow-in nodes; therefore, 
a Cyclic subset contains at least one strongly connected component. 
Lemma 2. For a loop which consists of a. single Cyclic subset, unwinding it m times, there 
exists a path of length a.t least m - 1. 
Proof: Since there is at least one strongly connected subgraph in the original graph 
by Lemma 1, unwinding it m times, we should ha.ve a. pa.th of length a.t least m - 1. 
The algorithm classification, in Figure 2 is used to identify each subset. Its time com-
plexity is O(m), where mis the number of dependence links in the input data dependence 
graph, because each edge (i.e., dependence link) in the input graph can not be visited more 
than once. In terms of N, the number of nodes, it is O(N2 ) in the worst case. 
2.2 Algorithm 
The basic strategy of our algorithm is to extract the Cyclic nodes from the loop, which form 
the central part of the schedule, and schedule them utilizing the concept of pattern, and then 
Cyclic subset. Also, since we don't deal with conditional jumps inside the loop in &ny special way, we ignore 
them in the scheduling process, &nd thus a data dependence graph alone is enough to represent the loop 
unambiguously. 
5 A strongly connected graph is one in which every node c&n be. reached from every other node. 
4 
Algorithm. classification 
Input. 
Output. 
Method. 
Data Dependence Graph of a loop 
Flow-in, Cyclic, Flow-out subsets of the loop 
0. Flow-in = Cyclic = Flow-out ={} 
1. buffer1 = {nodes which have no predecessors}. 
2. If buffer1 is empty, go to 5. 
Else add the nodes in buffer 1 to Flow-in. 
3. buffer 2 = {}. 
For each node x in buffer1 
for each successor of x 
if all predecessors of x are in Flow-in 
include it in buffer 2. 
4. buffer 1 = buffer 2. go to 2. 
5. buffer 1 = {nodes which are not in Flow-in and have no successors}. 
6. If buffer 1 is empty, go to 9. 
else add the nodes in buffer 1 to Flow-out. 
7. buffer 2 • {} 
For each node x in buff er1 
for each predecessor of x 
if all successors of x are in Flow-out 
include it in buffer 2. 
8. buffer 1 • buffer 2. go to 6. 
9. Cyclic = {nodes which are not in Flow-in nor in Flow-out}. 
Figure 2 
5 
include the schedule of non-Cyclic nodes. For now , suppose we have a loop which contains 
only Cyclic nodes (see Figure 3( a.)). 
The natural schedule that DOACROSS will produce for this loop is in the left two columns 
of Figure 3(c) . However we can produce a better schedule as shown in the right two columns 
of the same figure . 6 There we a.re exploiting parallelism inside as well as across iterations, 
while in DOACROSS only the latter form of parallelism is exploited. The issue is can we 
exploit both forms of parallelism in the presence of a large or unknown loop bound, while 
factoring in communication cost? 
As pointed out in the introduction section, our approach is based on a generalization of 
the concept of pattern first developed in [A.iNi88a]. Our algorithm utilizes the concept of 
pattern in two ways . It first obtains the idealized pattern of Perfect Pipelining which does 
not take into account communication delays. Then, it schedules the nodes in the pattern 
one by one7 to the processor which can execute it at the earliest time, when taking into 
account not only operational latencies, but also communication cost.8 By doing this , we are 
distorting (skewing) the idealized pattern to accommodate communication cost. Since the 
skewing we introduce is based on consistent (fixed) communication cost estimates , we expect 
that another pattern will emerge from the resulting schedule. 
In the right two columns of Figure 3(c), we show one example of such a pattern emerging 
(enclosed with a box in the figure). In Section 2.3 we prove the existence of such patterns in 
general . 
The algorithm for scheduling the Cyclic subset is in Figure 4. Its time complexity is 
0( M * P * N 2 + M 3 * N 3 ), where M is the expected number of unrollings to find a pattern, N 
is the number of operations in the loop body, and P is the number of processors . We have a 
total of M * N nodes to schedule. Most of the computing time is consumed in step 2. Its first 
sub-step is processor selection, the second pattern detection, and the third executable nodes 
collection. For each node, v, finding the destination processor takes 0( .V * P), because in the 
worst case we have to compute P ! (v,pj)'s, one for ea.ch processor, and computing a T(v,pj) 
5 In Figure J(e the subecripta show the iteration numbers. That is, Ao implies an instance of A from 
iteration 0. In thi1 example the execution time of each node and the cost of communication are both a.ssumed 
to be one cycle. 
7Since the (idealized) pattern shows only a partial ordering of nodes due to topological sorting, we need 
to enforce a fixed order for each set of par&llel nodes in it to euure the emergence of a new pattern. Any 
ordering (e.g., lexicographical ordering) is acceptable aa long aa it is consistent. 
8 In actual implementation, &I C&D be seen later in algorithm C11clic-1ched, these two steps, finding an 
idealized pattern and scheduling it, are not separated. In algorithm C11clic-1ched, the data dependence graph 
of the loop is topologically traversed while at the same time each node visited is being scheduled. 
6 
\ 
~ 
I 
:t 
• 
• 
' 
' • 
' 
·, " I 
' 
' 
' 
' '•, 
C0A0DgB9FoE0GoC1A1D1B1F1E1G1C2A2D2B2F2E2G1 ... 
cb> 
Figure 3: A scheduling example. 
1 
step 
0 
1 
2 
3 
4 
5 
6 
1 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
23 
PEO PEl PEO PEl 
Ao Co --;:fg 
Bo Do Bo 
Co Fa Ci 
Do Go Eo 
Eo Di A1 
Fo Fi Bi 
Go Ai G--1 EJ.. 
8i C1 A2 
Ci I D2 B1 
Di F1 C3 
Ei G1 E1 
Fi DJ AJ 
A2 Gi FJ 83 
81 G3 £3 
C1 
D1 
E2 
F1 
G1 AJ 
83 
C3 
D3 
£3 
F3 
G3 
<C> 
Algorithm. Cyclic-sched. 
Input. Data dependency graph of the Cyclic subset. 
Output. 
Method. 
A schedule for the Cyclic subset. 
1. Initialize the task queue with all the nodes which do not 
have predecessors. 
2. For each node, v, in the task queue 
/• Note that the task queue can never become empty since we 
are scheduling a Cyclic subset with unbounded unwinding. 
So, this loop exits only upon finding a pattern •/ 
schedule it to Pi, processor i, such that T(v,Pi) is the first 
minimum in the list (T(v,P1), . .. , T(v,Pp)), where pis the 
total number of processors, and T(v,Pj) is the cycle in 
the resulting schedule that v would be scheduled if it is 
assigned to Pj. 
Check if a pattern has emerged. 
If pattern found, exit. 
I• A pattern can be detected by checking if there is a 
configuration repeating. The meaning of this configuration 
and the proof that it correctly signals the emergence of a 
pattern will be given in Section 2.3. •/ 
For each successor w of v, 
decrease the number of predecessors by one. 
if number of predecessors for w • 0 
endfor 
endfor 
add w to the task queue 
Figure 4 
8 
I-
Algorithm. Flow-in-sched. 
Input. 
Output. 
Method. 
Flov-in subset. 
A schedule for Flow-in subset. 
1. Prepare p • Ceiling(L/H) free processors, 
where 
L is the size of the Flow-in subset, and H is the height 
of the pattern obtained from algorithm Cyclic-sched. Call 
them 0th, 1st, ... ,(p-1)th processor, each. 
2. For each iteration, i, 
assign the Flow-in subset of iteration i to (i mod p)th processor. 
endfor. 
Figure 5 
takes O(N) since we need to look at all the predecessors of node v, the number of which is 
bounded by N. Collecting executable nodes into the task queue also takes 0( N) since in the 
worst case, again, the node can have N successors. Therefore, the time complexity of the 
first and third sub-step is O(M • P • N 2 )). However, in real loops, most nodes have only 
small numbers of successors/predecessors, which allows us to reduce the time complexity for 
these two sub-steps to O(M • P • N) in realistic situation. 
For ea.ch node scheduled, we check whether a. pattern has been formed (the second sub-
step ). The number of nodes to inspect is O(x2 ), where xis the number of already scheduled 
nodes at the time of inspection. Since x could range from 0 to M • N, the time complexity 
for pattern detection is O(M3 • N 3 ). However, again, this is a. worst case scenario. M is 
typically very small, less than 10 in all the examples we ran (see Section 3 and 4). Also, there 
is no need to check the pattern from the beginning of the scheduling process. By detecting 
the pattern after the schedule is stabilized, we can reduce the time complexity for pattern 
detection considerably. In fact, for all the examples in Section 3 a.nd 4, the behaviour of the 
algorithm for pattern detection approached to O(N). 
The scheduling algorithm for Flow-in subset is in Figure 5, and, the final scheduling 
algorithm is in Figure 6, where algorithm Flow-out-sched is virtually the same a.s Flow-in-
sched. 
9 
Algorithm. 
Input 
Output 
Method. 
Data Dependence Graph ot a loop. 
A schedule tor it . 
1. Identity Flow-in, Cyclic, and Flow-out subsets (using algorithm 
classification). 
2. Schedule the Cyclic subset (using Cyclic-sched). 
3. Schedule the Flow-in subset (using Flov-in-sched). 
4 . Schedule the Flow-out subset (using Flov-out-sched). 
Figure 6 
2.3 Proofs 
Algorithm Cyclic-sched can terminate successfully only if a pattern is detected in the result-
ing schedule. We now prove the existence of that pattern. We assume that the number of 
processors, p, is sufficient to accommodate the resulting schedule, and the largest commu-
nication cost is k-each communication edge can have a different cost, but k is the upper 
bound of this cost.9• Note that since we a.re proving the correct termination of algorithm 
Cyclic-sched, we only need to look at the Cyclic nodes of a loop. 
The proof can be visualized by imagining an infinite schedule resulting from full unwinding 
and a window drawn on it, with width p and height k + 1. We will refer to the portion of 
the schedule surrounded by it as a configuration. We slide the window down the schedule 
and watch the configuration in it changing. U we find a configuration that has been observed 
before, we stop the sliding, locate the position of the previous twin configuration, draw 
another window on it, and restart the sliding but this time with two windows at the same 
speed. U we see the two windows show the same sequence of configurations as they slide 
down, we know we have found a pattern. So, the proof consists of two things: first, we 
prove that there exist two distinct configurations which a.re identical, and then that once two 
configurations are identical, the following two configuration sequences after them should be 
9This second a.ssumption is only used in the proceu of scheduling. It does not in &ny wa.y a.ft'ect the 
correctness of the execution of the resulting schedule. In fa.ct, a.s we will see in Section 3, the a.ctua.l execution 
time c&n vary quite dr&ma.tica.lly from tha.t auumed in the scheduling proceu. 
10 
__ .1 ___ __ ___ - ---~-- --- --- - - -·- - --·- - -- - -- -
the same. 
Definition 1. A shifted form of a set of nodes, (no,n1, ... ,nk), by dis the same set with 
the indices shifted by d, ( nd, nl+d, . . . , nk+d). 
Definition 2. Two configurations a.re identical if the set of nodes for one is a -shifted form 
of the other, and the schedules for them a.re exactly the same. 
Lemma 3. Any two nodes, v and w , if they a.re in the same configuration, should be within 
a. finite number of iterations from ea.ch other. That is, if v is from iteration i, and w 
from iteration j, then Ii - jl is finite. 
Proof: Suppose d = Ii - jl is arbitrarily large. We assume i is smaller than or equal 
to j, without loss of generality. From Lemma 2, we know there is a path of length at 
least d from iteration i to iteration j. Let the start node of this path be V, and the end 
node of it W. Also let the cycle of V in the resulting schedule be tv, and that of W 
be tw. Since d is arbitrarily large, the number of cycles between V and W, tw - tv, 
in the resulting schedule should be arbitrarily large, too. This means the number of 
cycles between v and w in the resulting schedule is also arbitrarily large because they 
are from the same iteration as V and W respectively, a.nd thus should be scheduled 
within a finite number of cycles from V and W each. 10 This is a contradiction because 
since v and w are in the same configuration, the number of cycles between them in the 
schedule can not be greater than k. 
Lemma 4. The number of non-identical configurations in the schedule is finite. 
Proof: Again imagine a window sliding down the schedule. We will prove that the num-
ber of non-identical configurations that this window ca.n show is finite. From Lemma 
3, we know the number of consecutive iterations that a.ny configuration can contain 
is bounded by some number, say M. Suppose at some point our window selects its 
nodes from iterations ( it+d, ... , iM+d), where dis an offset. We observe that the set of 
configurations that this window ca.n possibly show from iterations ( il+d, ... , iM+d) is 
exactly the same as that it would from iterations (ii. ... , iM ), because every configura-
tion from the former iterations has a.n identical matching configuration from the latter 
with a shifting distance d (see Definition 1 and 2). By generalizing, this means the 
10 Any two nodes° with the longest path between them having a length of I, should be scheduled within ( k+ 1 )/ 
cycles from each other, uauming a sufficient number of processors. Obviously, the longest paths between v 
and V and between w and W both have finite lengths since we uaume the original data dependence graph is 
a connected one. 
11 
--- --- . · - - - -- - - -- - ·· - - ~ l -
kinds of configurations that this window can show is limited by the kinds that itera-
tions (i 1 , ••• , iM) can supply. Since the number of nodes in this i tera.tion range is finite, 
and the size of the configuration window is finite too, so is the number of configurations 
that this window can show; therefore, the total number of non-identical configurations 
in the schedule is finite. 
Lemma 5. There exist two identical configurations separate in location rn the resulting 
schedule. 
Proof: Imagine a. configuration window is sliding down this schedule. As the window 
slides down, the contents in it will change, but from Lemma 4 the number of possi-
ble configurations that it can show is finite; therefore, eventually it will repeat some 
configuration which has appeared before. 
Lemma 6. If two configurations a.re identical, the two respective following configurations 
a.re identical, too. 
Proof: Let the two configurations be C and D, and the schedule lines11 right after 
each be Li and /2 (see Figure 9( c) for a.n example) . First we observe that any node in 
/ 1 should have at least one of its direct predecessors in configuration C (the same thing 
can be said for the nodes in /2 with respect to configuration D) . Otherwise its direct 
predecessors are all located before configuration C, and, therefore, it should have been 
scheduled within configuration C or before it. 12 This, in turn, means that the nodes 
that can come in /1 and /2 are a.mong the direct successors of the nodes in configuration 
C and D respectively. Let the set of direct successors of the nodes in configuration C 
be Sc, and that of the nodes in configuration D be SD . Since C and D a.re identical, 
Sc and SD also should be identical. This means that the algorithm, right after the 
completion of configuration C, will look at the sa.me sequence of nodes to schedule as 
it will after it has comJ>leted configuration D, as far as the schedule of line /1 and /2 is 
concerned. Then since the schedule in line /1 and /2 is completely determined by the 
configuration C a.nd D respectively, the schedules of the two lines should be the same. 
Since the schedules of /1 and /2 are identical, by moving the two surrounding win-
dows for configuration C and D one cycle down, we can see two identical succeeding 
11 A schedule line is the schedule of &11 procesaors at some fixed cycle. 
1 ~ A node can always be executed within le+ 1 cycles after its la.st direct predecessor is executed because we 
assume a sufficient number of processors. Note that le is the largest pouible communication time. Since the 
height of a configuration is le+ 1, a node &11 of wha.e predecessors have been executed before the configuration 
should be executable at leut at the bottom (schedule line) of the configuration. 
12 
! 
I 
I. 
configurations. 
Lemma 7. If two configurations a.re identical, the sequences of configurations following them 
are sa.me. 
Proof: Let the two configurations be Co and Do and the following sequences C; and 
D; ( i >= 1). The proof is by induction . If C;_ 1 and D; - 1 are identical, we ca.n sa.y C; 
and D; are identical from Lemma 6. Since Co and Do are identical, through induction , 
we know C; a.nd D; a.re identical for all i. 
Theorem 1. Cyclic-sched produces a. schedule which shows a pattern . 
Proof: From Lemma. 5, a. configuration eventually repeats itself. Once it is repeated, 
it keeps a.ppea.ring regularly by Lemma 7; so, we ha.ve a repeating pattern between the 
first a.nd (not including) second configura.tion. 13 
3 Examples 
The first example (see Figure 7(a)-(e) and Figure 8(a) a.nd 8(b)) shows the nontriviality of 
loop partitioning. The code is given in Figure 7(a), a.nd its da.ta. dependence gra.ph is in 
Figure 7(b ). We note there is only one kind of node, Cyclic. The latency vector lv shows 
the estimated execution time of the nodes. Figure 7(c) shows the topological sorting of the 
nodes. By scheduling ea.ch node from this list one by one, with the communication time 
(k = 2, in this example) ta.ken into consideration, we get Figure 7(d). Here, we ca.n see tha.t 
ea.ch processor is repeating some pattern of its own; a.nd in effect, ea.ch iteration is completed 
every three cycles. Finally in Figure 7(e), we show the transformed loop where the original 
loop is partitioned into two subloops. DOACROSS will produce the schedule in Figure 8(a.); 
it is the same a.s the schedule of a sequential execution (by collapsing the columns in the figure 
into PEO column a.nd removing all empty cycles) because no pipelining is possible due to the 
(E,A) dependence link. Even with a.n optimal reordering, as in Figure 8(b) which is obtained 
by a.n exhaustive search 1", DOACROSS would still yield no performance improvement, in 
this case, since no parallelism is achievable a.t the iteration level when synchronization cost is 
taken into account. The percentage parallelism obtained for this example, which we define as 
in [Cytron84] to be Sp = ( s - p/ s) • 100, where s and p a.re sequential a.nd parallel execution 
time respectively, is 40 by our algorithm, while that by DOACROSS is 0. 
13 Note the difference between a configuration and a pattern. In Figure 9(c), for example, window C shows 
a configuration, while the pattern is enclosed by a box with height 6 below it. 
"In general, optimal reordering of nodes is NP-hard (Cytron86)[MuSi87]. 
13 
Fat I = 1 TO Pf 
A: A(IJ = A(I-1] • ![t-1] 
B: B(IJ = A[IJ 
C: C(IJ = B[IJ 
D: D[IJ = D(!-1] • C[I - 1] 
I!: I! (I] = D[I] 
ENDF<:a 
( o.> 
;i) 
,,y 
(~ 
. c 
' : f' ..... \. D ,_.: 
••• 
A1D1B1E1C1D1A1E1B1C1A3D3B3E3C3D1A1E1B1C1AsDsBs .. . 
( c) 
PAIU!!CIM (M IS ASStJm> TO Bl Nf EV!N NU>eEll . ) 
P!O : >.(1) a A(O] • ! [OJ 
(~!ND A(l] TO P'!l) 
B (l] a A(l] 
C(l] a B(l] 
(R!C!IV! oc11 nac P!l) 
0[2] • 0[1] • C[l] 
(S!im O(~] TO P'!l) 
Iv w (1 . 1 . 1. 1 . 1) tor - A. l.C .D. I ll\ - ._ ardw . ! (2] • 0(2] rca u • 3 TO M-1 '4't 2 
(l!C!MI A(Il·l] ru:M P!l) 
A(Il] • A(Il·l] • !(Il-1] 
(S!J() A(Ill TO P!l) 
( b> 
step PEO 
0 At 
1 B1 
2 C1 
3 D1 
4 E1 
5 
6 A3 
7 B3 
8 C3 
9 D,. 
10 E1 
11 
12 As 
13 Bs 
14 
15 
(d> 
PEl 
D1 
E1 
A1 
B1 
c, 
D3 
£3 
A,. 
B1 
C1 
Ds 
Es 
B (Ill • A( l] 
C(Il • ll(Il] 
(l! Ml O(Il] nae P!l) 
O(Il•ll • O(Il] • C(Il] 
CSl!I> D(Il•li TO P!l) 
l!(Il•l] • 0( 1•1] 
l!lU<a 
Pll: 0(1] • 0(0] • C[O] 
(Siii> D[ll TO PIO) 
I! (1] • 0( ] 
(UCIIVI A(l] nae PIO) 
A(2] • A(l] • ! [1] 
(Siii) A(2l TO PIO) 
~~~) : ~(2) 
rdi 12 • 3 TO M·l '4't 2 
(l!C2M D(I2·1] nae P!O) 
O(I2] • D(I2·1] • C(I2·1] 
(S!J() D (I2l TO P!O) 
m~~0ldl1 nae PIO) 
A(I2•11 • A(I2] • !(I2] (Siii> (I2•1] 1'0 P!O) 
B(I2•1] • A(t2•1] 
C(I2•1) • 8(I2•1) 
!Hmm 
PAl!ND 
< e > 
Figure 7: A non-trivial scheduling example. 
14 
--- - --- - -- . -- - - - - ·· · ·· - - - ·· - - -
step PEO PEl PEJ i PE4 step PEO PEl PEJ PE4 
0 A1 0 A1 r 
l 81 j l 81 
· 2 C1 ' . 2 D1 
3 D1 1 J E1 
4 E1 I 4 C1 
.5 .5 
6 A2 6 A2 
1 82 1 82 
8 C1 8 D2 
9 D1 9 E2 
10 ' E2 10 C1 
11 11 
12 A3 12 A3 
13 83 13 83 
14 C3 14 03 
l.5 D, 15 £3 
16 £3 16 C3 
17 17 
18 18 
19 19 
( 0.) ( b) 
Figure 8: Schedules by DOACROSS for Figure 7(b). Compare it with Figure 7(d). 
The second example is from [Cytron86] (see Figure 9(a)-(c) and Figure 10). As in the first 
example, we show the code, data dependence graph, and the schedule. However, in step 1 
of our algorithm, the Flow-in buffer will conta.in nodes {6,7,8,9,10,11,12,13,14,15,16). There 
a.re no Flow-out nodes. The rest of the nodes are all Cyclic, a..s determined by algorithm 
classification. Note that the latency of the operations is not unique. Using algorithm Cyclic-
sched, we can generate Figure 9{c).15 We can s.ee processor 0 is repeating node 3 and 5, while 
processor 1 is repeating node 0,1,2, a.nd 4. Aga.in we assume the communication time is k = 2 
arbi tra.rily. Figure 10 shows the final transformed loop after the Flow-in nodes a.re distributed 
into three processors, and synchronization code inserted. In algorithm Flow-in-sched, for this 
case, L, the size of the Flow-in subset is 11, and H, the height of the pattern from algorithm 
Cyclic-sched is 6. Therefore, p, the number of needed free processors, is 3. In the result, we 
have partitioned the original loop into five subloops. The Flow-in nodes a.re distributed to 
three processors so a.a not to delay the execution time of the Cyclic subset. For this case, 
the percentage p&rallelism obtained by our algorithm is 72.7%, and that by DOACROSS is 
31.8%. 
We give two more examples, one from the 18'11 Livermore Loop (Figure ll{a.)-(d)), the 
other from a. fifth order elliptic filter [PaKn89] (Figure 12(a) and Figure 12(b)). Figure ll(a.) 
is the original da.ta. dependence graph for the first example. We extracted Cyclic nodes from 
15 In Figure 9(c), each node ia repreeented by two things: its n&me and its iteration number. For example, 
(3,11) means the instance of node 3 from iteration 11. 
15 
( 0.) 
...... 
H: ~ 
( c) 
6 
l2 
,.. 
• ~ l6 
\ . 
\ ' 
\' . 
' ', ... 
' 
• 
• 
Iv• (1 . 1 . 1, 1.l. 1. 1. 1.l . l . l.l . l . l.l.l . l) 111 ca....- ot ca. 1-U. 
( b> 
Figure 9: An example from [Cytron86]. 
16 
PAUJ!CIN (N IS ~ ro • A ltllnPU cw 3) 
P!O . r~ 10 • I ro N 
P!l · 
(UC! M "3 (I il nlClll P!l I 
M(IOJ • "3(10 • A7(10·1) 
(SZllD M(IOJ P!l) 
A7(10J • M(IOJ 
!!UtS 
"3(1) • "5(0] 
(sil() .U(l] ro PIO) 
~(l] • Aa(OJ 
"5(1]. ~(1] 
F~ 11 • l TO I 
.U(II) • AS(Il·ll 
(st!Cl .U (II J ro PIO) (UCllVI M(l1·1) f'JDI Pit //CJ A17(Il·l) ,_ Pll . l . c». 4. 
Aa(Il·l) • M(Il·l] • >.5(11·1) • A17(1l·l] 
~(11) • Aa(l1·1) 
~II)• ~(II] 
(UCllV! M(lf] nlClll PIO NllJ A17(11·1) l"ltOI PU . l, <». 4) 
Aa(ll] • M(ll] • AS(ll] • "17(11·1) 
P!l: r~ ll • 1 TO N·l tr( l 
Al(ll] • 9(U] 
A9(llJ • Al(ll) 
A11(1Jj • ..u 1ul All(ll • A9(U 
All(IJ • All(! J (S!IC> All cul ro PUJ 
Al4(UJ • Al [U) 
(UCllVI All(fl·l) ,_ n4J 
A6(llJ • A1(Ul • All(U·l) 
A.1..5(UI • A14( l) 
cS!IC> A.ls cul TO Pill 
A16(Ul• A1 cul 
AJ.7(U • A14(U 
(SlllD 7(U) TO Pill 
(llCllVI A15(U·l) FD PH) 
A10(U) • At(U) • AU(U·l) 
!IClrta 
P!l : res u • l TO •·1 ft l 
Al (Ill • l(lll 
A9(1l • Al(!)) 
All (Ill .... , (l)l 
Al.J(ll • At(U 
A13(13J • All(t I 
(S!llD All (ll] TO PH) 
A14(1l] • All(ll) 
(UCIIVI All[tl·l] ,_ Pll) 
A6(lll • Al(tll • All(U·l) 
WCI)! • A14( lJ (S!li> W(ll) TO Pl41 
A16!lll • A14(Ul 
A17 ll • A14(U 
DCZPT A17(1)) 
(SlllD A.17 (I l] ro P!l) 
(Ual'1i A.1..5(1l·1) l"ltOI P!l) 
A.10 (ll] • A9(llJ • Al5(1l · L) 
DU~ 
Pl4 : res 14 • 3 ro " rt l 
A.1(141 • 9(14] 
~.Pt1 1 ·."'im11 AU 14 • A9(14 All 14 • AJ.J(14l (SlllD All (14] ro PU) 
A14(14] • AJ.l(14J (UOM All!f4· 1l l"ltOI P!l) 
A6(14) • Al (f4l • AJ.l(l4· LJ 
WC141 • AJ.4( 4J 
cslliD A.15(14] ro P!lJ 
Alau14L· AJ.411•1 AJ.7 14 • AJ.4 14 
cmm 7(141 ro Pill 
(llC2M A.1..5(14· 1] l"ltOI P!l) 
AJ.0(14] • At(14] • A.1..5(14·1) 
aaat 
Figure 10: The parallelized loop for Figure 9(a). 
17 
it res11lting in Figure ll(b), and re-labeled the nodes as shown in Figure ll(c). The schedule 
is shown in Figure ll(d) with the pattern enclosed with a box. Figure 12(a) is the data 
dependence graph for the second example, and Figure 12(b) is its schedule for the Cyclic 
nodes . In both cases, most of the nodes are in Cyclic; for the first example, only 8 nodes, 
(1 ,2,3,6,9,10,ll.14) in Figure ll(a), are non-Cyclic nodes (they are Flow-in nodes), while in 
the second, only node 34 is a. non-Cyclic node (a. Flow-out node). In such cases, scheduling 
non-Cyclic nodes separately may ca.use low processor utilization. One way of avoiding this 
waste is to schedule these non-Cyclic nodes into one of the relatively idle processors, processor 
O in the first example and processor 1 in the second one. For both cases, inclusion of non-
Cyclic nodes can be achieved with only small amount of delay. The strategy is simple; after 
the schedule of Cyclic nodes is completed , if there is a relatively idle processor with idle time 
slots wide enough to accommodate the non-Cyclic nodes with little or no additional delay, 
combine the non-Cyclic nodes into the idle processor. This heuristic can be easily combined 
with our algorithm. 
In both cases, the loops are partitioned into two relatively independent subloops (see 
Figure 11( d) and Figure 12(b )), and these partitionings a.re producing higher percentage 
parallelism than those of DOACROSS . The percentage parallelism achieved by our algorithm 
for each example are 49.4 and 30.9, while those by DOACROSS a.re 12.6 and 0. Again, we 
assumed k = 2, where k is the communication cost. 
4 Experimentation 
The above examples show superior results for our algorithm. However, we have assumed 
that the communication cost is fixed, which means there is no unpredictable fluctuation 
in communication time. Also, the dependence pattern in the examples may have favored 
our algorithm. To test the performance of our algorithm and its robustness under unstable 
communication tra.ffic and complex dependence graphs thoroughly, we have generated 25 
random loops and tested our algorithm under various tra.ffic conditions. 
The way we generated a random loop is as follows. First, we fixed the number of nodes in 
the loop as 40, and the number of loop carried dependences ( lcd's) and simple dependences 
( sd 's) at 20 each. The execution time of each node is randomly chosen from 1 to 3 cycles 
using a random number generator. Then, a.gain using the random number generator, we 
generated actual dependence links, 20 for lcd's and another 20 for sd's. After this was done, 
we extracted only Cyclic nodes from the graph. The effect is that we have generated a random 
loop, which contains only Cyclic nodes whose latencies vary from 1 to 3 cycles, with less than 
18 
. 
' 
• . 
. 
' \ 
' . 
' 
(Cl.) 
•• • 
1Ut:U:U:U:U:tU:U:U:~ 
1a1a. ..... , .. ~. 
(C> 
( b> 
(~.I 
I~ : ! >t. I (>l . I 
(U. I 
~: I 1:: I 
>I . I • · I) 
Ill. I•· I) 
1'. I t . II 
1'. I t . I) 
l. l • . )! l . l lt . l 
u . ) u.) I. I! U. I 
u. J 11. ll 
~: ~ g: n 
U . >)! t . II 
~: :1 :: :1 
.. '! lt. 11 t . J . l 
It . I lt. I 
JI . J J. ) 
It . I I. I 
•. I 
.. . . . . S. I 
t . I 
1, I 
·· ·· · · U . I 
U . II U. II 
ti: ll tl: ll 
::: ll t:: :1 12. I! u. '! U . I U , I 
17 . I 1.1, II l7: I l!: : 
... 1:: ll 
.. g:: ll 
. ... ... (U, 11 lfr I 
7,fift:": I (>I . J) • • l 
!>I. ,, •• !1 " · l • . ~ ,.,, l •.• 
<d> 
! t: :111:: :11 
1. ' u . ' 111 . • 
1 
u . • (U: : g: !I 
u. •l 17 . • 
1 u. ·1 .. . u . ' •. ' 
• . • • . 41 
• . • lt. •) 
• . •••. <I 
JI . •1 H . ,, (It . • I . !I 
(~~: . ~ :: :, 
s. •I 
t . •I 
~ : :1 
(U. 41(U . 41 
(U . 41 !l.J. 41 (1' . •I U . •I 
, ... ·11•. 'I (J.•. ' lt . ' {JJ . 41 (U . ' 
(11 . •I !u. •I (17 . •I 1.1 . •I (17. •l u . •I 
.g:: ~ 
(lt . •I 
.... !19. •l 
.. . >t . <I 
:. lt :1 
1ii: ., 1;:: :1 
(U. •I h•. •I 
<~:::~) : : ,., 
1' . ' • . 
n . ' •. s 
l. • • 
I . S U. S 
L. s u . ' u. ,, ll. ' 
!
u. s i1 . 
11. S) 17. S) 
n: :1 1: : ~ 
Ill. ~ •. ~ '· .  t . s lt. ' · lt. H . s 14. S 
Figure 11: Scheduling the 18th Livermore Loop. 
19 
- - - - - - --- -- - - ------
• , 
I 
I 
• 
• i 
• l 
• 
' ' 
' 
' 
( Q.) 
....... 
(31 . n (20. ~ (l . 7 ( l. (27 . 7 !2l. (27 . 24. 
(32 . 7) 24. 
l~t ~71 i:: ~ ! ~t i . j ; ii o . . ...... . 1. . ..... . 
J , . ..... .. 
4 . • ....•.. 
s .. ..... .. 
s . . .. ... .. 
7 . ·1 · .. .. .. 
7 .... .. .. . 
•• 
•• t . 
11. 
(14 . 
(10 . 
!
u. 
u . 
u . 
17. 
ll• . 11. 31. 31. 
27, 
27 , 
32. 
23. 
21, 
ll . 
15 . • 
15 . • 
a.• 
19 •• 
21 •• 
20, • 
22 • • 
22 • • 
24 •• 
H . I 
25, • 
••• 
lt. ( 2. " o. 
1 . 
3, 
4. 
s. 9 
s. 9 .. . .. . . 
7, ' ...... . 
1. 9 ..... .. 
•• 9 ....•.. 
• • 9 ..•.•.• 
9, ' ... . •.• 
u. 9 
14. 9 
11, ' .. ... .. 
tl: : 
1
. ii . a .. 
u. 9 15. 
11. ' a. 
( b> 
Figure 12: Scheduling the fifth order elliptic filter. 
20 
(~ : :11~! : : 
31. 9 2l. ' 
27 . t ll. ' 
11 . 9) (l4. ' 
32 . 9 (24. 9 
23 . 9 (ll , ' 
21 . ' (1' . 9 
ll . ' .. . .. . . 
29 t .. L.L.101-iJ! .. ..... 
l . 10 
4 . 10 
S. 10 
S. 10 
7 . 10 
7 . 10 
6 . 10 
1 . 10 
9 . 10 .. .... . 
11 . 10) <. . .... . 
a.10 .. .. .. 
10.10 
12 . 10 
ll . 10 11.S 10) u . 10 u: 10 
17 . 10 16. 10 
30,10 19. 10 
11. 10 21 . 10 
31. 10 lO, 10 
31, 10 21 . 10 
27.10 22 . 10 
21.10 24. 10 
ll.10 24.10 
l3 . 10 ll. 10 
21.10 lt.10 
1~:i: ..i.' iill 
i:~ ..... .. 
l,U . . .. . . . 
4.U ... . • .• 
s.u ....... 
s.u 
7,U 
7,U 
•.u 
a.u 
9 , U 
u.u 
14,U 
10 . u ....... 
u.u 
U.U (l.S.U) 
presence of unpredictable communication cost. 
5 Conclusion 
In this paper we have presented a new technique to schedule non-vectorizable loops for MIMD 
machines which can produce higher percentage parallelism than conventional iteration-based 
pipelining techniques. We have proved our algorithm is correct and compared its performance 
against a conventional iteration-based pipelining technique. The results show that our ap-
proach can achieve higher performance, even when the estimation of communication cost is 
far off the mark, and the actual cost of communication is relatively high ( 7 times the basic 
node execution time). Thus our approach shows a great deal of robustness under adverse 
circumstances. 
22 
or equal to 40 nodes and less than or equal to 20 led 's and sd 's. We have repeated the same 
process with different seeds (1 to 25) , producing 25 different loops . 
For each loop generated, we have extracted only Cyclic nodes 16 , and scheduled them using 
our algorithm and DOACROSS . The resulting schedules were executed on a simulated mul-
tiprocessor. We assumed fully overlapped communication , and the estimated communication 
time for our algorithm was k = 3 cycles . To model the fluctuation in the actual communica-
tion time and asynchrony by the processors , we used a varying factor mm. With this varying 
factor, the run time cost of each communication link varied between k and k +mm - 1. We 
compared our algorithm with DO ACROSS under three different mm' s: mm = 1 (no fluctu-
ation), mm= 3 ( maximum 673 of delay in communication time), and mm= 5 (maximum 
1303 of delay in communication time) . Thus the schedule our algorithm produces is based 
on the estimated k , while at run time a/I communication takes k +mm - 1 cycles, clearly a 
worst case scenario. 
The result of performance comparison is in Table l(a). For each loop, we ran the simu-
lated multiprocessor and measured the parallel execution time. By subtracting it from the 
sequential execution time and dividing the result by the sequential execution time, we calcu-
lated the percentage parallelism. The entry in Table 1( a) shows the percentage parallelism, 
obtained this way, for each loop. When mm = 1, our algorithm produced better schedules 
than DOACROSS in all loops. The average percentage parallelism of our algorithm is a.bout 
a factor of 2.9 higher than that of DOACROSS. (See Table l(b).) When mm= 3, in only 
one out of the 25 loops our algorithm produced a worse schedule than DOACROSS; when 
mm = 5, only two such loops out of the 25 loops. But in both cases, the average percentage 
parallelism of our algorithm are about a factor of 3.0 (mm = 3 case) and 3.2 (mm = 5 
case) higher than those of DO ACROSS as shown in Table l(b ). One thing to note is that 
mm = 5 implies the communication cost was underestimated by a factor of 2.3, which will 
happen only in a very unstable asynchronous traffic. Even under this unpredictable situation, 
our algorithm exploits more parallelism than DOACROSS on average. In fact, despite our 
expectation that our algorithm performance would worsen under such adverse conditions, 
Table l(b) shows that in the presence of unstable . communication cost, our relative perfor-
mance versus DOACROSS actually improves (see the factor of speedup over DOA CROSS in 
the table). This suggests that careful scheduling can be both robust and profitable in the 
18 Non-Cyclic nodes wouldn't increa.se parallel execution time in our case since the critical path in the schedule 
is formed only by the Cyclic nodes. The execution time in DOACROSS would not be delayed considerably by 
them either, if we properly separate them from Cyclic nodes through reordering of operatiou. Thus, we ca.n 
put aside non-Cyclic nodes for the purpose of comparison between our algorithm a.nd DOACROSS. 
21 
mm= l mm= 3 mm = 5 
loop I x doa.cross x doa.cross : :ic doa.crou 
45 .2 18 .6 
l5 .2 0.0 
o 51 .8 26 .8 I 
l 36. l o.o I 
51 .J 23 .7 I 
26 .0 0.0 
2 55 .8 38 .7 50 .9 33.0 45.8 27.9 
26.6 8.0 
55.7 7.2 
18.2 0.8 
33.2 8.3 
31 41.2 19.0 
4 68 .5 11.4 
5 39.8 10 .5 
6 ~ 48 .6 l6 .9 
34.2 13.8 
62.1 9.2 
28.5 6.8 
40.9 12.4 
j -42 .0 14 .2 30.8 9.2 15.2 6.2 
8 65.7 40.i 60.5 37.7 56 .7 33 .l 
9 21.2 15.J 6.0 11.3 o.o 7.1 
10 48 .5 l5. 7 44. l l3.4 39.4 8.6 
ll 56.0 31.1 52.3 27.5 47.8 24.2 
l2 66.0 20 .0 61.4 l6 .2 57.1 ll .l 
13 55.6 l0 .5 47.7 7.7 36.4 4.6 
14 36.6 31.1 32.3 28.3 23.3 26.9 
15 44.3 l3 . l 31.6 10.4 22.0 5.9 
16 34. 1 0.0 22.5 0.0 12.8 o.o 
17 36. l 11.5 25.0 8.5 13.8 5.5 
18 56. 7 11.7 43.5 7.5 30.0 2.9 
19 36.4 25.3 30.3 21.2 18.7 17.3 
20 47.3 0.0 39.2 0.0 29.8 0.0 
21 42.9 18.8 30.6 14. 1 16. 7 8. 7 
22 34.4 3.7 29.2 1.2 17.7 0.0 
23 49.3 9.6 41.9 5.8 34 . .S l.l 
24 61.3 11.l 52.7 6.5 44.l 2.2 '------'~--"~~~......_~ ...... ~~--"l--~l--~...;...;;....11 
'0.) 
mm= l mm =3 mm=5 
x 47.4046 39.0674 30.2776 
DO ACROSS 16.3135 13.0623 9.4823 
Fa.ctor 
of speed-up 2.9 3.0 3.3 
over DOACROSS 
( b> 
Table 1: Comparison of performance between our algorithm (denoted by x) and DOACROSS. 
23 
--- -- ------ - ----- - -- --- - -
References 
[AiNi88a.] Aiken, A. a.nd Nicola.u, A. 1988. Optimal loop pa.ralleliza.tion. In Proceedings of the 
1988 ACM SIG PLAN Conference on Progra.mming La.nguage Design a.nd lmplementa.tion, 
June. 
[AiNi88bj Aiken, A. and ~icola.u, A. 1988. Perfect Pipelining: A new loop parallelization 
technique. In Proceedings of the 1988 European Symposim on Programming. Springer 
Verla.g Lecture Notes in Computer Science no. 300, Ma.rch. 
[A1Ke83] Allen,J.R., Kennedy K., Porterfield, C. and Warren, J. 1983. Conversion of control 
dependence to data. Dependence. In Proceedings of the 1983 Symposium on Principles of 
Progra.mming Languages, pp. 177-189, January. 
(Cytron84] Cytron, R.G . 1984 Compile-time Scheduling a.nd Optimiza.tion for Asynchronous 
machines. PhD Thesis, University of Illinois at Urbana.-Cha.mpagne. 
[Cytron86] Cytron, R.G. 1986 Doacross: Beyond Vectorization for Multiprocessors. In Pro-
ceedings of the 1986 Interna.tional Conference on Pa.rallel Processing, St. Cha.rles, IL, pp. 
836-844, August. 
(FiDo84] Fisher, J.A. and O'Donnell, J.J. 1984. VLIW ma.chines: Multiprocessors we can 
a.ctually progra.m. In Proceedings of CompCon Spring 84, pp. 299-305. IEEE Computer 
Society, Februa.ry. 
[MuSi87] Munshi, A.A. a.nd Simons, B. 1987. Scheduling Sequential Loops on Pa.rallel Pro-
cessors. In Proceedings of the 1987 International Conference on Parallel Processing, St. 
Charles, Illinois, August. 
[Padua79] Padua, D.A. Multiprocessors: discussion of some theoretical and practical prob-
lems. PhD Thesis, University of Illinois at Urbana.-Champagne. 
[PaKn89] Paulin, P.G. a.nd Knight, J.P. 1989. Force-directed scheduling for the Behaviora.l 
Synthesis of ASIC's. In IEEE transactions on Computer-Aided Design, Vol.8, No.6, June. 
24 
